mbox series

[RFC,v2,0/4] mm: Introduce MAP_BELOW_HINT

Message ID 20240829-patches-below_hint_mmap-v2-0-638a28d9eae0@rivosinc.com
Headers show
Series mm: Introduce MAP_BELOW_HINT | expand

Message

Charlie Jenkins Aug. 29, 2024, 7:15 a.m. UTC
Some applications rely on placing data in free bits addresses allocated
by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
address returned by mmap to be less than the 48-bit address space,
unless the hint address uses more than 47 bits (the 48th bit is reserved
for the kernel address space).

The riscv architecture needs a way to similarly restrict the virtual
address space. On the riscv port of OpenJDK an error is thrown if
attempted to run on the 57-bit address space, called sv57 [1].  golang
has a comment that sv57 support is not complete, but there are some
workarounds to get it to mostly work [2].

These applications work on x86 because x86 does an implicit 47-bit
restriction of mmap() address that contain a hint address that is less
than 48 bits.

Instead of implicitly restricting the address space on riscv (or any
current/future architecture), a flag would allow users to opt-in to this
behavior rather than opt-out as is done on other architectures. This is
desirable because it is a small class of applications that do pointer
masking.

This flag will also allow seemless compatibility between all
architectures, so applications like Go and OpenJDK that use bits in a
virtual address can request the exact number of bits they need in a
generic way. The flag can be checked inside of vm_unmapped_area() so
that this flag does not have to be handled individually by each
architecture. 

Link:
https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79
[1]
Link:
https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47
[2]

To: Arnd Bergmann <arnd@arndb.de>
To: Richard Henderson <richard.henderson@linaro.org>
To: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
To: Matt Turner <mattst88@gmail.com>
To: Vineet Gupta <vgupta@kernel.org>
To: Russell King <linux@armlinux.org.uk>
To: Guo Ren <guoren@kernel.org>
To: Huacai Chen <chenhuacai@kernel.org>
To: WANG Xuerui <kernel@xen0n.name>
To: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
To: Helge Deller <deller@gmx.de>
To: Michael Ellerman <mpe@ellerman.id.au>
To: Nicholas Piggin <npiggin@gmail.com>
To: Christophe Leroy <christophe.leroy@csgroup.eu>
To: Naveen N Rao <naveen@kernel.org>
To: Alexander Gordeev <agordeev@linux.ibm.com>
To: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
To: Heiko Carstens <hca@linux.ibm.com>
To: Vasily Gorbik <gor@linux.ibm.com>
To: Christian Borntraeger <borntraeger@linux.ibm.com>
To: Sven Schnelle <svens@linux.ibm.com>
To: Yoshinori Sato <ysato@users.sourceforge.jp>
To: Rich Felker <dalias@libc.org>
To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
To: David S. Miller <davem@davemloft.net>
To: Andreas Larsson <andreas@gaisler.com>
To: Thomas Gleixner <tglx@linutronix.de>
To: Ingo Molnar <mingo@redhat.com>
To: Borislav Petkov <bp@alien8.de>
To: Dave Hansen <dave.hansen@linux.intel.com>
To: x86@kernel.org
To: H. Peter Anvin <hpa@zytor.com>
To: Andy Lutomirski <luto@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
To: Muchun Song <muchun.song@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>
To: Liam R. Howlett <Liam.Howlett@oracle.com>
To: Vlastimil Babka <vbabka@suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-csky@vger.kernel.org
Cc: loongarch@lists.linux.dev
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kselftest@vger.kernel.org
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>

Changes in v2:
- Added much greater detail to cover letter
- Removed all code that touched architecture specific code and was able
  to factor this out into all generic functions, except for flags that
  needed to be added to vm_unmapped_area_info
- Made this an RFC since I have only tested it on riscv and x86
- Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d@rivosinc.com

---
Charlie Jenkins (4):
      mm: Add MAP_BELOW_HINT
      mm: Add hint and mmap_flags to struct vm_unmapped_area_info
      mm: Support MAP_BELOW_HINT in vm_unmapped_area()
      selftests/mm: Create MAP_BELOW_HINT test

 arch/alpha/kernel/osf_sys.c                  |  2 ++
 arch/arc/mm/mmap.c                           |  3 +++
 arch/arm/mm/mmap.c                           |  7 ++++++
 arch/csky/abiv1/mmap.c                       |  3 +++
 arch/loongarch/mm/mmap.c                     |  3 +++
 arch/mips/mm/mmap.c                          |  3 +++
 arch/parisc/kernel/sys_parisc.c              |  3 +++
 arch/powerpc/mm/book3s64/slice.c             |  7 ++++++
 arch/s390/mm/hugetlbpage.c                   |  4 ++++
 arch/s390/mm/mmap.c                          |  6 ++++++
 arch/sh/mm/mmap.c                            |  6 ++++++
 arch/sparc/kernel/sys_sparc_32.c             |  3 +++
 arch/sparc/kernel/sys_sparc_64.c             |  6 ++++++
 arch/sparc/mm/hugetlbpage.c                  |  4 ++++
 arch/x86/kernel/sys_x86_64.c                 |  6 ++++++
 arch/x86/mm/hugetlbpage.c                    |  4 ++++
 fs/hugetlbfs/inode.c                         |  4 ++++
 include/linux/mm.h                           |  2 ++
 include/uapi/asm-generic/mman-common.h       |  1 +
 mm/mmap.c                                    |  9 ++++++++
 tools/include/uapi/asm-generic/mman-common.h |  1 +
 tools/testing/selftests/mm/Makefile          |  1 +
 tools/testing/selftests/mm/map_below_hint.c  | 32 ++++++++++++++++++++++++++++
 23 files changed, 120 insertions(+)
---
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55

Comments

Charlie Jenkins Aug. 29, 2024, 10:16 p.m. UTC | #1
On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote:
> On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote:
> > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
> > > Some applications rely on placing data in free bits addresses allocated
> > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > address returned by mmap to be less than the 48-bit address space,
> > > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > > for the kernel address space).
> >
> > I'm still confused as to why, if an mmap flag is desired, and thus programs
> > are having to be heavily modified and controlled to be able to do this, why
> > you can't just do an mmap() with PROT_NONE early, around a hinted address
> > that, sits below the required limit, and then mprotect() or mmap() over it?
> >
> > Your feature is a major adjustment to mmap(), it needs to be pretty
> > significantly justified, especially if taking up a new flag.
> >
> > >
> > > The riscv architecture needs a way to similarly restrict the virtual
> > > address space. On the riscv port of OpenJDK an error is thrown if
> > > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > > has a comment that sv57 support is not complete, but there are some
> > > workarounds to get it to mostly work [2].
> > >
> > > These applications work on x86 because x86 does an implicit 47-bit
> > > restriction of mmap() address that contain a hint address that is less
> > > than 48 bits.
> >
> > You mean x86 _has_ to limit to physically available bits in a canonical
> > format :) this will not be the case for 5-page table levels though...

I might be misunderstanding but I am not talking about pointer masking
or canonical addresses here. I am referring to the pattern of:

1. Getting an address from mmap()
2. Writing data into bits assumed to be unused in the address
3. Using the data stored in the address
4. Clearing the data from the address and sign extending
5. Dereferencing the now sign-extended address to conform to canonical
   addresses

I am just talking about step 1 and 2 here -- getting an address from
mmap() that only uses bits that will allow your application to not
break. How canonicalization happens is a a separate conversation, that
can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv.
While LAM for x86 is only capable of masking addresses to 48 or 57 bits,
Ssnpm for riscv allow an arbitrary number of bits to be masked out.
A design goal here is to be able to support all of the pointer masking
flavors, and not just x86.

> >
> > >
> > > Instead of implicitly restricting the address space on riscv (or any
> > > current/future architecture), a flag would allow users to opt-in to this
> > > behavior rather than opt-out as is done on other architectures. This is
> > > desirable because it is a small class of applications that do pointer
> > > masking.
> >
> > I raised this last time and you didn't seem to address it so to be more
> > blunt:
> >
> > I don't understand why this needs to be an mmap() flag. From this it seems
> > the whole process needs allocations to be below a certain limit.

Yeah making it per-process does seem logical, as it would help with
pointer masking.

> >
> > That _could_ be achieved through a 'personality' or similar (though a
> > personality is on/off, rather than allowing configuration so maybe
> > something else would be needed).
> >
> > From what you're saying 57-bit is all you really need right? So maybe
> > ADDR_LIMIT_57BIT?

Addresses will always be limited to 57 bits on riscv and x86 (but not
necessarily on other architectures). A flag like that would have no
impact, I do not understand what you are suggesting. This patch is to
have a configurable number of bits be restricted.

If anything, a personality that was ADDR_LIMIT_48BIT would be the
closest to what I am trying to achieve. Since the issue is that
applications fail to work when the address space is greater than 48
bits.

> >
> > I don't see how you're going to actually enforce this in a process either
> > via an mmap flag, as a library might decide not to use it, so you'd need to
> > control the allocator, the thread library implementation, and everything
> > that might allocate.

It is reasonable to change the implementation to be per-process but that
is not the current proposal.

This flag was designed for applications which already directly manage
all of their addresses like OpenJDK and Go.

This flag implementation was an attempt to make this feature as least
invasive as possible to reduce maintainence burden and implementation
complexity.

> >
> > Liam also raised various points about VMA particulars that I'm not sure are
> > addressed either.
> >
> > I just find it hard to believe that everything will fit together.
> >
> > I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m
> > just not.
> >
> > >
> > > This flag will also allow seemless compatibility between all
> > > architectures, so applications like Go and OpenJDK that use bits in a
> > > virtual address can request the exact number of bits they need in a
> > > generic way. The flag can be checked inside of vm_unmapped_area() so
> > > that this flag does not have to be handled individually by each
> > > architecture.
> >
> > I'm still very unconvinced and feel the bar needs to be high for making
> > changes like this that carry maintainership burden.
> >

I may be naive but what is the burden here? It's two lines of code to
check MAP_BELOW_HINT and restrict the address. There are the additional
flags for hint and mmap_addr but those are also trivial to implement.

> > So for me, it's a no really as an overall concept.
> >
> > Happy to be convinced otherwise, however... (I may be missing details or
> > context that provide more justification).
> >
> 
> Some more thoughts:
> 
> * If you absolutely must keep allocations below a certain limit, you'd
>   probably need to actually associate this information with the VMA so the
>   memory can't be mremap()'d somewhere invalid (you might not control all
>   code so you can't guarantee this won't happen).
> * Keeping a map limit associated with a VMA would be horrid and keeping
>   VMAs as small as possible is a key aim, so that'd be a no go. VMA flags
>   are in limited supply also.

Yes that does seem like it would be challenging.

> * If we did implement a per-process thing, but it were arbitrary, we'd then
>   have to handle all kinds of corner cases forever (this is UAPI, can't
>   break it etc.) with crazy-low values, or determine a minimum that might
>   vary by arch...

Throwing an error if the value is determined to be "too low" seems
reasonable.

> * If we did this we'd absolutely have to implement a check in the brk()
>   implementation, which is a very very sensitive bit of code. And of
>   course, in mmap() and mremap()... and any arch-specific code that might
>   interface with this stuff (these functions are hooked).
> * A fixed address limit would make more sense, but it seems difficult to
>   know what would work for everybody, and again we'd have to deal with edge
>   cases and having a permanent maintenance burden.

A fixed value is not ideal, since a single size probably would not be
suffiecient for every application. However if necessary we could fix it
to 48-bits since arm64 and x86 already do that, and that would still
allow a generic way of defining this behavior.

> * If you did have a map flag what about merging between VMAs above the
>   limit and below it? To avoid that you'd need to implement some kind of a
>   'VMA flag that has an arbitrary characteristic' or a 'limit' field,
>   adjust all the 'can VMA merge' functions and write extensive testing and
>   none of that is frankly acceptable.
> * We have some 'weird' arches that might have problem with certain virtual
>   address ranges or require arbitrary mappings at a certain address range
>   that a limit might not be able to account for.
> 
> I'm absolutely opposed to a new MAP_ flag for this, but even if you
> implemented that, it implies a lot of complexity.
> 
> It implies even more complexity if you implement something per-process
> except if it were a fixed limit.
> 
> And if you implement a fixed limit, it's hard to see that it'll be
> acceptable to everybody, and I suspect we'd still run into some possible
> weirdness.
> 
> So again, I'm struggling to see how this concept can be justified in any
> form.

The piece I am missing here is that this idea is already being used by
x86 and arm64. They implicitly force all allocations to be below the
47-bit boundary if the hint address is below 47 bits. This flag is much
less invasive because it is opt-in and will not impact any existing
code. I am not familiar enough with all of the interactions spread
throughout mm to know how these architectures have managed to ensure
that this 48-bit limit is enforced across things like mremap() as well.

Are you against the idea that there should be a standard way for
applications to consistently obtain address that have free bits, or are
you just against this implementation? From your statement I assume you
mean that every architecture should continue to have varying behavior
and separate implementations for supporting larger address spaces.

- Charlie
Dave Hansen Aug. 30, 2024, 3:03 p.m. UTC | #2
On 8/29/24 01:42, Lorenzo Stoakes wrote:
>> These applications work on x86 because x86 does an implicit 47-bit
>> restriction of mmap() address that contain a hint address that is less
>> than 48 bits.
> You mean x86 _has_ to limit to physically available bits in a canonical
> format 🙂 this will not be the case for 5-page table levels though...

By "physically available bits" are you referring to the bits that can be
used as a part of the virtual address?  "Physically" may not have been
the best choice of words. ;)

There's a canonical hole in 4-level paging and 5-level paging on x86.
The 5-level canonical hole is just smaller.

Also, I should probably say that the >47-bit mmap() access hint was more
of a crutch than something that we wanted to make ABI forever.  We knew
that high addresses might break some apps and we hoped that the list of
things it would break would go down over time so that we could
eventually just let mmap() access the whole address space by default.

That optimism may have been misplaced.
Lorenzo Stoakes Aug. 30, 2024, 3:18 p.m. UTC | #3
On Fri, Aug 30, 2024 at 08:03:25AM GMT, Dave Hansen wrote:
> On 8/29/24 01:42, Lorenzo Stoakes wrote:
> >> These applications work on x86 because x86 does an implicit 47-bit
> >> restriction of mmap() address that contain a hint address that is less
> >> than 48 bits.
> > You mean x86 _has_ to limit to physically available bits in a canonical
> > format 🙂 this will not be the case for 5-page table levels though...
>
> By "physically available bits" are you referring to the bits that can be
> used as a part of the virtual address?  "Physically" may not have been
> the best choice of words. ;)
>
> There's a canonical hole in 4-level paging and 5-level paging on x86.
> The 5-level canonical hole is just smaller.

Yeah sorry this is what I meant!

>
> Also, I should probably say that the >47-bit mmap() access hint was more
> of a crutch than something that we wanted to make ABI forever.  We knew
> that high addresses might break some apps and we hoped that the list of
> things it would break would go down over time so that we could
> eventually just let mmap() access the whole address space by default.
>
> That optimism may have been misplaced.

Interesting, thanks. This speaks again I think to it being unwise to rely
on these things.

I do think the only workable form of this series is a fixed
personality-based mapping limit.
Kirill A. Shutemov Sept. 9, 2024, 9:46 a.m. UTC | #4
On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
> > On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
> > > Some applications rely on placing data in free bits addresses allocated
> > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > address returned by mmap to be less than the 48-bit address space,
> > > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > > for the kernel address space).
> > > 
> > > The riscv architecture needs a way to similarly restrict the virtual
> > > address space. On the riscv port of OpenJDK an error is thrown if
> > > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > > has a comment that sv57 support is not complete, but there are some
> > > workarounds to get it to mostly work [2].

I also saw libmozjs crashing with 57-bit address space on x86.

> > > These applications work on x86 because x86 does an implicit 47-bit
> > > restriction of mmap() address that contain a hint address that is less
> > > than 48 bits.
> > > 
> > > Instead of implicitly restricting the address space on riscv (or any
> > > current/future architecture), a flag would allow users to opt-in to this
> > > behavior rather than opt-out as is done on other architectures. This is
> > > desirable because it is a small class of applications that do pointer
> > > masking.

You reiterate the argument about "small class of applications". But it
makes no sense to me.

With full address space by default, this small class of applications is
going to *broken* unless they would handle RISC-V case specifically.

On other hand, if you limit VA to 128TiB by default (like many
architectures do[1]) everything would work without intervention.
And if an app needs wider address space it would get it with hint opt-in,
because it is required on x86-64 anyway. Again, no RISC-V-specific code.

I see no upside with your approach. Just worse user experience.

[1] See va_high_addr_switch test case in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/mm/Makefile#n115
Steven Price Oct. 21, 2024, 1:22 p.m. UTC | #5
On 09/09/2024 10:46, Kirill A. Shutemov wrote:
> On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
>> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
>>> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
>>>> Some applications rely on placing data in free bits addresses allocated
>>>> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
>>>> address returned by mmap to be less than the 48-bit address space,
>>>> unless the hint address uses more than 47 bits (the 48th bit is reserved
>>>> for the kernel address space).
>>>>
>>>> The riscv architecture needs a way to similarly restrict the virtual
>>>> address space. On the riscv port of OpenJDK an error is thrown if
>>>> attempted to run on the 57-bit address space, called sv57 [1].  golang
>>>> has a comment that sv57 support is not complete, but there are some
>>>> workarounds to get it to mostly work [2].
> 
> I also saw libmozjs crashing with 57-bit address space on x86.
> 
>>>> These applications work on x86 because x86 does an implicit 47-bit
>>>> restriction of mmap() address that contain a hint address that is less
>>>> than 48 bits.
>>>>
>>>> Instead of implicitly restricting the address space on riscv (or any
>>>> current/future architecture), a flag would allow users to opt-in to this
>>>> behavior rather than opt-out as is done on other architectures. This is
>>>> desirable because it is a small class of applications that do pointer
>>>> masking.
> 
> You reiterate the argument about "small class of applications". But it
> makes no sense to me.

Sorry to chime in late on this - I had been considering implementing
something like MAP_BELOW_HINT and found this thread.

While the examples of applications that want to use high VA bits and get
bitten by future upgrades is not very persuasive. It's worth pointing
out that there are a variety of somewhat horrid hacks out there to work
around this feature not existing.

E.g. from my brief research into other code:

  * Box64 seems to have a custom allocator based on reading 
    /proc/self/maps to allocate a block of VA space with a low enough 
    address [1]

  * PHP has code reading /proc/self/maps - I think this is to find a 
    segment which is close enough to the text segment [2]

  * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
    addresses [3][4]

  * pmdk has some funky code to find the lowest address that meets 
    certain requirements - this does look like an ALSR alternative and 
    probably couldn't directly use MAP_BELOW_HINT, although maybe this 
    suggests we need a mechanism to map without a VA-range? [5]

  * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
    a range [6]

  * LuaJIT uses an approach to 'probe' to find a suitable low address 
    for allocation [7]

The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
library to get low addresses without causing any problems for the rest
of the application. The use case I'm looking at is in a library and 
therefore a personality mode wouldn't be appropriate (because I don't 
want to affect the rest of the application). Reading /proc/self/maps
is also problematic because other threads could be allocating/freeing
at the same time.

Thanks,
Steve


[1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/
[2] https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap.c/#L62
[3] https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp
[4] https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb05c
[5] https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=29#L29
[6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826
[7] https://sources.debian.org/src/luajit/2.1.0+openresty20240815-1/src/lj_alloc.c/

> With full address space by default, this small class of applications is
> going to *broken* unless they would handle RISC-V case specifically.
> 
> On other hand, if you limit VA to 128TiB by default (like many
> architectures do[1]) everything would work without intervention.
> And if an app needs wider address space it would get it with hint opt-in,
> because it is required on x86-64 anyway. Again, no RISC-V-specific code.
> 
> I see no upside with your approach. Just worse user experience.
> 
> [1] See va_high_addr_switch test case in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/mm/Makefile#n115
>
Liam R. Howlett Oct. 21, 2024, 7:48 p.m. UTC | #6
* Steven Price <steven.price@arm.com> [241021 09:23]:
> On 09/09/2024 10:46, Kirill A. Shutemov wrote:
> > On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
> >> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
> >>> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
> >>>> Some applications rely on placing data in free bits addresses allocated
> >>>> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> >>>> address returned by mmap to be less than the 48-bit address space,
> >>>> unless the hint address uses more than 47 bits (the 48th bit is reserved
> >>>> for the kernel address space).
> >>>>
> >>>> The riscv architecture needs a way to similarly restrict the virtual
> >>>> address space. On the riscv port of OpenJDK an error is thrown if
> >>>> attempted to run on the 57-bit address space, called sv57 [1].  golang
> >>>> has a comment that sv57 support is not complete, but there are some
> >>>> workarounds to get it to mostly work [2].
> > 
> > I also saw libmozjs crashing with 57-bit address space on x86.
> > 
> >>>> These applications work on x86 because x86 does an implicit 47-bit
> >>>> restriction of mmap() address that contain a hint address that is less
> >>>> than 48 bits.
> >>>>
> >>>> Instead of implicitly restricting the address space on riscv (or any
> >>>> current/future architecture), a flag would allow users to opt-in to this
> >>>> behavior rather than opt-out as is done on other architectures. This is
> >>>> desirable because it is a small class of applications that do pointer
> >>>> masking.
> > 
> > You reiterate the argument about "small class of applications". But it
> > makes no sense to me.
> 
> Sorry to chime in late on this - I had been considering implementing
> something like MAP_BELOW_HINT and found this thread.
> 
> While the examples of applications that want to use high VA bits and get
> bitten by future upgrades is not very persuasive. It's worth pointing
> out that there are a variety of somewhat horrid hacks out there to work
> around this feature not existing.
> 
> E.g. from my brief research into other code:
> 
>   * Box64 seems to have a custom allocator based on reading 
>     /proc/self/maps to allocate a block of VA space with a low enough 
>     address [1]
> 
>   * PHP has code reading /proc/self/maps - I think this is to find a 
>     segment which is close enough to the text segment [2]
> 
>   * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
>     addresses [3][4]

Can't the limited number of applications that need to restrict the upper
bound use an LD_PRELOAD compatible library to do this?

> 
>   * pmdk has some funky code to find the lowest address that meets 
>     certain requirements - this does look like an ALSR alternative and 
>     probably couldn't directly use MAP_BELOW_HINT, although maybe this 
>     suggests we need a mechanism to map without a VA-range? [5]
> 
>   * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
>     a range [6]
> 
>   * LuaJIT uses an approach to 'probe' to find a suitable low address 
>     for allocation [7]
> 

Although I did not take a deep dive into each example above, there are
some very odd things being done, we will never cover all the use cases
with an exact API match.  What we have today can be made to work for
these users as they have figured ways to do it.

Are they pretty? no.  Are they common? no.  I'm not sure it's worth
plumbing in new MM code in for these users.

> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
> library to get low addresses without causing any problems for the rest
> of the application. The use case I'm looking at is in a library and 
> therefore a personality mode wouldn't be appropriate (because I don't 
> want to affect the rest of the application). Reading /proc/self/maps
> is also problematic because other threads could be allocating/freeing
> at the same time.

As long as you don't exhaust the lower limit you are trying to allocate
within - which is exactly the issue riscv is hitting.

I understand that you are providing examples to prove that this is
needed, but I feel like you are better demonstrating the flexibility
exists to implement solutions in different ways using todays API.

I think it would be best to use the existing methods and work around the
issue that was created in riscv while future changes could mirror amd64
and arm64.

...
> 
> 
> [1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/
> [2] https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap.c/#L62
> [3] https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp
> [4] https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb05c
> [5] https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=29#L29
> [6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826
> [7] https://sources.debian.org/src/luajit/2.1.0+openresty20240815-1/src/lj_alloc.c/
> 
...

Thanks,
Liam
Steven Price Oct. 23, 2024, 9:31 a.m. UTC | #7
Hi Liam,

On 21/10/2024 20:48, Liam R. Howlett wrote:
> * Steven Price <steven.price@arm.com> [241021 09:23]:
>> On 09/09/2024 10:46, Kirill A. Shutemov wrote:
>>> On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
>>>> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
>>>>> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
>>>>>> Some applications rely on placing data in free bits addresses allocated
>>>>>> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
>>>>>> address returned by mmap to be less than the 48-bit address space,
>>>>>> unless the hint address uses more than 47 bits (the 48th bit is reserved
>>>>>> for the kernel address space).
>>>>>>
>>>>>> The riscv architecture needs a way to similarly restrict the virtual
>>>>>> address space. On the riscv port of OpenJDK an error is thrown if
>>>>>> attempted to run on the 57-bit address space, called sv57 [1].  golang
>>>>>> has a comment that sv57 support is not complete, but there are some
>>>>>> workarounds to get it to mostly work [2].
>>>
>>> I also saw libmozjs crashing with 57-bit address space on x86.
>>>
>>>>>> These applications work on x86 because x86 does an implicit 47-bit
>>>>>> restriction of mmap() address that contain a hint address that is less
>>>>>> than 48 bits.
>>>>>>
>>>>>> Instead of implicitly restricting the address space on riscv (or any
>>>>>> current/future architecture), a flag would allow users to opt-in to this
>>>>>> behavior rather than opt-out as is done on other architectures. This is
>>>>>> desirable because it is a small class of applications that do pointer
>>>>>> masking.
>>>
>>> You reiterate the argument about "small class of applications". But it
>>> makes no sense to me.
>>
>> Sorry to chime in late on this - I had been considering implementing
>> something like MAP_BELOW_HINT and found this thread.
>>
>> While the examples of applications that want to use high VA bits and get
>> bitten by future upgrades is not very persuasive. It's worth pointing
>> out that there are a variety of somewhat horrid hacks out there to work
>> around this feature not existing.
>>
>> E.g. from my brief research into other code:
>>
>>   * Box64 seems to have a custom allocator based on reading 
>>     /proc/self/maps to allocate a block of VA space with a low enough 
>>     address [1]
>>
>>   * PHP has code reading /proc/self/maps - I think this is to find a 
>>     segment which is close enough to the text segment [2]
>>
>>   * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
>>     addresses [3][4]
> 
> Can't the limited number of applications that need to restrict the upper
> bound use an LD_PRELOAD compatible library to do this?

I'm not entirely sure what point you are making here. Yes an LD_PRELOAD
approach could be used instead of a personality type as a 'hack' to
preallocate the upper address space. The obvious disadvantage is that
you can't (easily) layer LD_PRELOAD so it won't work in the general case.

>>
>>   * pmdk has some funky code to find the lowest address that meets 
>>     certain requirements - this does look like an ALSR alternative and 
>>     probably couldn't directly use MAP_BELOW_HINT, although maybe this 
>>     suggests we need a mechanism to map without a VA-range? [5]
>>
>>   * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
>>     a range [6]
>>
>>   * LuaJIT uses an approach to 'probe' to find a suitable low address 
>>     for allocation [7]
>>
> 
> Although I did not take a deep dive into each example above, there are
> some very odd things being done, we will never cover all the use cases
> with an exact API match.  What we have today can be made to work for
> these users as they have figured ways to do it.
> 
> Are they pretty? no.  Are they common? no.  I'm not sure it's worth
> plumbing in new MM code in for these users.

My issue with the existing 'solutions' is that they all seem to be fragile:

 * Using /proc/self/maps is inherently racy if there could be any other
code running in the process at the same time.

 * Attempting to map the upper part of the address space only works if
done early enough - once an allocation arrives there, there's very
little you can robustly do (because the stray allocation might be freed).

 * LuaJIT's probing mechanism is probably robust, but it's inefficient -
LuaJIT has a fallback of linear probing, following by no hint (ASLR),
followed by pseudo-random probing. I don't know the history of the code
but it looks like it's probably been tweaked to try to avoid performance
issues.

>> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
>> library to get low addresses without causing any problems for the rest
>> of the application. The use case I'm looking at is in a library and 
>> therefore a personality mode wouldn't be appropriate (because I don't 
>> want to affect the rest of the application). Reading /proc/self/maps
>> is also problematic because other threads could be allocating/freeing
>> at the same time.
> 
> As long as you don't exhaust the lower limit you are trying to allocate
> within - which is exactly the issue riscv is hitting.

Obviously if you actually exhaust the lower limit then any
MAP_BELOW_HINT API would also fail - there's really not much that can be
done in that case.

> I understand that you are providing examples to prove that this is
> needed, but I feel like you are better demonstrating the flexibility
> exists to implement solutions in different ways using todays API.

My intention is to show that today's API doesn't provide a robust way of
doing this. Although I'm quite happy if you can point me at a robust way
with the current API. As I mentioned my goal is to be able to map memory
in a (multithreaded) library with a (ideally configurable) number of VA
bits. I don't particularly want to restrict the whole process, just
specific allocations.

I had typed up a series similar to this one as a MAP_BELOW flag would
fit my use-case well.

> I think it would be best to use the existing methods and work around the
> issue that was created in riscv while future changes could mirror amd64
> and arm64.

The riscv issue is a different issue to the one I'm trying to solve. I
agree MAP_BELOW_HINT isn't a great fix for that because we already have
differences between amd64 and arm64 and obviously no software currently
out there uses this new flag.

However, if we had introduced this flag in the past (e.g. if MAP_32BIT
had been implemented more generically, across architectures and with a
hint value, like this new flag) then we probably wouldn't be in this
situation. Applications that want to restrict the VA space would be able
to opt-in and be portable across architectures.

Another potential option is a mmap3() which actually allows the caller
to place constraints on the VA space (e.g. minimum, maximum and
alignment). There's plenty of code out there that has to over-allocate
and munmap() the unneeded part for alignment reasons. But I don't have a
specific need for that, and I'm guessing you wouldn't be in favour.

Thanks,
Steve

> ...
>>
>>
>> [1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/
>> [2] https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap.c/#L62
>> [3] https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp
>> [4] https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb05c
>> [5] https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=29#L29
>> [6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826
>> [7] https://sources.debian.org/src/luajit/2.1.0+openresty20240815-1/src/lj_alloc.c/
>>
> ...
> 
> Thanks,
> Liam
Liam R. Howlett Oct. 23, 2024, 6:10 p.m. UTC | #8
* Steven Price <steven.price@arm.com> [241023 05:31]:
> >>   * Box64 seems to have a custom allocator based on reading 
> >>     /proc/self/maps to allocate a block of VA space with a low enough 
> >>     address [1]
> >>
> >>   * PHP has code reading /proc/self/maps - I think this is to find a 
> >>     segment which is close enough to the text segment [2]
> >>
> >>   * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
> >>     addresses [3][4]
> > 
> > Can't the limited number of applications that need to restrict the upper
> > bound use an LD_PRELOAD compatible library to do this?
> 
> I'm not entirely sure what point you are making here. Yes an LD_PRELOAD
> approach could be used instead of a personality type as a 'hack' to
> preallocate the upper address space. The obvious disadvantage is that
> you can't (easily) layer LD_PRELOAD so it won't work in the general case.

My point is that riscv could work around the limited number of
applications that requires this.  It's not really viable for you.

> 
> >>
> >>   * pmdk has some funky code to find the lowest address that meets 
> >>     certain requirements - this does look like an ALSR alternative and 
> >>     probably couldn't directly use MAP_BELOW_HINT, although maybe this 
> >>     suggests we need a mechanism to map without a VA-range? [5]
> >>
> >>   * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
> >>     a range [6]
> >>
> >>   * LuaJIT uses an approach to 'probe' to find a suitable low address 
> >>     for allocation [7]
> >>
> > 
> > Although I did not take a deep dive into each example above, there are
> > some very odd things being done, we will never cover all the use cases
> > with an exact API match.  What we have today can be made to work for
> > these users as they have figured ways to do it.
> > 
> > Are they pretty? no.  Are they common? no.  I'm not sure it's worth
> > plumbing in new MM code in for these users.
> 
> My issue with the existing 'solutions' is that they all seem to be fragile:
> 
>  * Using /proc/self/maps is inherently racy if there could be any other
> code running in the process at the same time.

Yes, it is not thread safe.  Parsing text is also undesirable.

> 
>  * Attempting to map the upper part of the address space only works if
> done early enough - once an allocation arrives there, there's very
> little you can robustly do (because the stray allocation might be freed).
> 
>  * LuaJIT's probing mechanism is probably robust, but it's inefficient -
> LuaJIT has a fallback of linear probing, following by no hint (ASLR),
> followed by pseudo-random probing. I don't know the history of the code
> but it looks like it's probably been tweaked to try to avoid performance
> issues.
> 
> >> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
> >> library to get low addresses without causing any problems for the rest
> >> of the application. The use case I'm looking at is in a library and 
> >> therefore a personality mode wouldn't be appropriate (because I don't 
> >> want to affect the rest of the application). Reading /proc/self/maps
> >> is also problematic because other threads could be allocating/freeing
> >> at the same time.
> > 
> > As long as you don't exhaust the lower limit you are trying to allocate
> > within - which is exactly the issue riscv is hitting.
> 
> Obviously if you actually exhaust the lower limit then any
> MAP_BELOW_HINT API would also fail - there's really not much that can be
> done in that case.

Today we reverse the search, so you end up in the higher address
(bottom-up vs top-down) - although the direction is arch dependent.

If the allocation is too high/low then you could detect, free, and
handle the failure.

> 
> > I understand that you are providing examples to prove that this is
> > needed, but I feel like you are better demonstrating the flexibility
> > exists to implement solutions in different ways using todays API.
> 
> My intention is to show that today's API doesn't provide a robust way of
> doing this. Although I'm quite happy if you can point me at a robust way
> with the current API. As I mentioned my goal is to be able to map memory
> in a (multithreaded) library with a (ideally configurable) number of VA
> bits. I don't particularly want to restrict the whole process, just
> specific allocations.

If you don't need to restrict everything, won't the hint work for your
usecase?  I must be missing something from your requirements.

> 
> I had typed up a series similar to this one as a MAP_BELOW flag would
> fit my use-case well.
> 
> > I think it would be best to use the existing methods and work around the
> > issue that was created in riscv while future changes could mirror amd64
> > and arm64.
> 
> The riscv issue is a different issue to the one I'm trying to solve. I
> agree MAP_BELOW_HINT isn't a great fix for that because we already have
> differences between amd64 and arm64 and obviously no software currently
> out there uses this new flag.
> 
> However, if we had introduced this flag in the past (e.g. if MAP_32BIT
> had been implemented more generically, across architectures and with a
> hint value, like this new flag) then we probably wouldn't be in this
> situation. Applications that want to restrict the VA space would be able
> to opt-in and be portable across architectures.

I don't think that's true.  Some of the applications want all of the
allocations below a certain threshold and by the time they are adding
flags to allocations, it's too late.  What you are looking for is a
counterpart to mmap_min_addr, but for higher addresses?  This would have
to be set before any of the allocations occur for a specific binary (ie:
existing libraries need to be below that threshold too), I think?

> 
> Another potential option is a mmap3() which actually allows the caller
> to place constraints on the VA space (e.g. minimum, maximum and
> alignment). There's plenty of code out there that has to over-allocate
> and munmap() the unneeded part for alignment reasons. But I don't have a
> specific need for that, and I'm guessing you wouldn't be in favour.

You'd probably want control of the direction of the search too.

I think mmap3() would be difficult to have accepted as well.

...

Thanks,
Liam