diff mbox

[v5sub2,1/8] arm64: add support for module PLTs

Message ID 1454332178-4414-2-git-send-email-ard.biesheuvel@linaro.org
State Accepted
Commit fd045f6cd98ec4953147b318418bd45e441e52a3
Headers show

Commit Message

Ard Biesheuvel Feb. 1, 2016, 1:09 p.m. UTC
This adds support for emitting PLTs at module load time for relative
branches that are out of range. This is a prerequisite for KASLR, which
may place the kernel and the modules anywhere in the vmalloc area,
making it more likely that branch target offsets exceed the maximum
range of +/- 128 MB.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

---

In this version, I removed the distinction between relocations against
.init executable sections and ordinary executable sections. The reason
is that it is hardly worth the trouble, given that .init.text usually
does not contain that many far branches, and this version now only
reserves PLT entry space for jump and call relocations against undefined
symbols (since symbols defined in the same module can be assumed to be
within +/- 128 MB)

For example, the mac80211.ko module (which is fairly sizable at ~400 KB)
built with -mcmodel=large gives the following relocation counts:

                    relocs    branches   unique     !local
  .text              3925       3347       518        219
  .init.text           11          8         7          1
  .exit.text            4          4         4          1
  .text.unlikely       81         67        36         17

('unique' means branches to unique type/symbol/addend combos, of which
!local is the subset referring to undefined symbols)

IOW, we are only emitting a single PLT entry for the .init sections, and
we are better off just adding it to the core PLT section instead.
---
 arch/arm64/Kconfig              |   9 +
 arch/arm64/Makefile             |   6 +-
 arch/arm64/include/asm/module.h |  11 ++
 arch/arm64/kernel/Makefile      |   1 +
 arch/arm64/kernel/module-plts.c | 201 ++++++++++++++++++++
 arch/arm64/kernel/module.c      |  12 ++
 arch/arm64/kernel/module.lds    |   3 +
 7 files changed, 242 insertions(+), 1 deletion(-)

-- 
2.5.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Comments

Catalin Marinas Feb. 4, 2016, 3:13 p.m. UTC | #1
On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:
> This adds support for emitting PLTs at module load time for relative

> branches that are out of range. This is a prerequisite for KASLR, which

> may place the kernel and the modules anywhere in the vmalloc area,

> making it more likely that branch target offsets exceed the maximum

> range of +/- 128 MB.


Any downside to trying to keep the kernel+modules coupled together so
that we avoid the PLT?

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel Feb. 4, 2016, 3:31 p.m. UTC | #2
On 4 February 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

>> This adds support for emitting PLTs at module load time for relative

>> branches that are out of range. This is a prerequisite for KASLR, which

>> may place the kernel and the modules anywhere in the vmalloc area,

>> making it more likely that branch target offsets exceed the maximum

>> range of +/- 128 MB.

>

> Any downside to trying to keep the kernel+modules coupled together so

> that we avoid the PLT?

>


First of all, note that it is unlikely that the PLTs are ever required
in practice, considering that either
a) the kernel is loaded at the default location right at the base of
the vmalloc range, and in this case, the module space is reserved for
modules only, or
b) the kernel is loaded at some random offset in the 240+ GB vmalloc
space, and it is unlikely that all VMA space around the kernel happens
to be given out to non-randomized vmalloc/ioremap allocations

So while this patch was a requirement at first (since in the first
version of the series, the modules always remained below the vmalloc
area so they would always be out of range when KASLR was enabled), in
the current series, it is simply a last resort (although still
required) to make sure that modules can be loaded far away from the
kernel in the unlikely event that all VMA space in close proximity is
given to another user.

Reserving a dedicated module region around the kernel while it is
loaded at a random offset is not straight forward, since it affects
all other users of the vmalloc area. So generic vmalloc/vmap/ioremap
code would need to be updated to disregard the area around the kernel,
since simply reserving the VMA area for the modules would make it
inaccessible to module_alloc() as well (unless we reserve the whole
region and manage the module allocations separately)

So the bottom line is that it may be possible, but it is unlikely to
be worth the effort imo

-- 
Ard.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Catalin Marinas Feb. 5, 2016, 3:42 p.m. UTC | #3
On Thu, Feb 04, 2016 at 04:31:59PM +0100, Ard Biesheuvel wrote:
> On 4 February 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

> >> This adds support for emitting PLTs at module load time for relative

> >> branches that are out of range. This is a prerequisite for KASLR, which

> >> may place the kernel and the modules anywhere in the vmalloc area,

> >> making it more likely that branch target offsets exceed the maximum

> >> range of +/- 128 MB.

> >

> > Any downside to trying to keep the kernel+modules coupled together so

> > that we avoid the PLT?

> 

> First of all, note that it is unlikely that the PLTs are ever required

> in practice, considering that either

> a) the kernel is loaded at the default location right at the base of

> the vmalloc range, and in this case, the module space is reserved for

> modules only, or

> b) the kernel is loaded at some random offset in the 240+ GB vmalloc

> space, and it is unlikely that all VMA space around the kernel happens

> to be given out to non-randomized vmalloc/ioremap allocations


My worry is that we merge some code that's rarely tested.

> So the bottom line is that it may be possible, but it is unlikely to

> be worth the effort imo


The alternative of having our own allocator for modules isn't that
appealing either, so let's stick with the PLTs.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel Feb. 5, 2016, 3:53 p.m. UTC | #4
On 5 February 2016 at 16:42, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Feb 04, 2016 at 04:31:59PM +0100, Ard Biesheuvel wrote:

>> On 4 February 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:

>> > On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

>> >> This adds support for emitting PLTs at module load time for relative

>> >> branches that are out of range. This is a prerequisite for KASLR, which

>> >> may place the kernel and the modules anywhere in the vmalloc area,

>> >> making it more likely that branch target offsets exceed the maximum

>> >> range of +/- 128 MB.

>> >

>> > Any downside to trying to keep the kernel+modules coupled together so

>> > that we avoid the PLT?

>>

>> First of all, note that it is unlikely that the PLTs are ever required

>> in practice, considering that either

>> a) the kernel is loaded at the default location right at the base of

>> the vmalloc range, and in this case, the module space is reserved for

>> modules only, or

>> b) the kernel is loaded at some random offset in the 240+ GB vmalloc

>> space, and it is unlikely that all VMA space around the kernel happens

>> to be given out to non-randomized vmalloc/ioremap allocations

>

> My worry is that we merge some code that's rarely tested.

>


I understand. But unfortunately, having corner cases that are unlikely
but not impossible comes with the territory of randomization.

Alternatively, we could take the performance hit if KASLR is in effect
and allocate each module completely randomly as well. This way, the
code is always exercised (for now), and we can always backpedal later
if the performance is measurably worse.

>> So the bottom line is that it may be possible, but it is unlikely to

>> be worth the effort imo

>

> The alternative of having our own allocator for modules isn't that

> appealing either, so let's stick with the PLTs.

>

> --

> Catalin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Catalin Marinas Feb. 5, 2016, 4 p.m. UTC | #5
On Fri, Feb 05, 2016 at 04:53:10PM +0100, Ard Biesheuvel wrote:
> On 5 February 2016 at 16:42, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > On Thu, Feb 04, 2016 at 04:31:59PM +0100, Ard Biesheuvel wrote:

> >> On 4 February 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:

> >> > On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

> >> >> This adds support for emitting PLTs at module load time for relative

> >> >> branches that are out of range. This is a prerequisite for KASLR, which

> >> >> may place the kernel and the modules anywhere in the vmalloc area,

> >> >> making it more likely that branch target offsets exceed the maximum

> >> >> range of +/- 128 MB.

> >> >

> >> > Any downside to trying to keep the kernel+modules coupled together so

> >> > that we avoid the PLT?

> >>

> >> First of all, note that it is unlikely that the PLTs are ever required

> >> in practice, considering that either

> >> a) the kernel is loaded at the default location right at the base of

> >> the vmalloc range, and in this case, the module space is reserved for

> >> modules only, or

> >> b) the kernel is loaded at some random offset in the 240+ GB vmalloc

> >> space, and it is unlikely that all VMA space around the kernel happens

> >> to be given out to non-randomized vmalloc/ioremap allocations

> >

> > My worry is that we merge some code that's rarely tested.

> 

> I understand. But unfortunately, having corner cases that are unlikely

> but not impossible comes with the territory of randomization.

> 

> Alternatively, we could take the performance hit if KASLR is in effect

> and allocate each module completely randomly as well. This way, the

> code is always exercised (for now), and we can always backpedal later

> if the performance is measurably worse.


I'm fine with this. You can post it as a separate patch that we can
easily revert/modify later (like turning it into a config option).

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel Feb. 5, 2016, 4:20 p.m. UTC | #6
On 5 February 2016 at 17:00, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Fri, Feb 05, 2016 at 04:53:10PM +0100, Ard Biesheuvel wrote:

>> On 5 February 2016 at 16:42, Catalin Marinas <catalin.marinas@arm.com> wrote:

>> > On Thu, Feb 04, 2016 at 04:31:59PM +0100, Ard Biesheuvel wrote:

>> >> On 4 February 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:

>> >> > On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

>> >> >> This adds support for emitting PLTs at module load time for relative

>> >> >> branches that are out of range. This is a prerequisite for KASLR, which

>> >> >> may place the kernel and the modules anywhere in the vmalloc area,

>> >> >> making it more likely that branch target offsets exceed the maximum

>> >> >> range of +/- 128 MB.

>> >> >

>> >> > Any downside to trying to keep the kernel+modules coupled together so

>> >> > that we avoid the PLT?

>> >>

>> >> First of all, note that it is unlikely that the PLTs are ever required

>> >> in practice, considering that either

>> >> a) the kernel is loaded at the default location right at the base of

>> >> the vmalloc range, and in this case, the module space is reserved for

>> >> modules only, or

>> >> b) the kernel is loaded at some random offset in the 240+ GB vmalloc

>> >> space, and it is unlikely that all VMA space around the kernel happens

>> >> to be given out to non-randomized vmalloc/ioremap allocations

>> >

>> > My worry is that we merge some code that's rarely tested.

>>

>> I understand. But unfortunately, having corner cases that are unlikely

>> but not impossible comes with the territory of randomization.

>>

>> Alternatively, we could take the performance hit if KASLR is in effect

>> and allocate each module completely randomly as well. This way, the

>> code is always exercised (for now), and we can always backpedal later

>> if the performance is measurably worse.

>

> I'm fine with this. You can post it as a separate patch that we can

> easily revert/modify later (like turning it into a config option).

>


OK, I will hack something up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Catalin Marinas Feb. 5, 2016, 4:46 p.m. UTC | #7
On Fri, Feb 05, 2016 at 05:20:14PM +0100, Ard Biesheuvel wrote:
> On 5 February 2016 at 17:00, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > On Fri, Feb 05, 2016 at 04:53:10PM +0100, Ard Biesheuvel wrote:

> >> On 5 February 2016 at 16:42, Catalin Marinas <catalin.marinas@arm.com> wrote:

> >> > On Thu, Feb 04, 2016 at 04:31:59PM +0100, Ard Biesheuvel wrote:

> >> >> On 4 February 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:

> >> >> > On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

> >> >> >> This adds support for emitting PLTs at module load time for relative

> >> >> >> branches that are out of range. This is a prerequisite for KASLR, which

> >> >> >> may place the kernel and the modules anywhere in the vmalloc area,

> >> >> >> making it more likely that branch target offsets exceed the maximum

> >> >> >> range of +/- 128 MB.

> >> >> >

> >> >> > Any downside to trying to keep the kernel+modules coupled together so

> >> >> > that we avoid the PLT?

> >> >>

> >> >> First of all, note that it is unlikely that the PLTs are ever required

> >> >> in practice, considering that either

> >> >> a) the kernel is loaded at the default location right at the base of

> >> >> the vmalloc range, and in this case, the module space is reserved for

> >> >> modules only, or

> >> >> b) the kernel is loaded at some random offset in the 240+ GB vmalloc

> >> >> space, and it is unlikely that all VMA space around the kernel happens

> >> >> to be given out to non-randomized vmalloc/ioremap allocations

> >> >

> >> > My worry is that we merge some code that's rarely tested.

> >>

> >> I understand. But unfortunately, having corner cases that are unlikely

> >> but not impossible comes with the territory of randomization.

> >>

> >> Alternatively, we could take the performance hit if KASLR is in effect

> >> and allocate each module completely randomly as well. This way, the

> >> code is always exercised (for now), and we can always backpedal later

> >> if the performance is measurably worse.

> >

> > I'm fine with this. You can post it as a separate patch that we can

> > easily revert/modify later (like turning it into a config option).

> 

> OK, I will hack something up


If it's simpler, you can just add a config option but defaulting to the
full vmalloc space for modules.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel Feb. 5, 2016, 4:54 p.m. UTC | #8
On 5 February 2016 at 17:46, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Fri, Feb 05, 2016 at 05:20:14PM +0100, Ard Biesheuvel wrote:

>> On 5 February 2016 at 17:00, Catalin Marinas <catalin.marinas@arm.com> wrote:

>> > On Fri, Feb 05, 2016 at 04:53:10PM +0100, Ard Biesheuvel wrote:

>> >> On 5 February 2016 at 16:42, Catalin Marinas <catalin.marinas@arm.com> wrote:

>> >> > On Thu, Feb 04, 2016 at 04:31:59PM +0100, Ard Biesheuvel wrote:

>> >> >> On 4 February 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:

>> >> >> > On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

>> >> >> >> This adds support for emitting PLTs at module load time for relative

>> >> >> >> branches that are out of range. This is a prerequisite for KASLR, which

>> >> >> >> may place the kernel and the modules anywhere in the vmalloc area,

>> >> >> >> making it more likely that branch target offsets exceed the maximum

>> >> >> >> range of +/- 128 MB.

>> >> >> >

>> >> >> > Any downside to trying to keep the kernel+modules coupled together so

>> >> >> > that we avoid the PLT?

>> >> >>

>> >> >> First of all, note that it is unlikely that the PLTs are ever required

>> >> >> in practice, considering that either

>> >> >> a) the kernel is loaded at the default location right at the base of

>> >> >> the vmalloc range, and in this case, the module space is reserved for

>> >> >> modules only, or

>> >> >> b) the kernel is loaded at some random offset in the 240+ GB vmalloc

>> >> >> space, and it is unlikely that all VMA space around the kernel happens

>> >> >> to be given out to non-randomized vmalloc/ioremap allocations

>> >> >

>> >> > My worry is that we merge some code that's rarely tested.

>> >>

>> >> I understand. But unfortunately, having corner cases that are unlikely

>> >> but not impossible comes with the territory of randomization.

>> >>

>> >> Alternatively, we could take the performance hit if KASLR is in effect

>> >> and allocate each module completely randomly as well. This way, the

>> >> code is always exercised (for now), and we can always backpedal later

>> >> if the performance is measurably worse.

>> >

>> > I'm fine with this. You can post it as a separate patch that we can

>> > easily revert/modify later (like turning it into a config option).

>>

>> OK, I will hack something up

>

> If it's simpler, you can just add a config option but defaulting to the

> full vmalloc space for modules.

>


What would be the simplest is to randomize the 128 MB module region as
a whole, and either put it close to the kernel (which is what I am
doing now), or put it at a random offset inside the vmalloc space, in
which case all branches will be resolved via PLTs. My suggestion to
randomize each module_alloc() call separately is actually not that
straight-forward.

So what I propose now is to keep a single module_load_offset that
randomizes the base of the region, and add a default-n config option
that shrinks the interval it is chosen from so that PLTs are usually
not needed.

-- 
Ard.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Catalin Marinas Feb. 5, 2016, 5:21 p.m. UTC | #9
On Fri, Feb 05, 2016 at 05:54:46PM +0100, Ard Biesheuvel wrote:
> On 5 February 2016 at 17:46, Catalin Marinas <catalin.marinas@arm.com> wrote:

> > On Fri, Feb 05, 2016 at 05:20:14PM +0100, Ard Biesheuvel wrote:

> >> On 5 February 2016 at 17:00, Catalin Marinas <catalin.marinas@arm.com> wrote:

> >> > On Fri, Feb 05, 2016 at 04:53:10PM +0100, Ard Biesheuvel wrote:

> >> >> On 5 February 2016 at 16:42, Catalin Marinas <catalin.marinas@arm.com> wrote:

> >> >> > On Thu, Feb 04, 2016 at 04:31:59PM +0100, Ard Biesheuvel wrote:

> >> >> >> On 4 February 2016 at 16:13, Catalin Marinas <catalin.marinas@arm.com> wrote:

> >> >> >> > On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

> >> >> >> >> This adds support for emitting PLTs at module load time for relative

> >> >> >> >> branches that are out of range. This is a prerequisite for KASLR, which

> >> >> >> >> may place the kernel and the modules anywhere in the vmalloc area,

> >> >> >> >> making it more likely that branch target offsets exceed the maximum

> >> >> >> >> range of +/- 128 MB.

> >> >> >> >

> >> >> >> > Any downside to trying to keep the kernel+modules coupled together so

> >> >> >> > that we avoid the PLT?

> >> >> >>

> >> >> >> First of all, note that it is unlikely that the PLTs are ever required

> >> >> >> in practice, considering that either

> >> >> >> a) the kernel is loaded at the default location right at the base of

> >> >> >> the vmalloc range, and in this case, the module space is reserved for

> >> >> >> modules only, or

> >> >> >> b) the kernel is loaded at some random offset in the 240+ GB vmalloc

> >> >> >> space, and it is unlikely that all VMA space around the kernel happens

> >> >> >> to be given out to non-randomized vmalloc/ioremap allocations

> >> >> >

> >> >> > My worry is that we merge some code that's rarely tested.

> >> >>

> >> >> I understand. But unfortunately, having corner cases that are unlikely

> >> >> but not impossible comes with the territory of randomization.

> >> >>

> >> >> Alternatively, we could take the performance hit if KASLR is in effect

> >> >> and allocate each module completely randomly as well. This way, the

> >> >> code is always exercised (for now), and we can always backpedal later

> >> >> if the performance is measurably worse.

> >> >

> >> > I'm fine with this. You can post it as a separate patch that we can

> >> > easily revert/modify later (like turning it into a config option).

> >>

> >> OK, I will hack something up

> >

> > If it's simpler, you can just add a config option but defaulting to the

> > full vmalloc space for modules.

> 

> What would be the simplest is to randomize the 128 MB module region as

> a whole, and either put it close to the kernel (which is what I am

> doing now), or put it at a random offset inside the vmalloc space, in

> which case all branches will be resolved via PLTs. My suggestion to

> randomize each module_alloc() call separately is actually not that

> straight-forward.

> 

> So what I propose now is to keep a single module_load_offset that

> randomizes the base of the region, and add a default-n config option

> that shrinks the interval it is chosen from so that PLTs are usually

> not needed.


Sounds fine.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Will Deacon Feb. 25, 2016, 4:07 p.m. UTC | #10
Hi Ard,

On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:
> This adds support for emitting PLTs at module load time for relative

> branches that are out of range. This is a prerequisite for KASLR, which

> may place the kernel and the modules anywhere in the vmalloc area,

> making it more likely that branch target offsets exceed the maximum

> range of +/- 128 MB.

> 

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> ---

> 

> In this version, I removed the distinction between relocations against

> .init executable sections and ordinary executable sections. The reason

> is that it is hardly worth the trouble, given that .init.text usually

> does not contain that many far branches, and this version now only

> reserves PLT entry space for jump and call relocations against undefined

> symbols (since symbols defined in the same module can be assumed to be

> within +/- 128 MB)

> 

> For example, the mac80211.ko module (which is fairly sizable at ~400 KB)

> built with -mcmodel=large gives the following relocation counts:

> 

>                     relocs    branches   unique     !local

>   .text              3925       3347       518        219

>   .init.text           11          8         7          1

>   .exit.text            4          4         4          1

>   .text.unlikely       81         67        36         17

> 

> ('unique' means branches to unique type/symbol/addend combos, of which

> !local is the subset referring to undefined symbols)

> 

> IOW, we are only emitting a single PLT entry for the .init sections, and

> we are better off just adding it to the core PLT section instead.

> ---

>  arch/arm64/Kconfig              |   9 +

>  arch/arm64/Makefile             |   6 +-

>  arch/arm64/include/asm/module.h |  11 ++

>  arch/arm64/kernel/Makefile      |   1 +

>  arch/arm64/kernel/module-plts.c | 201 ++++++++++++++++++++

>  arch/arm64/kernel/module.c      |  12 ++

>  arch/arm64/kernel/module.lds    |   3 +

>  7 files changed, 242 insertions(+), 1 deletion(-)


[...]

> +struct plt_entry {

> +	/*

> +	 * A program that conforms to the AArch64 Procedure Call Standard

> +	 * (AAPCS64) must assume that a veneer that alters IP0 (x16) and/or

> +	 * IP1 (x17) may be inserted at any branch instruction that is

> +	 * exposed to a relocation that supports long branches. Since that

> +	 * is exactly what we are dealing with here, we are free to use x16

> +	 * as a scratch register in the PLT veneers.

> +	 */

> +	__le32	mov0;	/* movn	x16, #0x....			*/

> +	__le32	mov1;	/* movk	x16, #0x...., lsl #16		*/

> +	__le32	mov2;	/* movk	x16, #0x...., lsl #32		*/

> +	__le32	br;	/* br	x16				*/

> +};


I'm worried about this code when CONFIG_ARM64_LSE_ATOMICS=y, but we don't
detect them on the CPU at runtime. In this case, all atomic operations
are moved out-of-line and called using a bl instruction from inline asm.

The out-of-line code is compiled with magic GCC options to force the
explicit save/restore of all used registers (see arch/arm64/lib/Makefile),
otherwise we'd have to clutter the inline asm with constraints that
wouldn't be needed had we managed to patch the bl with an LSE atomic
instruction.

If you're emitting a PLT, couldn't we end up with silent corruption of
x16 for modules using out-of-line atomics like this?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel Feb. 25, 2016, 4:12 p.m. UTC | #11
On 25 February 2016 at 17:07, Will Deacon <will.deacon@arm.com> wrote:
> Hi Ard,

>

> On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

>> This adds support for emitting PLTs at module load time for relative

>> branches that are out of range. This is a prerequisite for KASLR, which

>> may place the kernel and the modules anywhere in the vmalloc area,

>> making it more likely that branch target offsets exceed the maximum

>> range of +/- 128 MB.

>>

>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

>> ---

>>

>> In this version, I removed the distinction between relocations against

>> .init executable sections and ordinary executable sections. The reason

>> is that it is hardly worth the trouble, given that .init.text usually

>> does not contain that many far branches, and this version now only

>> reserves PLT entry space for jump and call relocations against undefined

>> symbols (since symbols defined in the same module can be assumed to be

>> within +/- 128 MB)

>>

>> For example, the mac80211.ko module (which is fairly sizable at ~400 KB)

>> built with -mcmodel=large gives the following relocation counts:

>>

>>                     relocs    branches   unique     !local

>>   .text              3925       3347       518        219

>>   .init.text           11          8         7          1

>>   .exit.text            4          4         4          1

>>   .text.unlikely       81         67        36         17

>>

>> ('unique' means branches to unique type/symbol/addend combos, of which

>> !local is the subset referring to undefined symbols)

>>

>> IOW, we are only emitting a single PLT entry for the .init sections, and

>> we are better off just adding it to the core PLT section instead.

>> ---

>>  arch/arm64/Kconfig              |   9 +

>>  arch/arm64/Makefile             |   6 +-

>>  arch/arm64/include/asm/module.h |  11 ++

>>  arch/arm64/kernel/Makefile      |   1 +

>>  arch/arm64/kernel/module-plts.c | 201 ++++++++++++++++++++

>>  arch/arm64/kernel/module.c      |  12 ++

>>  arch/arm64/kernel/module.lds    |   3 +

>>  7 files changed, 242 insertions(+), 1 deletion(-)

>

> [...]

>

>> +struct plt_entry {

>> +     /*

>> +      * A program that conforms to the AArch64 Procedure Call Standard

>> +      * (AAPCS64) must assume that a veneer that alters IP0 (x16) and/or

>> +      * IP1 (x17) may be inserted at any branch instruction that is

>> +      * exposed to a relocation that supports long branches. Since that

>> +      * is exactly what we are dealing with here, we are free to use x16

>> +      * as a scratch register in the PLT veneers.

>> +      */

>> +     __le32  mov0;   /* movn x16, #0x....                    */

>> +     __le32  mov1;   /* movk x16, #0x...., lsl #16           */

>> +     __le32  mov2;   /* movk x16, #0x...., lsl #32           */

>> +     __le32  br;     /* br   x16                             */

>> +};

>

> I'm worried about this code when CONFIG_ARM64_LSE_ATOMICS=y, but we don't

> detect them on the CPU at runtime. In this case, all atomic operations

> are moved out-of-line and called using a bl instruction from inline asm.

>

> The out-of-line code is compiled with magic GCC options


Which options are those exactly?

> to force the

> explicit save/restore of all used registers (see arch/arm64/lib/Makefile),

> otherwise we'd have to clutter the inline asm with constraints that

> wouldn't be needed had we managed to patch the bl with an LSE atomic

> instruction.

>

> If you're emitting a PLT, couldn't we end up with silent corruption of

> x16 for modules using out-of-line atomics like this?

>


If you violate the AAPCS64 ABI, then obviously the claim above does not hold.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel Feb. 25, 2016, 4:13 p.m. UTC | #12
On 25 February 2016 at 17:12, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 25 February 2016 at 17:07, Will Deacon <will.deacon@arm.com> wrote:

>> Hi Ard,

>>

>> On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

>>> This adds support for emitting PLTs at module load time for relative

>>> branches that are out of range. This is a prerequisite for KASLR, which

>>> may place the kernel and the modules anywhere in the vmalloc area,

>>> making it more likely that branch target offsets exceed the maximum

>>> range of +/- 128 MB.

>>>

>>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

>>> ---

>>>

>>> In this version, I removed the distinction between relocations against

>>> .init executable sections and ordinary executable sections. The reason

>>> is that it is hardly worth the trouble, given that .init.text usually

>>> does not contain that many far branches, and this version now only

>>> reserves PLT entry space for jump and call relocations against undefined

>>> symbols (since symbols defined in the same module can be assumed to be

>>> within +/- 128 MB)

>>>

>>> For example, the mac80211.ko module (which is fairly sizable at ~400 KB)

>>> built with -mcmodel=large gives the following relocation counts:

>>>

>>>                     relocs    branches   unique     !local

>>>   .text              3925       3347       518        219

>>>   .init.text           11          8         7          1

>>>   .exit.text            4          4         4          1

>>>   .text.unlikely       81         67        36         17

>>>

>>> ('unique' means branches to unique type/symbol/addend combos, of which

>>> !local is the subset referring to undefined symbols)

>>>

>>> IOW, we are only emitting a single PLT entry for the .init sections, and

>>> we are better off just adding it to the core PLT section instead.

>>> ---

>>>  arch/arm64/Kconfig              |   9 +

>>>  arch/arm64/Makefile             |   6 +-

>>>  arch/arm64/include/asm/module.h |  11 ++

>>>  arch/arm64/kernel/Makefile      |   1 +

>>>  arch/arm64/kernel/module-plts.c | 201 ++++++++++++++++++++

>>>  arch/arm64/kernel/module.c      |  12 ++

>>>  arch/arm64/kernel/module.lds    |   3 +

>>>  7 files changed, 242 insertions(+), 1 deletion(-)

>>

>> [...]

>>

>>> +struct plt_entry {

>>> +     /*

>>> +      * A program that conforms to the AArch64 Procedure Call Standard

>>> +      * (AAPCS64) must assume that a veneer that alters IP0 (x16) and/or

>>> +      * IP1 (x17) may be inserted at any branch instruction that is

>>> +      * exposed to a relocation that supports long branches. Since that

>>> +      * is exactly what we are dealing with here, we are free to use x16

>>> +      * as a scratch register in the PLT veneers.

>>> +      */

>>> +     __le32  mov0;   /* movn x16, #0x....                    */

>>> +     __le32  mov1;   /* movk x16, #0x...., lsl #16           */

>>> +     __le32  mov2;   /* movk x16, #0x...., lsl #32           */

>>> +     __le32  br;     /* br   x16                             */

>>> +};

>>

>> I'm worried about this code when CONFIG_ARM64_LSE_ATOMICS=y, but we don't

>> detect them on the CPU at runtime. In this case, all atomic operations

>> are moved out-of-line and called using a bl instruction from inline asm.

>>

>> The out-of-line code is compiled with magic GCC options

>

> Which options are those exactly?

>

>> to force the

>> explicit save/restore of all used registers (see arch/arm64/lib/Makefile),

>> otherwise we'd have to clutter the inline asm with constraints that

>> wouldn't be needed had we managed to patch the bl with an LSE atomic

>> instruction.

>>

>> If you're emitting a PLT, couldn't we end up with silent corruption of

>> x16 for modules using out-of-line atomics like this?

>>

>

> If you violate the AAPCS64 ABI, then obviously the claim above does not hold.


IOW, yes, x16 will be corrupted unexpectedly if it you are expecting
it to be preserved across a bl or b instruction that is exposed to a
jump/call relocation

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Will Deacon Feb. 25, 2016, 4:26 p.m. UTC | #13
On Thu, Feb 25, 2016 at 05:12:01PM +0100, Ard Biesheuvel wrote:
> On 25 February 2016 at 17:07, Will Deacon <will.deacon@arm.com> wrote:

> > On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

> >> +struct plt_entry {

> >> +     /*

> >> +      * A program that conforms to the AArch64 Procedure Call Standard

> >> +      * (AAPCS64) must assume that a veneer that alters IP0 (x16) and/or

> >> +      * IP1 (x17) may be inserted at any branch instruction that is

> >> +      * exposed to a relocation that supports long branches. Since that

> >> +      * is exactly what we are dealing with here, we are free to use x16

> >> +      * as a scratch register in the PLT veneers.

> >> +      */

> >> +     __le32  mov0;   /* movn x16, #0x....                    */

> >> +     __le32  mov1;   /* movk x16, #0x...., lsl #16           */

> >> +     __le32  mov2;   /* movk x16, #0x...., lsl #32           */

> >> +     __le32  br;     /* br   x16                             */

> >> +};

> >

> > I'm worried about this code when CONFIG_ARM64_LSE_ATOMICS=y, but we don't

> > detect them on the CPU at runtime. In this case, all atomic operations

> > are moved out-of-line and called using a bl instruction from inline asm.

> >

> > The out-of-line code is compiled with magic GCC options

> 

> Which options are those exactly?


# Tell the compiler to treat all general purpose registers as
# callee-saved, which allows for efficient runtime patching of the bl
# instruction in the caller with an atomic instruction when supported by
# the CPU. Result and argument registers are handled correctly, based on
# the function prototype.
lib-$(CONFIG_ARM64_LSE_ATOMICS) += atomic_ll_sc.o
CFLAGS_atomic_ll_sc.o	:= -fcall-used-x0 -ffixed-x1 -ffixed-x2		\
		   -ffixed-x3 -ffixed-x4 -ffixed-x5 -ffixed-x6		\
		   -ffixed-x7 -fcall-saved-x8 -fcall-saved-x9		\
		   -fcall-saved-x10 -fcall-saved-x11 -fcall-saved-x12	\
		   -fcall-saved-x13 -fcall-saved-x14 -fcall-saved-x15	\
		   -fcall-saved-x16 -fcall-saved-x17 -fcall-saved-x18

> > to force the

> > explicit save/restore of all used registers (see arch/arm64/lib/Makefile),

> > otherwise we'd have to clutter the inline asm with constraints that

> > wouldn't be needed had we managed to patch the bl with an LSE atomic

> > instruction.

> >

> > If you're emitting a PLT, couldn't we end up with silent corruption of

> > x16 for modules using out-of-line atomics like this?

> >

> 

> If you violate the AAPCS64 ABI, then obviously the claim above does not hold.


Indeed, but this is what mainline is doing today and I'm not keen on
breaking it. One way to fix it would be to generate a different type of
plt for branches to the atomic functions that would use the stack
instead of x16.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel Feb. 25, 2016, 4:33 p.m. UTC | #14
On 25 February 2016 at 17:26, Will Deacon <will.deacon@arm.com> wrote:
> On Thu, Feb 25, 2016 at 05:12:01PM +0100, Ard Biesheuvel wrote:

>> On 25 February 2016 at 17:07, Will Deacon <will.deacon@arm.com> wrote:

>> > On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

>> >> +struct plt_entry {

>> >> +     /*

>> >> +      * A program that conforms to the AArch64 Procedure Call Standard

>> >> +      * (AAPCS64) must assume that a veneer that alters IP0 (x16) and/or

>> >> +      * IP1 (x17) may be inserted at any branch instruction that is

>> >> +      * exposed to a relocation that supports long branches. Since that

>> >> +      * is exactly what we are dealing with here, we are free to use x16

>> >> +      * as a scratch register in the PLT veneers.

>> >> +      */

>> >> +     __le32  mov0;   /* movn x16, #0x....                    */

>> >> +     __le32  mov1;   /* movk x16, #0x...., lsl #16           */

>> >> +     __le32  mov2;   /* movk x16, #0x...., lsl #32           */

>> >> +     __le32  br;     /* br   x16                             */

>> >> +};

>> >

>> > I'm worried about this code when CONFIG_ARM64_LSE_ATOMICS=y, but we don't

>> > detect them on the CPU at runtime. In this case, all atomic operations

>> > are moved out-of-line and called using a bl instruction from inline asm.

>> >

>> > The out-of-line code is compiled with magic GCC options

>>

>> Which options are those exactly?

>

> # Tell the compiler to treat all general purpose registers as

> # callee-saved, which allows for efficient runtime patching of the bl

> # instruction in the caller with an atomic instruction when supported by

> # the CPU. Result and argument registers are handled correctly, based on

> # the function prototype.

> lib-$(CONFIG_ARM64_LSE_ATOMICS) += atomic_ll_sc.o

> CFLAGS_atomic_ll_sc.o   := -fcall-used-x0 -ffixed-x1 -ffixed-x2         \

>                    -ffixed-x3 -ffixed-x4 -ffixed-x5 -ffixed-x6          \

>                    -ffixed-x7 -fcall-saved-x8 -fcall-saved-x9           \

>                    -fcall-saved-x10 -fcall-saved-x11 -fcall-saved-x12   \

>                    -fcall-saved-x13 -fcall-saved-x14 -fcall-saved-x15   \

>                    -fcall-saved-x16 -fcall-saved-x17 -fcall-saved-x18

>


Yikes. Is that safe?

>> > to force the

>> > explicit save/restore of all used registers (see arch/arm64/lib/Makefile),

>> > otherwise we'd have to clutter the inline asm with constraints that

>> > wouldn't be needed had we managed to patch the bl with an LSE atomic

>> > instruction.

>> >

>> > If you're emitting a PLT, couldn't we end up with silent corruption of

>> > x16 for modules using out-of-line atomics like this?

>> >

>>

>> If you violate the AAPCS64 ABI, then obviously the claim above does not hold.

>

> Indeed, but this is what mainline is doing today and I'm not keen on

> breaking it. One way to fix it would be to generate a different type of

> plt for branches to the atomic functions that would use the stack

> instead of x16.

>


AFAIK, gcc never uses x18 (the platform register) so we may be able to
use that instead. We'd need confirmation from the toolchain guys,
though ...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Will Deacon Feb. 25, 2016, 4:42 p.m. UTC | #15
On Thu, Feb 25, 2016 at 05:33:25PM +0100, Ard Biesheuvel wrote:
> On 25 February 2016 at 17:26, Will Deacon <will.deacon@arm.com> wrote:

> > On Thu, Feb 25, 2016 at 05:12:01PM +0100, Ard Biesheuvel wrote:

> >> On 25 February 2016 at 17:07, Will Deacon <will.deacon@arm.com> wrote:

> >> > On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

> >> >> +struct plt_entry {

> >> >> +     /*

> >> >> +      * A program that conforms to the AArch64 Procedure Call Standard

> >> >> +      * (AAPCS64) must assume that a veneer that alters IP0 (x16) and/or

> >> >> +      * IP1 (x17) may be inserted at any branch instruction that is

> >> >> +      * exposed to a relocation that supports long branches. Since that

> >> >> +      * is exactly what we are dealing with here, we are free to use x16

> >> >> +      * as a scratch register in the PLT veneers.

> >> >> +      */

> >> >> +     __le32  mov0;   /* movn x16, #0x....                    */

> >> >> +     __le32  mov1;   /* movk x16, #0x...., lsl #16           */

> >> >> +     __le32  mov2;   /* movk x16, #0x...., lsl #32           */

> >> >> +     __le32  br;     /* br   x16                             */

> >> >> +};

> >> >

> >> > I'm worried about this code when CONFIG_ARM64_LSE_ATOMICS=y, but we don't

> >> > detect them on the CPU at runtime. In this case, all atomic operations

> >> > are moved out-of-line and called using a bl instruction from inline asm.

> >> >

> >> > The out-of-line code is compiled with magic GCC options

> >>

> >> Which options are those exactly?

> >

> > # Tell the compiler to treat all general purpose registers as

> > # callee-saved, which allows for efficient runtime patching of the bl

> > # instruction in the caller with an atomic instruction when supported by

> > # the CPU. Result and argument registers are handled correctly, based on

> > # the function prototype.

> > lib-$(CONFIG_ARM64_LSE_ATOMICS) += atomic_ll_sc.o

> > CFLAGS_atomic_ll_sc.o   := -fcall-used-x0 -ffixed-x1 -ffixed-x2         \

> >                    -ffixed-x3 -ffixed-x4 -ffixed-x5 -ffixed-x6          \

> >                    -ffixed-x7 -fcall-saved-x8 -fcall-saved-x9           \

> >                    -fcall-saved-x10 -fcall-saved-x11 -fcall-saved-x12   \

> >                    -fcall-saved-x13 -fcall-saved-x14 -fcall-saved-x15   \

> >                    -fcall-saved-x16 -fcall-saved-x17 -fcall-saved-x18

> >

> 

> Yikes. Is that safe?


It seems to work, and x86 uses a similar trick for it's hweight code.

> >> > to force the

> >> > explicit save/restore of all used registers (see arch/arm64/lib/Makefile),

> >> > otherwise we'd have to clutter the inline asm with constraints that

> >> > wouldn't be needed had we managed to patch the bl with an LSE atomic

> >> > instruction.

> >> >

> >> > If you're emitting a PLT, couldn't we end up with silent corruption of

> >> > x16 for modules using out-of-line atomics like this?

> >> >

> >>

> >> If you violate the AAPCS64 ABI, then obviously the claim above does not hold.

> >

> > Indeed, but this is what mainline is doing today and I'm not keen on

> > breaking it. One way to fix it would be to generate a different type of

> > plt for branches to the atomic functions that would use the stack

> > instead of x16.

> >

> 

> AFAIK, gcc never uses x18 (the platform register) so we may be able to

> use that instead. We'd need confirmation from the toolchain guys,

> though ...


In fact, a better thing to do is probably for the atomic code to
save/restore those register explicitly and then remove them from the
cflags above.

I'll try hacking something together...

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel Feb. 25, 2016, 4:43 p.m. UTC | #16
On 25 February 2016 at 17:42, Will Deacon <will.deacon@arm.com> wrote:
> On Thu, Feb 25, 2016 at 05:33:25PM +0100, Ard Biesheuvel wrote:

>> On 25 February 2016 at 17:26, Will Deacon <will.deacon@arm.com> wrote:

>> > On Thu, Feb 25, 2016 at 05:12:01PM +0100, Ard Biesheuvel wrote:

>> >> On 25 February 2016 at 17:07, Will Deacon <will.deacon@arm.com> wrote:

>> >> > On Mon, Feb 01, 2016 at 02:09:31PM +0100, Ard Biesheuvel wrote:

>> >> >> +struct plt_entry {

>> >> >> +     /*

>> >> >> +      * A program that conforms to the AArch64 Procedure Call Standard

>> >> >> +      * (AAPCS64) must assume that a veneer that alters IP0 (x16) and/or

>> >> >> +      * IP1 (x17) may be inserted at any branch instruction that is

>> >> >> +      * exposed to a relocation that supports long branches. Since that

>> >> >> +      * is exactly what we are dealing with here, we are free to use x16

>> >> >> +      * as a scratch register in the PLT veneers.

>> >> >> +      */

>> >> >> +     __le32  mov0;   /* movn x16, #0x....                    */

>> >> >> +     __le32  mov1;   /* movk x16, #0x...., lsl #16           */

>> >> >> +     __le32  mov2;   /* movk x16, #0x...., lsl #32           */

>> >> >> +     __le32  br;     /* br   x16                             */

>> >> >> +};

>> >> >

>> >> > I'm worried about this code when CONFIG_ARM64_LSE_ATOMICS=y, but we don't

>> >> > detect them on the CPU at runtime. In this case, all atomic operations

>> >> > are moved out-of-line and called using a bl instruction from inline asm.

>> >> >

>> >> > The out-of-line code is compiled with magic GCC options

>> >>

>> >> Which options are those exactly?

>> >

>> > # Tell the compiler to treat all general purpose registers as

>> > # callee-saved, which allows for efficient runtime patching of the bl

>> > # instruction in the caller with an atomic instruction when supported by

>> > # the CPU. Result and argument registers are handled correctly, based on

>> > # the function prototype.

>> > lib-$(CONFIG_ARM64_LSE_ATOMICS) += atomic_ll_sc.o

>> > CFLAGS_atomic_ll_sc.o   := -fcall-used-x0 -ffixed-x1 -ffixed-x2         \

>> >                    -ffixed-x3 -ffixed-x4 -ffixed-x5 -ffixed-x6          \

>> >                    -ffixed-x7 -fcall-saved-x8 -fcall-saved-x9           \

>> >                    -fcall-saved-x10 -fcall-saved-x11 -fcall-saved-x12   \

>> >                    -fcall-saved-x13 -fcall-saved-x14 -fcall-saved-x15   \

>> >                    -fcall-saved-x16 -fcall-saved-x17 -fcall-saved-x18

>> >

>>

>> Yikes. Is that safe?

>

> It seems to work, and x86 uses a similar trick for it's hweight code.

>

>> >> > to force the

>> >> > explicit save/restore of all used registers (see arch/arm64/lib/Makefile),

>> >> > otherwise we'd have to clutter the inline asm with constraints that

>> >> > wouldn't be needed had we managed to patch the bl with an LSE atomic

>> >> > instruction.

>> >> >

>> >> > If you're emitting a PLT, couldn't we end up with silent corruption of

>> >> > x16 for modules using out-of-line atomics like this?

>> >> >

>> >>

>> >> If you violate the AAPCS64 ABI, then obviously the claim above does not hold.

>> >

>> > Indeed, but this is what mainline is doing today and I'm not keen on

>> > breaking it. One way to fix it would be to generate a different type of

>> > plt for branches to the atomic functions that would use the stack

>> > instead of x16.

>> >

>>

>> AFAIK, gcc never uses x18 (the platform register) so we may be able to

>> use that instead. We'd need confirmation from the toolchain guys,

>> though ...

>

> In fact, a better thing to do is probably for the atomic code to

> save/restore those register explicitly and then remove them from the

> cflags above.

>

> I'll try hacking something together...

>


That would be more correct, indeed. Also, the PLT can only use br
<reg> so it will always enter the callee with the preserved value
stacked, and so the callee needs to be modified in any case.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Will Deacon Feb. 25, 2016, 4:46 p.m. UTC | #17
On Thu, Feb 25, 2016 at 05:43:35PM +0100, Ard Biesheuvel wrote:
> On 25 February 2016 at 17:42, Will Deacon <will.deacon@arm.com> wrote:

> > On Thu, Feb 25, 2016 at 05:33:25PM +0100, Ard Biesheuvel wrote:

> >> AFAIK, gcc never uses x18 (the platform register) so we may be able to

> >> use that instead. We'd need confirmation from the toolchain guys,

> >> though ...

> >

> > In fact, a better thing to do is probably for the atomic code to

> > save/restore those register explicitly and then remove them from the

> > cflags above.

> >

> > I'll try hacking something together...

> >

> 

> That would be more correct, indeed. Also, the PLT can only use br

> <reg> so it will always enter the callee with the preserved value

> stacked, and so the callee needs to be modified in any case.


Wait -- why does the callee need to be modified here? It has no idea
about whether or not it was invoked via a PLT.

Also, can I rely on the PLT not clobbering the condition code?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel Feb. 25, 2016, 4:49 p.m. UTC | #18
On 25 February 2016 at 17:46, Will Deacon <will.deacon@arm.com> wrote:
> On Thu, Feb 25, 2016 at 05:43:35PM +0100, Ard Biesheuvel wrote:

>> On 25 February 2016 at 17:42, Will Deacon <will.deacon@arm.com> wrote:

>> > On Thu, Feb 25, 2016 at 05:33:25PM +0100, Ard Biesheuvel wrote:

>> >> AFAIK, gcc never uses x18 (the platform register) so we may be able to

>> >> use that instead. We'd need confirmation from the toolchain guys,

>> >> though ...

>> >

>> > In fact, a better thing to do is probably for the atomic code to

>> > save/restore those register explicitly and then remove them from the

>> > cflags above.

>> >

>> > I'll try hacking something together...

>> >

>>

>> That would be more correct, indeed. Also, the PLT can only use br

>> <reg> so it will always enter the callee with the preserved value

>> stacked, and so the callee needs to be modified in any case.

>

> Wait -- why does the callee need to be modified here? It has no idea

> about whether or not it was invoked via a PLT.

>


No, it doesn't, and that is exactly the problem: the PLT must end in a
br <reg> instruction because that is the only branch instruction with
unlimited range. That means you will always enter the callee with its
address in some register rather than the value that should have been
preserved. I don't see a way to manage that from the callee

> Also, can I rely on the PLT not clobbering the condition code?

>


For my particular implementation, yes.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel Feb. 25, 2016, 4:50 p.m. UTC | #19
On 25 February 2016 at 17:49, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 25 February 2016 at 17:46, Will Deacon <will.deacon@arm.com> wrote:

>> On Thu, Feb 25, 2016 at 05:43:35PM +0100, Ard Biesheuvel wrote:

>>> On 25 February 2016 at 17:42, Will Deacon <will.deacon@arm.com> wrote:

>>> > On Thu, Feb 25, 2016 at 05:33:25PM +0100, Ard Biesheuvel wrote:

>>> >> AFAIK, gcc never uses x18 (the platform register) so we may be able to

>>> >> use that instead. We'd need confirmation from the toolchain guys,

>>> >> though ...

>>> >

>>> > In fact, a better thing to do is probably for the atomic code to

>>> > save/restore those register explicitly and then remove them from the

>>> > cflags above.

>>> >

>>> > I'll try hacking something together...

>>> >

>>>

>>> That would be more correct, indeed. Also, the PLT can only use br

>>> <reg> so it will always enter the callee with the preserved value

>>> stacked, and so the callee needs to be modified in any case.

>>

>> Wait -- why does the callee need to be modified here? It has no idea

>> about whether or not it was invoked via a PLT.

>>

>

> No, it doesn't, and that is exactly the problem: the PLT must end in a

> br <reg> instruction because that is the only branch instruction with

> unlimited range. That means you will always enter the callee with its

> address in some register rather than the value that should have been

> preserved. I don't see a way to manage that from the callee

>


*caller! dammit

>> Also, can I rely on the PLT not clobbering the condition code?

>>

>

> For my particular implementation, yes.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Will Deacon Feb. 25, 2016, 4:56 p.m. UTC | #20
On Thu, Feb 25, 2016 at 05:50:17PM +0100, Ard Biesheuvel wrote:
> On 25 February 2016 at 17:49, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> > On 25 February 2016 at 17:46, Will Deacon <will.deacon@arm.com> wrote:

> >> On Thu, Feb 25, 2016 at 05:43:35PM +0100, Ard Biesheuvel wrote:

> >>> On 25 February 2016 at 17:42, Will Deacon <will.deacon@arm.com> wrote:

> >>> > On Thu, Feb 25, 2016 at 05:33:25PM +0100, Ard Biesheuvel wrote:

> >>> >> AFAIK, gcc never uses x18 (the platform register) so we may be able to

> >>> >> use that instead. We'd need confirmation from the toolchain guys,

> >>> >> though ...

> >>> >

> >>> > In fact, a better thing to do is probably for the atomic code to

> >>> > save/restore those register explicitly and then remove them from the

> >>> > cflags above.

> >>> >

> >>> > I'll try hacking something together...

> >>> >

> >>>

> >>> That would be more correct, indeed. Also, the PLT can only use br

> >>> <reg> so it will always enter the callee with the preserved value

> >>> stacked, and so the callee needs to be modified in any case.

> >>

> >> Wait -- why does the callee need to be modified here? It has no idea

> >> about whether or not it was invoked via a PLT.

> >>

> >

> > No, it doesn't, and that is exactly the problem: the PLT must end in a

> > br <reg> instruction because that is the only branch instruction with

> > unlimited range. That means you will always enter the callee with its

> > address in some register rather than the value that should have been

> > preserved. I don't see a way to manage that from the callee

> >

> 

> *caller! dammit


The caller can do:

save x16, 17
bl plt
restore x16, x17

the plt will do:

get_addr_of_callee_into_x16_and_clobber_x17_or_something
br callee

then the callee will be compiled with all those weird options, but *not*
the ones specifying x16 and x17. That means it can happily use those guys
as scratch, because the caller will take care of them.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel Feb. 25, 2016, 5:31 p.m. UTC | #21
On 25 February 2016 at 17:56, Will Deacon <will.deacon@arm.com> wrote:
> On Thu, Feb 25, 2016 at 05:50:17PM +0100, Ard Biesheuvel wrote:

>> On 25 February 2016 at 17:49, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>> > On 25 February 2016 at 17:46, Will Deacon <will.deacon@arm.com> wrote:

>> >> On Thu, Feb 25, 2016 at 05:43:35PM +0100, Ard Biesheuvel wrote:

>> >>> On 25 February 2016 at 17:42, Will Deacon <will.deacon@arm.com> wrote:

>> >>> > On Thu, Feb 25, 2016 at 05:33:25PM +0100, Ard Biesheuvel wrote:

>> >>> >> AFAIK, gcc never uses x18 (the platform register) so we may be able to

>> >>> >> use that instead. We'd need confirmation from the toolchain guys,

>> >>> >> though ...

>> >>> >

>> >>> > In fact, a better thing to do is probably for the atomic code to

>> >>> > save/restore those register explicitly and then remove them from the

>> >>> > cflags above.

>> >>> >

>> >>> > I'll try hacking something together...

>> >>> >

>> >>>

>> >>> That would be more correct, indeed. Also, the PLT can only use br

>> >>> <reg> so it will always enter the callee with the preserved value

>> >>> stacked, and so the callee needs to be modified in any case.

>> >>

>> >> Wait -- why does the callee need to be modified here? It has no idea

>> >> about whether or not it was invoked via a PLT.

>> >>

>> >

>> > No, it doesn't, and that is exactly the problem: the PLT must end in a

>> > br <reg> instruction because that is the only branch instruction with

>> > unlimited range. That means you will always enter the callee with its

>> > address in some register rather than the value that should have been

>> > preserved. I don't see a way to manage that from the callee

>> >

>>

>> *caller! dammit

>

> The caller can do:

>

> save x16, 17

> bl plt

> restore x16, x17

>


This is __LL_SC_CALL, right? So you just fold a stp/ldp in there?

> the plt will do:

>

> get_addr_of_callee_into_x16_and_clobber_x17_or_something

> br callee

>

> then the callee will be compiled with all those weird options, but *not*

> the ones specifying x16 and x17. That means it can happily use those guys

> as scratch, because the caller will take care of them.

>


Yes, that makes sense. But if you don't relax that restriction, you
only need the alternative __LL_SC_CALL for modules.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
diff mbox

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index cd767fa3037a..141f65ab0ed5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -394,6 +394,7 @@  config ARM64_ERRATUM_843419
 	bool "Cortex-A53: 843419: A load or store might access an incorrect address"
 	depends on MODULES
 	default y
+	select ARM64_MODULE_CMODEL_LARGE
 	help
 	  This option builds kernel modules using the large memory model in
 	  order to avoid the use of the ADRP instruction, which can cause
@@ -753,6 +754,14 @@  config ARM64_LSE_ATOMICS
 
 endmenu
 
+config ARM64_MODULE_CMODEL_LARGE
+	bool
+
+config ARM64_MODULE_PLTS
+	bool
+	select ARM64_MODULE_CMODEL_LARGE
+	select HAVE_MOD_ARCH_SPECIFIC
+
 endmenu
 
 menu "Boot options"
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index 307237cfe728..a6bba9623836 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -43,10 +43,14 @@  endif
 
 CHECKFLAGS	+= -D__aarch64__
 
-ifeq ($(CONFIG_ARM64_ERRATUM_843419), y)
+ifeq ($(CONFIG_ARM64_MODULE_CMODEL_LARGE), y)
 KBUILD_CFLAGS_MODULE	+= -mcmodel=large
 endif
 
+ifeq ($(CONFIG_ARM64_MODULE_PLTS),y)
+KBUILD_LDFLAGS_MODULE	+= -T $(srctree)/arch/arm64/kernel/module.lds
+endif
+
 # Default value
 head-y		:= arch/arm64/kernel/head.o
 
diff --git a/arch/arm64/include/asm/module.h b/arch/arm64/include/asm/module.h
index e80e232b730e..8652fb613304 100644
--- a/arch/arm64/include/asm/module.h
+++ b/arch/arm64/include/asm/module.h
@@ -20,4 +20,15 @@ 
 
 #define MODULE_ARCH_VERMAGIC	"aarch64"
 
+#ifdef CONFIG_ARM64_MODULE_PLTS
+struct mod_arch_specific {
+	struct elf64_shdr	*plt;
+	int			plt_num_entries;
+	int			plt_max_entries;
+};
+#endif
+
+u64 module_emit_plt_entry(struct module *mod, const Elf64_Rela *rela,
+			  Elf64_Sym *sym);
+
 #endif /* __ASM_MODULE_H */
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 83cd7e68e83b..e2f0a755beaa 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -30,6 +30,7 @@  arm64-obj-$(CONFIG_COMPAT)		+= sys32.o kuser32.o signal32.o 	\
 					   ../../arm/kernel/opcodes.o
 arm64-obj-$(CONFIG_FUNCTION_TRACER)	+= ftrace.o entry-ftrace.o
 arm64-obj-$(CONFIG_MODULES)		+= arm64ksyms.o module.o
+arm64-obj-$(CONFIG_ARM64_MODULE_PLTS)	+= module-plts.o
 arm64-obj-$(CONFIG_PERF_EVENTS)		+= perf_regs.o perf_callchain.o
 arm64-obj-$(CONFIG_HW_PERF_EVENTS)	+= perf_event.o
 arm64-obj-$(CONFIG_HAVE_HW_BREAKPOINT)	+= hw_breakpoint.o
diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
new file mode 100644
index 000000000000..1ce90d8450ae
--- /dev/null
+++ b/arch/arm64/kernel/module-plts.c
@@ -0,0 +1,201 @@ 
+/*
+ * Copyright (C) 2014-2016 Linaro Ltd. <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/elf.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sort.h>
+
+struct plt_entry {
+	/*
+	 * A program that conforms to the AArch64 Procedure Call Standard
+	 * (AAPCS64) must assume that a veneer that alters IP0 (x16) and/or
+	 * IP1 (x17) may be inserted at any branch instruction that is
+	 * exposed to a relocation that supports long branches. Since that
+	 * is exactly what we are dealing with here, we are free to use x16
+	 * as a scratch register in the PLT veneers.
+	 */
+	__le32	mov0;	/* movn	x16, #0x....			*/
+	__le32	mov1;	/* movk	x16, #0x...., lsl #16		*/
+	__le32	mov2;	/* movk	x16, #0x...., lsl #32		*/
+	__le32	br;	/* br	x16				*/
+};
+
+u64 module_emit_plt_entry(struct module *mod, const Elf64_Rela *rela,
+			  Elf64_Sym *sym)
+{
+	struct plt_entry *plt = (struct plt_entry *)mod->arch.plt->sh_addr;
+	int i = mod->arch.plt_num_entries;
+	u64 val = sym->st_value + rela->r_addend;
+
+	/*
+	 * We only emit PLT entries against undefined (SHN_UNDEF) symbols,
+	 * which are listed in the ELF symtab section, but without a type
+	 * or a size.
+	 * So, similar to how the module loader uses the Elf64_Sym::st_value
+	 * field to store the resolved addresses of undefined symbols, let's
+	 * borrow the Elf64_Sym::st_size field (whose value is never used by
+	 * the module loader, even for symbols that are defined) to record
+	 * the address of a symbol's associated PLT entry as we emit it for a
+	 * zero addend relocation (which is the only kind we have to deal with
+	 * in practice). This allows us to find duplicates without having to
+	 * go through the table every time.
+	 */
+	if (rela->r_addend == 0 && sym->st_size != 0) {
+		BUG_ON(sym->st_size < (u64)plt || sym->st_size >= (u64)&plt[i]);
+		return sym->st_size;
+	}
+
+	mod->arch.plt_num_entries++;
+	BUG_ON(mod->arch.plt_num_entries > mod->arch.plt_max_entries);
+
+	/*
+	 * MOVK/MOVN/MOVZ opcode:
+	 * +--------+------------+--------+-----------+-------------+---------+
+	 * | sf[31] | opc[30:29] | 100101 | hw[22:21] | imm16[20:5] | Rd[4:0] |
+	 * +--------+------------+--------+-----------+-------------+---------+
+	 *
+	 * Rd     := 0x10 (x16)
+	 * hw     := 0b00 (no shift), 0b01 (lsl #16), 0b10 (lsl #32)
+	 * opc    := 0b11 (MOVK), 0b00 (MOVN), 0b10 (MOVZ)
+	 * sf     := 1 (64-bit variant)
+	 */
+	plt[i] = (struct plt_entry){
+		cpu_to_le32(0x92800010 | (((~val      ) & 0xffff)) << 5),
+		cpu_to_le32(0xf2a00010 | ((( val >> 16) & 0xffff)) << 5),
+		cpu_to_le32(0xf2c00010 | ((( val >> 32) & 0xffff)) << 5),
+		cpu_to_le32(0xd61f0200)
+	};
+
+	if (rela->r_addend == 0)
+		sym->st_size = (u64)&plt[i];
+
+	return (u64)&plt[i];
+}
+
+#define cmp_3way(a,b)	((a) < (b) ? -1 : (a) > (b))
+
+static int cmp_rela(const void *a, const void *b)
+{
+	const Elf64_Rela *x = a, *y = b;
+	int i;
+
+	/* sort by type, symbol index and addend */
+	i = cmp_3way(ELF64_R_TYPE(x->r_info), ELF64_R_TYPE(y->r_info));
+	if (i == 0)
+		i = cmp_3way(ELF64_R_SYM(x->r_info), ELF64_R_SYM(y->r_info));
+	if (i == 0)
+		i = cmp_3way(x->r_addend, y->r_addend);
+	return i;
+}
+
+static bool duplicate_rel(const Elf64_Rela *rela, int num)
+{
+	/*
+	 * Entries are sorted by type, symbol index and addend. That means
+	 * that, if a duplicate entry exists, it must be in the preceding
+	 * slot.
+	 */
+	return num > 0 && cmp_rela(rela + num, rela + num - 1) == 0;
+}
+
+static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num)
+{
+	unsigned int ret = 0;
+	Elf64_Sym *s;
+	int i;
+
+	for (i = 0; i < num; i++) {
+		switch (ELF64_R_TYPE(rela[i].r_info)) {
+		case R_AARCH64_JUMP26:
+		case R_AARCH64_CALL26:
+			/*
+			 * We only have to consider branch targets that resolve
+			 * to undefined symbols. This is not simply a heuristic,
+			 * it is a fundamental limitation, since the PLT itself
+			 * is part of the module, and needs to be within 128 MB
+			 * as well, so modules can never grow beyond that limit.
+			 */
+			s = syms + ELF64_R_SYM(rela[i].r_info);
+			if (s->st_shndx != SHN_UNDEF)
+				break;
+
+			/*
+			 * Jump relocations with non-zero addends against
+			 * undefined symbols are supported by the ELF spec, but
+			 * do not occur in practice (e.g., 'jump n bytes past
+			 * the entry point of undefined function symbol f').
+			 * So we need to support them, but there is no need to
+			 * take them into consideration when trying to optimize
+			 * this code. So let's only check for duplicates when
+			 * the addend is zero: this allows us to record the PLT
+			 * entry address in the symbol table itself, rather than
+			 * having to search the list for duplicates each time we
+			 * emit one.
+			 */
+			if (rela[i].r_addend != 0 || !duplicate_rel(rela, i))
+				ret++;
+			break;
+		}
+	}
+	return ret;
+}
+
+int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
+			      char *secstrings, struct module *mod)
+{
+	unsigned long plt_max_entries = 0;
+	Elf64_Sym *syms = NULL;
+	int i;
+
+	/*
+	 * Find the empty .plt section so we can expand it to store the PLT
+	 * entries. Record the symtab address as well.
+	 */
+	for (i = 0; i < ehdr->e_shnum; i++) {
+		if (strcmp(".plt", secstrings + sechdrs[i].sh_name) == 0)
+			mod->arch.plt = sechdrs + i;
+		else if (sechdrs[i].sh_type == SHT_SYMTAB)
+			syms = (Elf64_Sym *)sechdrs[i].sh_addr;
+	}
+
+	if (!mod->arch.plt) {
+		pr_err("%s: module PLT section missing\n", mod->name);
+		return -ENOEXEC;
+	}
+	if (!syms) {
+		pr_err("%s: module symtab section missing\n", mod->name);
+		return -ENOEXEC;
+	}
+
+	for (i = 0; i < ehdr->e_shnum; i++) {
+		Elf64_Rela *rels = (void *)ehdr + sechdrs[i].sh_offset;
+		int numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela);
+		Elf64_Shdr *dstsec = sechdrs + sechdrs[i].sh_info;
+
+		if (sechdrs[i].sh_type != SHT_RELA)
+			continue;
+
+		/* ignore relocations that operate on non-exec sections */
+		if (!(dstsec->sh_flags & SHF_EXECINSTR))
+			continue;
+
+		/* sort by type, symbol index and addend */
+		sort(rels, numrels, sizeof(Elf64_Rela), cmp_rela, NULL);
+
+		plt_max_entries += count_plts(syms, rels, numrels);
+	}
+
+	mod->arch.plt->sh_type = SHT_NOBITS;
+	mod->arch.plt->sh_flags = SHF_EXECINSTR | SHF_ALLOC;
+	mod->arch.plt->sh_addralign = L1_CACHE_BYTES;
+	mod->arch.plt->sh_size = plt_max_entries * sizeof(struct plt_entry);
+	mod->arch.plt_num_entries = 0;
+	mod->arch.plt_max_entries = plt_max_entries;
+	return 0;
+}
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index 93e970231ca9..84113d3e1df1 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -38,6 +38,11 @@  void *module_alloc(unsigned long size)
 				GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
 				NUMA_NO_NODE, __builtin_return_address(0));
 
+	if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
+		p = __vmalloc_node_range(size, MODULE_ALIGN, VMALLOC_START,
+				VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
+				NUMA_NO_NODE, __builtin_return_address(0));
+
 	if (p && (kasan_module_alloc(p, size) < 0)) {
 		vfree(p);
 		return NULL;
@@ -361,6 +366,13 @@  int apply_relocate_add(Elf64_Shdr *sechdrs,
 		case R_AARCH64_CALL26:
 			ovf = reloc_insn_imm(RELOC_OP_PREL, loc, val, 2, 26,
 					     AARCH64_INSN_IMM_26);
+
+			if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
+			    ovf == -ERANGE) {
+				val = module_emit_plt_entry(me, &rel[i], sym);
+				ovf = reloc_insn_imm(RELOC_OP_PREL, loc, val, 2,
+						     26, AARCH64_INSN_IMM_26);
+			}
 			break;
 
 		default:
diff --git a/arch/arm64/kernel/module.lds b/arch/arm64/kernel/module.lds
new file mode 100644
index 000000000000..8949f6c6f729
--- /dev/null
+++ b/arch/arm64/kernel/module.lds
@@ -0,0 +1,3 @@ 
+SECTIONS {
+	.plt (NOLOAD) : { BYTE(0) }
+}