Message ID | 20241028160917.1380714-6-alexander.shishkin@linux.intel.com |
---|---|
State | New |
Headers | show |
Series | Enable Linear Address Space Separation support | expand |
On 10/28/2024 9:07 AM, Alexander Shishkin wrote: > In order to map the EFI runtime services, set_virtual_address_map > needs to be called, which resides in the lower half of the address > space. This means that LASS needs to be temporarily disabled around > this call. This can only be done before the CR pinning is set up. > ... > > /* > * This needs to follow the FPU initializtion, since EFI depends on it. > + * It also needs to precede the CR pinning setup, because we need to be > + * able to temporarily clear the CR4.LASS bit in order to execute the > + * set_virtual_address_map call, which resides in lower addresses and > + * would trip LASS if enabled. > */ It would be helpful to describe why lass_stac()/clac() doesn't work here and instead the heavy handed CR4 toggling is needed. > if (efi_enabled(EFI_RUNTIME_SERVICES)) > efi_enter_virtual_mode(); > > + setup_cr_pinning(); > +
> /* > * This needs to follow the FPU initializtion, since EFI depends on it. > + * It also needs to precede the CR pinning setup, because we need to be > + * able to temporarily clear the CR4.LASS bit in order to execute the > + * set_virtual_address_map call, which resides in lower addresses and > + * would trip LASS if enabled. > */ Why are the temporary mappings used to patch kernel code in the lower half of the virtual address space? The comments in front of use_temporary_mm() say: * Using a temporary mm allows to set temporary mappings that are not accessible * by other CPUs. Such mappings are needed to perform sensitive memory writes * that override the kernel memory protections (e.g., W^X), without exposing the * temporary page-table mappings that are required for these write operations to * other CPUs. Using a temporary mm also allows to avoid TLB shootdowns when the * mapping is torn down. But couldn't we map into upper half and do some/all of: 1) Trust that there aren't stupid bugs that dereference random pointers into the temporary mapping? 2) Make a "this CPU only" mapping 3) Avoid preemption while patching so there is no need for TLB shootdown by other CPUs when the temporary mapping is torn down, just flush local TLB. -Tony
On 10/29/24 15:26, Luck, Tony wrote: >> /* >> * This needs to follow the FPU initializtion, since EFI depends on it. >> + * It also needs to precede the CR pinning setup, because we need to be >> + * able to temporarily clear the CR4.LASS bit in order to execute the >> + * set_virtual_address_map call, which resides in lower addresses and >> + * would trip LASS if enabled. >> */ > > Why are the temporary mappings used to patch kernel code in the lower half > of the virtual address space? I was just asking myself the same thing. The upper half is always mapped uniformly. When you create an MM you copy the 256->511th pgd entries verbatim from the init_mm's pgd. If you map something the <=255th pgd entry, it isn't (by default) visible to other mm's. That's why a new mm also tends to get you a new process. > But couldn't we map into upper half and do some/all of: > > 1) Trust that there aren't stupid bugs that dereference random pointers into the > temporary mapping? > 2) Make a "this CPU only" mapping > 3) Avoid preemption while patching so there is no need for TLB shootdown > by other CPUs when the temporary mapping is torn down, just flush local TLB. It's about enforcing R^X semantics. We should limit the time and scope where mappings have some data both writeable and executable. If we poke text in the upper half of the address space, any kernel thread might be exploited to write to what will soon be executable. If we do it in the lower half in its own mm, you have to compromise the thread doing the text poking after the mapping is created but before it is invalidated. With LASS you *ALSO* need to do it in the STAC/CLAC window which is smaller than the window when the TLB is valid. *IF* we switched things to do text poking in the upper half of the address space, we'd probably want to find a completely unused PGD entry. I'm not sure off the top of my head if we have a good one for that or if it's worth the trouble.
> *IF* we switched things to do text poking in the upper half of the > address space, we'd probably want to find a completely unused PGD entry. > I'm not sure off the top of my head if we have a good one for that or > if it's worth the trouble. I expect that would be easy on 64-bit (no way the kernel needs all the PGD entries from 256..511) but hard for 32-bit (where kernel address space is in critically short supply on any machine with >1GB RAM). -Tony
On 10/29/24 15:59, Luck, Tony wrote: >> *IF* we switched things to do text poking in the upper half of the >> address space, we'd probably want to find a completely unused PGD entry. >> I'm not sure off the top of my head if we have a good one for that or >> if it's worth the trouble. > > I expect that would be easy on 64-bit (no way the kernel needs all the > PGD entries from 256..511) but hard for 32-bit (where kernel address > space is in critically short supply on any machine with >1GB RAM). No LASS on 32 bits... -hpa
On 10/29/24 15:59, Luck, Tony wrote: >> *IF* we switched things to do text poking in the upper half of the >> address space, we'd probably want to find a completely unused PGD entry. >> I'm not sure off the top of my head if we have a good one for that or >> if it's worth the trouble. > I expect that would be easy on 64-bit (no way the kernel needs all the > PGD entries from 256..511) but hard for 32-bit (where kernel address > space is in critically short supply on any machine with >1GB RAM). Yeah, I was talking about 64-bit only. On 32-bit PAE a PGD maps 1/4 of the address space which is totally unworkable for stealing.
On 10/29/24 16:03, Dave Hansen wrote: > On 10/29/24 15:59, Luck, Tony wrote: >>> *IF* we switched things to do text poking in the upper half of the >>> address space, we'd probably want to find a completely unused PGD entry. >>> I'm not sure off the top of my head if we have a good one for that or >>> if it's worth the trouble. >> I expect that would be easy on 64-bit (no way the kernel needs all the >> PGD entries from 256..511) but hard for 32-bit (where kernel address >> space is in critically short supply on any machine with >1GB RAM). > > Yeah, I was talking about 64-bit only. On 32-bit PAE a PGD maps 1/4 of > the address space which is totally unworkable for stealing. But it is also not necessary. -hpa
>> Yeah, I was talking about 64-bit only. On 32-bit PAE a PGD maps 1/4 of >> the address space which is totally unworkable for stealing. > > But it is also not necessary. So maybe we could make the 64-bit version of use_temporary_mm() use some reserved address mapping to a reserved PGD in the upper half of address space, and the 32-bit version continue to use "user" addresses. It's unclear to me whether adding complexity here would be worth it to remove the 64-bit STAC/CLAC text patching issues. -Tony
On 10/29/24 16:18, Luck, Tony wrote: >>> Yeah, I was talking about 64-bit only. On 32-bit PAE a PGD maps 1/4 of >>> the address space which is totally unworkable for stealing. >> >> But it is also not necessary. > > So maybe we could make the 64-bit version of use_temporary_mm() > use some reserved address mapping to a reserved PGD in the upper > half of address space, and the 32-bit version continue to use "user" > addresses. It's unclear to me whether adding complexity here would be > worth it to remove the 64-bit STAC/CLAC text patching issues. > For 32 bits we can also simply use something further down in the hierarchy. It's not like we can afford to have the PGD be anything other than RWX on 32 bits. -hpa
On Tue, Oct 29, 2024 at 11:18:29PM +0000, Luck, Tony wrote: > >> Yeah, I was talking about 64-bit only. On 32-bit PAE a PGD maps 1/4 of > >> the address space which is totally unworkable for stealing. > > > > But it is also not necessary. > > So maybe we could make the 64-bit version of use_temporary_mm() > use some reserved address mapping to a reserved PGD in the upper > half of address space, and the 32-bit version continue to use "user" > addresses. It's unclear to me whether adding complexity here would be > worth it to remove the 64-bit STAC/CLAC text patching issues. Redesigning use_temporary_mm() is an interesting experiment, but it is out of scope for the series. LASS blocks LAM enabling. It would be nice to get LAM re-enabled soonish. Maybe we can look at use_temporary_mm() again after LASS gets upstream?
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index b24ad418536e..c249fd0aa3fb 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1953,7 +1953,6 @@ static __init void identify_boot_cpu(void) enable_sep_cpu(); #endif cpu_detect_tlb(&boot_cpu_data); - setup_cr_pinning(); tsx_init(); tdx_init(); @@ -2385,10 +2384,16 @@ void __init arch_cpu_finalize_init(void) /* * This needs to follow the FPU initializtion, since EFI depends on it. + * It also needs to precede the CR pinning setup, because we need to be + * able to temporarily clear the CR4.LASS bit in order to execute the + * set_virtual_address_map call, which resides in lower addresses and + * would trip LASS if enabled. */ if (efi_enabled(EFI_RUNTIME_SERVICES)) efi_enter_virtual_mode(); + setup_cr_pinning(); + /* * Ensure that access to the per CPU representation has the initial * boot CPU configuration.
In order to map the EFI runtime services, set_virtual_address_map needs to be called, which resides in the lower half of the address space. This means that LASS needs to be temporarily disabled around this call. This can only be done before the CR pinning is set up. Move CR pinning setup behind the EFI initialization. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- arch/x86/kernel/cpu/common.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)