Message ID | 20250429113242.998312-1-vschneid@redhat.com |
---|---|
Headers | show |
Series | context_tracking,x86: Defer some IPIs until a user->kernel transition | expand |
On Wed, 30 Apr 2025 11:07:35 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 4/30/25 10:20, Steven Rostedt wrote: > > On Tue, 29 Apr 2025 09:11:57 -0700 > > Dave Hansen <dave.hansen@intel.com> wrote: > > > >> I don't think we should do this series. > > > > Could you provide more rationale for your decision. > > I talked about it a bit in here: > > > https://lore.kernel.org/all/408ebd8b-4bfb-4c4f-b118-7fe853c6e897@intel.com/ Hmm, that's easily missed. But thanks for linking it. > > But, basically, this series puts a new onus on the entry code: it can't > touch the vmalloc() area ... except the LDT ... and except the PEBS > buffers. If anyone touches vmalloc()'d memory (or anything else that > eventually gets deferred), they crash. They _only_ crash on these > NOHZ_FULL systems. > > Putting new restrictions on the entry code is really nasty. Let's say a > new hardware feature showed up that touched vmalloc()'d memory in the > entry code. Probably, nobody would notice until they got that new > hardware and tried to do a NOHZ_FULL workload. It might take years to > uncover, once that hardware was out in the wild. > > I have a substantial number of gray hairs from dealing with corner cases > in the entry code. > > You _could_ make it more debuggable. Could you make this work for all > tasks, not just NOHZ_FULL? The same logic _should_ apply. It would be > inefficient, but would provide good debugging coverage. > > I also mentioned this earlier, but PTI could be leveraged here to ensure > that the TLB is flushed properly. You could have the rule that anything > mapped into the user page table can't have a deferred flush and then do > deferred flushes at SWITCH_TO_KERNEL_CR3 time. Yeah, that's in > arch-specific assembly, but it's a million times easier to reason about > because the window where a deferred-flush allocation might bite you is > so small. > > Look at the syscall code for instance: > > > SYM_CODE_START(entry_SYSCALL_64) > > swapgs > > movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) > > SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp > > You can _trivially_ audit this and know that swapgs doesn't touch memory > and that as long as PER_CPU_VAR()s and the process stack don't have > their mappings munged and flushes deferred that this would be correct. Hmm, so there is still a path for this? At least if it added more ways to debug it, and some other changes to make the locations where vmalloc is dangerous smaller? > > >> If folks want this functionality, they should get a new CPU that can > >> flush the TLB without IPIs. > > > > That's a pretty heavy handed response. I'm not sure that's always a > > feasible solution. > > > > From my experience in the world, software has always been around to fix the > > hardware, not the other way around ;-) > > Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think. > You can go buy the Intel hardware off the shelf today. Sure, but changing CPUs on machines is not always that feasible either. -- Steve
On 4/30/25 12:42, Steven Rostedt wrote: >> Look at the syscall code for instance: >> >>> SYM_CODE_START(entry_SYSCALL_64) >>> swapgs >>> movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) >>> SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp >> You can _trivially_ audit this and know that swapgs doesn't touch memory >> and that as long as PER_CPU_VAR()s and the process stack don't have >> their mappings munged and flushes deferred that this would be correct. > Hmm, so there is still a path for this? > > At least if it added more ways to debug it, and some other changes to make > the locations where vmalloc is dangerous smaller? Being able to debug it would be a good start. But, more generally, what we need is for more people to be able to run the code in the first place. Would a _normal_ system (without setups that are trying to do NOHZ_FULL) ever be able to defer TLB flush IPIs? If the answer is no, then, yeah, I'll settle for some debugging options. But if you shrink the window as small as I'm talking about, it would look very different from this series. For instance, imagine when a CPU goes into the NOHZ mode. Could it just unconditionally flush the TLB on the way back into the kernel (in the same SWITCH_TO_KERNEL_CR3 spot)? Yeah, it'll make entry into the kernel expensive for NOHZ tasks, but it's not *THAT* bad. And if the entire point of a NOHZ_FULL task is to minimize the number of kernel entries then a little extra overhead there doesn't sound too bad. Also, about the new hardware, I suspect there's some mystery customer lurking in the shadows asking folks for this functionality. Could you at least go _talk_ to the mystery customer(s) and see which hardware they care about? They might already even have the magic CPUs they need for this, or have them on the roadmap. If they've got Intel CPUs, I'd be happy to help figure it out.
On 30/04/25 13:00, Dave Hansen wrote: > On 4/30/25 12:42, Steven Rostedt wrote: >>> Look at the syscall code for instance: >>> >>>> SYM_CODE_START(entry_SYSCALL_64) >>>> swapgs >>>> movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) >>>> SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp >>> You can _trivially_ audit this and know that swapgs doesn't touch memory >>> and that as long as PER_CPU_VAR()s and the process stack don't have >>> their mappings munged and flushes deferred that this would be correct. >> Hmm, so there is still a path for this? >> >> At least if it added more ways to debug it, and some other changes to make >> the locations where vmalloc is dangerous smaller? > > Being able to debug it would be a good start. But, more generally, what > we need is for more people to be able to run the code in the first > place. Would a _normal_ system (without setups that are trying to do > NOHZ_FULL) ever be able to defer TLB flush IPIs? > > If the answer is no, then, yeah, I'll settle for some debugging options. > > But if you shrink the window as small as I'm talking about, it would > look very different from this series. > > For instance, imagine when a CPU goes into the NOHZ mode. Could it just > unconditionally flush the TLB on the way back into the kernel (in the > same SWITCH_TO_KERNEL_CR3 spot)? Yeah, it'll make entry into the kernel > expensive for NOHZ tasks, but it's not *THAT* bad. And if the entire > point of a NOHZ_FULL task is to minimize the number of kernel entries > then a little extra overhead there doesn't sound too bad. > Right, so my thought per your previous comments was to special case the TLB flush, depend on kPTI and do it uncondtionally in SWITCH_TO_KERNEL_CR3 just like you've described - but keep the context tracking mechanism for other deferrable operations. My gripe with that was having two separate mechanisms - super early entry around SWITCH_TO_KERNEL_CR3) - later entry at context tracking Shifting everything to SWITCH_TO_KERNEL_CR3 means we lose the context_tracking infra to dynamically defer operations (atomically reading and writing to context_tracking.state), which means we unconditionally run all possible deferrable operations. This doesn't scream scalable, even though as you say NOHZ_FULL kernel entry is already a "you lose" situation. Yet another option is to duplicate the context tracking state specifically for IPI deferral and have it driven in/by SWITCH_TO_KERNEL_CR3, which is also not super savoury. I suppose I can start poking around running deferred ops in that SWITCH_TO_KERNEL_CR3 region, and add state/infra on top. Let's see where this gets me :-) Again, thanks for the insight and the suggestions Dave! > Also, about the new hardware, I suspect there's some mystery customer > lurking in the shadows asking folks for this functionality. Could you at > least go _talk_ to the mystery customer(s) and see which hardware they > care about? They might already even have the magic CPUs they need for > this, or have them on the roadmap. If they've got Intel CPUs, I'd be > happy to help figure it out.
On Wed, Apr 30, 2025 at 11:07:35AM -0700, Dave Hansen wrote: > Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think. > You can go buy the Intel hardware off the shelf today. To be fair, the Intel RAR thing is pretty horrific :-( Definitely sub-par compared to the AMD and ARM things. Furthermore, the paper states it is a uarch feature for SPR with no guarantee future uarchs will get it (and to be fair, I'd prefer it if they didn't). Furthermore, I suspect it will actually be slower than IPIs for anything with more than 64 logical CPUs due to reduced parallelism.
On 5/2/25 02:55, Valentin Schneider wrote: > My gripe with that was having two separate mechanisms > - super early entry around SWITCH_TO_KERNEL_CR3) > - later entry at context tracking What do you mean by "later entry"? All of the paths to enter the kernel from userspace have some SWITCH_TO_KERNEL_CR3 variant. If they didn't, the userspace that they entered from could have attacked the kernel with Meltdown. I'm theorizing that if this is _just_ about avoiding TLB flush IPIs that you can get away with a single mechanism.
On 5/2/25 04:22, Peter Zijlstra wrote: > On Wed, Apr 30, 2025 at 11:07:35AM -0700, Dave Hansen wrote: > >> Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think. >> You can go buy the Intel hardware off the shelf today. > To be fair, the Intel RAR thing is pretty horrific 🙁 Definitely > sub-par compared to the AMD and ARM things. > > Furthermore, the paper states it is a uarch feature for SPR with no > guarantee future uarchs will get it (and to be fair, I'd prefer it if > they didn't). I don't think any of that is set in stone, fwiw. It should be entirely possible to obtain a longer promise about its availability. Or ask that AMD and Intel put their heads together in their fancy new x86 advisory group and figure out a single way forward. If you're right that RAR stinks and INVLPGB rocks, then it'll be an easy thing to advise. > Furthermore, I suspect it will actually be slower than IPIs for anything > with more than 64 logical CPUs due to reduced parallelism. Maybe my brain is crusty and I need to go back and read the spec, but I remember RAR using the normal old APIC programming that normal old TLB flush IPIs use. So they have similar restrictions. If it's inefficient to program a wide IPI, it's also inefficient to program a RAR operation. So the (theoretical) pro is that you program it like an IPI and it slots into the IPI code fairly easily. But the con is that it has the same limitations as IPIs. I was actually concerned that INVLPGB won't be scalable. Since it doesn't have the ability to target specific CPUs in the ISA, it fundamentally need to either have a mechanism to reach all CPUs, or some way to know which TLB entries each CPU might have. Maybe AMD has something super duper clever to limit the broadcast scope. But if they don't, then a small range flush on a small number of CPUs might end up being pretty expensive, relatively. I don't think this is a big problem in Rik's series because he had a floor on the size of processes that get INVLPGB applied. Also, if it turns out to be a problem, it's dirt simple to revert back to IPIs for problematic TLB flushes. But I am deeply curious how the system will behave if there are a boatload of processes doing modestly-sized INVLPGBs that only apply to a handful of CPUs on a very large system. AMD and Intel came at this from very different angles (go figure). The designs are prioritizing different things for sure. I can't wait to see both of them fighting it out under real workloads.
On Fri, May 02, 2025 at 07:33:55AM -0700, Dave Hansen wrote: > On 5/2/25 04:22, Peter Zijlstra wrote: > > On Wed, Apr 30, 2025 at 11:07:35AM -0700, Dave Hansen wrote: > > > >> Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think. > >> You can go buy the Intel hardware off the shelf today. > > To be fair, the Intel RAR thing is pretty horrific 🙁 Definitely > > sub-par compared to the AMD and ARM things. > > > > Furthermore, the paper states it is a uarch feature for SPR with no > > guarantee future uarchs will get it (and to be fair, I'd prefer it if > > they didn't). > > I don't think any of that is set in stone, fwiw. It should be entirely > possible to obtain a longer promise about its availability. > > Or ask that AMD and Intel put their heads together in their fancy new > x86 advisory group and figure out a single way forward. This might be a good thing regardless. > > Furthermore, I suspect it will actually be slower than IPIs for anything > > with more than 64 logical CPUs due to reduced parallelism. > > Maybe my brain is crusty and I need to go back and read the spec, but I > remember RAR using the normal old APIC programming that normal old TLB > flush IPIs use. So they have similar restrictions. If it's inefficient > to program a wide IPI, it's also inefficient to program a RAR operation. > So the (theoretical) pro is that you program it like an IPI and it slots > into the IPI code fairly easily. But the con is that it has the same > limitations as IPIs. The problem is in the request structure. Sending an IPI is an async action. You do, done. OTOH RAR has a request buffer where pending requests are put and 'polled' for completion. This buffer does not have room for more than 64 CPUs. This means that if you want to invalidate across more, you need to do it in multiple batches. So where IPI is: - IPI all CPUs - local invalidate - wait for completion This then becomes: for () - RAR some CPUs - wait for completion Or so I thought to have understood, the paper isn't the easiest to read. > I was actually concerned that INVLPGB won't be scalable. Since it > doesn't have the ability to target specific CPUs in the ISA, it > fundamentally need to either have a mechanism to reach all CPUs, or some > way to know which TLB entries each CPU might have. > > Maybe AMD has something super duper clever to limit the broadcast scope. > But if they don't, then a small range flush on a small number of CPUs > might end up being pretty expensive, relatively. So the way I understand things: Sending IPIs is sending a message on the interconnect. Mostly this is a cacheline in size (because MESI). Sparc (v9?) has a fun feature where you can actually put data payload in an IPI. Now, we can target an IPI to a single CPU or to a (limited) set of CPU or broadcast to all CPUs. In fact, targeted IPIs might still be broadcast IPIs, except most CPUs will ignore it because it doesn't match them. TLBI broadcast is like sending IPIs to all CPUs, the message goes out, everybody sees it. Much like how snoop filters and the like function, a CPU can process these messages async -- your CPU doesn't stall for a cacheline invalidate message either (except ofcourse if it is actively using that line). Same for TLBI, if the local TLB does not have anything that matches, its done. Even if it does match, as long as nothing makes active use of it, it can just drop the TLB entry without disturbing the actual core. Only if the CPU has a matching TLB entry *and* it is active, then we have options. One option is to interrupt the core, another option is to wait for it to stop using it. IIUC the current AMD implementation does the 'interrupt' thing. One thing to consider in all this is that if we TLBI for an executable page, we should very much also wipe the u-ops cache and all such related structures -- ARM might have an 'issue' here. That is, I think the TLBI problem is very similar to the I in MESI -- except possibly simpler, because E must not happen until all CPUs acknowledge I etc. TLBI does not have this, it has until the next TLBSYNC. Anyway, I'm not a hardware person, but this is how I understand these things to work.
On 02/05/25 06:53, Dave Hansen wrote: > On 5/2/25 02:55, Valentin Schneider wrote: >> My gripe with that was having two separate mechanisms >> - super early entry around SWITCH_TO_KERNEL_CR3) >> - later entry at context tracking > > What do you mean by "later entry"? > I meant the point at which the deferred operation is run in the current patches, i.e. ct_kernel_enter() - kernel entry from the PoV of context tracking. > All of the paths to enter the kernel from userspace have some > SWITCH_TO_KERNEL_CR3 variant. If they didn't, the userspace that they > entered from could have attacked the kernel with Meltdown. > > I'm theorizing that if this is _just_ about avoiding TLB flush IPIs that > you can get away with a single mechanism. So right now there would indeed be the TLB flush IPIs, but also the text_poke() ones (sync_core() after patching text). These are the two NOHZ-breaking IPIs that show up on my HP box, and that I also got reports for from folks using NOHZ_FULL + CPU isolation in production, mostly on SPR "edge enhanced" type of systems. There's been some other sources of IPIs that have been fixed with an ad-hoc solution - disable the mechanism for NOHZ_FULL CPUs or do it differently such that an IPI isn't required, e.g. https://lore.kernel.org/lkml/ZJtBrybavtb1x45V@tpad/ While I don't expect the list to grow much, it's unfortunately not just the TLB flush IPIs.
gah, the cc list here is rotund... On 5/2/25 09:38, Valentin Schneider wrote: ... >> All of the paths to enter the kernel from userspace have some >> SWITCH_TO_KERNEL_CR3 variant. If they didn't, the userspace that they >> entered from could have attacked the kernel with Meltdown. >> >> I'm theorizing that if this is _just_ about avoiding TLB flush IPIs that >> you can get away with a single mechanism. > > So right now there would indeed be the TLB flush IPIs, but also the > text_poke() ones (sync_core() after patching text). > > These are the two NOHZ-breaking IPIs that show up on my HP box, and that I > also got reports for from folks using NOHZ_FULL + CPU isolation in > production, mostly on SPR "edge enhanced" type of systems. ... > While I don't expect the list to grow much, it's unfortunately not just the > TLB flush IPIs. Isn't text patching way easier than TLB flushes? You just need *some* serialization. Heck, since TLB flushes are architecturally serializing, you could probably even reuse the exact same mechanism: implement deferred text patch serialization operations as a deferred TLB flush. The hardest part is figuring out which CPUs are in the state where they can be deferred or not. But you have to solve that in any case, and you already have an algorithm to do it.
On 02/05/25 10:57, Dave Hansen wrote: > gah, the cc list here is rotund... > > On 5/2/25 09:38, Valentin Schneider wrote: > ... >>> All of the paths to enter the kernel from userspace have some >>> SWITCH_TO_KERNEL_CR3 variant. If they didn't, the userspace that they >>> entered from could have attacked the kernel with Meltdown. >>> >>> I'm theorizing that if this is _just_ about avoiding TLB flush IPIs that >>> you can get away with a single mechanism. >> >> So right now there would indeed be the TLB flush IPIs, but also the >> text_poke() ones (sync_core() after patching text). >> >> These are the two NOHZ-breaking IPIs that show up on my HP box, and that I >> also got reports for from folks using NOHZ_FULL + CPU isolation in >> production, mostly on SPR "edge enhanced" type of systems. > ... >> While I don't expect the list to grow much, it's unfortunately not just the >> TLB flush IPIs. > > Isn't text patching way easier than TLB flushes? You just need *some* > serialization. Heck, since TLB flushes are architecturally serializing, > you could probably even reuse the exact same mechanism: implement > deferred text patch serialization operations as a deferred TLB flush. > > The hardest part is figuring out which CPUs are in the state where they > can be deferred or not. But you have to solve that in any case, and you > already have an algorithm to do it. Alright, off to mess around SWITCH_TO_KERNEL_CR3 to see how shoving deferred operations there would look then.