mbox series

[v5,00/25] context_tracking,x86: Defer some IPIs until a user->kernel transition

Message ID 20250429113242.998312-1-vschneid@redhat.com
Headers show
Series context_tracking,x86: Defer some IPIs until a user->kernel transition | expand

Message

Valentin Schneider April 29, 2025, 11:32 a.m. UTC
Context
=======

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

  64359.052209596    NetworkManager       0    1405     smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
    smp_call_function_many_cond+0x1
    smp_call_function+0x39
    on_each_cpu+0x2a
    flush_tlb_kernel_range+0x7b
    __purge_vmap_area_lazy+0x70
    _vm_unmap_aliases.part.42+0xdf
    change_page_attr_set_clr+0x16a
    set_memory_ro+0x26
    bpf_int_jit_compile+0x2f9
    bpf_prog_select_runtime+0xc6
    bpf_prepare_filter+0x523
    sk_attach_filter+0x13
    sock_setsockopt+0x92c
    __sys_setsockopt+0x16a
    __x64_sys_setsockopt+0x20
    do_syscall_64+0x87
    entry_SYSCALL_64_after_hwframe+0x65

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.

Deferral approach
=================

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.

What about IPIs whose callback take a parameter, you may ask?

Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.

This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).

Kernel entry vs execution of the deferred operation
===================================================

This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].

There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (before we start getting into
context_tracking.c proper), i.e.:

  idtentry_func_foo()                <--- we're in the kernel
    irqentry_enter()
      enter_from_user_mode()
	__ct_user_exit()
	    ct_kernel_enter_state()
	      ct_work_flush()        <--- deferred operation is executed here

This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.

Where are we at with this whole thing?
======================================

Dave has been incredibly helpful wrt figuring out what would and wouldn't
(mostly that) be safe to do for deferring kernel range TLB flush IPIs, see [5].

Long story short, there are ugly things I can still do to (safely) defer the TLB
flush IPIs, but it's going to be a long session of pulling my own hair out, and
I got plenty so I won't be done for a while.

In the meantime, I think everything leading up to deferring text poke IPIs is
sane-ish and could get in. I'm not the biggest fan of adding an API with a
single user, but hey, I've been working on this for "a little while" now and
I'll still need to get the other IPIs sorted out.

TL;DR: Text patching IPI deferral LGTM so here it is for now, I'm still working
on the TLB flush thing.

Patches
=======

o Patches 1-2 are standalone objtool cleanups.

o Patches 3-4 add an RCU testing feature.

o Patches 5-6 add infrastructure for annotating static keys and static calls
  that may be used in noinstr code (courtesy of Josh).
o Patches 7-20 use said annotations on relevant keys / calls.
o Patch 21 enforces proper usage of said annotations (courtesy of Josh).

o Patches 22-23 deal with detecting NOINSTR text in modules

o Patches 24-25 add the actual IPI deferral faff

Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v5

Testing
=======

Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL10 userspace.

Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:

$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
 	           -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
	           -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
		   rteval --onlyload --loads-cpulist=$HK_CPUS \
		   --hackbench-runlowmem=True --duration=$DURATION

This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 3 hours.

v6.14
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
     93 callback=generic_smp_call_function_single_interrupt+0x0
     22 callback=nohz_full_kick_func+0x0
     
# These are the different CSD's that caused IPIs    
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
   1456 func=do_flush_tlb_all
     78 func=do_sync_core
     33 func=nohz_full_kick_func
     26 func=do_kernel_range_flush

v6.14 + patches
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
     86 callback=generic_smp_call_function_single_interrupt+0x0
     41 callback=nohz_full_kick_func+0x0

# These are the different CSD's that caused IPIs          
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
   1378 func=do_flush_tlb_all
     33 func=nohz_full_kick_func

So the TLB flush is still there driving most of the IPIs, but at least the
instruction patching IPIs are gone. With kernel TLB flushes deferred, there are
no IPIs sent to isolated CPUs in that 3hr window, but as stated above that still
needs some more work.
     
Also note that tlb_remove_table_smp_sync() showed up during testing of v3, and
has gone as mysteriously as it showed up. Yair had a series adressing this [6]
which per these results would be worth revisiting.

Acknowledgements
================

Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o All of the folks who attended various (too many?) talks about this and
  provided precious feedback.  
o The mm folks for pointing out what I can and can't do with TLB flushes

Links
=====

[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: http://lore.kernel.org/r/eef09bdc-7546-462b-9ac0-661a44d2ceae@intel.com
[6]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/

Revisions
=========

v4 -> v5
++++++++

o Rebased onto v6.15-rc3
o Collected Reviewed-by

o Annotated a few more static keys
o Added proper checking of noinstr sections that are in loadable code such as
  KVM early entry (Sean Christopherson)

o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or
  CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI
  entry from idle (thanks to Frederic!)

o Ditched the vmap TLB flush deferral (for now)  
  

RFCv3 -> v4
+++++++++++

o Rebased onto v6.13-rc6

o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)

o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups

o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ

RFCv2 -> RFCv3
++++++++++++++

o Rebased onto v6.12-rc6

o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral


RFCv1 -> RFCv2
++++++++++++++

o Rebased onto v6.5-rc1

o Updated the trace filter patches (Steven)

o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
  existing .state field (Peter, Frederic)
  
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
  rcutorture case for a low-size counter (Paul) 

o Fixed flush_tlb_kernel_range_deferrable() definition

Josh Poimboeuf (3):
  jump_label: Add annotations for validating noinstr usage
  static_call: Add read-only-after-init static calls
  objtool: Add noinstr validation for static branches/calls

Valentin Schneider (22):
  objtool: Make validate_call() recognize indirect calls to pv_ops[]
  objtool: Flesh out warning related to pv_ops[] calls
  rcu: Add a small-width RCU watching counter debug option
  rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
  x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
  x86/idle: Mark x86_idle static call as __ro_after_init
  x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
  riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
  loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
  perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
  sched/clock: Mark sched_clock_running key as __ro_after_init
  KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
  x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr
  sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
  KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
    allowed in .noinstr
  stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
  module: Remove outdated comment about text_size
  module: Add MOD_NOINSTR_TEXT mem_type
  context-tracking: Introduce work deferral infrastructure
  context_tracking,x86: Defer kernel text patching IPIs

 arch/Kconfig                                  |   9 ++
 arch/arm/kernel/paravirt.c                    |   2 +-
 arch/arm64/kernel/paravirt.c                  |   2 +-
 arch/loongarch/kernel/paravirt.c              |   2 +-
 arch/riscv/kernel/paravirt.c                  |   2 +-
 arch/x86/Kconfig                              |   1 +
 arch/x86/events/amd/brs.c                     |   2 +-
 arch/x86/include/asm/context_tracking_work.h  |  18 +++
 arch/x86/include/asm/text-patching.h          |   1 +
 arch/x86/kernel/alternative.c                 |  39 ++++++-
 arch/x86/kernel/cpu/bugs.c                    |   2 +-
 arch/x86/kernel/kprobes/core.c                |   4 +-
 arch/x86/kernel/kprobes/opt.c                 |   4 +-
 arch/x86/kernel/module.c                      |   2 +-
 arch/x86/kernel/paravirt.c                    |   4 +-
 arch/x86/kernel/process.c                     |   2 +-
 arch/x86/kvm/vmx/vmx.c                        |  11 +-
 arch/x86/kvm/vmx/vmx_onhyperv.c               |   2 +-
 include/asm-generic/sections.h                |  15 +++
 include/linux/context_tracking.h              |  21 ++++
 include/linux/context_tracking_state.h        |  54 +++++++--
 include/linux/context_tracking_work.h         |  26 +++++
 include/linux/jump_label.h                    |  30 ++++-
 include/linux/module.h                        |   6 +-
 include/linux/objtool.h                       |   7 ++
 include/linux/static_call.h                   |  19 ++++
 kernel/context_tracking.c                     |  69 +++++++++++-
 kernel/kprobes.c                              |   8 +-
 kernel/module/main.c                          |  85 ++++++++++----
 kernel/rcu/Kconfig.debug                      |  15 +++
 kernel/sched/clock.c                          |   7 +-
 kernel/stackleak.c                            |   6 +-
 kernel/time/Kconfig                           |   5 +
 tools/objtool/Documentation/objtool.txt       |  34 ++++++
 tools/objtool/check.c                         | 106 +++++++++++++++---
 tools/objtool/include/objtool/check.h         |   1 +
 tools/objtool/include/objtool/elf.h           |   1 +
 tools/objtool/include/objtool/special.h       |   1 +
 tools/objtool/special.c                       |  15 ++-
 .../selftests/rcutorture/configs/rcu/TREE04   |   1 +
 40 files changed, 557 insertions(+), 84 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h

--
2.49.0

Comments

Steven Rostedt April 30, 2025, 7:42 p.m. UTC | #1
On Wed, 30 Apr 2025 11:07:35 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 4/30/25 10:20, Steven Rostedt wrote:
> > On Tue, 29 Apr 2025 09:11:57 -0700
> > Dave Hansen <dave.hansen@intel.com> wrote:
> >   
> >> I don't think we should do this series.  
> > 
> > Could you provide more rationale for your decision.  
> 
> I talked about it a bit in here:
> 
> > https://lore.kernel.org/all/408ebd8b-4bfb-4c4f-b118-7fe853c6e897@intel.com/  

Hmm, that's easily missed. But thanks for linking it.

> 
> But, basically, this series puts a new onus on the entry code: it can't
> touch the vmalloc() area ... except the LDT ... and except the PEBS
> buffers. If anyone touches vmalloc()'d memory (or anything else that
> eventually gets deferred), they crash. They _only_ crash on these
> NOHZ_FULL systems.
> 
> Putting new restrictions on the entry code is really nasty. Let's say a
> new hardware feature showed up that touched vmalloc()'d memory in the
> entry code. Probably, nobody would notice until they got that new
> hardware and tried to do a NOHZ_FULL workload. It might take years to
> uncover, once that hardware was out in the wild.
> 
> I have a substantial number of gray hairs from dealing with corner cases
> in the entry code.
> 
> You _could_ make it more debuggable. Could you make this work for all
> tasks, not just NOHZ_FULL? The same logic _should_ apply. It would be
> inefficient, but would provide good debugging coverage.
> 
> I also mentioned this earlier, but PTI could be leveraged here to ensure
> that the TLB is flushed properly. You could have the rule that anything
> mapped into the user page table can't have a deferred flush and then do
> deferred flushes at SWITCH_TO_KERNEL_CR3 time. Yeah, that's in
> arch-specific assembly, but it's a million times easier to reason about
> because the window where a deferred-flush allocation might bite you is
> so small.
> 
> Look at the syscall code for instance:
> 
> > SYM_CODE_START(entry_SYSCALL_64)
> >         swapgs
> >         movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
> >         SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp  
> 
> You can _trivially_ audit this and know that swapgs doesn't touch memory
> and that as long as PER_CPU_VAR()s and the process stack don't have
> their mappings munged and flushes deferred that this would be correct.

Hmm, so there is still a path for this?

At least if it added more ways to debug it, and some other changes to make
the locations where vmalloc is dangerous smaller?

> 
> >> If folks want this functionality, they should get a new CPU that can
> >> flush the TLB without IPIs.  
> > 
> > That's a pretty heavy handed response. I'm not sure that's always a
> > feasible solution.
> > 
> > From my experience in the world, software has always been around to fix the
> > hardware, not the other way around ;-)  
> 
> Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think.
> You can go buy the Intel hardware off the shelf today.

Sure, but changing CPUs on machines is not always that feasible either.

-- Steve
Dave Hansen April 30, 2025, 8 p.m. UTC | #2
On 4/30/25 12:42, Steven Rostedt wrote:
>> Look at the syscall code for instance:
>>
>>> SYM_CODE_START(entry_SYSCALL_64)
>>>         swapgs
>>>         movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
>>>         SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp  
>> You can _trivially_ audit this and know that swapgs doesn't touch memory
>> and that as long as PER_CPU_VAR()s and the process stack don't have
>> their mappings munged and flushes deferred that this would be correct.
> Hmm, so there is still a path for this?
> 
> At least if it added more ways to debug it, and some other changes to make
> the locations where vmalloc is dangerous smaller?

Being able to debug it would be a good start. But, more generally, what
we need is for more people to be able to run the code in the first
place. Would a _normal_ system (without setups that are trying to do
NOHZ_FULL) ever be able to defer TLB flush IPIs?

If the answer is no, then, yeah, I'll settle for some debugging options.

But if you shrink the window as small as I'm talking about, it would
look very different from this series.

For instance, imagine when a CPU goes into the NOHZ mode. Could it just
unconditionally flush the TLB on the way back into the kernel (in the
same SWITCH_TO_KERNEL_CR3 spot)? Yeah, it'll make entry into the kernel
expensive for NOHZ tasks, but it's not *THAT* bad. And if the entire
point of a NOHZ_FULL task is to minimize the number of kernel entries
then a little extra overhead there doesn't sound too bad.

Also, about the new hardware, I suspect there's some mystery customer
lurking in the shadows asking folks for this functionality. Could you at
least go _talk_ to the mystery customer(s) and see which hardware they
care about? They might already even have the magic CPUs they need for
this, or have them on the roadmap. If they've got Intel CPUs, I'd be
happy to help figure it out.
Valentin Schneider May 2, 2025, 9:55 a.m. UTC | #3
On 30/04/25 13:00, Dave Hansen wrote:
> On 4/30/25 12:42, Steven Rostedt wrote:
>>> Look at the syscall code for instance:
>>>
>>>> SYM_CODE_START(entry_SYSCALL_64)
>>>>         swapgs
>>>>         movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
>>>>         SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
>>> You can _trivially_ audit this and know that swapgs doesn't touch memory
>>> and that as long as PER_CPU_VAR()s and the process stack don't have
>>> their mappings munged and flushes deferred that this would be correct.
>> Hmm, so there is still a path for this?
>>
>> At least if it added more ways to debug it, and some other changes to make
>> the locations where vmalloc is dangerous smaller?
>
> Being able to debug it would be a good start. But, more generally, what
> we need is for more people to be able to run the code in the first
> place. Would a _normal_ system (without setups that are trying to do
> NOHZ_FULL) ever be able to defer TLB flush IPIs?
>
> If the answer is no, then, yeah, I'll settle for some debugging options.
>
> But if you shrink the window as small as I'm talking about, it would
> look very different from this series.
>
> For instance, imagine when a CPU goes into the NOHZ mode. Could it just
> unconditionally flush the TLB on the way back into the kernel (in the
> same SWITCH_TO_KERNEL_CR3 spot)? Yeah, it'll make entry into the kernel
> expensive for NOHZ tasks, but it's not *THAT* bad. And if the entire
> point of a NOHZ_FULL task is to minimize the number of kernel entries
> then a little extra overhead there doesn't sound too bad.
>

Right, so my thought per your previous comments was to special case the
TLB flush, depend on kPTI and do it uncondtionally in SWITCH_TO_KERNEL_CR3
just like you've described - but keep the context tracking mechanism for
other deferrable operations.

My gripe with that was having two separate mechanisms
- super early entry around SWITCH_TO_KERNEL_CR3)
- later entry at context tracking

Shifting everything to SWITCH_TO_KERNEL_CR3 means we lose the
context_tracking infra to dynamically defer operations (atomically reading
and writing to context_tracking.state), which means we unconditionally run
all possible deferrable operations. This doesn't scream scalable, even
though as you say NOHZ_FULL kernel entry is already a "you lose" situation.

Yet another option is to duplicate the context tracking state specifically
for IPI deferral and have it driven in/by SWITCH_TO_KERNEL_CR3, which is
also not super savoury.

I suppose I can start poking around running deferred ops in that
SWITCH_TO_KERNEL_CR3 region, and add state/infra on top. Let's see where
this gets me :-)

Again, thanks for the insight and the suggestions Dave!

> Also, about the new hardware, I suspect there's some mystery customer
> lurking in the shadows asking folks for this functionality. Could you at
> least go _talk_ to the mystery customer(s) and see which hardware they
> care about? They might already even have the magic CPUs they need for
> this, or have them on the roadmap. If they've got Intel CPUs, I'd be
> happy to help figure it out.
Peter Zijlstra May 2, 2025, 11:22 a.m. UTC | #4
On Wed, Apr 30, 2025 at 11:07:35AM -0700, Dave Hansen wrote:

> Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think.
> You can go buy the Intel hardware off the shelf today.

To be fair, the Intel RAR thing is pretty horrific :-( Definitely
sub-par compared to the AMD and ARM things.

Furthermore, the paper states it is a uarch feature for SPR with no
guarantee future uarchs will get it (and to be fair, I'd prefer it if
they didn't).

Furthermore, I suspect it will actually be slower than IPIs for anything
with more than 64 logical CPUs due to reduced parallelism.
Dave Hansen May 2, 2025, 1:53 p.m. UTC | #5
On 5/2/25 02:55, Valentin Schneider wrote:
> My gripe with that was having two separate mechanisms
> - super early entry around SWITCH_TO_KERNEL_CR3)
> - later entry at context tracking

What do you mean by "later entry"?

All of the paths to enter the kernel from userspace have some
SWITCH_TO_KERNEL_CR3 variant. If they didn't, the userspace that they
entered from could have attacked the kernel with Meltdown.

I'm theorizing that if this is _just_ about avoiding TLB flush IPIs that
you can get away with a single mechanism.
Dave Hansen May 2, 2025, 2:33 p.m. UTC | #6
On 5/2/25 04:22, Peter Zijlstra wrote:
> On Wed, Apr 30, 2025 at 11:07:35AM -0700, Dave Hansen wrote:
> 
>> Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think.
>> You can go buy the Intel hardware off the shelf today.
> To be fair, the Intel RAR thing is pretty horrific 🙁 Definitely
> sub-par compared to the AMD and ARM things.
> 
> Furthermore, the paper states it is a uarch feature for SPR with no
> guarantee future uarchs will get it (and to be fair, I'd prefer it if
> they didn't).

I don't think any of that is set in stone, fwiw. It should be entirely
possible to obtain a longer promise about its availability.

Or ask that AMD and Intel put their heads together in their fancy new
x86 advisory group and figure out a single way forward. If you're right
that RAR stinks and INVLPGB rocks, then it'll be an easy thing to advise.

> Furthermore, I suspect it will actually be slower than IPIs for anything
> with more than 64 logical CPUs due to reduced parallelism.

Maybe my brain is crusty and I need to go back and read the spec, but I
remember RAR using the normal old APIC programming that normal old TLB
flush IPIs use. So they have similar restrictions. If it's inefficient
to program a wide IPI, it's also inefficient to program a RAR operation.
So the (theoretical) pro is that you program it like an IPI and it slots
into the IPI code fairly easily. But the con is that it has the same
limitations as IPIs.

I was actually concerned that INVLPGB won't be scalable. Since it
doesn't have the ability to target specific CPUs in the ISA, it
fundamentally need to either have a mechanism to reach all CPUs, or some
way to know which TLB entries each CPU might have.

Maybe AMD has something super duper clever to limit the broadcast scope.
But if they don't, then a small range flush on a small number of CPUs
might end up being pretty expensive, relatively.

I don't think this is a big problem in Rik's series because he had a
floor on the size of processes that get INVLPGB applied. Also, if it
turns out to be a problem, it's dirt simple to revert back to IPIs for
problematic TLB flushes.

But I am deeply curious how the system will behave if there are a
boatload of processes doing modestly-sized INVLPGBs that only apply to a
handful of CPUs on a very large system.

AMD and Intel came at this from very different angles (go figure). The
designs are prioritizing different things for sure. I can't wait to see
both of them fighting it out under real workloads.
Peter Zijlstra May 2, 2025, 3:20 p.m. UTC | #7
On Fri, May 02, 2025 at 07:33:55AM -0700, Dave Hansen wrote:
> On 5/2/25 04:22, Peter Zijlstra wrote:
> > On Wed, Apr 30, 2025 at 11:07:35AM -0700, Dave Hansen wrote:
> > 
> >> Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think.
> >> You can go buy the Intel hardware off the shelf today.
> > To be fair, the Intel RAR thing is pretty horrific 🙁 Definitely
> > sub-par compared to the AMD and ARM things.
> > 
> > Furthermore, the paper states it is a uarch feature for SPR with no
> > guarantee future uarchs will get it (and to be fair, I'd prefer it if
> > they didn't).
> 
> I don't think any of that is set in stone, fwiw. It should be entirely
> possible to obtain a longer promise about its availability.
> 
> Or ask that AMD and Intel put their heads together in their fancy new
> x86 advisory group and figure out a single way forward. 

This might be a good thing regardless.

> > Furthermore, I suspect it will actually be slower than IPIs for anything
> > with more than 64 logical CPUs due to reduced parallelism.
> 
> Maybe my brain is crusty and I need to go back and read the spec, but I
> remember RAR using the normal old APIC programming that normal old TLB
> flush IPIs use. So they have similar restrictions. If it's inefficient
> to program a wide IPI, it's also inefficient to program a RAR operation.
> So the (theoretical) pro is that you program it like an IPI and it slots
> into the IPI code fairly easily. But the con is that it has the same
> limitations as IPIs.

The problem is in the request structure. Sending an IPI is an async
action. You do, done.

OTOH RAR has a request buffer where pending requests are put and 'polled'
for completion. This buffer does not have room for more than 64 CPUs.

This means that if you want to invalidate across more, you need to do it
in multiple batches.

So where IPI is:

 - IPI all CPUs
 - local invalidate
 - wait for completion

This then becomes:

 for ()
   - RAR some CPUs
   - wait for completion

Or so I thought to have understood, the paper isn't the easiest to read.

> I was actually concerned that INVLPGB won't be scalable. Since it
> doesn't have the ability to target specific CPUs in the ISA, it
> fundamentally need to either have a mechanism to reach all CPUs, or some
> way to know which TLB entries each CPU might have.
> 
> Maybe AMD has something super duper clever to limit the broadcast scope.
> But if they don't, then a small range flush on a small number of CPUs
> might end up being pretty expensive, relatively.

So the way I understand things:

Sending IPIs is sending a message on the interconnect. Mostly this is a
cacheline in size (because MESI). Sparc (v9?) has a fun feature where
you can actually put data payload in an IPI.

Now, we can target an IPI to a single CPU or to a (limited) set of CPU
or broadcast to all CPUs. In fact, targeted IPIs might still be
broadcast IPIs, except most CPUs will ignore it because it doesn't match
them.

TLBI broadcast is like sending IPIs to all CPUs, the message goes out,
everybody sees it.

Much like how snoop filters and the like function, a CPU can process
these messages async -- your CPU doesn't stall for a cacheline
invalidate message either (except ofcourse if it is actively using that
line). Same for TLBI, if the local TLB does not have anything that
matches, its done. Even if it does match, as long as nothing makes
active use of it, it can just drop the TLB entry without disturbing the
actual core.

Only if the CPU has a matching TLB entry *and* it is active, then we
have options. One option is to interrupt the core, another option is to
wait for it to stop using it.

IIUC the current AMD implementation does the 'interrupt' thing.

One thing to consider in all this is that if we TLBI for an executable
page, we should very much also wipe the u-ops cache and all such related
structures -- ARM might have an 'issue' here.

That is, I think the TLBI problem is very similar to the I in MESI --
except possibly simpler, because E must not happen until all CPUs
acknowledge I etc. TLBI does not have this, it has until the next
TLBSYNC.

Anyway, I'm not a hardware person, but this is how I understand these
things to work.
Valentin Schneider May 2, 2025, 4:38 p.m. UTC | #8
On 02/05/25 06:53, Dave Hansen wrote:
> On 5/2/25 02:55, Valentin Schneider wrote:
>> My gripe with that was having two separate mechanisms
>> - super early entry around SWITCH_TO_KERNEL_CR3)
>> - later entry at context tracking
>
> What do you mean by "later entry"?
>

I meant the point at which the deferred operation is run in the current
patches, i.e. ct_kernel_enter() - kernel entry from the PoV of context
tracking.

> All of the paths to enter the kernel from userspace have some
> SWITCH_TO_KERNEL_CR3 variant. If they didn't, the userspace that they
> entered from could have attacked the kernel with Meltdown.
>
> I'm theorizing that if this is _just_ about avoiding TLB flush IPIs that
> you can get away with a single mechanism.

So right now there would indeed be the TLB flush IPIs, but also the
text_poke() ones (sync_core() after patching text).

These are the two NOHZ-breaking IPIs that show up on my HP box, and that I
also got reports for from folks using NOHZ_FULL + CPU isolation in
production, mostly on SPR "edge enhanced" type of systems.

There's been some other sources of IPIs that have been fixed with an ad-hoc
solution - disable the mechanism for NOHZ_FULL CPUs or do it differently
such that an IPI isn't required, e.g.

  https://lore.kernel.org/lkml/ZJtBrybavtb1x45V@tpad/

While I don't expect the list to grow much, it's unfortunately not just the
TLB flush IPIs.
Dave Hansen May 2, 2025, 5:57 p.m. UTC | #9
gah, the cc list here is rotund...

On 5/2/25 09:38, Valentin Schneider wrote:
...
>> All of the paths to enter the kernel from userspace have some
>> SWITCH_TO_KERNEL_CR3 variant. If they didn't, the userspace that they
>> entered from could have attacked the kernel with Meltdown.
>>
>> I'm theorizing that if this is _just_ about avoiding TLB flush IPIs that
>> you can get away with a single mechanism.
> 
> So right now there would indeed be the TLB flush IPIs, but also the
> text_poke() ones (sync_core() after patching text).
> 
> These are the two NOHZ-breaking IPIs that show up on my HP box, and that I
> also got reports for from folks using NOHZ_FULL + CPU isolation in
> production, mostly on SPR "edge enhanced" type of systems.
...
> While I don't expect the list to grow much, it's unfortunately not just the
> TLB flush IPIs.

Isn't text patching way easier than TLB flushes? You just need *some*
serialization. Heck, since TLB flushes are architecturally serializing,
you could probably even reuse the exact same mechanism: implement
deferred text patch serialization operations as a deferred TLB flush.

The hardest part is figuring out which CPUs are in the state where they
can be deferred or not. But you have to solve that in any case, and you
already have an algorithm to do it.
Valentin Schneider May 5, 2025, 3:45 p.m. UTC | #10
On 02/05/25 10:57, Dave Hansen wrote:
> gah, the cc list here is rotund...
>
> On 5/2/25 09:38, Valentin Schneider wrote:
> ...
>>> All of the paths to enter the kernel from userspace have some
>>> SWITCH_TO_KERNEL_CR3 variant. If they didn't, the userspace that they
>>> entered from could have attacked the kernel with Meltdown.
>>>
>>> I'm theorizing that if this is _just_ about avoiding TLB flush IPIs that
>>> you can get away with a single mechanism.
>>
>> So right now there would indeed be the TLB flush IPIs, but also the
>> text_poke() ones (sync_core() after patching text).
>>
>> These are the two NOHZ-breaking IPIs that show up on my HP box, and that I
>> also got reports for from folks using NOHZ_FULL + CPU isolation in
>> production, mostly on SPR "edge enhanced" type of systems.
> ...
>> While I don't expect the list to grow much, it's unfortunately not just the
>> TLB flush IPIs.
>
> Isn't text patching way easier than TLB flushes? You just need *some*
> serialization. Heck, since TLB flushes are architecturally serializing,
> you could probably even reuse the exact same mechanism: implement
> deferred text patch serialization operations as a deferred TLB flush.
>
> The hardest part is figuring out which CPUs are in the state where they
> can be deferred or not. But you have to solve that in any case, and you
> already have an algorithm to do it.

Alright, off to mess around SWITCH_TO_KERNEL_CR3 to see how shoving
deferred operations there would look then.