Message ID | 20230720163056.2564824-19-vschneid@redhat.com |
---|---|
State | New |
Headers | show |
Series | context_tracking,x86: Defer some IPIs until a user->kernel transition | expand |
On Tue, Jul 25, 2023 at 06:49:45AM -0400, Joel Fernandes wrote: > Interesting series Valentin. Some high-level question/comments on this one: > > > On Jul 20, 2023, at 12:34 PM, Valentin Schneider <vschneid@redhat.com> wrote: > > > > text_poke_bp_batch() sends IPIs to all online CPUs to synchronize > > them vs the newly patched instruction. CPUs that are executing in userspace > > do not need this synchronization to happen immediately, and this is > > actually harmful interference for NOHZ_FULL CPUs. > > Does the amount of harm not correspond to practical frequency of text_poke? > How often does instruction patching really happen? If it is very infrequent > then I am not sure if it is that harmful. Well, it can happen quite a bit, also from things people would not typically 'expect' it. For instance, the moment you create the first per-task perf event we frob some jump-labels (and again some second after the last one goes away). The same for a bunch of runtime network configurations. > > As the synchronization IPIs are sent using a blocking call, returning from > > text_poke_bp_batch() implies all CPUs will observe the patched > > instruction(s), and this should be preserved even if the IPI is deferred. > > In other words, to safely defer this synchronization, any kernel > > instruction leading to the execution of the deferred instruction > > sync (ct_work_flush()) must *not* be mutable (patchable) at runtime. > > If it is not infrequent, then are you handling the case where userland > spends multiple seconds before entering the kernel, and all this while > the blocking call waits? Perhaps in such situation you want the real IPI > to be sent out instead of the deferred one? Please re-read what Valentin wrote -- nobody is waiting on anything.
On 7/25/23 09:39, Peter Zijlstra wrote: > On Tue, Jul 25, 2023 at 06:49:45AM -0400, Joel Fernandes wrote: >> Interesting series Valentin. Some high-level question/comments on this one: >> >>> On Jul 20, 2023, at 12:34 PM, Valentin Schneider <vschneid@redhat.com> wrote: >>> >>> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize >>> them vs the newly patched instruction. CPUs that are executing in userspace >>> do not need this synchronization to happen immediately, and this is >>> actually harmful interference for NOHZ_FULL CPUs. >> >> Does the amount of harm not correspond to practical frequency of text_poke? >> How often does instruction patching really happen? If it is very infrequent >> then I am not sure if it is that harmful. > > Well, it can happen quite a bit, also from things people would not > typically 'expect' it. > > For instance, the moment you create the first per-task perf event we > frob some jump-labels (and again some second after the last one goes > away). > > The same for a bunch of runtime network configurations. Ok cool. I guess I still have memories of that old ARM device I had where modifications to kernel text was forbidden by hardware (was a security feature). That was making kprobes unusable... >>> As the synchronization IPIs are sent using a blocking call, returning from >>> text_poke_bp_batch() implies all CPUs will observe the patched >>> instruction(s), and this should be preserved even if the IPI is deferred. >>> In other words, to safely defer this synchronization, any kernel >>> instruction leading to the execution of the deferred instruction >>> sync (ct_work_flush()) must *not* be mutable (patchable) at runtime. >> >> If it is not infrequent, then are you handling the case where userland >> spends multiple seconds before entering the kernel, and all this while >> the blocking call waits? Perhaps in such situation you want the real IPI >> to be sent out instead of the deferred one? > > Please re-read what Valentin wrote -- nobody is waiting on anything. Makes sense. To be fair I received his email 3 minutes before yours ;-). But thank you both for clarifying! - Joel
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 5bc29e6b2ed38..2c66687ce00e2 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -2,11 +2,13 @@ #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H #define _ASM_X86_CONTEXT_TRACKING_WORK_H +#include <asm/sync_core.h> + static __always_inline void arch_context_tracking_work(int work) { switch (work) { - case CONTEXT_WORK_n: - // Do work... + case CONTEXT_WORK_SYNC: + sync_core(); break; } } diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index 29832c338cdc5..b6939e965e69d 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -43,6 +43,7 @@ extern void text_poke_early(void *addr, const void *opcode, size_t len); */ extern void *text_poke(void *addr, const void *opcode, size_t len); extern void text_poke_sync(void); +extern void text_poke_sync_deferrable(void); extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len); extern void *text_poke_copy(void *addr, const void *opcode, size_t len); extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok); diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 72646d75b6ffe..fcce480e1919e 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -18,6 +18,7 @@ #include <linux/mmu_context.h> #include <linux/bsearch.h> #include <linux/sync_core.h> +#include <linux/context_tracking.h> #include <asm/text-patching.h> #include <asm/alternative.h> #include <asm/sections.h> @@ -1933,9 +1934,24 @@ static void do_sync_core(void *info) sync_core(); } +static bool do_sync_core_defer_cond(int cpu, void *info) +{ + return !ct_set_cpu_work(cpu, CONTEXT_WORK_SYNC); +} + +static void __text_poke_sync(smp_cond_func_t cond_func) +{ + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1); +} + void text_poke_sync(void) { - on_each_cpu(do_sync_core, NULL, 1); + __text_poke_sync(NULL); +} + +void text_poke_sync_deferrable(void) +{ + __text_poke_sync(do_sync_core_defer_cond); } /* @@ -2145,7 +2161,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE); } - text_poke_sync(); + text_poke_sync_deferrable(); /* * Second step: update all but the first byte of the patched range. @@ -2207,7 +2223,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * not necessary and we'd be safe even without it. But * better safe than sorry (plus there's not only Intel). */ - text_poke_sync(); + text_poke_sync_deferrable(); } /* @@ -2228,7 +2244,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries } if (do_sync) - text_poke_sync(); + text_poke_sync_deferrable(); /* * Remove and wait for refs to be zero. diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index f7f6042eb7e6c..a38c914753397 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -735,7 +735,7 @@ void arch_arm_kprobe(struct kprobe *p) u8 int3 = INT3_INSN_OPCODE; text_poke(p->addr, &int3, 1); - text_poke_sync(); + text_poke_sync_deferrable(); perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1); } @@ -745,7 +745,7 @@ void arch_disarm_kprobe(struct kprobe *p) perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1); text_poke(p->addr, &p->opcode, 1); - text_poke_sync(); + text_poke_sync_deferrable(); } void arch_remove_kprobe(struct kprobe *p) diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index 57b0037d0a996..88451a744ceda 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -521,11 +521,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op) JMP32_INSN_SIZE - INT3_INSN_SIZE); text_poke(addr, new, INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable(); text_poke(addr + INT3_INSN_SIZE, new + INT3_INSN_SIZE, JMP32_INSN_SIZE - INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable(); perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE); } diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index b05f62ee2344b..8b4542dc51b6d 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -242,7 +242,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs, write, apply); if (!early) { - text_poke_sync(); + text_poke_sync_deferrable(); mutex_unlock(&text_mutex); } diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index fb74db8876dd2..13fc97b395030 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -5,12 +5,12 @@ #include <linux/bitops.h> enum { - CONTEXT_WORK_n_OFFSET, + CONTEXT_WORK_SYNC_OFFSET, CONTEXT_WORK_MAX_OFFSET }; enum ct_work { - CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET), + CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET), CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET) };