Message ID | 20230326062039.341479-2-sathyanarayanan.kuppuswamy@linux.intel.com |
---|---|
State | Superseded |
Headers | show |
Series | TDX Guest Quote generation support | expand |
Hi Kai, On 3/27/23 7:38 PM, Huang, Kai wrote: >> +/* Reserve an IRQ from x86_vector_domain for TD event notification */ >> +static int __init tdx_event_irq_init(void) >> +{ >> + struct irq_alloc_info info; >> + cpumask_t saved_cpumask; >> + struct irq_cfg *cfg; >> + int cpu, irq; >> + >> + if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) >> + return 0; >> + >> + init_irq_alloc_info(&info, NULL); >> + >> + /* >> + * Event notification vector will be delivered to the CPU >> + * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested. >> + * So set the IRQ affinity to the current CPU. >> + */ >> + cpu = get_cpu(); >> + cpumask_copy(&saved_cpumask, current->cpus_ptr); >> + info.mask = cpumask_of(cpu); >> + put_cpu(); > The 'saved_cpumask' related code is ugly. If you move put_cpu() to the end of > this function, I think you can remove all related code: > > cpu = get_cpu(); > > /* > * Set @info->mask to local cpu to make sure a valid vector is > * pre-allocated when TDX event notification IRQ is allocated > * from x86_vector_domain. > */ > init_irq_alloc_info(&info, cpumask_of(cpu)); > > // rest staff: request_irq(), hypercall ... > > put_cpu(); > init_irq_alloc_info() is a sleeping function. Since get_cpu() disables preemption, we cannot call sleeping function after it. Initially, I have implemented it like you have mentioned. However, I discovered the following error. [ 2.400755] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580 [ 2.404664] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 [ 2.408671] preempt_count: 1, expected: 0 [ 2.412650] RCU nest depth: 0, expected: 0 [ 2.412666] no locks held by swapper/0/1. [ 2.416650] Preemption disabled at: [ 2.416650] [<ffffffff83b8089f>] tdx_arch_init+0x38/0x117 [ 2.420670] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc4-00117-g672ca073d9f9-dirty #2527 [ 2.424650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 2.424650] Call Trace: [ 2.424650] <TASK> [ 2.424650] dump_stack_lvl+0x6a/0x86 [ 2.424650] __might_resched.cold+0xf4/0x12f [ 2.424650] __mutex_lock+0x50/0x810 [ 2.424650] ? lock_is_held_type+0xd8/0x130 [ 2.424650] ? __irq_alloc_descs+0xcf/0x310 [ 2.424650] ? find_held_lock+0x2b/0x80 [ 2.424650] ? __irq_alloc_descs+0xcf/0x310 [ 2.424650] __irq_alloc_descs+0xcf/0x310 [ 2.424650] irq_domain_alloc_descs.part.0+0x49/0xa0 [ 2.424650] __irq_domain_alloc_irqs+0x2a0/0x4f0 [ 2.424650] ? next_arg+0x129/0x1f0 [ 2.424650] ? tdx_guest_init+0x5b/0x5b [ 2.424650] tdx_arch_init+0x8e/0x117 [ 2.424650] do_one_initcall+0x137/0x2ec [ 2.424650] ? rcu_read_lock_sched_held+0x36/0x60 [ 2.424650] kernel_init_freeable+0x1e3/0x241 [ 2.424650] ? rest_init+0x1a0/0x1a0 [ 2.424650] kernel_init+0x17/0x170 [ 2.424650] ? rest_init+0x1a0/0x1a0 [ 2.424650] ret_from_fork+0x1f/0x30 [ 2.424650] </TASK>
On 3/27/23 9:02 PM, Huang, Kai wrote: > On Mon, 2023-03-27 at 19:50 -0700, Sathyanarayanan Kuppuswamy wrote: >> Hi Kai, >> >> On 3/27/23 7:38 PM, Huang, Kai wrote: >>>> +/* Reserve an IRQ from x86_vector_domain for TD event notification */ >>>> +static int __init tdx_event_irq_init(void) >>>> +{ >>>> + struct irq_alloc_info info; >>>> + cpumask_t saved_cpumask; >>>> + struct irq_cfg *cfg; >>>> + int cpu, irq; >>>> + >>>> + if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) >>>> + return 0; >>>> + >>>> + init_irq_alloc_info(&info, NULL); >>>> + >>>> + /* >>>> + * Event notification vector will be delivered to the CPU >>>> + * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested. >>>> + * So set the IRQ affinity to the current CPU. >>>> + */ >>>> + cpu = get_cpu(); >>>> + cpumask_copy(&saved_cpumask, current->cpus_ptr); >>>> + info.mask = cpumask_of(cpu); >>>> + put_cpu(); >>> The 'saved_cpumask' related code is ugly. If you move put_cpu() to the end of >>> this function, I think you can remove all related code: >>> >>> cpu = get_cpu(); >>> >>> /* >>> * Set @info->mask to local cpu to make sure a valid vector is >>> * pre-allocated when TDX event notification IRQ is allocated >>> * from x86_vector_domain. >>> */ >>> init_irq_alloc_info(&info, cpumask_of(cpu)); >>> >>> // rest staff: request_irq(), hypercall ... >>> >>> put_cpu(); >>> >> >> init_irq_alloc_info() is a sleeping function. Since get_cpu() disables >> preemption, we cannot call sleeping function after it. Initially, I >> have implemented it like you have mentioned. However, I discovered the >> following error. > > Oh sorry I forgot this. So I think we should use migrate_disable() instead: > > migrate_disable(); > > init_irq_alloc_info(&info, cpumask_of(smp_processor_id())); > > ... > > migrate_enable(); > > Or, should we just use early_initcall() so that only BSP is running? IMHO it's > OK to always allocate the vector from BSP. > > Anyway, either way is fine to me. Final version looks like below. static int __init tdx_event_irq_init(void) { struct irq_alloc_info info; struct irq_cfg *cfg; int irq; if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) return 0; init_irq_alloc_info(&info, NULL); /* * Event notification vector will be delivered to the CPU * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested. * So set the IRQ affinity to the current CPU. */ info.mask = cpumask_of(0); irq = irq_domain_alloc_irqs(x86_vector_domain, 1, cpu_to_node(0), &info); if (irq <= 0) { pr_err("Event notification IRQ allocation failed %d\n", irq); return -EIO; } irq_set_handler(irq, handle_edge_irq); /* Since the IRQ affinity is set, it cannot be balanced */ if (request_irq(irq, tdx_event_irq_handler, IRQF_NOBALANCING, "tdx_event_irq", NULL)) { pr_err("Event notification IRQ request failed\n"); goto err_free_domain_irqs; } cfg = irq_cfg(irq); /* * Since tdx_event_irq_init() is triggered via early_initcall(), * it will called before secondary CPUs bringup. Since there is * only one CPU, it complies with the requirement of executing * the TDVMCALL_SETUP_NOTIFY_INTR hypercall on the same CPU where * the IRQ vector is allocated. * * Register callback vector address with VMM. More details * about the ABI can be found in TDX Guest-Host-Communication * Interface (GHCI), sec titled * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>". */ if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, cfg->vector, 0, 0, 0)) { pr_err("Event notification hypercall failed\n"); goto err_free_irqs; } tdx_event_irq = irq; return 0; err_free_irqs: free_irq(irq, NULL); err_free_domain_irqs: irq_domain_free_irqs(irq, 1); return -EIO; } early_initcall(tdx_event_irq_init) > >> >> [ 2.400755] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580 >> [ 2.404664] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 >> [ 2.408671] preempt_count: 1, expected: 0 >> [ 2.412650] RCU nest depth: 0, expected: 0 >> [ 2.412666] no locks held by swapper/0/1. >> [ 2.416650] Preemption disabled at: >> [ 2.416650] [<ffffffff83b8089f>] tdx_arch_init+0x38/0x117 >> [ 2.420670] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc4-00117-g672ca073d9f9-dirty #2527 >> [ 2.424650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 >> [ 2.424650] Call Trace: >> [ 2.424650] <TASK> >> [ 2.424650] dump_stack_lvl+0x6a/0x86 >> [ 2.424650] __might_resched.cold+0xf4/0x12f >> [ 2.424650] __mutex_lock+0x50/0x810 >> [ 2.424650] ? lock_is_held_type+0xd8/0x130 >> [ 2.424650] ? __irq_alloc_descs+0xcf/0x310 >> [ 2.424650] ? find_held_lock+0x2b/0x80 >> [ 2.424650] ? __irq_alloc_descs+0xcf/0x310 >> [ 2.424650] __irq_alloc_descs+0xcf/0x310 >> [ 2.424650] irq_domain_alloc_descs.part.0+0x49/0xa0 >> [ 2.424650] __irq_domain_alloc_irqs+0x2a0/0x4f0 >> [ 2.424650] ? next_arg+0x129/0x1f0 >> [ 2.424650] ? tdx_guest_init+0x5b/0x5b >> [ 2.424650] tdx_arch_init+0x8e/0x117 >> [ 2.424650] do_one_initcall+0x137/0x2ec >> [ 2.424650] ? rcu_read_lock_sched_held+0x36/0x60 >> [ 2.424650] kernel_init_freeable+0x1e3/0x241 >> [ 2.424650] ? rest_init+0x1a0/0x1a0 >> [ 2.424650] kernel_init+0x17/0x170 >> [ 2.424650] ? rest_init+0x1a0/0x1a0 >> [ 2.424650] ret_from_fork+0x1f/0x30 >> [ 2.424650] </TASK> >
On Tue, 2023-04-04 at 18:02 -0700, Sathyanarayanan Kuppuswamy wrote: > > On 3/27/23 9:02 PM, Huang, Kai wrote: > > On Mon, 2023-03-27 at 19:50 -0700, Sathyanarayanan Kuppuswamy wrote: > > > Hi Kai, > > > > > > On 3/27/23 7:38 PM, Huang, Kai wrote: > > > > > +/* Reserve an IRQ from x86_vector_domain for TD event notification */ > > > > > +static int __init tdx_event_irq_init(void) > > > > > +{ > > > > > + struct irq_alloc_info info; > > > > > + cpumask_t saved_cpumask; > > > > > + struct irq_cfg *cfg; > > > > > + int cpu, irq; > > > > > + > > > > > + if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) > > > > > + return 0; > > > > > + > > > > > + init_irq_alloc_info(&info, NULL); > > > > > + > > > > > + /* > > > > > + * Event notification vector will be delivered to the CPU > > > > > + * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested. > > > > > + * So set the IRQ affinity to the current CPU. > > > > > + */ > > > > > + cpu = get_cpu(); > > > > > + cpumask_copy(&saved_cpumask, current->cpus_ptr); > > > > > + info.mask = cpumask_of(cpu); > > > > > + put_cpu(); > > > > The 'saved_cpumask' related code is ugly. If you move put_cpu() to the end of > > > > this function, I think you can remove all related code: > > > > > > > > cpu = get_cpu(); > > > > > > > > /* > > > > * Set @info->mask to local cpu to make sure a valid vector is > > > > * pre-allocated when TDX event notification IRQ is allocated > > > > * from x86_vector_domain. > > > > */ > > > > init_irq_alloc_info(&info, cpumask_of(cpu)); > > > > > > > > // rest staff: request_irq(), hypercall ... > > > > > > > > put_cpu(); > > > > > > > > > > init_irq_alloc_info() is a sleeping function. Since get_cpu() disables > > > preemption, we cannot call sleeping function after it. Initially, I > > > have implemented it like you have mentioned. However, I discovered the > > > following error. > > > > Oh sorry I forgot this. So I think we should use migrate_disable() instead: > > > > migrate_disable(); > > > > init_irq_alloc_info(&info, cpumask_of(smp_processor_id())); > > > > ... > > > > migrate_enable(); > > > > Or, should we just use early_initcall() so that only BSP is running? IMHO it's > > OK to always allocate the vector from BSP. > > > > Anyway, either way is fine to me. > > Final version looks like below. > > static int __init tdx_event_irq_init(void) > { > struct irq_alloc_info info; > struct irq_cfg *cfg; > int irq; > > if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) > return 0; > > init_irq_alloc_info(&info, NULL); > > /* > * Event notification vector will be delivered to the CPU > * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested. > * So set the IRQ affinity to the current CPU. > */ > info.mask = cpumask_of(0); > > irq = irq_domain_alloc_irqs(x86_vector_domain, 1, cpu_to_node(0), &info); > if (irq <= 0) { > pr_err("Event notification IRQ allocation failed %d\n", irq); > return -EIO; > } > > irq_set_handler(irq, handle_edge_irq); > > /* Since the IRQ affinity is set, it cannot be balanced */ > if (request_irq(irq, tdx_event_irq_handler, IRQF_NOBALANCING, > "tdx_event_irq", NULL)) { > pr_err("Event notification IRQ request failed\n"); > goto err_free_domain_irqs; > } > > cfg = irq_cfg(irq); > > /* > * Since tdx_event_irq_init() is triggered via early_initcall(), > * it will called before secondary CPUs bringup. Since there is > * only one CPU, it complies with the requirement of executing > * the TDVMCALL_SETUP_NOTIFY_INTR hypercall on the same CPU where > * the IRQ vector is allocated. > * > * Register callback vector address with VMM. More details > * about the ABI can be found in TDX Guest-Host-Communication > * Interface (GHCI), sec titled > * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>". > */ > if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, cfg->vector, 0, 0, 0)) { > pr_err("Event notification hypercall failed\n"); > goto err_free_irqs; > } > > tdx_event_irq = irq; > > return 0; > > err_free_irqs: > free_irq(irq, NULL); > err_free_domain_irqs: > irq_domain_free_irqs(irq, 1); > > return -EIO; > } > early_initcall(tdx_event_irq_init) I found there's another series also doing similar thing, and it seems Thomas wasn't happy about using x86_vector_domain directly: https://lore.kernel.org/lkml/877cv99k0y.ffs@tglx/ An alternative was also posted (creating IRQ domain on top of x86_vector_domain): https://lore.kernel.org/lkml/20230328182933.GA1403032@vm02.guest.corp.microsoft.com/ I think we should monitor that and hear from others more.
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 055300e08fb3..d03985952d45 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -7,12 +7,18 @@ #include <linux/cpufeature.h> #include <linux/export.h> #include <linux/io.h> +#include <linux/string.h> +#include <linux/uaccess.h> +#include <linux/interrupt.h> +#include <linux/irq.h> +#include <linux/numa.h> #include <asm/coco.h> #include <asm/tdx.h> #include <asm/vmx.h> #include <asm/insn.h> #include <asm/insn-eval.h> #include <asm/pgtable.h> +#include <asm/irqdomain.h> /* TDX module Call Leaf IDs */ #define TDX_GET_INFO 1 @@ -27,6 +33,7 @@ /* TDX hypercall Leaf IDs */ #define TDVMCALL_MAP_GPA 0x10001 #define TDVMCALL_REPORT_FATAL_ERROR 0x10003 +#define TDVMCALL_SETUP_NOTIFY_INTR 0x10004 /* MMIO direction */ #define EPT_READ 0 @@ -51,6 +58,16 @@ #define TDREPORT_SUBTYPE_0 0 +struct event_irq_entry { + tdx_event_irq_cb_t handler; + void *data; + struct list_head head; +}; + +static int tdx_event_irq; +static LIST_HEAD(event_irq_cb_list); +static DEFINE_SPINLOCK(event_irq_cb_lock); + /* * Wrapper for standard use of __tdx_hypercall with no output aside from * return code. @@ -873,3 +890,149 @@ void __init tdx_early_init(void) pr_info("Guest detected\n"); } + +static irqreturn_t tdx_event_irq_handler(int irq, void *dev_id) +{ + struct event_irq_entry *entry; + + spin_lock(&event_irq_cb_lock); + list_for_each_entry(entry, &event_irq_cb_list, head) { + if (entry->handler) + entry->handler(entry->data); + } + spin_unlock(&event_irq_cb_lock); + + return IRQ_HANDLED; +} + +/* Reserve an IRQ from x86_vector_domain for TD event notification */ +static int __init tdx_event_irq_init(void) +{ + struct irq_alloc_info info; + cpumask_t saved_cpumask; + struct irq_cfg *cfg; + int cpu, irq; + + if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) + return 0; + + init_irq_alloc_info(&info, NULL); + + /* + * Event notification vector will be delivered to the CPU + * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested. + * So set the IRQ affinity to the current CPU. + */ + cpu = get_cpu(); + cpumask_copy(&saved_cpumask, current->cpus_ptr); + info.mask = cpumask_of(cpu); + put_cpu(); + + irq = irq_domain_alloc_irqs(x86_vector_domain, 1, NUMA_NO_NODE, &info); + if (irq <= 0) { + pr_err("Event notification IRQ allocation failed %d\n", irq); + return -EIO; + } + + irq_set_handler(irq, handle_edge_irq); + + cfg = irq_cfg(irq); + if (!cfg) { + pr_err("Event notification IRQ config not found\n"); + goto err_free_irqs; + } + + if (request_irq(irq, tdx_event_irq_handler, IRQF_NOBALANCING, + "tdx_event_irq", NULL)) { + pr_err("Event notification IRQ request failed\n"); + goto err_free_irqs; + } + + set_cpus_allowed_ptr(current, cpumask_of(cpu)); + + /* + * Register callback vector address with VMM. More details + * about the ABI can be found in TDX Guest-Host-Communication + * Interface (GHCI), sec titled + * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>". + */ + if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, cfg->vector, 0, 0, 0)) { + pr_err("Event notification hypercall failed\n"); + goto err_restore_cpus; + } + + set_cpus_allowed_ptr(current, &saved_cpumask); + + tdx_event_irq = irq; + + return 0; + +err_restore_cpus: + set_cpus_allowed_ptr(current, &saved_cpumask); + free_irq(irq, NULL); +err_free_irqs: + irq_domain_free_irqs(irq, 1); + + return -EIO; +} +arch_initcall(tdx_event_irq_init) + +/** + * tdx_register_event_irq_cb() - Register TDX event IRQ callback handler. + * @handler: Address of driver specific event IRQ callback handler. Handler + * will be called in IRQ context and hence cannot sleep. + * @data: Context data to be passed to the callback handler. + * + * Return: 0 on success or standard error code on other failures. + */ +int tdx_register_event_irq_cb(tdx_event_irq_cb_t handler, void *data) +{ + struct event_irq_entry *entry; + unsigned long flags; + + if (tdx_event_irq <= 0) + return -EIO; + + entry = kzalloc(sizeof(*entry), GFP_KERNEL); + if (!entry) + return -ENOMEM; + + entry->data = data; + entry->handler = handler; + + spin_lock_irqsave(&event_irq_cb_lock, flags); + list_add_tail(&entry->head, &event_irq_cb_list); + spin_unlock_irqrestore(&event_irq_cb_lock, flags); + + return 0; +} +EXPORT_SYMBOL_GPL(tdx_register_event_irq_cb); + +/** + * tdx_unregister_event_irq_cb() - Unregister TDX event IRQ callback handler. + * @handler: Address of driver specific event IRQ callback handler. + * @data: Context data to be passed to the callback handler. + * + * Return: 0 on success or -EIO if event IRQ is not allocated. + */ +int tdx_unregister_event_irq_cb(tdx_event_irq_cb_t handler, void *data) +{ + struct event_irq_entry *entry; + unsigned long flags; + + if (tdx_event_irq <= 0) + return -EIO; + + spin_lock_irqsave(&event_irq_cb_lock, flags); + list_for_each_entry(entry, &event_irq_cb_list, head) { + if (entry->handler == handler && entry->data == data) { + list_del(&entry->head); + kfree(entry); + break; + } + } + spin_unlock_irqrestore(&event_irq_cb_lock, flags); + + return 0; +} +EXPORT_SYMBOL_GPL(tdx_unregister_event_irq_cb); diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 28d889c9aa16..8807fe1b1f3f 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -53,6 +53,8 @@ struct ve_info { #ifdef CONFIG_INTEL_TDX_GUEST +typedef int (*tdx_event_irq_cb_t)(void *); + void __init tdx_early_init(void); /* Used to communicate with the TDX module */ @@ -69,6 +71,10 @@ bool tdx_early_handle_ve(struct pt_regs *regs); int tdx_mcall_get_report0(u8 *reportdata, u8 *tdreport); +int tdx_register_event_irq_cb(tdx_event_irq_cb_t handler, void *data); + +int tdx_unregister_event_irq_cb(tdx_event_irq_cb_t handler, void *data); + #else static inline void tdx_early_init(void) { };