Message ID | 1405501458-2822-1-git-send-email-anup.patel@linaro.org |
---|---|
State | New |
Headers | show |
On Wed, 2014-07-16 at 14:34 +0530, Anup Patel wrote: > If we have a Guest/DomU with two or more of its VCPUs running > on same host CPU then it can quite likely happen that these > VCPUs fight for same spinlock and one of them will waste CPU > cycles in WFE instruction. This patch makes WFE instruction > trap for VCPU and forces VCPU to yield its timeslice. > > The KVM ARM/ARM64 also does similar thing for handling WFE > instructions. (Please refer, > https://lists.cs.columbia.edu/pipermail/kvmarm/2013-November/006259.html) > > In general, this patch is more of an optimization for an > oversubscribed system having number of VCPUs more than > underlying host CPUs. Sounds good. Do you have any numbers for the improvement? > Signed-off-by: Anup Patel <anup.patel@linaro.org> > Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> > Tested-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> > --- > xen/arch/arm/traps.c | 27 ++++++++++++++++----------- > 1 file changed, 16 insertions(+), 11 deletions(-) > > diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c > index 686d8b7..bc9f67a 100644 > --- a/xen/arch/arm/traps.c > +++ b/xen/arch/arm/traps.c > @@ -90,7 +90,7 @@ void __cpuinit init_traps(void) > > /* Setup hypervisor traps */ > WRITE_SYSREG(HCR_PTW|HCR_BSU_INNER|HCR_AMO|HCR_IMO|HCR_FMO|HCR_VM| > - HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2); > + HCR_TWE|HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2); > isb(); > } > > @@ -1803,16 +1803,21 @@ asmlinkage void do_trap_hypervisor(struct cpu_user_regs *regs) > advance_pc(regs, hsr); > return; > } > - /* at the moment we only trap WFI */ > - vcpu_block(); > - /* The ARM spec declares that even if local irqs are masked in > - * the CPSR register, an irq should wake up a cpu from WFI anyway. > - * For this reason we need to check for irqs that need delivery, > - * ignoring the CPSR register, *after* calling SCHEDOP_block to > - * avoid races with vgic_vcpu_inject_irq. > - */ > - if ( local_events_need_delivery_nomask() ) > - vcpu_unblock(current); > + if ( hsr.bits & 0x1 ) { Please can you add an appropriate struct hsr_wfe_wfi to union hsr and use that instead of opencoding bit patterns in hsr.bits. > + /* Yield the VCPU for WFE */ > + vcpu_force_reschedule(current); This is OK fine by me, but I'm wondering: Does the HW have any support for actually blocking until an event occurs in another vcpu (some sort of trap on sev)? I don't think so, nor am I really sure how such a h/w feature might work... Ian.
On 16 July 2014 15:06, Ian Campbell <Ian.Campbell@citrix.com> wrote: > On Wed, 2014-07-16 at 14:34 +0530, Anup Patel wrote: >> If we have a Guest/DomU with two or more of its VCPUs running >> on same host CPU then it can quite likely happen that these >> VCPUs fight for same spinlock and one of them will waste CPU >> cycles in WFE instruction. This patch makes WFE instruction >> trap for VCPU and forces VCPU to yield its timeslice. >> >> The KVM ARM/ARM64 also does similar thing for handling WFE >> instructions. (Please refer, >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-November/006259.html) >> >> In general, this patch is more of an optimization for an >> oversubscribed system having number of VCPUs more than >> underlying host CPUs. > > Sounds good. Do you have any numbers for the improvement? I did not try any benchmarks myself but hackbench shows good improvement for KVM hence it is a worthy optimization for Xen too. I found this change missing for Xen hence this patch. > >> Signed-off-by: Anup Patel <anup.patel@linaro.org> >> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> >> Tested-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> >> --- >> xen/arch/arm/traps.c | 27 ++++++++++++++++----------- >> 1 file changed, 16 insertions(+), 11 deletions(-) >> >> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c >> index 686d8b7..bc9f67a 100644 >> --- a/xen/arch/arm/traps.c >> +++ b/xen/arch/arm/traps.c >> @@ -90,7 +90,7 @@ void __cpuinit init_traps(void) >> >> /* Setup hypervisor traps */ >> WRITE_SYSREG(HCR_PTW|HCR_BSU_INNER|HCR_AMO|HCR_IMO|HCR_FMO|HCR_VM| >> - HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2); >> + HCR_TWE|HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2); >> isb(); >> } >> >> @@ -1803,16 +1803,21 @@ asmlinkage void do_trap_hypervisor(struct cpu_user_regs *regs) >> advance_pc(regs, hsr); >> return; >> } >> - /* at the moment we only trap WFI */ >> - vcpu_block(); >> - /* The ARM spec declares that even if local irqs are masked in >> - * the CPSR register, an irq should wake up a cpu from WFI anyway. >> - * For this reason we need to check for irqs that need delivery, >> - * ignoring the CPSR register, *after* calling SCHEDOP_block to >> - * avoid races with vgic_vcpu_inject_irq. >> - */ >> - if ( local_events_need_delivery_nomask() ) >> - vcpu_unblock(current); >> + if ( hsr.bits & 0x1 ) { > > Please can you add an appropriate struct hsr_wfe_wfi to union hsr and > use that instead of opencoding bit patterns in hsr.bits. Sure, I will update this and send it right away. > >> + /* Yield the VCPU for WFE */ >> + vcpu_force_reschedule(current); > > This is OK fine by me, but I'm wondering: Does the HW have any support > for actually blocking until an event occurs in another vcpu (some sort > of trap on sev)? I don't think so, nor am I really sure how such a h/w > feature might work... Typically, on HW the WFE instruction forces cores to go in low power mode. The core will come out of low power mode due to some interrupt or SEV executed from another core. > > Ian. > -- Anup
On Wed, 2014-07-16 at 15:37 +0530, Anup Patel wrote: > On 16 July 2014 15:06, Ian Campbell <Ian.Campbell@citrix.com> wrote: > > On Wed, 2014-07-16 at 14:34 +0530, Anup Patel wrote: > >> If we have a Guest/DomU with two or more of its VCPUs running > >> on same host CPU then it can quite likely happen that these > >> VCPUs fight for same spinlock and one of them will waste CPU > >> cycles in WFE instruction. This patch makes WFE instruction > >> trap for VCPU and forces VCPU to yield its timeslice. > >> > >> The KVM ARM/ARM64 also does similar thing for handling WFE > >> instructions. (Please refer, > >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-November/006259.html) > >> > >> In general, this patch is more of an optimization for an > >> oversubscribed system having number of VCPUs more than > >> underlying host CPUs. > > > > Sounds good. Do you have any numbers for the improvement? > > I did not try any benchmarks myself but hackbench shows good > improvement for KVM hence it is a worthy optimization for Xen too. > > I found this change missing for Xen hence this patch. > > > > >> Signed-off-by: Anup Patel <anup.patel@linaro.org> > >> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> > >> Tested-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> > >> --- > >> xen/arch/arm/traps.c | 27 ++++++++++++++++----------- > >> 1 file changed, 16 insertions(+), 11 deletions(-) > >> > >> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c > >> index 686d8b7..bc9f67a 100644 > >> --- a/xen/arch/arm/traps.c > >> +++ b/xen/arch/arm/traps.c > >> @@ -90,7 +90,7 @@ void __cpuinit init_traps(void) > >> > >> /* Setup hypervisor traps */ > >> WRITE_SYSREG(HCR_PTW|HCR_BSU_INNER|HCR_AMO|HCR_IMO|HCR_FMO|HCR_VM| > >> - HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2); > >> + HCR_TWE|HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2); > >> isb(); > >> } > >> > >> @@ -1803,16 +1803,21 @@ asmlinkage void do_trap_hypervisor(struct cpu_user_regs *regs) > >> advance_pc(regs, hsr); > >> return; > >> } > >> - /* at the moment we only trap WFI */ > >> - vcpu_block(); > >> - /* The ARM spec declares that even if local irqs are masked in > >> - * the CPSR register, an irq should wake up a cpu from WFI anyway. > >> - * For this reason we need to check for irqs that need delivery, > >> - * ignoring the CPSR register, *after* calling SCHEDOP_block to > >> - * avoid races with vgic_vcpu_inject_irq. > >> - */ > >> - if ( local_events_need_delivery_nomask() ) > >> - vcpu_unblock(current); > >> + if ( hsr.bits & 0x1 ) { > > > > Please can you add an appropriate struct hsr_wfe_wfi to union hsr and > > use that instead of opencoding bit patterns in hsr.bits. > > Sure, I will update this and send it right away. Thanks. > >> + /* Yield the VCPU for WFE */ > >> + vcpu_force_reschedule(current); > > > > This is OK fine by me, but I'm wondering: Does the HW have any support > > for actually blocking until an event occurs in another vcpu (some sort > > of trap on sev)? I don't think so, nor am I really sure how such a h/w > > feature might work... > > Typically, on HW the WFE instruction forces cores to go in low power > mode. The core will come out of low power mode due to some interrupt > or SEV executed from another core. Right, but for virt in order to put a wfe-ing vcpu to sleep we would need to be able to get some sort of hypervisor event when another vcpu in the same guest executed sev. Otherwise the best we can do is to yield as you've done here, but that will still keep spinning if there are no other vcpus to run. Ian.
On 16 July 2014 15:39, Ian Campbell <Ian.Campbell@citrix.com> wrote: > On Wed, 2014-07-16 at 15:37 +0530, Anup Patel wrote: >> On 16 July 2014 15:06, Ian Campbell <Ian.Campbell@citrix.com> wrote: >> > On Wed, 2014-07-16 at 14:34 +0530, Anup Patel wrote: >> >> If we have a Guest/DomU with two or more of its VCPUs running >> >> on same host CPU then it can quite likely happen that these >> >> VCPUs fight for same spinlock and one of them will waste CPU >> >> cycles in WFE instruction. This patch makes WFE instruction >> >> trap for VCPU and forces VCPU to yield its timeslice. >> >> >> >> The KVM ARM/ARM64 also does similar thing for handling WFE >> >> instructions. (Please refer, >> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-November/006259.html) >> >> >> >> In general, this patch is more of an optimization for an >> >> oversubscribed system having number of VCPUs more than >> >> underlying host CPUs. >> > >> > Sounds good. Do you have any numbers for the improvement? >> >> I did not try any benchmarks myself but hackbench shows good >> improvement for KVM hence it is a worthy optimization for Xen too. >> >> I found this change missing for Xen hence this patch. >> >> > >> >> Signed-off-by: Anup Patel <anup.patel@linaro.org> >> >> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> >> >> Tested-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> >> >> --- >> >> xen/arch/arm/traps.c | 27 ++++++++++++++++----------- >> >> 1 file changed, 16 insertions(+), 11 deletions(-) >> >> >> >> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c >> >> index 686d8b7..bc9f67a 100644 >> >> --- a/xen/arch/arm/traps.c >> >> +++ b/xen/arch/arm/traps.c >> >> @@ -90,7 +90,7 @@ void __cpuinit init_traps(void) >> >> >> >> /* Setup hypervisor traps */ >> >> WRITE_SYSREG(HCR_PTW|HCR_BSU_INNER|HCR_AMO|HCR_IMO|HCR_FMO|HCR_VM| >> >> - HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2); >> >> + HCR_TWE|HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2); >> >> isb(); >> >> } >> >> >> >> @@ -1803,16 +1803,21 @@ asmlinkage void do_trap_hypervisor(struct cpu_user_regs *regs) >> >> advance_pc(regs, hsr); >> >> return; >> >> } >> >> - /* at the moment we only trap WFI */ >> >> - vcpu_block(); >> >> - /* The ARM spec declares that even if local irqs are masked in >> >> - * the CPSR register, an irq should wake up a cpu from WFI anyway. >> >> - * For this reason we need to check for irqs that need delivery, >> >> - * ignoring the CPSR register, *after* calling SCHEDOP_block to >> >> - * avoid races with vgic_vcpu_inject_irq. >> >> - */ >> >> - if ( local_events_need_delivery_nomask() ) >> >> - vcpu_unblock(current); >> >> + if ( hsr.bits & 0x1 ) { >> > >> > Please can you add an appropriate struct hsr_wfe_wfi to union hsr and >> > use that instead of opencoding bit patterns in hsr.bits. >> >> Sure, I will update this and send it right away. > > Thanks. > >> >> + /* Yield the VCPU for WFE */ >> >> + vcpu_force_reschedule(current); >> > >> > This is OK fine by me, but I'm wondering: Does the HW have any support >> > for actually blocking until an event occurs in another vcpu (some sort >> > of trap on sev)? I don't think so, nor am I really sure how such a h/w >> > feature might work... >> >> Typically, on HW the WFE instruction forces cores to go in low power >> mode. The core will come out of low power mode due to some interrupt >> or SEV executed from another core. > > Right, but for virt in order to put a wfe-ing vcpu to sleep we would > need to be able to get some sort of hypervisor event when another vcpu > in the same guest executed sev. > > Otherwise the best we can do is to yield as you've done here, but that > will still keep spinning if there are no other vcpus to run. Yes, we cannot trap SEV instruction. Yield seems to be the only option here, unless we modify the guest and replace SEV instruction with a hypercall. -- Anup > > Ian. >
On Wed, 2014-07-16 at 15:47 +0530, Anup Patel wrote: > On 16 July 2014 15:39, Ian Campbell <Ian.Campbell@citrix.com> wrote: > > On Wed, 2014-07-16 at 15:37 +0530, Anup Patel wrote: > >> On 16 July 2014 15:06, Ian Campbell <Ian.Campbell@citrix.com> wrote: > >> > On Wed, 2014-07-16 at 14:34 +0530, Anup Patel wrote: > >> >> If we have a Guest/DomU with two or more of its VCPUs running > >> >> on same host CPU then it can quite likely happen that these > >> >> VCPUs fight for same spinlock and one of them will waste CPU > >> >> cycles in WFE instruction. This patch makes WFE instruction > >> >> trap for VCPU and forces VCPU to yield its timeslice. > >> >> > >> >> The KVM ARM/ARM64 also does similar thing for handling WFE > >> >> instructions. (Please refer, > >> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-November/006259.html) > >> >> > >> >> In general, this patch is more of an optimization for an > >> >> oversubscribed system having number of VCPUs more than > >> >> underlying host CPUs. > >> > > >> > Sounds good. Do you have any numbers for the improvement? > >> > >> I did not try any benchmarks myself but hackbench shows good > >> improvement for KVM hence it is a worthy optimization for Xen too. > >> > >> I found this change missing for Xen hence this patch. > >> > >> > > >> >> Signed-off-by: Anup Patel <anup.patel@linaro.org> > >> >> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> > >> >> Tested-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> > >> >> --- > >> >> xen/arch/arm/traps.c | 27 ++++++++++++++++----------- > >> >> 1 file changed, 16 insertions(+), 11 deletions(-) > >> >> > >> >> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c > >> >> index 686d8b7..bc9f67a 100644 > >> >> --- a/xen/arch/arm/traps.c > >> >> +++ b/xen/arch/arm/traps.c > >> >> @@ -90,7 +90,7 @@ void __cpuinit init_traps(void) > >> >> > >> >> /* Setup hypervisor traps */ > >> >> WRITE_SYSREG(HCR_PTW|HCR_BSU_INNER|HCR_AMO|HCR_IMO|HCR_FMO|HCR_VM| > >> >> - HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2); > >> >> + HCR_TWE|HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2); > >> >> isb(); > >> >> } > >> >> > >> >> @@ -1803,16 +1803,21 @@ asmlinkage void do_trap_hypervisor(struct cpu_user_regs *regs) > >> >> advance_pc(regs, hsr); > >> >> return; > >> >> } > >> >> - /* at the moment we only trap WFI */ > >> >> - vcpu_block(); > >> >> - /* The ARM spec declares that even if local irqs are masked in > >> >> - * the CPSR register, an irq should wake up a cpu from WFI anyway. > >> >> - * For this reason we need to check for irqs that need delivery, > >> >> - * ignoring the CPSR register, *after* calling SCHEDOP_block to > >> >> - * avoid races with vgic_vcpu_inject_irq. > >> >> - */ > >> >> - if ( local_events_need_delivery_nomask() ) > >> >> - vcpu_unblock(current); > >> >> + if ( hsr.bits & 0x1 ) { > >> > > >> > Please can you add an appropriate struct hsr_wfe_wfi to union hsr and > >> > use that instead of opencoding bit patterns in hsr.bits. > >> > >> Sure, I will update this and send it right away. > > > > Thanks. > > > >> >> + /* Yield the VCPU for WFE */ > >> >> + vcpu_force_reschedule(current); > >> > > >> > This is OK fine by me, but I'm wondering: Does the HW have any support > >> > for actually blocking until an event occurs in another vcpu (some sort > >> > of trap on sev)? I don't think so, nor am I really sure how such a h/w > >> > feature might work... > >> > >> Typically, on HW the WFE instruction forces cores to go in low power > >> mode. The core will come out of low power mode due to some interrupt > >> or SEV executed from another core. > > > > Right, but for virt in order to put a wfe-ing vcpu to sleep we would > > need to be able to get some sort of hypervisor event when another vcpu > > in the same guest executed sev. > > > > Otherwise the best we can do is to yield as you've done here, but that > > will still keep spinning if there are no other vcpus to run. > > Yes, we cannot trap SEV instruction. Plus it would be unwieldy to have to enable that for all *other* vcpus when one did a WFE. > Yield seems to be the only option here, unless we modify the guest and > replace SEV instruction with a hypercall. Perhaps one day we can add PARAVIRT_SPINLOCKS to arm. But not now ;-) Ian.
diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c index 686d8b7..bc9f67a 100644 --- a/xen/arch/arm/traps.c +++ b/xen/arch/arm/traps.c @@ -90,7 +90,7 @@ void __cpuinit init_traps(void) /* Setup hypervisor traps */ WRITE_SYSREG(HCR_PTW|HCR_BSU_INNER|HCR_AMO|HCR_IMO|HCR_FMO|HCR_VM| - HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2); + HCR_TWE|HCR_TWI|HCR_TSC|HCR_TAC|HCR_SWIO|HCR_TIDCP, HCR_EL2); isb(); } @@ -1803,16 +1803,21 @@ asmlinkage void do_trap_hypervisor(struct cpu_user_regs *regs) advance_pc(regs, hsr); return; } - /* at the moment we only trap WFI */ - vcpu_block(); - /* The ARM spec declares that even if local irqs are masked in - * the CPSR register, an irq should wake up a cpu from WFI anyway. - * For this reason we need to check for irqs that need delivery, - * ignoring the CPSR register, *after* calling SCHEDOP_block to - * avoid races with vgic_vcpu_inject_irq. - */ - if ( local_events_need_delivery_nomask() ) - vcpu_unblock(current); + if ( hsr.bits & 0x1 ) { + /* Yield the VCPU for WFE */ + vcpu_force_reschedule(current); + } else { + /* Block the VCPU for WFI */ + vcpu_block(); + /* The ARM spec declares that even if local irqs are masked in + * the CPSR register, an irq should wake up a cpu from WFI anyway. + * For this reason we need to check for irqs that need delivery, + * ignoring the CPSR register, *after* calling SCHEDOP_block to + * avoid races with vgic_vcpu_inject_irq. + */ + if ( local_events_need_delivery_nomask() ) + vcpu_unblock(current); + } advance_pc(regs, hsr); break; case HSR_EC_CP15_32: