Message ID | 1410274423-9461-1-git-send-email-peter.maydell@linaro.org |
---|---|
State | Superseded |
Headers | show |
On Tue, Sep 09, 2014 at 03:53:43PM +0100, Peter Maydell wrote: > The pl011 and pl031 devices both use level triggered interrupts, > but the device tree we construct was incorrectly telling the > kernel to configure the GIC to treat them as edge triggered. > This meant that output from the pl011 would hang after a while. > > Signed-off-by: Peter Maydell <peter.maydell@linaro.org> > Cc: qemu-stable@nongnu.org > --- > Thanks to Christoffer Dall for figuring out the cause of the hangs here. > > hw/arm/virt.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/hw/arm/virt.c b/hw/arm/virt.c > index e8f231e..1b343f0 100644 > --- a/hw/arm/virt.c > +++ b/hw/arm/virt.c > @@ -371,7 +371,7 @@ static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic) > 2, base, 2, size); > qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts", > GIC_FDT_IRQ_TYPE_SPI, irq, > - GIC_FDT_IRQ_FLAGS_EDGE_LO_HI); > + GIC_FDT_IRQ_FLAGS_LEVEL_HI); > qemu_fdt_setprop_cells(vbi->fdt, nodename, "clocks", > vbi->clock_phandle, vbi->clock_phandle); > qemu_fdt_setprop(vbi->fdt, nodename, "clock-names", > @@ -398,7 +398,7 @@ static void create_rtc(const VirtBoardInfo *vbi, qemu_irq *pic) > 2, base, 2, size); > qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts", > GIC_FDT_IRQ_TYPE_SPI, irq, > - GIC_FDT_IRQ_FLAGS_EDGE_LO_HI); > + GIC_FDT_IRQ_FLAGS_LEVEL_HI); > qemu_fdt_setprop_cell(vbi->fdt, nodename, "clocks", vbi->clock_phandle); > qemu_fdt_setprop_string(vbi->fdt, nodename, "clock-names", "apb_pclk"); > g_free(nodename); > -- > 1.9.1 > I've been trying to figure out why we would see this particular hang with SMP and not UP (or maybe it is just very much more unlikely to happen with UP), but haven't been able to come up with a sequence of events to support this yet. It also worries me that we weren't seeing this with KVM, since it indicates that we're either doing something wrong in the KVM or QEMU GIC emulation code, potentially. In any case, this patch is correct, so: Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
On 10 September 2014 12:43, Christoffer Dall <christoffer.dall@linaro.org> wrote: > On Tue, Sep 09, 2014 at 03:53:43PM +0100, Peter Maydell wrote: >> The pl011 and pl031 devices both use level triggered interrupts, >> but the device tree we construct was incorrectly telling the >> kernel to configure the GIC to treat them as edge triggered. >> This meant that output from the pl011 would hang after a while. >> >> Signed-off-by: Peter Maydell <peter.maydell@linaro.org> >> Cc: qemu-stable@nongnu.org >> --- >> Thanks to Christoffer Dall for figuring out the cause of the hangs here. >> >> hw/arm/virt.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/hw/arm/virt.c b/hw/arm/virt.c >> index e8f231e..1b343f0 100644 >> --- a/hw/arm/virt.c >> +++ b/hw/arm/virt.c >> @@ -371,7 +371,7 @@ static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic) >> 2, base, 2, size); >> qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts", >> GIC_FDT_IRQ_TYPE_SPI, irq, >> - GIC_FDT_IRQ_FLAGS_EDGE_LO_HI); >> + GIC_FDT_IRQ_FLAGS_LEVEL_HI); >> qemu_fdt_setprop_cells(vbi->fdt, nodename, "clocks", >> vbi->clock_phandle, vbi->clock_phandle); >> qemu_fdt_setprop(vbi->fdt, nodename, "clock-names", >> @@ -398,7 +398,7 @@ static void create_rtc(const VirtBoardInfo *vbi, qemu_irq *pic) >> 2, base, 2, size); >> qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts", >> GIC_FDT_IRQ_TYPE_SPI, irq, >> - GIC_FDT_IRQ_FLAGS_EDGE_LO_HI); >> + GIC_FDT_IRQ_FLAGS_LEVEL_HI); >> qemu_fdt_setprop_cell(vbi->fdt, nodename, "clocks", vbi->clock_phandle); >> qemu_fdt_setprop_string(vbi->fdt, nodename, "clock-names", "apb_pclk"); >> g_free(nodename); >> -- >> 1.9.1 >> > I've been trying to figure out why we would see this particular hang > with SMP and not UP (or maybe it is just very much more unlikely to > happen with UP), but haven't been able to come up with a sequence of > events to support this yet. It also worries me that we weren't seeing > this with KVM, since it indicates that we're either doing something > wrong in the KVM or QEMU GIC emulation code, potentially. > > In any case, this patch is correct, so: > > Acked-by: Christoffer Dall <christoffer.dall@linaro.org> I am still seeing lockups and self detected stalls even with this patch, and sometimes just hitting a key into the console will get things moving again. So I am not convinced yet whether this fixes something fundamentally, or just works around it by taking an alternative code path through the kernel which doesn't trigger the same root bug.
On Wed, Sep 10, 2014 at 12:48 PM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > On 10 September 2014 12:43, Christoffer Dall > <christoffer.dall@linaro.org> wrote: >> On Tue, Sep 09, 2014 at 03:53:43PM +0100, Peter Maydell wrote: >>> The pl011 and pl031 devices both use level triggered interrupts, >>> but the device tree we construct was incorrectly telling the >>> kernel to configure the GIC to treat them as edge triggered. >>> This meant that output from the pl011 would hang after a while. >>> >>> Signed-off-by: Peter Maydell <peter.maydell@linaro.org> >>> Cc: qemu-stable@nongnu.org >>> --- >>> Thanks to Christoffer Dall for figuring out the cause of the hangs here. >>> >>> hw/arm/virt.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c >>> index e8f231e..1b343f0 100644 >>> --- a/hw/arm/virt.c >>> +++ b/hw/arm/virt.c >>> @@ -371,7 +371,7 @@ static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic) >>> 2, base, 2, size); >>> qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts", >>> GIC_FDT_IRQ_TYPE_SPI, irq, >>> - GIC_FDT_IRQ_FLAGS_EDGE_LO_HI); >>> + GIC_FDT_IRQ_FLAGS_LEVEL_HI); >>> qemu_fdt_setprop_cells(vbi->fdt, nodename, "clocks", >>> vbi->clock_phandle, vbi->clock_phandle); >>> qemu_fdt_setprop(vbi->fdt, nodename, "clock-names", >>> @@ -398,7 +398,7 @@ static void create_rtc(const VirtBoardInfo *vbi, qemu_irq *pic) >>> 2, base, 2, size); >>> qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts", >>> GIC_FDT_IRQ_TYPE_SPI, irq, >>> - GIC_FDT_IRQ_FLAGS_EDGE_LO_HI); >>> + GIC_FDT_IRQ_FLAGS_LEVEL_HI); >>> qemu_fdt_setprop_cell(vbi->fdt, nodename, "clocks", vbi->clock_phandle); >>> qemu_fdt_setprop_string(vbi->fdt, nodename, "clock-names", "apb_pclk"); >>> g_free(nodename); >>> -- >>> 1.9.1 >>> >> I've been trying to figure out why we would see this particular hang >> with SMP and not UP (or maybe it is just very much more unlikely to >> happen with UP), but haven't been able to come up with a sequence of >> events to support this yet. It also worries me that we weren't seeing >> this with KVM, since it indicates that we're either doing something >> wrong in the KVM or QEMU GIC emulation code, potentially. >> >> In any case, this patch is correct, so: >> >> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> > > I am still seeing lockups and self detected stalls even with this > patch, and sometimes just hitting a key into the console will get > things moving again. > So I am not convinced yet whether this fixes something fundamentally, > or just works around it by taking an alternative code path through the > kernel which doesn't trigger the same root bug. As far as I can tell, this fix won't change any other behavior in the kernel than configuring the GICD for that IRQ to level/edge on system boot. I also saw the lockups of the system that went away when giving the pl011 some inputs, but I thought I only saw this before the fix - admittedly I didn't test this very thoroughly. I wonder if we are losing timer interrupts due to the same artifact....? -Christoffer
On Wed, Sep 10, 2014 at 9:32 PM, Christoffer Dall <christoffer.dall@linaro.org> wrote: > On Wed, Sep 10, 2014 at 12:48 PM, Ard Biesheuvel > <ard.biesheuvel@linaro.org> wrote: >> On 10 September 2014 12:43, Christoffer Dall >> <christoffer.dall@linaro.org> wrote: >>> On Tue, Sep 09, 2014 at 03:53:43PM +0100, Peter Maydell wrote: >>>> The pl011 and pl031 devices both use level triggered interrupts, >>>> but the device tree we construct was incorrectly telling the >>>> kernel to configure the GIC to treat them as edge triggered. >>>> This meant that output from the pl011 would hang after a while. >>>> >>>> Signed-off-by: Peter Maydell <peter.maydell@linaro.org> >>>> Cc: qemu-stable@nongnu.org >>>> --- >>>> Thanks to Christoffer Dall for figuring out the cause of the hangs here. >>>> >>>> hw/arm/virt.c | 4 ++-- >>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c >>>> index e8f231e..1b343f0 100644 >>>> --- a/hw/arm/virt.c >>>> +++ b/hw/arm/virt.c >>>> @@ -371,7 +371,7 @@ static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic) >>>> 2, base, 2, size); >>>> qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts", >>>> GIC_FDT_IRQ_TYPE_SPI, irq, >>>> - GIC_FDT_IRQ_FLAGS_EDGE_LO_HI); >>>> + GIC_FDT_IRQ_FLAGS_LEVEL_HI); >>>> qemu_fdt_setprop_cells(vbi->fdt, nodename, "clocks", >>>> vbi->clock_phandle, vbi->clock_phandle); >>>> qemu_fdt_setprop(vbi->fdt, nodename, "clock-names", >>>> @@ -398,7 +398,7 @@ static void create_rtc(const VirtBoardInfo *vbi, qemu_irq *pic) >>>> 2, base, 2, size); >>>> qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts", >>>> GIC_FDT_IRQ_TYPE_SPI, irq, >>>> - GIC_FDT_IRQ_FLAGS_EDGE_LO_HI); >>>> + GIC_FDT_IRQ_FLAGS_LEVEL_HI); >>>> qemu_fdt_setprop_cell(vbi->fdt, nodename, "clocks", vbi->clock_phandle); >>>> qemu_fdt_setprop_string(vbi->fdt, nodename, "clock-names", "apb_pclk"); >>>> g_free(nodename); >>>> -- >>>> 1.9.1 >>>> >>> I've been trying to figure out why we would see this particular hang >>> with SMP and not UP (or maybe it is just very much more unlikely to >>> happen with UP), but haven't been able to come up with a sequence of >>> events to support this yet. It also worries me that we weren't seeing >>> this with KVM, since it indicates that we're either doing something >>> wrong in the KVM or QEMU GIC emulation code, potentially. >>> >>> In any case, this patch is correct, so: >>> >>> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> >> >> I am still seeing lockups and self detected stalls even with this >> patch, and sometimes just hitting a key into the console will get >> things moving again. >> So I am not convinced yet whether this fixes something fundamentally, >> or just works around it by taking an alternative code path through the >> kernel which doesn't trigger the same root bug. > > As far as I can tell, this fix won't change any other behavior in the > kernel than configuring the GICD for that IRQ to level/edge on system > boot. > > I also saw the lockups of the system that went away when giving the > pl011 some inputs, but I thought I only saw this before the fix - > admittedly I didn't test this very thoroughly. > > I wonder if we are losing timer interrupts due to the same artifact....? > I recently tried to debug SMP lockups on vexpress with bisection pointing at: commit 6d327171551a12b937c5718073b984 8d0274c74d Author: Alex Bligh <alex@alex.org.uk> Date: Wed Aug 21 16:02:59 2013 +0100 aio / timers: Remove alarm timers Remove alarm timers from qemu-timers.c now we use g_poll / ppoll instead. Signed-off-by: Alex Bligh <alex@alex.org.uk> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Unfortunately this patch predates the virt machine, but I am highly suspicious of the same root cause - self detected stalling was amongst a range of symptoms I was seeing. My best theory is a TCG mainloop bug which is consistent with your works-on-KVM observation. I had a whole bunch of patches that added traces to arm timer and interrupts etc. I should probably send them through. My basic conclusion was that that CPU exit_request were comming in pairs with only a short delay between them (for the timer most likely). Looking at the tcg mainloop: static void tcg_exec_all(void) { int r; /* Account partial waits to QEMU_CLOCK_VIRTUAL. */ qemu_clock_warp(QEMU_CLOCK_VIRTUAL); if (next_cpu == NULL) { next_cpu = first_cpu; } for (; next_cpu != NULL && !exit_request; next_cpu = CPU_NEXT(next_cpu)) { CPUState *cpu = next_cpu; CPUArchState *env = cpu->env_ptr; qemu_clock_enable(QEMU_CLOCK_VIRTUAL, (cpu->singlestep_enabled & SSTEP_NOTIMER) == 0); if (cpu_can_run(cpu)) { r = tcg_cpu_exec(env); if (r == EXCP_DEBUG) { cpu_handle_guest_debug(cpu); break; } } else if (cpu->stop || cpu->stopped) { break; } } exit_request = 0; } The for loop implements a round robin schedule. But with the double exit requests I was seeing, the first exit_request would cause the currently executing CPU to exit, then the second exit_request would cause the second CPU to exit straight away with no code execution moving schedule back to the first. This pattern occured consistently for many cycles starving the second CPU long term. This often leading to secondary CPU probe failure. That qemu_clock_enable and the qemu_clock_warp are a little suspicous to me. Can they be the source of exit_requests causing CPU starvation? Regards, Peter > -Christoffer >
diff --git a/hw/arm/virt.c b/hw/arm/virt.c index e8f231e..1b343f0 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -371,7 +371,7 @@ static void create_uart(const VirtBoardInfo *vbi, qemu_irq *pic) 2, base, 2, size); qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts", GIC_FDT_IRQ_TYPE_SPI, irq, - GIC_FDT_IRQ_FLAGS_EDGE_LO_HI); + GIC_FDT_IRQ_FLAGS_LEVEL_HI); qemu_fdt_setprop_cells(vbi->fdt, nodename, "clocks", vbi->clock_phandle, vbi->clock_phandle); qemu_fdt_setprop(vbi->fdt, nodename, "clock-names", @@ -398,7 +398,7 @@ static void create_rtc(const VirtBoardInfo *vbi, qemu_irq *pic) 2, base, 2, size); qemu_fdt_setprop_cells(vbi->fdt, nodename, "interrupts", GIC_FDT_IRQ_TYPE_SPI, irq, - GIC_FDT_IRQ_FLAGS_EDGE_LO_HI); + GIC_FDT_IRQ_FLAGS_LEVEL_HI); qemu_fdt_setprop_cell(vbi->fdt, nodename, "clocks", vbi->clock_phandle); qemu_fdt_setprop_string(vbi->fdt, nodename, "clock-names", "apb_pclk"); g_free(nodename);
The pl011 and pl031 devices both use level triggered interrupts, but the device tree we construct was incorrectly telling the kernel to configure the GIC to treat them as edge triggered. This meant that output from the pl011 would hang after a while. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Cc: qemu-stable@nongnu.org --- Thanks to Christoffer Dall for figuring out the cause of the hangs here. hw/arm/virt.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)