Message ID | 1502095997-31219-1-git-send-email-bhupinder.thakur@linaro.org |
---|---|
Headers | show |
Series | SBSA UART emulation support in Xen | expand |
Hi Bhupinder, I gave another and I have a couple of comments. Booting Linux with earlycon enabled take quite a while. I can see the characters coming slower than on the minitel. It seems to be a bit better after switching off the bootconsole. Overall Linux is taking ~20 times to boot with pl011 vs HVC console. I do agree that pl011 is emulated and therefore you have to trap after each character. But 20 times sounds far too much. After that I tried to stress the emulation a bit with "find ." to get a lot of output. And I noticed a lot of message similar to the one below on xen console: d6v0 vpl011: Unexpected OUT ring buffer full Associated to that the character have been eaten resulting to non-sense log. A bit above the printk printing this message, there are a comment saying: /* * It is expected that the ring is not full when this function is called * as the guest is expected to write to the data register only when the * TXFF flag is not set. * In case the guest does write even when the TXFF flag is set then the * data will be silently dropped. */ I am quite surprised that Linux is not looking at the TXFF flags. So this needs some investigation. Cheers, On 07/08/17 09:52, Bhupinder Thakur wrote: > SBSA UART emulation for guests in Xen > ====================================== > Linaro has published VM System specification for ARM Processors, which > provides a set of guidelines for both guest OS and hypervisor implementations, > such that building OS images according to these guidelines guarantees > that those images can also run on hypervisors compliant with this specification. > > One of the spec requirements is that the hypervisor must provide an > emulated SBSA UART as a serial console which meets the minimum requirements in > SBSA UART as defined in appendix B of the following > ARM Server Base Architecture Document: > > https://static.docs.arm.com/den0029/a/Server_Base_System_Architecture_v3_1_ARM_DEN_0029A.pdf. > > This feature allows the Xen guests to use SBSA compliant UART as > as a console. > > Note that SBSA UART is a subset of full featured ARM pl011 UART and > supports only a subset of registers as mentioned below. It does not support > rx/tx DMA. > > Currently, Xen supports paravirtualized (aka PV console) and an emulated serial > consoles. This feature will expose an emulated SBSA UART console to the > guest, which a user can access using xenconsole. > > The device tree passed to the guest VM will contain the SBSA UART MMIO address > range and an irq for receiving rx/tx interrupts. The device tree format > is specified in Documentation/devicetree/bindings/serial/arm_sbsa_uart.txt. > > The Xen hypervisor will expose two types of interfaces to the backend and domU. > > The interface exposed to domU will be an emulated SBSA UART by emulating the > access to the following registers by the guest. > > - Data register (DR) - RW > - Raw interrupt status register (RIS) - RO > - Masked interrupt status register (MIS)- RO > - Interrupt Mask (IMSC) - RW > - Interrupt Clear (ICR) - WO > > It will also inject the interrupts to the guest in the following > conditions: > > - incoming data in the rx buffer for the guest > - there is space in the tx buffer for the guest to write more data > > The interface exposed to the backend will be the same PV console interface, > which minimizes the changes required in xenconsole to support a new SBSA UART > console. > > This interface has rx and tx ring buffers and an event channel for > sending/receiving events from the backend. > > So essentially Xen handles the data on behalf of domU and the backend. Any data > written by domU is captured by Xen and written to the TX (OUT) ring buffer > and an event is raised to the backend to read the TX ring buffer. > > Similarly on reciving an event from xenconsole, Xen injects an interrupt to guest to > indicate there is data available in the RX (IN) ring buffer. > > The SBSA UART state is completely captured in the set of registers > mentioned above and this state is updated everytime there is an event from > the backend or there is register read/write access from domU. > > For example, if domU has masked the rx interrupt in the IMSC register, then Xen > will not inject an interrupt to guest and will just update the RIS register. > Once the interrupt is unmasked by guest, the interrupt will be delivered to the > guest. > > Changes summary: > > Xen Hypervisor > =============== > > 1. Add emulation code to emulate read/write access to SBSA UART registers and > interrupts: > - It emulates DR read/write by reading and writing from/to the IN and > OUT ring buffers and raising an event to dom0 when there is data in > the OUT ring buffer and injecting an interrupt to the guest when there > is data in the IN ring buffer. > - Other registers are related to interrupt management and essentially > control when interrupts are delivered to the guest. > > 2. Add a new domctl API to initialize SBSA UART emulation in Xen. > > 3. Enable SBSA UART emulation for a domain based on a libxl option passed during > domain creation. > > Toolstack > ========== > > 1. Add a new option "vuart" in the domU configuration file to enable/disable vuart. > > 2. Create a SBSA UART DT node in the guest device tree. It uses a fixed > SPI IRQ number and MMIO address range for SBSA UART. > > 3. Call vuart init DOMCTL API to enable SBSA UART emulation. > > 5. Add a new vuart xenstore node, which contains: > - ring-ref > - event channel > - buffer limit > - type > > Xenconsoled > ============ > > 1. Split the domain structure to support multiple consoles. > > 2. Modify different APIs such as buffer_append() etc. to operate on the > console structure. > > 3. Add support for handling multiple consoles. > > 4. Add support for vuart console: > > The vpl011 changes available at the following repo: > > url: https://git@git.linaro.org:/people/bhupinder.thakur/xen.git > branch: vpl011_v6 > > Kindly wait for one day to checkout the code from the above URL. > > There are some TBD items which need to be looked at in the future: > > 1. Currently UEFI firmware logs the output to hvc console only. How can > UEFI firmware be made aware of pl011 console and how it can use it > as a console instead of hvc. > > There was a discussion on this and it was decided that SBSA UART should > be used as a debug port by the UEFI firmware so that all debug output > is redirected to this port. > > 2. Linux seems to have hvc console as the default console i.e. if no > console is specified then it uses hvc as the console. How can an > option be provided in Linux to select either hvc or pl011 as the > default console. > > It was suggeted to use the SPCR in ACPI and the stdout-path option in the > device tree to specify the default console. However, currently hvc console > is not describable in the ACPI/device tree. This support will have to be > added to allow the user to specify the default console. > > 3. ACPI support for pl011 device. > > CC: Andrew Cooper <andrew.cooper3@citrix.com> > CC: George Dunlap <George.Dunlap@eu.citrix.com> > CC: Ian Jackson <ian.jackson@eu.citrix.com> > CC: Jan Beulich <jbeulich@suse.com> > CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > CC: Stefano Stabellini <sstabellini@kernel.org> > CC: Tim Deegan <tim@xen.org> > CC: Wei Liu <wei.liu2@citrix.com> > CC: Julien Grall <julien.grall@arm.com> > CC: Andre Przywara <andre.przywara@arm.com> > > Bhupinder Thakur (25): > xen/arm: vpl011: Define common ring buffer helper functions in > console.h > xen/arm: vpl011: Add SBSA UART emulation in Xen > xen/arm: vpl011: Allocate a new GFN in the toolstack for vuart > xen/arm: vpl011: Add support for vuart in libxl > xen/arm: vpl011: Rearrange xen header includes in alphabetical order > in domctl.c > xen/arm: vpl011: Add a new domctl API to initialize vpl011 > xen/arm: vpl011: Add a new vuart node in the xenstore > xen/arm: vpl011: Modify xenconsole to define and use a new console > structure > xen/arm: vpl011: Rename the console structure field conspath to xspath > xen/arm: vpl011: Modify xenconsole functions to take console structure > as input > xen/arm: vpl011: Add a new console_init function in xenconsole > xen/arm: vpl011: Add a new buffer_available function in xenconsole > xen/arm: vpl011: Add a new maybe_add_console_evtchn_fd function in > xenconsole > xen/arm: vpl011: Add a new maybe_add_console_tty_fd function in > xenconsole > xen/arm: vpl011: Add a new console_evtchn_unmask function in > xenconsole > xen/arm: vpl011: Add a new handle_console_ring function in xenconsole > xen/arm: vpl011: Add a new handle_console_tty function in xenconsole > xen/arm: vpl011: Add a new console_cleanup function in xenconsole > xen/arm: vpl011: Add a new console_open_log function in xenconsole > xen/arm: vpl011: Add a new console_close_evtchn function in xenconsole > xen/arm: vpl011: Add support for multiple consoles in xenconsole > xen/arm: vpl011: Add support for vuart console in xenconsole > xen/arm: vpl011: Add a new vuart console type to xenconsole client > xen/arm: vpl011: Add a pl011 uart DT node in the guest device tree > xen/arm: vpl011: Update documentation for vuart console support > > config/arm32.mk | 1 + > config/arm64.mk | 1 + > docs/man/xl.cfg.pod.5.in | 12 + > docs/misc/console.txt | 44 ++- > tools/console/Makefile | 3 +- > tools/console/client/main.c | 13 +- > tools/console/daemon/io.c | 659 +++++++++++++++++++++++------------ > tools/libxc/include/xc_dom.h | 2 + > tools/libxc/include/xenctrl.h | 20 ++ > tools/libxc/xc_dom_arm.c | 5 +- > tools/libxc/xc_dom_boot.c | 2 + > tools/libxc/xc_domain.c | 25 ++ > tools/libxl/libxl.h | 5 + > tools/libxl/libxl_arch.h | 7 + > tools/libxl/libxl_arm.c | 74 ++++ > tools/libxl/libxl_console.c | 47 +++ > tools/libxl/libxl_create.c | 9 +- > tools/libxl/libxl_device.c | 9 +- > tools/libxl/libxl_dom.c | 5 + > tools/libxl/libxl_internal.h | 6 + > tools/libxl/libxl_types.idl | 7 + > tools/libxl/libxl_types_internal.idl | 1 + > tools/libxl/libxl_x86.c | 8 + > tools/xl/xl_cmdtable.c | 2 +- > tools/xl/xl_console.c | 5 +- > tools/xl/xl_parse.c | 8 + > xen/arch/arm/Kconfig | 7 + > xen/arch/arm/Makefile | 1 + > xen/arch/arm/domain.c | 6 + > xen/arch/arm/domctl.c | 48 ++- > xen/arch/arm/vpl011.c | 454 ++++++++++++++++++++++++ > xen/include/asm-arm/domain.h | 6 + > xen/include/asm-arm/pl011-uart.h | 2 + > xen/include/asm-arm/vpl011.h | 72 ++++ > xen/include/public/arch-arm.h | 6 + > xen/include/public/domctl.h | 21 ++ > xen/include/public/io/console.h | 4 + > 37 files changed, 1366 insertions(+), 241 deletions(-) > create mode 100644 xen/arch/arm/vpl011.c > create mode 100644 xen/include/asm-arm/vpl011.h >
Hi Julien, Thanks for the testing. On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: > Hi Bhupinder, > > I gave another and I have a couple of comments. > > Booting Linux with earlycon enabled take quite a while. I can see the > characters coming slower than on the minitel. It seems to be a bit better > after switching off the bootconsole. Overall Linux is taking ~20 times to > boot with pl011 vs HVC console. > > I do agree that pl011 is emulated and therefore you have to trap after each > character. But 20 times sounds far too much. > I think this slowness could be due to ratelimiting of the pl011 events in xenconosle. Currently, the rate limit is set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). I increased the rate limit to 600 events (30 * 20) per 200 msecs. With this change, I see that the the find command is running faster and smoother. Earlier the find output would be jerky. > After that I tried to stress the emulation a bit with "find ." to get a lot > of output. And I noticed a lot of message similar to the one below on xen > console: > > d6v0 vpl011: Unexpected OUT ring buffer full > > Associated to that the character have been eaten resulting to non-sense log. > > A bit above the printk printing this message, there are a comment saying: > > /* > * It is expected that the ring is not full when this function is called > * as the guest is expected to write to the data register only when the > * TXFF flag is not set. > * In case the guest does write even when the TXFF flag is set then the > * data will be silently dropped. > */ > > I am quite surprised that Linux is not looking at the TXFF flags. So this > needs some investigation. I ran 'find' but could not reproduce the issue. I will try to reproduce this issue by reducing the OUT buffer size. Regards, Bhupinder
On Wed, Aug 09, 2017 at 04:28:14PM +0530, Bhupinder Thakur wrote: > Hi Julien, > > Thanks for the testing. > > On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: > > Hi Bhupinder, > > > > I gave another and I have a couple of comments. > > > > Booting Linux with earlycon enabled take quite a while. I can see the > > characters coming slower than on the minitel. It seems to be a bit better > > after switching off the bootconsole. Overall Linux is taking ~20 times to > > boot with pl011 vs HVC console. > > > > I do agree that pl011 is emulated and therefore you have to trap after each > > character. But 20 times sounds far too much. > > > I think this slowness could be due to ratelimiting of the pl011 events > in xenconosle. Currently, the rate limit is > set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). > > I increased the rate limit to 600 events (30 * 20) per 200 msecs. With > this change, > I see that the the find command is running faster and smoother. > Earlier the find output would be jerky. Please consider batching the events instead of bumping the limit.
Hi Wei, On 9 August 2017 at 16:33, Wei Liu <wei.liu2@citrix.com> wrote: > On Wed, Aug 09, 2017 at 04:28:14PM +0530, Bhupinder Thakur wrote: >> Hi Julien, >> >> Thanks for the testing. >> >> On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: >> > Hi Bhupinder, >> > >> > I gave another and I have a couple of comments. >> > >> > Booting Linux with earlycon enabled take quite a while. I can see the >> > characters coming slower than on the minitel. It seems to be a bit better >> > after switching off the bootconsole. Overall Linux is taking ~20 times to >> > boot with pl011 vs HVC console. >> > >> > I do agree that pl011 is emulated and therefore you have to trap after each >> > character. But 20 times sounds far too much. >> > >> I think this slowness could be due to ratelimiting of the pl011 events >> in xenconosle. Currently, the rate limit is >> set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). >> >> I increased the rate limit to 600 events (30 * 20) per 200 msecs. With >> this change, >> I see that the the find command is running faster and smoother. >> Earlier the find output would be jerky. > > Please consider batching the events instead of bumping the limit. If I try to batch the events then it may delay per character processing (if no further characters are coming) since we will wait for more events to come to batch them together. By keeping the rate limit high, it is ensured that it can handle large burst of events. If there is sporadic data then the rate limit is not hit. I am thinking of making the rate limit configuration per console. Regards, Bhupinder
On Thu, Aug 10, 2017 at 01:29:04PM +0530, Bhupinder Thakur wrote: > Hi Wei, > > On 9 August 2017 at 16:33, Wei Liu <wei.liu2@citrix.com> wrote: > > On Wed, Aug 09, 2017 at 04:28:14PM +0530, Bhupinder Thakur wrote: > >> Hi Julien, > >> > >> Thanks for the testing. > >> > >> On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: > >> > Hi Bhupinder, > >> > > >> > I gave another and I have a couple of comments. > >> > > >> > Booting Linux with earlycon enabled take quite a while. I can see the > >> > characters coming slower than on the minitel. It seems to be a bit better > >> > after switching off the bootconsole. Overall Linux is taking ~20 times to > >> > boot with pl011 vs HVC console. > >> > > >> > I do agree that pl011 is emulated and therefore you have to trap after each > >> > character. But 20 times sounds far too much. > >> > > >> I think this slowness could be due to ratelimiting of the pl011 events > >> in xenconosle. Currently, the rate limit is > >> set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). > >> > >> I increased the rate limit to 600 events (30 * 20) per 200 msecs. With > >> this change, > >> I see that the the find command is running faster and smoother. > >> Earlier the find output would be jerky. > > > > Please consider batching the events instead of bumping the limit. > > If I try to batch the events then it may delay per character > processing (if no further characters are coming) since we will wait > for more events to come to batch them together. By keeping the rate > limit high, it is ensured that it can handle large burst of events. If > there is sporadic data then the rate limit is not hit. But do you really need to send an event for every character (that's my reading of your patch, please correct me if I'm wrong)? Suppose you have a buffer you can send one event when the buffer is processed. > > I am thinking of making the rate limit configuration per console. > There is the question why vuart is special with regard to other types of console. If something is mandated by the spec it should be specifically called out to justify the decision to bump the rate limit. > Regards, > Bhupinder
Hi Wei, On 10/08/17 12:40, Wei Liu wrote: > On Thu, Aug 10, 2017 at 01:29:04PM +0530, Bhupinder Thakur wrote: >> Hi Wei, >> >> On 9 August 2017 at 16:33, Wei Liu <wei.liu2@citrix.com> wrote: >>> On Wed, Aug 09, 2017 at 04:28:14PM +0530, Bhupinder Thakur wrote: >>>> Hi Julien, >>>> >>>> Thanks for the testing. >>>> >>>> On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: >>>>> Hi Bhupinder, >>>>> >>>>> I gave another and I have a couple of comments. >>>>> >>>>> Booting Linux with earlycon enabled take quite a while. I can see the >>>>> characters coming slower than on the minitel. It seems to be a bit better >>>>> after switching off the bootconsole. Overall Linux is taking ~20 times to >>>>> boot with pl011 vs HVC console. >>>>> >>>>> I do agree that pl011 is emulated and therefore you have to trap after each >>>>> character. But 20 times sounds far too much. >>>>> >>>> I think this slowness could be due to ratelimiting of the pl011 events >>>> in xenconosle. Currently, the rate limit is >>>> set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). >>>> >>>> I increased the rate limit to 600 events (30 * 20) per 200 msecs. With >>>> this change, >>>> I see that the the find command is running faster and smoother. >>>> Earlier the find output would be jerky. >>> >>> Please consider batching the events instead of bumping the limit. >> >> If I try to batch the events then it may delay per character >> processing (if no further characters are coming) since we will wait >> for more events to come to batch them together. By keeping the rate >> limit high, it is ensured that it can handle large burst of events. If >> there is sporadic data then the rate limit is not hit. > > But do you really need to send an event for every character (that's my > reading of your patch, please correct me if I'm wrong)? Suppose you have > a buffer you can send one event when the buffer is processed. I am not sure what you mean by buffer here. The interface with the guest is a single data register. So characters are written one by one. > >> >> I am thinking of making the rate limit configuration per console. >> > > There is the question why vuart is special with regard to other types of > console. > > If something is mandated by the spec it should be specifically called > out to justify the decision to bump the rate limit. I might have other ideas how to improve the speed. I will answer on Bhupinder's e-mail. Cheers,
On Thu, Aug 10, 2017 at 01:40:07PM +0100, Julien Grall wrote: > Hi Wei, > > On 10/08/17 12:40, Wei Liu wrote: > > On Thu, Aug 10, 2017 at 01:29:04PM +0530, Bhupinder Thakur wrote: > > > Hi Wei, > > > > > > On 9 August 2017 at 16:33, Wei Liu <wei.liu2@citrix.com> wrote: > > > > On Wed, Aug 09, 2017 at 04:28:14PM +0530, Bhupinder Thakur wrote: > > > > > Hi Julien, > > > > > > > > > > Thanks for the testing. > > > > > > > > > > On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: > > > > > > Hi Bhupinder, > > > > > > > > > > > > I gave another and I have a couple of comments. > > > > > > > > > > > > Booting Linux with earlycon enabled take quite a while. I can see the > > > > > > characters coming slower than on the minitel. It seems to be a bit better > > > > > > after switching off the bootconsole. Overall Linux is taking ~20 times to > > > > > > boot with pl011 vs HVC console. > > > > > > > > > > > > I do agree that pl011 is emulated and therefore you have to trap after each > > > > > > character. But 20 times sounds far too much. > > > > > > > > > > > I think this slowness could be due to ratelimiting of the pl011 events > > > > > in xenconosle. Currently, the rate limit is > > > > > set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). > > > > > > > > > > I increased the rate limit to 600 events (30 * 20) per 200 msecs. With > > > > > this change, > > > > > I see that the the find command is running faster and smoother. > > > > > Earlier the find output would be jerky. > > > > > > > > Please consider batching the events instead of bumping the limit. > > > > > > If I try to batch the events then it may delay per character > > > processing (if no further characters are coming) since we will wait > > > for more events to come to batch them together. By keeping the rate > > > limit high, it is ensured that it can handle large burst of events. If > > > there is sporadic data then the rate limit is not hit. > > > > But do you really need to send an event for every character (that's my > > reading of your patch, please correct me if I'm wrong)? Suppose you have > > a buffer you can send one event when the buffer is processed. > > I am not sure what you mean by buffer here. The interface with the guest is > a single data register. So characters are written one by one. Yes, that's what the guest sees, but that is not necessarily what xenconsoled sees. > > > > > > > > > I am thinking of making the rate limit configuration per console. > > > > > > > There is the question why vuart is special with regard to other types of > > console. > > > > If something is mandated by the spec it should be specifically called > > out to justify the decision to bump the rate limit. > > I might have other ideas how to improve the speed. I will answer on > Bhupinder's e-mail. > Sure. > Cheers, > > -- > Julien Grall
On 09/08/17 11:58, Bhupinder Thakur wrote: > Hi Julien, Hi Bhupinder, > Thanks for the testing. > > On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: >> Hi Bhupinder, >> >> I gave another and I have a couple of comments. >> >> Booting Linux with earlycon enabled take quite a while. I can see the >> characters coming slower than on the minitel. It seems to be a bit better >> after switching off the bootconsole. Overall Linux is taking ~20 times to >> boot with pl011 vs HVC console. >> >> I do agree that pl011 is emulated and therefore you have to trap after each >> character. But 20 times sounds far too much. >> > I think this slowness could be due to ratelimiting of the pl011 events > in xenconosle. Currently, the rate limit is > set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). > > I increased the rate limit to 600 events (30 * 20) per 200 msecs. With > this change, > I see that the the find command is running faster and smoother. > Earlier the find output would be jerky. I think there might be another solution avoiding increasing the rate limit. If you look at the earlycon code for pl011 in Linux: static void pl011_putc(struct uart_port *port, int c) { while (readl(port->membase + UART01x_FR) & UART01x_FR_TXFF) cpu_relax(); if (port->iotype == UPIO_MEM32) writel(c, port->membase + UART01x_DR); else writeb(c, port->membase + UART01x_DR); while (readl(port->membase + UART01x_FR) & UART01x_FR_BUSY) cpu_relax(); } Linux will wait the UART to be idle before sending a new character. Now looking at vpl011 emulation, the busy bit set when a new character is queued (see vpl011_write_data). This bit will only be cleared when the console daemon will raise an event and the queue is empty (see vpl011_data_avail). This means for earlycon, you will need a round trip Guest -> Xen -> Dom0 -> Xen -> Guest for each single character. This is a bit counterproductive and combined with the limit it makes it worse. I would take a different approach on the BUSY bit. We can consider the queue between Xen and xenconsoled as outside of the UART. If the character is queued, then job done. I think this would improve quite a lot of the performance. Also, I would append a new patch at then end of the series rather modify patch #1. This would avoid to do more review :). > >> After that I tried to stress the emulation a bit with "find ." to get a lot >> of output. And I noticed a lot of message similar to the one below on xen >> console: >> >> d6v0 vpl011: Unexpected OUT ring buffer full >> >> Associated to that the character have been eaten resulting to non-sense log. >> >> A bit above the printk printing this message, there are a comment saying: >> >> /* >> * It is expected that the ring is not full when this function is called >> * as the guest is expected to write to the data register only when the >> * TXFF flag is not set. >> * In case the guest does write even when the TXFF flag is set then the >> * data will be silently dropped. >> */ >> >> I am quite surprised that Linux is not looking at the TXFF flags. So this >> needs some investigation. > > I ran 'find' but could not reproduce the issue. Sorry I forgot to precise that you need to run find in a directory with a lot of files. A good solution would be: find / Cheers,
Hi, On 10/08/17 14:01, Wei Liu wrote: > On Thu, Aug 10, 2017 at 01:40:07PM +0100, Julien Grall wrote: >> Hi Wei, >> >> On 10/08/17 12:40, Wei Liu wrote: >>> On Thu, Aug 10, 2017 at 01:29:04PM +0530, Bhupinder Thakur wrote: >>>> Hi Wei, >>>> >>>> On 9 August 2017 at 16:33, Wei Liu <wei.liu2@citrix.com> wrote: >>>>> On Wed, Aug 09, 2017 at 04:28:14PM +0530, Bhupinder Thakur wrote: >>>>>> Hi Julien, >>>>>> >>>>>> Thanks for the testing. >>>>>> >>>>>> On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: >>>>>>> Hi Bhupinder, >>>>>>> >>>>>>> I gave another and I have a couple of comments. >>>>>>> >>>>>>> Booting Linux with earlycon enabled take quite a while. I can see the >>>>>>> characters coming slower than on the minitel. It seems to be a bit better >>>>>>> after switching off the bootconsole. Overall Linux is taking ~20 times to >>>>>>> boot with pl011 vs HVC console. >>>>>>> >>>>>>> I do agree that pl011 is emulated and therefore you have to trap after each >>>>>>> character. But 20 times sounds far too much. >>>>>>> >>>>>> I think this slowness could be due to ratelimiting of the pl011 events >>>>>> in xenconosle. Currently, the rate limit is >>>>>> set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). >>>>>> >>>>>> I increased the rate limit to 600 events (30 * 20) per 200 msecs. With >>>>>> this change, >>>>>> I see that the the find command is running faster and smoother. >>>>>> Earlier the find output would be jerky. >>>>> >>>>> Please consider batching the events instead of bumping the limit. >>>> >>>> If I try to batch the events then it may delay per character >>>> processing (if no further characters are coming) since we will wait >>>> for more events to come to batch them together. By keeping the rate >>>> limit high, it is ensured that it can handle large burst of events. If >>>> there is sporadic data then the rate limit is not hit. >>> >>> But do you really need to send an event for every character (that's my >>> reading of your patch, please correct me if I'm wrong)? Suppose you have >>> a buffer you can send one event when the buffer is processed. >> >> I am not sure what you mean by buffer here. The interface with the guest is >> a single data register. So characters are written one by one. > > Yes, that's what the guest sees, but that is not necessarily what > xenconsoled sees. I am a bit confused here. What are you suggesting? To send a notification every N characters? In that case when do you assume that the buffer need to be flushed before N characters (e.g for the prompt)? Cheers,
On Thu, Aug 10, 2017 at 03:31:57PM +0100, Julien Grall wrote: > Hi, > > On 10/08/17 14:01, Wei Liu wrote: > > On Thu, Aug 10, 2017 at 01:40:07PM +0100, Julien Grall wrote: > > > Hi Wei, > > > > > > On 10/08/17 12:40, Wei Liu wrote: > > > > On Thu, Aug 10, 2017 at 01:29:04PM +0530, Bhupinder Thakur wrote: > > > > > Hi Wei, > > > > > > > > > > On 9 August 2017 at 16:33, Wei Liu <wei.liu2@citrix.com> wrote: > > > > > > On Wed, Aug 09, 2017 at 04:28:14PM +0530, Bhupinder Thakur wrote: > > > > > > > Hi Julien, > > > > > > > > > > > > > > Thanks for the testing. > > > > > > > > > > > > > > On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: > > > > > > > > Hi Bhupinder, > > > > > > > > > > > > > > > > I gave another and I have a couple of comments. > > > > > > > > > > > > > > > > Booting Linux with earlycon enabled take quite a while. I can see the > > > > > > > > characters coming slower than on the minitel. It seems to be a bit better > > > > > > > > after switching off the bootconsole. Overall Linux is taking ~20 times to > > > > > > > > boot with pl011 vs HVC console. > > > > > > > > > > > > > > > > I do agree that pl011 is emulated and therefore you have to trap after each > > > > > > > > character. But 20 times sounds far too much. > > > > > > > > > > > > > > > I think this slowness could be due to ratelimiting of the pl011 events > > > > > > > in xenconosle. Currently, the rate limit is > > > > > > > set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). > > > > > > > > > > > > > > I increased the rate limit to 600 events (30 * 20) per 200 msecs. With > > > > > > > this change, > > > > > > > I see that the the find command is running faster and smoother. > > > > > > > Earlier the find output would be jerky. > > > > > > > > > > > > Please consider batching the events instead of bumping the limit. > > > > > > > > > > If I try to batch the events then it may delay per character > > > > > processing (if no further characters are coming) since we will wait > > > > > for more events to come to batch them together. By keeping the rate > > > > > limit high, it is ensured that it can handle large burst of events. If > > > > > there is sporadic data then the rate limit is not hit. > > > > > > > > But do you really need to send an event for every character (that's my > > > > reading of your patch, please correct me if I'm wrong)? Suppose you have > > > > a buffer you can send one event when the buffer is processed. > > > > > > I am not sure what you mean by buffer here. The interface with the guest is > > > a single data register. So characters are written one by one. > > > > Yes, that's what the guest sees, but that is not necessarily what > > xenconsoled sees. > > I am a bit confused here. What are you suggesting? To send a notification I think the other email you just sent is quite close to what I would suggest. I will reply there.
On Thu, Aug 10, 2017 at 03:26:07PM +0100, Julien Grall wrote: > > > On 09/08/17 11:58, Bhupinder Thakur wrote: > > Hi Julien, > > Hi Bhupinder, > > > Thanks for the testing. > > > > On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: > > > Hi Bhupinder, > > > > > > I gave another and I have a couple of comments. > > > > > > Booting Linux with earlycon enabled take quite a while. I can see the > > > characters coming slower than on the minitel. It seems to be a bit better > > > after switching off the bootconsole. Overall Linux is taking ~20 times to > > > boot with pl011 vs HVC console. > > > > > > I do agree that pl011 is emulated and therefore you have to trap after each > > > character. But 20 times sounds far too much. > > > > > I think this slowness could be due to ratelimiting of the pl011 events > > in xenconosle. Currently, the rate limit is > > set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). > > > > I increased the rate limit to 600 events (30 * 20) per 200 msecs. With > > this change, > > I see that the the find command is running faster and smoother. > > Earlier the find output would be jerky. > > I think there might be another solution avoiding increasing the rate limit. > > If you look at the earlycon code for pl011 in Linux: > > static void pl011_putc(struct uart_port *port, int c) > { > while (readl(port->membase + UART01x_FR) & UART01x_FR_TXFF) > cpu_relax(); > if (port->iotype == UPIO_MEM32) > writel(c, port->membase + UART01x_DR); > else > writeb(c, port->membase + UART01x_DR); > while (readl(port->membase + UART01x_FR) & UART01x_FR_BUSY) > cpu_relax(); > } > > Linux will wait the UART to be idle before sending a new character. > > Now looking at vpl011 emulation, the busy bit set when a new character is > queued (see vpl011_write_data). This bit will only be cleared when the > console daemon will raise an event and the queue is empty (see > vpl011_data_avail). > > This means for earlycon, you will need a round trip Guest -> Xen -> Dom0 -> > Xen -> Guest for each single character. This is a bit counterproductive and > combined with the limit it makes it worse. > > I would take a different approach on the BUSY bit. We can consider the queue > between Xen and xenconsoled as outside of the UART. If the character is > queued, then job done. I think this would improve quite a lot of the > performance. Yes. This. The guest sees a register, which is essentially a synchronous interface to the guest. The current code, as you already see, will issue one event for every character. That's excessive. The interface between Xen and xenconsoled can be asynchronous, it can opt to queue X characters before sending an event, also setup a oneshot timer to avoid hanging. This however has some other implications -- it might not be as reliable as the original method because data is not guaranteed to hit backend. If the guest crashes very early on, depending the actual implementation you might not be able get the data.
On 10/08/17 17:00, Wei Liu wrote: > On Thu, Aug 10, 2017 at 03:26:07PM +0100, Julien Grall wrote: >> >> >> On 09/08/17 11:58, Bhupinder Thakur wrote: >>> Hi Julien, >> >> Hi Bhupinder, >> >>> Thanks for the testing. >>> >>> On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: >>>> Hi Bhupinder, >>>> >>>> I gave another and I have a couple of comments. >>>> >>>> Booting Linux with earlycon enabled take quite a while. I can see the >>>> characters coming slower than on the minitel. It seems to be a bit better >>>> after switching off the bootconsole. Overall Linux is taking ~20 times to >>>> boot with pl011 vs HVC console. >>>> >>>> I do agree that pl011 is emulated and therefore you have to trap after each >>>> character. But 20 times sounds far too much. >>>> >>> I think this slowness could be due to ratelimiting of the pl011 events >>> in xenconosle. Currently, the rate limit is >>> set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). >>> >>> I increased the rate limit to 600 events (30 * 20) per 200 msecs. With >>> this change, >>> I see that the the find command is running faster and smoother. >>> Earlier the find output would be jerky. >> >> I think there might be another solution avoiding increasing the rate limit. >> >> If you look at the earlycon code for pl011 in Linux: >> >> static void pl011_putc(struct uart_port *port, int c) >> { >> while (readl(port->membase + UART01x_FR) & UART01x_FR_TXFF) >> cpu_relax(); >> if (port->iotype == UPIO_MEM32) >> writel(c, port->membase + UART01x_DR); >> else >> writeb(c, port->membase + UART01x_DR); >> while (readl(port->membase + UART01x_FR) & UART01x_FR_BUSY) >> cpu_relax(); >> } >> >> Linux will wait the UART to be idle before sending a new character. >> >> Now looking at vpl011 emulation, the busy bit set when a new character is >> queued (see vpl011_write_data). This bit will only be cleared when the >> console daemon will raise an event and the queue is empty (see >> vpl011_data_avail). >> >> This means for earlycon, you will need a round trip Guest -> Xen -> Dom0 -> >> Xen -> Guest for each single character. This is a bit counterproductive and >> combined with the limit it makes it worse. >> >> I would take a different approach on the BUSY bit. We can consider the queue >> between Xen and xenconsoled as outside of the UART. If the character is >> queued, then job done. I think this would improve quite a lot of the >> performance. > > Yes. This. > > The guest sees a register, which is essentially a synchronous interface > to the guest. The current code, as you already see, will issue one event > for every character. That's excessive. I am actually not suggesting to modify that at the moment. I think you may have other trouble with the interaction between the user and th console by doing that. Imagine you want to print the prompt, it may lag a bit before getting it. The only thing I suggest is to not set the BUSY bit in the UART everytime a character is queued. > > The interface between Xen and xenconsoled can be asynchronous, it can > opt to queue X characters before sending an event, also setup a oneshot > timer to avoid hanging. > > This however has some other implications -- it might not be as reliable > as the original method because data is not guaranteed to hit backend. If > the guest crashes very early on, depending the actual implementation you > might not be able get the data. Would it be possible to ask xenconsoled to dump everything on domain crash? Some kind of synchronization. Cheers,
On Thu, Aug 10, 2017 at 05:11:52PM +0100, Julien Grall wrote: > > > On 10/08/17 17:00, Wei Liu wrote: > > On Thu, Aug 10, 2017 at 03:26:07PM +0100, Julien Grall wrote: > > > > > > > > > On 09/08/17 11:58, Bhupinder Thakur wrote: > > > > Hi Julien, > > > > > > Hi Bhupinder, > > > > > > > Thanks for the testing. > > > > > > > > On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: > > > > > Hi Bhupinder, > > > > > > > > > > I gave another and I have a couple of comments. > > > > > > > > > > Booting Linux with earlycon enabled take quite a while. I can see the > > > > > characters coming slower than on the minitel. It seems to be a bit better > > > > > after switching off the bootconsole. Overall Linux is taking ~20 times to > > > > > boot with pl011 vs HVC console. > > > > > > > > > > I do agree that pl011 is emulated and therefore you have to trap after each > > > > > character. But 20 times sounds far too much. > > > > > > > > > I think this slowness could be due to ratelimiting of the pl011 events > > > > in xenconosle. Currently, the rate limit is > > > > set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). > > > > > > > > I increased the rate limit to 600 events (30 * 20) per 200 msecs. With > > > > this change, > > > > I see that the the find command is running faster and smoother. > > > > Earlier the find output would be jerky. > > > > > > I think there might be another solution avoiding increasing the rate limit. > > > > > > If you look at the earlycon code for pl011 in Linux: > > > > > > static void pl011_putc(struct uart_port *port, int c) > > > { > > > while (readl(port->membase + UART01x_FR) & UART01x_FR_TXFF) > > > cpu_relax(); > > > if (port->iotype == UPIO_MEM32) > > > writel(c, port->membase + UART01x_DR); > > > else > > > writeb(c, port->membase + UART01x_DR); > > > while (readl(port->membase + UART01x_FR) & UART01x_FR_BUSY) > > > cpu_relax(); > > > } > > > > > > Linux will wait the UART to be idle before sending a new character. > > > > > > Now looking at vpl011 emulation, the busy bit set when a new character is > > > queued (see vpl011_write_data). This bit will only be cleared when the > > > console daemon will raise an event and the queue is empty (see > > > vpl011_data_avail). > > > > > > This means for earlycon, you will need a round trip Guest -> Xen -> Dom0 -> > > > Xen -> Guest for each single character. This is a bit counterproductive and > > > combined with the limit it makes it worse. > > > > > > I would take a different approach on the BUSY bit. We can consider the queue > > > between Xen and xenconsoled as outside of the UART. If the character is > > > queued, then job done. I think this would improve quite a lot of the > > > performance. > > > > Yes. This. > > > > The guest sees a register, which is essentially a synchronous interface > > to the guest. The current code, as you already see, will issue one event > > for every character. That's excessive. > > I am actually not suggesting to modify that at the moment. I think you may > have other trouble with the interaction between the user and th console by > doing that. Imagine you want to print the prompt, it may lag a bit before > getting it. > > The only thing I suggest is to not set the BUSY bit in the UART everytime a > character is queued. > Did you come to that conclusion that this would work by looking at the spec or Linux source code? I think it should conform to the spec, not a specific guest. But you're the maintainer, you have the final say. > > > > The interface between Xen and xenconsoled can be asynchronous, it can > > opt to queue X characters before sending an event, also setup a oneshot > > timer to avoid hanging. > > > > This however has some other implications -- it might not be as reliable > > as the original method because data is not guaranteed to hit backend. If > > the guest crashes very early on, depending the actual implementation you > > might not be able get the data. > > Would it be possible to ask xenconsoled to dump everything on domain crash? > Some kind of synchronization. > No, not at the moment. If the data is still in Xen and destroyed, xenconsoled can't do anything.
On 10/08/17 17:38, Wei Liu wrote: > On Thu, Aug 10, 2017 at 05:11:52PM +0100, Julien Grall wrote: >> >> >> On 10/08/17 17:00, Wei Liu wrote: >>> On Thu, Aug 10, 2017 at 03:26:07PM +0100, Julien Grall wrote: >>>> >>>> >>>> On 09/08/17 11:58, Bhupinder Thakur wrote: >>>>> Hi Julien, >>>> >>>> Hi Bhupinder, >>>> >>>>> Thanks for the testing. >>>>> >>>>> On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: >>>>>> Hi Bhupinder, >>>>>> >>>>>> I gave another and I have a couple of comments. >>>>>> >>>>>> Booting Linux with earlycon enabled take quite a while. I can see the >>>>>> characters coming slower than on the minitel. It seems to be a bit better >>>>>> after switching off the bootconsole. Overall Linux is taking ~20 times to >>>>>> boot with pl011 vs HVC console. >>>>>> >>>>>> I do agree that pl011 is emulated and therefore you have to trap after each >>>>>> character. But 20 times sounds far too much. >>>>>> >>>>> I think this slowness could be due to ratelimiting of the pl011 events >>>>> in xenconosle. Currently, the rate limit is >>>>> set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). >>>>> >>>>> I increased the rate limit to 600 events (30 * 20) per 200 msecs. With >>>>> this change, >>>>> I see that the the find command is running faster and smoother. >>>>> Earlier the find output would be jerky. >>>> >>>> I think there might be another solution avoiding increasing the rate limit. >>>> >>>> If you look at the earlycon code for pl011 in Linux: >>>> >>>> static void pl011_putc(struct uart_port *port, int c) >>>> { >>>> while (readl(port->membase + UART01x_FR) & UART01x_FR_TXFF) >>>> cpu_relax(); >>>> if (port->iotype == UPIO_MEM32) >>>> writel(c, port->membase + UART01x_DR); >>>> else >>>> writeb(c, port->membase + UART01x_DR); >>>> while (readl(port->membase + UART01x_FR) & UART01x_FR_BUSY) >>>> cpu_relax(); >>>> } >>>> >>>> Linux will wait the UART to be idle before sending a new character. >>>> >>>> Now looking at vpl011 emulation, the busy bit set when a new character is >>>> queued (see vpl011_write_data). This bit will only be cleared when the >>>> console daemon will raise an event and the queue is empty (see >>>> vpl011_data_avail). >>>> >>>> This means for earlycon, you will need a round trip Guest -> Xen -> Dom0 -> >>>> Xen -> Guest for each single character. This is a bit counterproductive and >>>> combined with the limit it makes it worse. >>>> >>>> I would take a different approach on the BUSY bit. We can consider the queue >>>> between Xen and xenconsoled as outside of the UART. If the character is >>>> queued, then job done. I think this would improve quite a lot of the >>>> performance. >>> >>> Yes. This. >>> >>> The guest sees a register, which is essentially a synchronous interface >>> to the guest. The current code, as you already see, will issue one event >>> for every character. That's excessive. >> >> I am actually not suggesting to modify that at the moment. I think you may >> have other trouble with the interaction between the user and th console by >> doing that. Imagine you want to print the prompt, it may lag a bit before >> getting it. >> >> The only thing I suggest is to not set the BUSY bit in the UART everytime a >> character is queued. >> > > Did you come to that conclusion that this would work by looking at the > spec or Linux source code? I think it should conform to the spec, not a > specific guest. But you're the maintainer, you have the final say. I read both the spec and the code. From the spec: "UART busy. If this bit is set to 1, the UART is busy transmitting data. This bit remains set until the complete byte, including all the stop bits, has been sent from the shift register. This bit is set as soon as the transmit FIFO becomes non-empty, regardless of whether the UART is enabled or not." Currently, we considered that the shared ring is the FIFO of the UART. Meaning that the BUSY bit is set until xenconsoled read everything. I don't think implementing a FIFO is highly critical in an emulation (QEMU does not implement it for instance). And definitely using the shared ring brings slow down (involve multiple context switch). I would suggest to take a different approach where the BUSY is only set if we can't add more data in the shared ring. This would be clear as soon as the ring has space. If we really we could implement is small FIFO (the SBSA requested a least 32-entry separate for transmit and receive). But I don't think this is critically for a first approach. > >>> >>> The interface between Xen and xenconsoled can be asynchronous, it can >>> opt to queue X characters before sending an event, also setup a oneshot >>> timer to avoid hanging. >>> >>> This however has some other implications -- it might not be as reliable >>> as the original method because data is not guaranteed to hit backend. If >>> the guest crashes very early on, depending the actual implementation you >>> might not be able get the data. >> >> Would it be possible to ask xenconsoled to dump everything on domain crash? >> Some kind of synchronization. >> > > No, not at the moment. If the data is still in Xen and destroyed, > xenconsoled can't do anything. The vUART emulation is directly queuing the data, there are no intermediate buffer. So all the data would be in the shared ring available for xenconsoled to go through. It would be quite a useful enhancement for when the guest crash. Cheers,
Hi Julien, >> After that I tried to stress the emulation a bit with "find ." to get a lot >> of output. And I noticed a lot of message similar to the one below on xen >> console: >> >> d6v0 vpl011: Unexpected OUT ring buffer full >> >> Associated to that the character have been eaten resulting to non-sense log. >> >> A bit above the printk printing this message, there are a comment saying: >> >> /* >> * It is expected that the ring is not full when this function is called >> * as the guest is expected to write to the data register only when the >> * TXFF flag is not set. >> * In case the guest does write even when the TXFF flag is set then the >> * data will be silently dropped. >> */ >> >> I am quite surprised that Linux is not looking at the TXFF flags. So this >> needs some investigation. > > I ran 'find' but could not reproduce the issue. > > I will try to reproduce this issue by reducing the OUT buffer size. I could not reproduce the issue with the reduced buffer of 16 bytes also. I think it may not be reproducible on the foundation model. I am trying to bring up xen on an ARM machine here to reproduce the issue. While looking at the pl011 driver in linux, I see one potential case where the the driver may send more data even when the TX FIFO is full. It seems the pl011 driver expects that the TX interrupt must be raised only when at least half of TX FIFO queue is empty. pl011_tx_chars() calls pl011_tx_char() in a loop for (fifosize/2) number of times. Since these APIs are getting called in the interrupt context, pl011_tx_char() does not check for TX FIFO full status because it expects that fifosize/2 space is available. In the emulation logic, we should set the TX bit in the status register only if at least space for 16 bytes is available (since SBSA fifosize is 32). Currently, we may be setting the TX bit even if there is space for 1 byte. In that scenario, the driver may write more data then emulation logic can queue up. Regards, Bhupinder
On 14/08/17 08:52, Bhupinder Thakur wrote: > Hi Julien, Hi Bhupinder, > > >>> After that I tried to stress the emulation a bit with "find ." to get a lot >>> of output. And I noticed a lot of message similar to the one below on xen >>> console: >>> >>> d6v0 vpl011: Unexpected OUT ring buffer full >>> >>> Associated to that the character have been eaten resulting to non-sense log. >>> >>> A bit above the printk printing this message, there are a comment saying: >>> >>> /* >>> * It is expected that the ring is not full when this function is called >>> * as the guest is expected to write to the data register only when the >>> * TXFF flag is not set. >>> * In case the guest does write even when the TXFF flag is set then the >>> * data will be silently dropped. >>> */ >>> >>> I am quite surprised that Linux is not looking at the TXFF flags. So this >>> needs some investigation. >> >> I ran 'find' but could not reproduce the issue. >> >> I will try to reproduce this issue by reducing the OUT buffer size. > > I could not reproduce the issue with the reduced buffer of 16 bytes > also. I think it may not be reproducible on the foundation model. I am > trying to bring up xen on an ARM machine here to reproduce the issue. > > While looking at the pl011 driver in linux, I see one potential case > where the the driver may send more data even when the TX FIFO is full. > It seems the pl011 driver expects that the TX interrupt must be raised > only when at least half of TX FIFO queue is empty. > > pl011_tx_chars() calls pl011_tx_char() in a loop for (fifosize/2) > number of times. Since these APIs are getting called in the interrupt > context, pl011_tx_char() does not check for TX FIFO full status > because it expects that fifosize/2 space is available. > > In the emulation logic, we should set the TX bit in the status > register only if at least space for 16 bytes is available (since SBSA > fifosize is 32). Currently, we may be setting the TX bit even if there > is space for 1 byte. In that scenario, the driver may write more data > then emulation logic can queue up. I understand that it is the behavior expected by Linux. However did you check it was compliant with the spec? If I looked at the PL011 (ARM DDI 0183G), the interrupt FIFO level is set via UARTIFLS. The reset value is half-way mark (i.e 16 byte). However, looking at the SBSA, this register is inexistent and I can't find anything mentioning the half-way mark. So we need some clarification here. Let me ask it. Meanwhile, trying the half-way mark would be good to see if it helps. Cheers,
On Thu, Aug 10, 2017 at 06:51:24PM +0100, Julien Grall wrote: > > > On 10/08/17 17:38, Wei Liu wrote: > > On Thu, Aug 10, 2017 at 05:11:52PM +0100, Julien Grall wrote: > > > > > > > > > On 10/08/17 17:00, Wei Liu wrote: > > > > On Thu, Aug 10, 2017 at 03:26:07PM +0100, Julien Grall wrote: > > > > > > > > > > > > > > > On 09/08/17 11:58, Bhupinder Thakur wrote: > > > > > > Hi Julien, > > > > > > > > > > Hi Bhupinder, > > > > > > > > > > > Thanks for the testing. > > > > > > > > > > > > On 8 August 2017 at 21:29, Julien Grall <julien.grall@arm.com> wrote: > > > > > > > Hi Bhupinder, > > > > > > > > > > > > > > I gave another and I have a couple of comments. > > > > > > > > > > > > > > Booting Linux with earlycon enabled take quite a while. I can see the > > > > > > > characters coming slower than on the minitel. It seems to be a bit better > > > > > > > after switching off the bootconsole. Overall Linux is taking ~20 times to > > > > > > > boot with pl011 vs HVC console. > > > > > > > > > > > > > > I do agree that pl011 is emulated and therefore you have to trap after each > > > > > > > character. But 20 times sounds far too much. > > > > > > > > > > > > > I think this slowness could be due to ratelimiting of the pl011 events > > > > > > in xenconosle. Currently, the rate limit is > > > > > > set to 30 events per 200 msecs (see RATE_LIMIT_ALLOWANCE/RATE_LIMIT_PERIOD). > > > > > > > > > > > > I increased the rate limit to 600 events (30 * 20) per 200 msecs. With > > > > > > this change, > > > > > > I see that the the find command is running faster and smoother. > > > > > > Earlier the find output would be jerky. > > > > > > > > > > I think there might be another solution avoiding increasing the rate limit. > > > > > > > > > > If you look at the earlycon code for pl011 in Linux: > > > > > > > > > > static void pl011_putc(struct uart_port *port, int c) > > > > > { > > > > > while (readl(port->membase + UART01x_FR) & UART01x_FR_TXFF) > > > > > cpu_relax(); > > > > > if (port->iotype == UPIO_MEM32) > > > > > writel(c, port->membase + UART01x_DR); > > > > > else > > > > > writeb(c, port->membase + UART01x_DR); > > > > > while (readl(port->membase + UART01x_FR) & UART01x_FR_BUSY) > > > > > cpu_relax(); > > > > > } > > > > > > > > > > Linux will wait the UART to be idle before sending a new character. > > > > > > > > > > Now looking at vpl011 emulation, the busy bit set when a new character is > > > > > queued (see vpl011_write_data). This bit will only be cleared when the > > > > > console daemon will raise an event and the queue is empty (see > > > > > vpl011_data_avail). > > > > > > > > > > This means for earlycon, you will need a round trip Guest -> Xen -> Dom0 -> > > > > > Xen -> Guest for each single character. This is a bit counterproductive and > > > > > combined with the limit it makes it worse. > > > > > > > > > > I would take a different approach on the BUSY bit. We can consider the queue > > > > > between Xen and xenconsoled as outside of the UART. If the character is > > > > > queued, then job done. I think this would improve quite a lot of the > > > > > performance. > > > > > > > > Yes. This. > > > > > > > > The guest sees a register, which is essentially a synchronous interface > > > > to the guest. The current code, as you already see, will issue one event > > > > for every character. That's excessive. > > > > > > I am actually not suggesting to modify that at the moment. I think you may > > > have other trouble with the interaction between the user and th console by > > > doing that. Imagine you want to print the prompt, it may lag a bit before > > > getting it. > > > > > > The only thing I suggest is to not set the BUSY bit in the UART everytime a > > > character is queued. > > > > > > > Did you come to that conclusion that this would work by looking at the > > spec or Linux source code? I think it should conform to the spec, not a > > specific guest. But you're the maintainer, you have the final say. > > I read both the spec and the code. From the spec: > > "UART busy. If this bit is set to 1, the UART is busy transmitting data. > This bit remains set until the > complete byte, including all the stop bits, has been sent from the shift > register. > This bit is set as soon as the transmit FIFO becomes non-empty, regardless > of whether the UART is > enabled or not." > > Currently, we considered that the shared ring is the FIFO of the UART. > Meaning that the BUSY bit is set until xenconsoled read everything. > > I don't think implementing a FIFO is highly critical in an emulation (QEMU > does not implement it for instance). And definitely using the shared ring > brings slow down (involve multiple context switch). > > I would suggest to take a different approach where the BUSY is only set if > we can't add more data in the shared ring. This would be clear as soon as > the ring has space. > > If we really we could implement is small FIFO (the SBSA requested a least > 32-entry separate for transmit and receive). But I don't think this is > critically for a first approach. > Sure, I think this is reasonable. > > > > > > > > > > The interface between Xen and xenconsoled can be asynchronous, it can > > > > opt to queue X characters before sending an event, also setup a oneshot > > > > timer to avoid hanging. > > > > > > > > This however has some other implications -- it might not be as reliable > > > > as the original method because data is not guaranteed to hit backend. If > > > > the guest crashes very early on, depending the actual implementation you > > > > might not be able get the data. > > > > > > Would it be possible to ask xenconsoled to dump everything on domain crash? > > > Some kind of synchronization. > > > > > > > No, not at the moment. If the data is still in Xen and destroyed, > > xenconsoled can't do anything. > > The vUART emulation is directly queuing the data, there are no intermediate > buffer. So all the data would be in the shared ring available for > xenconsoled to go through. > > It would be quite a useful enhancement for when the guest crash. > What I meant was actually "If the data is still in the ring". There is no synchronisation between Xen and xenconsoled to let the latter pull out all data before destroying the guest. As it stands, if we go with the fully-synchronised approach for now, that won't be a problem. But when you want to fiddle with the BUSY bit or any other optimisation -- leaving more data in the ring -- that might cause data loss. I'm not against bumping the rate-limit, but I want a justification and want to consider as many approaches as possible. Ultimately the final decision is up to you and Stefano.
Hi Wei, On 15/08/17 10:49, Wei Liu wrote: > On Thu, Aug 10, 2017 at 06:51:24PM +0100, Julien Grall wrote: >>>>> The interface between Xen and xenconsoled can be asynchronous, it can >>>>> opt to queue X characters before sending an event, also setup a oneshot >>>>> timer to avoid hanging. >>>>> >>>>> This however has some other implications -- it might not be as reliable >>>>> as the original method because data is not guaranteed to hit backend. If >>>>> the guest crashes very early on, depending the actual implementation you >>>>> might not be able get the data. >>>> >>>> Would it be possible to ask xenconsoled to dump everything on domain crash? >>>> Some kind of synchronization. >>>> >>> >>> No, not at the moment. If the data is still in Xen and destroyed, >>> xenconsoled can't do anything. >> >> The vUART emulation is directly queuing the data, there are no intermediate >> buffer. So all the data would be in the shared ring available for >> xenconsoled to go through. >> >> It would be quite a useful enhancement for when the guest crash. >> > > What I meant was actually "If the data is still in the ring". There is > no synchronisation between Xen and xenconsoled to let the latter pull > out all data before destroying the guest. I don't think you would need synchronisation between Xen and xenconsoled at domain destruction. Given that xenconsoled would have to unmap the page, I was suggesting that it makes sure that cons == prod before destroying the instance. So all the character queued would be displayed on the console. > > As it stands, if we go with the fully-synchronised approach for now, > that won't be a problem. But when you want to fiddle with the BUSY bit > or any other optimisation -- leaving more data in the ring -- that might > cause data loss. The fully-synchronised solution is far too slow, we are looking at optimization to avoid the synchronization for every single character. This would leave some potential data loss (max a few characters) if the domain crashed. > > I'm not against bumping the rate-limit, but I want a justification and > want to consider as many approaches as possible. Ultimately the final > decision is up to you and Stefano. I am not in favor of bumping the rate-limit at the moment. I believe we can solve the problem directly in the emulation. I spoke with Bhupinder about various solution that seem promising. Cheers,
On Fri, Aug 18, 2017 at 02:30:03PM +0100, Julien Grall wrote: > Hi Wei, > > On 15/08/17 10:49, Wei Liu wrote: > > On Thu, Aug 10, 2017 at 06:51:24PM +0100, Julien Grall wrote: > > > > > > The interface between Xen and xenconsoled can be asynchronous, it can > > > > > > opt to queue X characters before sending an event, also setup a oneshot > > > > > > timer to avoid hanging. > > > > > > > > > > > > This however has some other implications -- it might not be as reliable > > > > > > as the original method because data is not guaranteed to hit backend. If > > > > > > the guest crashes very early on, depending the actual implementation you > > > > > > might not be able get the data. > > > > > > > > > > Would it be possible to ask xenconsoled to dump everything on domain crash? > > > > > Some kind of synchronization. > > > > > > > > > > > > > No, not at the moment. If the data is still in Xen and destroyed, > > > > xenconsoled can't do anything. > > > > > > The vUART emulation is directly queuing the data, there are no intermediate > > > buffer. So all the data would be in the shared ring available for > > > xenconsoled to go through. > > > > > > It would be quite a useful enhancement for when the guest crash. > > > > > > > What I meant was actually "If the data is still in the ring". There is > > no synchronisation between Xen and xenconsoled to let the latter pull > > out all data before destroying the guest. > > I don't think you would need synchronisation between Xen and xenconsoled at > domain destruction. Given that xenconsoled would have to unmap the page, I > was suggesting that it makes sure that cons == prod before destroying the > instance. So all the character queued would be displayed on the console. So this is relying on Dom0 holding a reference to the page so that the page won't be freed. This sounds plausible.
Hi, A quick update below. On 14/08/17 14:54, Julien Grall wrote: > On 14/08/17 08:52, Bhupinder Thakur wrote: >> I could not reproduce the issue with the reduced buffer of 16 bytes >> also. I think it may not be reproducible on the foundation model. I am >> trying to bring up xen on an ARM machine here to reproduce the issue. >> >> While looking at the pl011 driver in linux, I see one potential case >> where the the driver may send more data even when the TX FIFO is full. >> It seems the pl011 driver expects that the TX interrupt must be raised >> only when at least half of TX FIFO queue is empty. >> >> pl011_tx_chars() calls pl011_tx_char() in a loop for (fifosize/2) >> number of times. Since these APIs are getting called in the interrupt >> context, pl011_tx_char() does not check for TX FIFO full status >> because it expects that fifosize/2 space is available. >> >> In the emulation logic, we should set the TX bit in the status >> register only if at least space for 16 bytes is available (since SBSA >> fifosize is 32). Currently, we may be setting the TX bit even if there >> is space for 1 byte. In that scenario, the driver may write more data >> then emulation logic can queue up. > > I understand that it is the behavior expected by Linux. However did you > check it was compliant with the spec? > > If I looked at the PL011 (ARM DDI 0183G), the interrupt FIFO level is > set via UARTIFLS. The reset value is half-way mark (i.e 16 byte). > > However, looking at the SBSA, this register is inexistent and I can't > find anything mentioning the half-way mark. So we need some > clarification here. Let me ask it. > > Meanwhile, trying the half-way mark would be good to see if it helps. I got the confirmation that some clarification is needed in the spec. Until this is done, we should assume the halfway mark in our emulation. Cheers, > > Cheers, >