Message ID | 1430417667-4245-5-git-send-email-christopher.covington@linaro.org |
---|---|
State | New |
Headers | show |
On 30 April 2015 at 19:14, Christopher Covington <christopher.covington@linaro.org> wrote: > Present a system with an instructions per cycle of exactly one. > This makes it less likely a user will mistake the cycle counter > values as meaningful and makes calculations involving cycles > trivial while preserving the necessary property of the cycle > counter register as monotonically increasing. > > Signed-off-by: Christopher Covington <christopher.covington@linaro.org> > --- > target-arm/helper.c | 9 +++------ > 1 file changed, 3 insertions(+), 6 deletions(-) > > diff --git a/target-arm/helper.c b/target-arm/helper.c > index 3e6fb0b..a027a19 100644 > --- a/target-arm/helper.c > +++ b/target-arm/helper.c > @@ -648,8 +648,7 @@ void pmccntr_sync(CPUARMState *env) > { > uint64_t temp_ticks; > > - temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL), > - get_ticks_per_sec(), 1000000); > + temp_ticks = cpu_get_icount_raw(); Are you really really sure the _raw function is the correct one? Nowhere else in the codebase calls it except the other wrappers in cpu.c which provide a sane view of the instruction count... (I suspect cpu_get_icount_raw() should really be static.) PS: it would be helpful if you could make sure you include a cover letter for future patchseries: patchsets without a cover letter are awkward for my workflow... (A single standalone patch doesn't need a coverletter.) thanks -- PMM
Hi Peter, Thanks for taking a look. On Apr 30, 2015 2:28 PM, "Peter Maydell" <peter.maydell@linaro.org> wrote: > > On 30 April 2015 at 19:14, Christopher Covington > <christopher.covington@linaro.org> wrote: > > Present a system with an instructions per cycle of exactly one. > > This makes it less likely a user will mistake the cycle counter > > values as meaningful and makes calculations involving cycles > > trivial while preserving the necessary property of the cycle > > counter register as monotonically increasing. > > > > Signed-off-by: Christopher Covington <christopher.covington@linaro.org> > > --- > > target-arm/helper.c | 9 +++------ > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > diff --git a/target-arm/helper.c b/target-arm/helper.c > > index 3e6fb0b..a027a19 100644 > > --- a/target-arm/helper.c > > +++ b/target-arm/helper.c > > @@ -648,8 +648,7 @@ void pmccntr_sync(CPUARMState *env) > > { > > uint64_t temp_ticks; > > > > - temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL), > > - get_ticks_per_sec(), 1000000); > > + temp_ticks = cpu_get_icount_raw(); > > Are you really really sure the _raw function is the correct one? > Nowhere else in the codebase calls it except the other wrappers > in cpu.c which provide a sane view of the instruction count... > (I suspect cpu_get_icount_raw() should really be static.) I thought it wasn't being used because it was new, having been introduced by Pavel Dovgalyuk's determinism patches. When you talk about sanity, what would this patch make insane? The instructions per second and cycles per second that would result? If so, what if seconds/timer ticks were changed in the same patch to be derived from the instruction count. All of this could potentially only apply with -icount specified. > PS: it would be helpful if you could make sure you include > a cover letter for future patchseries: patchsets without a > cover letter are awkward for my workflow... (A single > standalone patch doesn't need a coverletter.) Will do. For this series it should have been that the first four patches are hopefully not controversial, but so far have only been tested by some very simple bare metal code (using PSCI exit instead of Angel exit at the end) and booting the Linux kernel and looking at the printks, but not yet by running `perf stat`. The last patch (this one) I realize may be controversial as it tries to make QEMU conform to certain expectations such as determinism that it has not historically met. Thanks, Chris
On 30 April 2015 at 22:33, Christopher Covington <christopher.covington@linaro.org> wrote: > On Apr 30, 2015 2:28 PM, "Peter Maydell" <peter.maydell@linaro.org> wrote: >> Are you really really sure the _raw function is the correct one? >> Nowhere else in the codebase calls it except the other wrappers >> in cpu.c which provide a sane view of the instruction count... >> (I suspect cpu_get_icount_raw() should really be static.) > > I thought it wasn't being used because it was new, having been introduced by > Pavel Dovgalyuk's determinism patches. > > When you talk about sanity, what would this patch make insane? The > instructions per second and cycles per second that would result? If so, what > if seconds/timer ticks were changed in the same patch to be derived from the > instruction count. All of this could potentially only apply with -icount > specified. I don't really know for certain how the code here works, but it makes me very dubious when I see a function that is being used literally nowhere else in any other target CPU, and where every other code path to it goes via taking a lock. Paolo: can you suggest what the right function to call to implement a cycle counter is? thanks -- PMM
On Thu, Apr 30, 2015 at 9:24 PM, Peter Crosthwaite <peter.crosthwaite@xilinx.com> wrote: > On Thu, Apr 30, 2015 at 11:14 AM, Christopher Covington > <christopher.covington@linaro.org> wrote: >> Present a system with an instructions per cycle of exactly one. >> This makes it less likely a user will mistake the cycle counter >> values as meaningful and makes calculations involving cycles >> trivial while preserving the necessary property of the cycle >> counter register as monotonically increasing. >> >> Signed-off-by: Christopher Covington <christopher.covington@linaro.org> >> --- >> target-arm/helper.c | 9 +++------ >> 1 file changed, 3 insertions(+), 6 deletions(-) >> >> diff --git a/target-arm/helper.c b/target-arm/helper.c >> index 3e6fb0b..a027a19 100644 >> --- a/target-arm/helper.c >> +++ b/target-arm/helper.c >> @@ -648,8 +648,7 @@ void pmccntr_sync(CPUARMState *env) >> { >> uint64_t temp_ticks; >> >> - temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL), >> - get_ticks_per_sec(), 1000000); >> + temp_ticks = cpu_get_icount_raw(); > > So I have guests (for better or worse) that make assumptions about the > rate of the PMCCNTR WRT real time. But isn't the PMCCNTR really a > clock cycle counter rather than an instruction counter? That clock > rate is well defined even if it is just the trivial > get_ticks_per_sec() at the moment. Ideally we should have a > configurable clock rate in there instead of get_ticks_per_sec(). This > is a major change in definition. Is this with KVM, user emulation, or system emulation mode? I'm mainly looking at things from the TCG (and mostly system mode) perspective. If not using KVM, would the assumptions of your guests hold if you ran them on a hardware device instead of QEMU TCG? I'm still trying to understand all the code involved, but I don't see get_ticks_per_sec() or any other constants as problematic. > I can see your use case though, where you actually want this to mean > something WRT program performance. Should we add a switch between the > two behaviours? This switch already exists, in the form of -icount, and I really should have already been using it. What initially scared me away was qemu_icount_bias and icount_adjust(). Accounting for halting seems like a good thing but the accounting for "speed" by referencing a host clock doesn't make sense to me. If I have an ARM development board hooked up to an x86 desktop, it's not querying the desktop for time. So why does it make sense for QEMU system emulation mode behave like that? The -icount help text makes it sound like adjustment for speed reasons should only take place when "auto" is specified, but I have yet to understand the code well enough to verify that. Chris
On 6 May 2015 at 15:05, Peter Crosthwaite <peter.crosthwaite@xilinx.com> wrote: > What I am most worried about (and I need to run some tests to really > confirm facts) is what happens if a CPU WFIs. Should the PMCCNTR > continue on or hold its value? If we match instruction execution to > PMCCNTR to the PMCCNTR will freeze. See the v8 ARM ARM D5.1.1: this doesn't count PE clock cycles, it's linked to the hardware processor clock. The exact relationship is IMPDEF so we have some leeway for doing whatever seems reasonable here. Permitted things: * counter can stop on WFI (see D5.1.3) * counter can continue to run on WFI (it's impdef whether the PE clock is gated when in WFI, though I would expect that to be a popular implementation) * counter can read the same value if read twice in a row * counter can run forward a lot even if no insns executed (the example given is of a hyperthreading implementation) So we should do whatever seems most convenient implementation-wise I think... -- PMM
On Thu, Sep 24, 2015 at 12:43 PM, Christopher Covington <cov@codeaurora.org> wrote: > cpu_get_ticks() provides a common interface across targets for > calculating CPU cycles. Using this fixes PMCCNTR reads when -icount > is specified (previously a non-increasing value was returned). > > Signed-off-by: Christopher Covington <cov@codeaurora.org> > --- > target-arm/helper.c | 9 +++------ > 1 file changed, 3 insertions(+), 6 deletions(-) > > diff --git a/target-arm/helper.c b/target-arm/helper.c > index 7dc49cb..32923fb 100644 > --- a/target-arm/helper.c > +++ b/target-arm/helper.c > @@ -729,8 +729,7 @@ void pmccntr_sync(CPUARMState *env) > { > uint64_t temp_ticks; > > - temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL), > - get_ticks_per_sec(), 1000000); > + temp_ticks = cpu_get_ticks(); This patch doesn't apply anymore, you will need to rebase it. Also I don't think this is correct. cpu_get_ticks() returns the host CPU cycle counter, when in this case we want the guest cycles. Thanks, Alistair > > if (env->cp15.c9_pmcr & PMCRD) { > /* Increment once every 64 processor clock cycles */ > @@ -768,8 +767,7 @@ static uint64_t pmccntr_read(CPUARMState *env, const ARMCPRegInfo *ri) > return env->cp15.c15_ccnt; > } > > - total_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL), > - get_ticks_per_sec(), 1000000); > + total_ticks = cpu_get_ticks(); > > if (env->cp15.c9_pmcr & PMCRD) { > /* Increment once every 64 processor clock cycles */ > @@ -789,8 +787,7 @@ static void pmccntr_write(CPUARMState *env, const ARMCPRegInfo *ri, > return; > } > > - total_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL), > - get_ticks_per_sec(), 1000000); > + total_ticks = cpu_get_ticks(); > > if (env->cp15.c9_pmcr & PMCRD) { > /* Increment once every 64 processor clock cycles */ > -- > Qualcomm Innovation Center, Inc. > The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, > a Linux Foundation Collaborative Project > >
On 09/28/2015 06:05 PM, Alistair Francis wrote: > On Thu, Sep 24, 2015 at 12:43 PM, Christopher Covington > <cov@codeaurora.org> wrote: >> cpu_get_ticks() provides a common interface across targets for >> calculating CPU cycles. Using this fixes PMCCNTR reads when -icount >> is specified (previously a non-increasing value was returned). >> >> Signed-off-by: Christopher Covington <cov@codeaurora.org> >> --- >> target-arm/helper.c | 9 +++------ >> 1 file changed, 3 insertions(+), 6 deletions(-) >> >> diff --git a/target-arm/helper.c b/target-arm/helper.c >> index 7dc49cb..32923fb 100644 >> --- a/target-arm/helper.c >> +++ b/target-arm/helper.c >> @@ -729,8 +729,7 @@ void pmccntr_sync(CPUARMState *env) >> { >> uint64_t temp_ticks; >> >> - temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL), >> - get_ticks_per_sec(), 1000000); >> + temp_ticks = cpu_get_ticks(); > Also I don't think this is correct. cpu_get_ticks() returns the host > CPU cycle counter, when in this case we want the guest cycles. I too find the use of host CPU cycles quite perplexing. Paolo suggested it [1]. Maybe there are timeouts in some software that tend to work better in such a mode. Perhaps it is faster, although my intuition is that it's just providing a facade of frequency scaling to the guest. 1. https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg00162.html I like to declare guest instructions per guest CPU cycles = 1. As I understand it, an "-icount 0" pair of parameters is how to do this in QEMU for x86. I'd like it to work for ARM. I wrote a simple assembly test case which I'm working on porting it to the kvm-unit-tests framework. In the non-icount case, I saw roughly the same order of magnitude for guest IPC before and after the patch. I'd like to also write CPU frequency (guest CPU cycles per generic timer guest seconds) and (M)IPS (guest instructions per generic timer guest seconds) tests. Thanks, Christopher Covington
On 09/29/2015 10:07 AM, Christopher Covington wrote: > On 09/28/2015 06:05 PM, Alistair Francis wrote: >> On Thu, Sep 24, 2015 at 12:43 PM, Christopher Covington >> <cov@codeaurora.org> wrote: >>> cpu_get_ticks() provides a common interface across targets for >>> calculating CPU cycles. Using this fixes PMCCNTR reads when -icount >>> is specified (previously a non-increasing value was returned). >>> >>> Signed-off-by: Christopher Covington <cov@codeaurora.org> >>> --- >>> target-arm/helper.c | 9 +++------ >>> 1 file changed, 3 insertions(+), 6 deletions(-) >>> >>> diff --git a/target-arm/helper.c b/target-arm/helper.c >>> index 7dc49cb..32923fb 100644 >>> --- a/target-arm/helper.c >>> +++ b/target-arm/helper.c >>> @@ -729,8 +729,7 @@ void pmccntr_sync(CPUARMState *env) >>> { >>> uint64_t temp_ticks; >>> >>> - temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL), >>> - get_ticks_per_sec(), 1000000); >>> + temp_ticks = cpu_get_ticks(); > >> Also I don't think this is correct. cpu_get_ticks() returns the host >> CPU cycle counter, when in this case we want the guest cycles. > > I too find the use of host CPU cycles quite perplexing. Paolo suggested it > [1]. Maybe there are timeouts in some software that tend to work better in > such a mode. Perhaps it is faster, although my intuition is that it's just > providing a facade of frequency scaling to the guest. > > 1. https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg00162.html > > I like to declare guest instructions per guest CPU cycles = 1. As I understand > it, an "-icount 0" pair of parameters is how to do this in QEMU for x86. I'd > like it to work for ARM. > > I wrote a simple assembly test case which I'm working on porting it to the > kvm-unit-tests framework. In the non-icount case, I saw roughly the same order > of magnitude for guest IPC before and after the patch. I'd like to also write > CPU frequency (guest CPU cycles per generic timer guest seconds) and (M)IPS > (guest instructions per generic timer guest seconds) tests. I've sent out the CPI test case and while exercising it I noticed that Laurent's patch fixed -icount. So my original goal has been accomplished. I'm happy to rebase this patch if the authorities (Peter Maydell?) think using cpu_get_ticks() is the right thing to do. Otherwise I'll probably try to move on to support for the instructions event in the ARM PMU. Thanks, Christopher Covington
On 2 October 2015 at 17:44, Christopher Covington <cov@codeaurora.org> wrote: > I've sent out the CPI test case and while exercising it I noticed that > Laurent's patch fixed -icount. So my original goal has been accomplished. I'm > happy to rebase this patch if the authorities (Peter Maydell?) think using > cpu_get_ticks() is the right thing to do. Otherwise I'll probably try to move > on to support for the instructions event in the ARM PMU. Authority here is probably Peter Crosthwaite. I can produce an opinion but I'd have to go and research a bunch of stuff to do that, so I'm hoping to avoid it... thanks -- PMM
On Fri, Oct 2, 2015 at 9:56 AM, Peter Maydell <peter.maydell@linaro.org> wrote: > On 2 October 2015 at 17:44, Christopher Covington <cov@codeaurora.org> wrote: >> I've sent out the CPI test case and while exercising it I noticed that >> Laurent's patch fixed -icount. So my original goal has been accomplished. I'm >> happy to rebase this patch if the authorities (Peter Maydell?) think using >> cpu_get_ticks() is the right thing to do. Otherwise I'll probably try to move >> on to support for the instructions event in the ARM PMU. > > Authority here is probably Peter Crosthwaite. I can produce an > opinion but I'd have to go and research a bunch of stuff to do > that, so I'm hoping to avoid it... > So my idea here is the CPU input frequency should be a property of the CPU. Some experimental results confirm that the PMCCNTR on many common ARM implementations is directly connected to the input clock and can be relied on as a straight free-running counter. I think a genuine instruction counter is something else and this timer should be independent of any core provider of cycle count. Regards, Peter > thanks > -- PMM
On 2 October 2015 at 18:25, Peter Crosthwaite <crosthwaitepeter@gmail.com> wrote: > So my idea here is the CPU input frequency should be a property of the CPU. > > Some experimental results confirm that the PMCCNTR on many common ARM > implementations is directly connected to the input clock and can be > relied on as a straight free-running counter. I think a genuine > instruction counter is something else and this timer should be > independent of any core provider of cycle count. Architecturally, the PMCCNTR counter is counting the hardware processor clock. It's definitely not an instruction counter. (It's also not counting Processor Element clock cycles, though that only makes a difference if you have a multi-threaded hw implementation.) It is generally subject to any hw changes in clock frequency (including if your WFI/WFE do clock stopping). What that means for QEMU I'm not totally sure :-) thanks -- PMM
On Fri, Oct 2, 2015 at 11:08 AM, Peter Maydell <peter.maydell@linaro.org> wrote: > On 2 October 2015 at 18:25, Peter Crosthwaite > <crosthwaitepeter@gmail.com> wrote: >> So my idea here is the CPU input frequency should be a property of the CPU. >> >> Some experimental results confirm that the PMCCNTR on many common ARM >> implementations is directly connected to the input clock and can be >> relied on as a straight free-running counter. I think a genuine >> instruction counter is something else and this timer should be >> independent of any core provider of cycle count. > > Architecturally, the PMCCNTR counter is counting the hardware processor > clock. It's definitely not an instruction counter. (It's also not > counting Processor Element clock cycles, though that only makes a > difference if you have a multi-threaded hw implementation.) It is > generally subject to any hw changes in clock frequency (including if > your WFI/WFE do clock stopping). > WFI/WFE halting could be easily implemented directly as that, in similar way to EL filtering. > What that means for QEMU I'm not totally sure :-) I think this all points to it being just another normal timer (like those in hw/timer). Regards, Peter > > thanks > -- PMM
On 10/02/2015 01:25 PM, Peter Crosthwaite wrote: > On Fri, Oct 2, 2015 at 9:56 AM, Peter Maydell <peter.maydell@linaro.org> wrote: >> On 2 October 2015 at 17:44, Christopher Covington <cov@codeaurora.org> wrote: >>> I've sent out the CPI test case and while exercising it I noticed that >>> Laurent's patch fixed -icount. So my original goal has been accomplished. I'm >>> happy to rebase this patch if the authorities (Peter Maydell?) think using >>> cpu_get_ticks() is the right thing to do. Otherwise I'll probably try to move >>> on to support for the instructions event in the ARM PMU. >> >> Authority here is probably Peter Crosthwaite. I can produce an >> opinion but I'd have to go and research a bunch of stuff to do >> that, so I'm hoping to avoid it... > > So my idea here is the CPU input frequency should be a property of the CPU. > > Some experimental results confirm that the PMCCNTR on many common ARM > implementations is directly connected to the input clock and can be > relied on as a straight free-running counter. I think a genuine > instruction counter is something else Yes, the "genuine" instruction counter is something else. The instruction count is only relevant for folks trying to get deterministic execution by using the -icount option. QEMU TCG mode does not emulate a cycle-level input clock for the guest (the whole class of functional models skip this time-consuming step) but rather operates a block at a time. By doing a little extra, I think it also interpolates the exact instruction count. Specifying a fixed IPC = n is the most sensible way of deterministically calculating a PMCCNTR_EL0 value that I know of. The -icount option allows users to choose such deterministic behavior. > and this timer should be independent of any core provider of cycle count. What, if anything, do you think should be hooked up to the core provider of cycle count? Christopher Covington
On Fri, Oct 2, 2015 at 12:25 PM, Christopher Covington <cov@codeaurora.org> wrote: > On 10/02/2015 01:25 PM, Peter Crosthwaite wrote: >> On Fri, Oct 2, 2015 at 9:56 AM, Peter Maydell <peter.maydell@linaro.org> wrote: >>> On 2 October 2015 at 17:44, Christopher Covington <cov@codeaurora.org> wrote: >>>> I've sent out the CPI test case and while exercising it I noticed that >>>> Laurent's patch fixed -icount. So my original goal has been accomplished. I'm >>>> happy to rebase this patch if the authorities (Peter Maydell?) think using >>>> cpu_get_ticks() is the right thing to do. Otherwise I'll probably try to move >>>> on to support for the instructions event in the ARM PMU. >>> >>> Authority here is probably Peter Crosthwaite. I can produce an >>> opinion but I'd have to go and research a bunch of stuff to do >>> that, so I'm hoping to avoid it... >> >> So my idea here is the CPU input frequency should be a property of the CPU. >> >> Some experimental results confirm that the PMCCNTR on many common ARM >> implementations is directly connected to the input clock and can be >> relied on as a straight free-running counter. I think a genuine >> instruction counter is something else > > Yes, the "genuine" instruction counter is something else. The instruction > count is only relevant for folks trying to get deterministic execution by > using the -icount option. QEMU TCG mode does not emulate a cycle-level input > clock for the guest (the whole class of functional models skip this > time-consuming step) but rather operates a block at a time. By doing a little > extra, I think it also interpolates the exact instruction count. Specifying a > fixed IPC = n is the most sensible way of deterministically calculating a > PMCCNTR_EL0 value that I know of. The -icount option allows users to choose > such deterministic behavior. > >> and this timer should be independent of any core provider of cycle count. > > What, if anything, do you think should be hooked up to the core provider of > cycle count? > Depends, Is this a virtual-machine only concept, or do you have something with a real-hardware analogue? Regards, Peter > Christopher Covington > > -- > Qualcomm Innovation Center, Inc. > The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, > a Linux Foundation Collaborative Project
On 10/02/2015 03:56 PM, Peter Crosthwaite wrote: > On Fri, Oct 2, 2015 at 12:25 PM, Christopher Covington > <cov@codeaurora.org> wrote: >> On 10/02/2015 01:25 PM, Peter Crosthwaite wrote: >>> On Fri, Oct 2, 2015 at 9:56 AM, Peter Maydell <peter.maydell@linaro.org> wrote: >>>> On 2 October 2015 at 17:44, Christopher Covington <cov@codeaurora.org> wrote: >>>>> I've sent out the CPI test case and while exercising it I noticed that >>>>> Laurent's patch fixed -icount. So my original goal has been accomplished. I'm >>>>> happy to rebase this patch if the authorities (Peter Maydell?) think using >>>>> cpu_get_ticks() is the right thing to do. Otherwise I'll probably try to move >>>>> on to support for the instructions event in the ARM PMU. >>>> >>>> Authority here is probably Peter Crosthwaite. I can produce an >>>> opinion but I'd have to go and research a bunch of stuff to do >>>> that, so I'm hoping to avoid it... >>> >>> So my idea here is the CPU input frequency should be a property of the CPU. >>> >>> Some experimental results confirm that the PMCCNTR on many common ARM >>> implementations is directly connected to the input clock and can be >>> relied on as a straight free-running counter. I think a genuine >>> instruction counter is something else >> >> Yes, the "genuine" instruction counter is something else. The instruction >> count is only relevant for folks trying to get deterministic execution by >> using the -icount option. QEMU TCG mode does not emulate a cycle-level input >> clock for the guest (the whole class of functional models skip this >> time-consuming step) but rather operates a block at a time. By doing a little >> extra, I think it also interpolates the exact instruction count. Specifying a >> fixed IPC = n is the most sensible way of deterministically calculating a >> PMCCNTR_EL0 value that I know of. The -icount option allows users to choose >> such deterministic behavior. >> >>> and this timer should be independent of any core provider of cycle count. >> >> What, if anything, do you think should be hooked up to the core provider of >> cycle count? > > Depends, Is this a virtual-machine only concept, or do you have > something with a real-hardware analogue? What I meant to ask was, do you see any reason for cpu_get_ticks() to exist? If no architecture besides i386 wants to use it, perhaps the code should be moved there. Thanks, Christopher Covington
On 2 October 2015 at 21:48, Christopher Covington <cov@codeaurora.org> wrote: > What I meant to ask was, do you see any reason for cpu_get_ticks() to exist? > If no architecture besides i386 wants to use it, perhaps the code should be > moved there. OTOH various non-x86 things do use the closely related cpu_get_real_ticks(), and the implementation of cpu_get_ticks() is very closely related to the other clock code in cpus.c. -- PMM
On 03/10/2015 00:41, Peter Maydell wrote: > > What I meant to ask was, do you see any reason for cpu_get_ticks() to exist? > > If no architecture besides i386 wants to use it, perhaps the code should be > > moved there. > > OTOH various non-x86 things do use the closely related cpu_get_real_ticks(), > and the implementation of cpu_get_ticks() is very closely related to > the other clock code in cpus.c. cpu_get_real_ticks() is returning the host cycle counter; cpu_get_ticks() is stopping/resuming it when the VM is stopped/resumed. In other words, cpu_get_real_ticks() is to cpu_get_ticks() what QEMU_CLOCK_REALTIME is to QEMU_CLOCK_VIRTUAL. They are also similar in that both cpu_get_ticks() and QEMU_CLOCK_VIRTUAL "morph" to the instruction count in icount mode. cpu_get_ticks() should be the right implementation for the ARM PMCCNTR cycle counter, since: 1) PMCCNTR is roughly the same as the x86 RDTSC; 2) cpu_get_ticks() is the only monotonically increasing value outside icount mode. Paolo
On 5 October 2015 at 15:09, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 03/10/2015 00:41, Peter Maydell wrote: >> > What I meant to ask was, do you see any reason for cpu_get_ticks() to exist? >> > If no architecture besides i386 wants to use it, perhaps the code should be >> > moved there. >> >> OTOH various non-x86 things do use the closely related cpu_get_real_ticks(), >> and the implementation of cpu_get_ticks() is very closely related to >> the other clock code in cpus.c. > > cpu_get_real_ticks() is returning the host cycle counter; > cpu_get_ticks() is stopping/resuming it when the VM is stopped/resumed. > In other words, cpu_get_real_ticks() is to cpu_get_ticks() what > QEMU_CLOCK_REALTIME is to QEMU_CLOCK_VIRTUAL. ...but it seems wrong to have anything in the simulation care about the host cycle counter, especially since on some hosts the underlying implementation is terrible. thanks -- PMM
On 05/10/2015 16:11, Peter Maydell wrote: > > > OTOH various non-x86 things do use the closely related cpu_get_real_ticks(), > > > and the implementation of cpu_get_ticks() is very closely related to > > > the other clock code in cpus.c. > > > > cpu_get_real_ticks() is returning the host cycle counter; > > cpu_get_ticks() is stopping/resuming it when the VM is stopped/resumed. > > In other words, cpu_get_real_ticks() is to cpu_get_ticks() what > > QEMU_CLOCK_REALTIME is to QEMU_CLOCK_VIRTUAL. > > ...but it seems wrong to have anything in the simulation care > about the host cycle counter, I agree; instruction count is much better. Unfortunately, -icount has a relatively hefty performance penalty. It is also common that code using PMCCNTR/RDTSC wants the increment between two reads to be large-ish and at least remotely related to the number of instructions that were executed in between. I've also used rdtsc in the guest from time to time to benchmark changes to TCG. Having it map to the host cycle counter is somewhat useful for that purpose, though one might say this usecase is niche. > especially since on some hosts > the underlying implementation is terrible. I remember seeing a bug where this terrible implementation caused divisions by zero on the host. The default implementation in include/qemu/timer.h should be changed to qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL). Note that in practice this terrible implementation is only used on ARM/AArch64. Is PMCCNTR or something similar accessible from userspace? I guess no, since even write access is enabled via PMUSERENR (and in general PMUSERENR ought to be false on a preemptive-multitasking OS). In the end I guess qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) would work just as well for PMCCNTR, but I think non-Linux OSes still have a relatively slow clock_gettime() so there is still an advantage in using cpu_get_ticks(). Paolo ps: on x86, a long time ago rdtsc was reliable and clock_gettime() was slow, so it was a no-brainer for benchmarks and the like; then rdtsc became unreliable and clock_gettime() became fast on Linux; but now at least on new machines rdtsc is usually reliable again.
On 5 October 2015 at 15:27, Paolo Bonzini <pbonzini@redhat.com> wrote: > I remember seeing a bug where this terrible implementation caused > divisions by zero on the host. The default implementation in > include/qemu/timer.h should be changed to > qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL). > > Note that in practice this terrible implementation is only used on > ARM/AArch64. Is PMCCNTR or something similar accessible from userspace? > I guess no, since even write access is enabled via PMUSERENR (and in > general PMUSERENR ought to be false on a preemptive-multitasking OS). Yeah, there's no guarantee on userspace access. I think the fastest way to get some kind of count is to use a library function that boils down to gettimeofday(2), which we will do purely in userspace in the kernel VDSO if possible. thanks -- PMM
On 06/10/2015 14:49, Peter Maydell wrote: > Yeah, there's no guarantee on userspace access. I think the > fastest way to get some kind of count is to use a library > function that boils down to gettimeofday(2), which we will > do purely in userspace in the kernel VDSO if possible. Could we just use CNTVCT_EL0? Which cores have it? Paolo
On 6 October 2015 at 13:58, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 06/10/2015 14:49, Peter Maydell wrote: >> Yeah, there's no guarantee on userspace access. I think the >> fastest way to get some kind of count is to use a library >> function that boils down to gettimeofday(2), which we will >> do purely in userspace in the kernel VDSO if possible. Looking closer, clock_gettime() also has a userspace only fastpath in the VDSO. > Could we just use CNTVCT_EL0? Which cores have it? Not guaranteed to be enabled for userspace access by the kernel, and even if it is enabled, the kernel folks don't (last time I checked) consider this userspace ABI -- it's only there for the benefit of the VDSO. (CNTVCT_EL0 is how the fast clock_gettime and getimeofday paths are implemented.) thanks -- PMM
On 06/10/2015 15:06, Peter Maydell wrote: > Looking closer, clock_gettime() also has a userspace > only fastpath in the VDSO. Yup, hence the patch that changed the default cpu_get_real_ticks() implementation to get_clock(). >> > Could we just use CNTVCT_EL0? Which cores have it? > > Not guaranteed to be enabled for userspace access by the kernel, > and even if it is enabled, the kernel folks don't (last time I > checked) consider this userspace ABI -- it's only there for > the benefit of the VDSO. Ok, thanks. > (CNTVCT_EL0 is how the fast clock_gettime > and getimeofday paths are implemented.) Yes. Paolo
On 24 September 2015 at 20:43, Christopher Covington <cov@codeaurora.org> wrote: > cpu_get_ticks() provides a common interface across targets for > calculating CPU cycles. Using this fixes PMCCNTR reads when -icount > is specified (previously a non-increasing value was returned). > > Signed-off-by: Christopher Covington <cov@codeaurora.org> > --- > target-arm/helper.c | 9 +++------ > 1 file changed, 3 insertions(+), 6 deletions(-) So, I think the conclusion from this thread was that we should (a) change the default cpu_get_host_ticks() implementation to call get_clock() (b) rebase this patch and apply it Is anybody planning to do that? In any case, I'm taking this thread off my "must-review" list :-) thanks -- PMM
On 10/13/2015 04:53 PM, Peter Maydell wrote: > On 24 September 2015 at 20:43, Christopher Covington <cov@codeaurora.org> wrote: >> cpu_get_ticks() provides a common interface across targets for >> calculating CPU cycles. Using this fixes PMCCNTR reads when -icount >> is specified (previously a non-increasing value was returned). >> >> Signed-off-by: Christopher Covington <cov@codeaurora.org> >> --- >> target-arm/helper.c | 9 +++------ >> 1 file changed, 3 insertions(+), 6 deletions(-) > > So, I think the conclusion from this thread was that we should > (a) change the default cpu_get_host_ticks() implementation to > call get_clock() > (b) rebase this patch and apply it > > Is anybody planning to do that? It's not in my plans at the moment. Thanks, Christopher Covington
diff --git a/target-arm/helper.c b/target-arm/helper.c index 3e6fb0b..a027a19 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -648,8 +648,7 @@ void pmccntr_sync(CPUARMState *env) { uint64_t temp_ticks; - temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL), - get_ticks_per_sec(), 1000000); + temp_ticks = cpu_get_icount_raw(); if (env->cp15.c9_pmcr & PMCRD) { /* Increment once every 64 processor clock cycles */ @@ -687,8 +686,7 @@ static uint64_t pmccntr_read(CPUARMState *env, const ARMCPRegInfo *ri) return env->cp15.c15_ccnt; } - total_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL), - get_ticks_per_sec(), 1000000); + total_ticks = cpu_get_icount_raw(); if (env->cp15.c9_pmcr & PMCRD) { /* Increment once every 64 processor clock cycles */ @@ -708,8 +706,7 @@ static void pmccntr_write(CPUARMState *env, const ARMCPRegInfo *ri, return; } - total_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL), - get_ticks_per_sec(), 1000000); + total_ticks = cpu_get_icount_raw(); if (env->cp15.c9_pmcr & PMCRD) { /* Increment once every 64 processor clock cycles */
Present a system with an instructions per cycle of exactly one. This makes it less likely a user will mistake the cycle counter values as meaningful and makes calculations involving cycles trivial while preserving the necessary property of the cycle counter register as monotonically increasing. Signed-off-by: Christopher Covington <christopher.covington@linaro.org> --- target-arm/helper.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)