[RFC,5/5] arm: Simplify cycle counter

Message ID	1430417667-4245-5-git-send-email-christopher.covington@linaro.org
State	New
Headers	show Return-Path: <patchwork-forward+bncBCLP3ZHUTYLBBPPDRGVAKGQEIR557LY@linaro.org> MIME-Version: 1.0 Received-SPF: pass (google.com: domain of patch+caf_=patchwork-forward=linaro.org@linaro.org designates 209.85.217.173 as permitted sender) client-ip=209.85.217.173; Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 2001:4830:134:3::11 as permitted sender) client-ip=2001:4830:134:3::11; From: Christopher Covington <christopher.covington@linaro.org> To: peter.maydell@linaro.org, qemu-devel@nongnu.org Date: Thu, 30 Apr 2015 14:14:27 -0400 Message-Id: <1430417667-4245-5-git-send-email-christopher.covington@linaro.org> In-Reply-To: <1430417667-4245-1-git-send-email-christopher.covington@linaro.org> References: <1430417667-4245-1-git-send-email-christopher.covington@linaro.org> Cc: Christopher Covington <christopher.covington@linaro.org> Subject: [Qemu-devel] [RFC 5/5] arm: Simplify cycle counter Precedence: list Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: qemu-devel-bounces+patch=linaro.org@nongnu.org Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org

Christopher Covington April 30, 2015, 6:14 p.m. UTC

Present a system with an instructions per cycle of exactly one.
This makes it less likely a user will mistake the cycle counter
values as meaningful and makes calculations involving cycles
trivial while preserving the necessary property of the cycle
counter register as monotonically increasing.

Signed-off-by: Christopher Covington <christopher.covington@linaro.org>
---
 target-arm/helper.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

Peter Maydell April 30, 2015, 6:27 p.m. UTC | #1

On 30 April 2015 at 19:14, Christopher Covington
<christopher.covington@linaro.org> wrote:
> Present a system with an instructions per cycle of exactly one.
> This makes it less likely a user will mistake the cycle counter
> values as meaningful and makes calculations involving cycles
> trivial while preserving the necessary property of the cycle
> counter register as monotonically increasing.
>
> Signed-off-by: Christopher Covington <christopher.covington@linaro.org>
> ---
>  target-arm/helper.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/target-arm/helper.c b/target-arm/helper.c
> index 3e6fb0b..a027a19 100644
> --- a/target-arm/helper.c
> +++ b/target-arm/helper.c
> @@ -648,8 +648,7 @@ void pmccntr_sync(CPUARMState *env)
>  {
>      uint64_t temp_ticks;
>
> -    temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
> -                          get_ticks_per_sec(), 1000000);
> +    temp_ticks = cpu_get_icount_raw();

Are you really really sure the _raw function is the correct one?
Nowhere else in the codebase calls it except the other wrappers
in cpu.c which provide a sane view of the instruction count...
(I suspect cpu_get_icount_raw() should really be static.)

PS: it would be helpful if you could make sure you include
a cover letter for future patchseries: patchsets without a
cover letter are awkward for my workflow... (A single
standalone patch doesn't need a coverletter.)

thanks
-- PMM

Christopher Covington April 30, 2015, 9:33 p.m. UTC | #2

Hi Peter,

Thanks for taking a look.

On Apr 30, 2015 2:28 PM, "Peter Maydell" <peter.maydell@linaro.org> wrote:
>
> On 30 April 2015 at 19:14, Christopher Covington
> <christopher.covington@linaro.org> wrote:
> > Present a system with an instructions per cycle of exactly one.
> > This makes it less likely a user will mistake the cycle counter
> > values as meaningful and makes calculations involving cycles
> > trivial while preserving the necessary property of the cycle
> > counter register as monotonically increasing.
> >
> > Signed-off-by: Christopher Covington <christopher.covington@linaro.org>
> > ---
> >  target-arm/helper.c | 9 +++------
> >  1 file changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/target-arm/helper.c b/target-arm/helper.c
> > index 3e6fb0b..a027a19 100644
> > --- a/target-arm/helper.c
> > +++ b/target-arm/helper.c
> > @@ -648,8 +648,7 @@ void pmccntr_sync(CPUARMState *env)
> >  {
> >      uint64_t temp_ticks;
> >
> > -    temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
> > -                          get_ticks_per_sec(), 1000000);
> > +    temp_ticks = cpu_get_icount_raw();
>
> Are you really really sure the _raw function is the correct one?
> Nowhere else in the codebase calls it except the other wrappers
> in cpu.c which provide a sane view of the instruction count...
> (I suspect cpu_get_icount_raw() should really be static.)

I thought it wasn't being used because it was new, having been introduced
by Pavel Dovgalyuk's determinism patches.

When you talk about sanity, what would this patch make insane? The
instructions per second and cycles per second that would result? If so,
what if seconds/timer ticks were changed in the same patch to be derived
from the instruction count. All of this could potentially only apply with
-icount specified.

> PS: it would be helpful if you could make sure you include
> a cover letter for future patchseries: patchsets without a
> cover letter are awkward for my workflow... (A single
> standalone patch doesn't need a coverletter.)

Will do. For this series it should have been that the first four patches
are hopefully not controversial, but so far have only been tested by some
very simple bare metal code (using PSCI exit instead of Angel exit at the
end) and booting the Linux kernel and looking at the printks, but not yet
by running `perf stat`. The last patch (this one) I realize may be
controversial as it tries to make QEMU conform to certain expectations such
as determinism that it has not historically met.

Thanks,
Chris

Peter Maydell April 30, 2015, 10:02 p.m. UTC | #3

On 30 April 2015 at 22:33, Christopher Covington
<christopher.covington@linaro.org> wrote:
> On Apr 30, 2015 2:28 PM, "Peter Maydell" <peter.maydell@linaro.org> wrote:
>> Are you really really sure the _raw function is the correct one?
>> Nowhere else in the codebase calls it except the other wrappers
>> in cpu.c which provide a sane view of the instruction count...
>> (I suspect cpu_get_icount_raw() should really be static.)
>
> I thought it wasn't being used because it was new, having been introduced by
> Pavel Dovgalyuk's determinism patches.
>
> When you talk about sanity, what would this patch make insane? The
> instructions per second and cycles per second that would result? If so, what
> if seconds/timer ticks were changed in the same patch to be derived from the
> instruction count. All of this could potentially only apply with -icount
> specified.

I don't really know for certain how the code here works, but
it makes me very dubious when I see a function that is being
used literally nowhere else in any other target CPU, and
where every other code path to it goes via taking a lock.

Paolo: can you suggest what the right function to call to implement
a cycle counter is?

thanks
-- PMM

Christopher Covington May 1, 2015, 2:35 p.m. UTC | #4

On Thu, Apr 30, 2015 at 9:24 PM, Peter Crosthwaite
<peter.crosthwaite@xilinx.com> wrote:
> On Thu, Apr 30, 2015 at 11:14 AM, Christopher Covington
> <christopher.covington@linaro.org> wrote:
>> Present a system with an instructions per cycle of exactly one.
>> This makes it less likely a user will mistake the cycle counter
>> values as meaningful and makes calculations involving cycles
>> trivial while preserving the necessary property of the cycle
>> counter register as monotonically increasing.
>>
>> Signed-off-by: Christopher Covington <christopher.covington@linaro.org>
>> ---
>>  target-arm/helper.c | 9 +++------
>>  1 file changed, 3 insertions(+), 6 deletions(-)
>>
>> diff --git a/target-arm/helper.c b/target-arm/helper.c
>> index 3e6fb0b..a027a19 100644
>> --- a/target-arm/helper.c
>> +++ b/target-arm/helper.c
>> @@ -648,8 +648,7 @@ void pmccntr_sync(CPUARMState *env)
>>  {
>>      uint64_t temp_ticks;
>>
>> -    temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
>> -                          get_ticks_per_sec(), 1000000);
>> +    temp_ticks = cpu_get_icount_raw();
>
> So I have guests (for better or worse) that make assumptions about the
> rate of the PMCCNTR WRT real time. But isn't the PMCCNTR really a
> clock cycle counter rather than an instruction counter? That clock
> rate is well defined even if it is just the trivial
> get_ticks_per_sec() at the moment. Ideally we should have a
> configurable clock rate in there instead of get_ticks_per_sec(). This
> is a major change in definition.

Is this with KVM, user emulation, or system emulation mode? I'm mainly
looking at things from the TCG (and mostly system mode) perspective.
If not using KVM, would the assumptions of your guests hold if you ran
them on a hardware device instead of QEMU TCG? I'm still trying to
understand all the code involved, but I don't see get_ticks_per_sec()
or any other constants as problematic.

> I can see your use case though, where you actually want this to mean
> something WRT program performance. Should we add a switch between the
> two behaviours?

This switch already exists, in the form of -icount, and I really
should have already been using it. What initially scared me away was
qemu_icount_bias and icount_adjust(). Accounting for halting seems
like a good thing but the accounting for "speed" by referencing a host
clock doesn't make sense to me. If I have an ARM development board
hooked up to an x86 desktop, it's not querying the desktop for time.
So why does it make sense for QEMU system emulation mode behave like
that? The -icount help text makes it sound like adjustment for speed
reasons should only take place when "auto" is specified, but I have
yet to understand the code well enough to verify that.

Chris

Peter Maydell May 6, 2015, 3:38 p.m. UTC | #5

On 6 May 2015 at 15:05, Peter Crosthwaite <peter.crosthwaite@xilinx.com> wrote:
> What I am most worried about (and I need to run some tests to really
> confirm facts) is what happens if a CPU WFIs. Should the PMCCNTR
> continue on or hold its value? If we match instruction execution to
> PMCCNTR to the PMCCNTR will freeze.

See the v8 ARM ARM D5.1.1: this doesn't count PE clock cycles, it's
linked to the hardware processor clock. The exact relationship is
IMPDEF so we have some leeway for doing whatever seems reasonable here.
Permitted things:
 * counter can stop on WFI (see D5.1.3)
 * counter can continue to run on WFI
   (it's impdef whether the PE clock is gated when in WFI, though
   I would expect that to be a popular implementation)
 * counter can read the same value if read twice in a row
 * counter can run forward a lot even if no insns executed
   (the example given is of a hyperthreading implementation)

So we should do whatever seems most convenient implementation-wise
I think...

-- PMM

Alistair Francis Sept. 28, 2015, 10:05 p.m. UTC | #6

On Thu, Sep 24, 2015 at 12:43 PM, Christopher Covington
<cov@codeaurora.org> wrote:
> cpu_get_ticks() provides a common interface across targets for
> calculating CPU cycles. Using this fixes PMCCNTR reads when -icount
> is specified (previously a non-increasing value was returned).
>
> Signed-off-by: Christopher Covington <cov@codeaurora.org>
> ---
>  target-arm/helper.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/target-arm/helper.c b/target-arm/helper.c
> index 7dc49cb..32923fb 100644
> --- a/target-arm/helper.c
> +++ b/target-arm/helper.c
> @@ -729,8 +729,7 @@ void pmccntr_sync(CPUARMState *env)
>  {
>      uint64_t temp_ticks;
>
> -    temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
> -                          get_ticks_per_sec(), 1000000);
> +    temp_ticks = cpu_get_ticks();

This patch doesn't apply anymore, you will need to rebase it.

Also I don't think this is correct. cpu_get_ticks() returns the host
CPU cycle counter, when in this case we want the guest cycles.

Thanks,

Alistair

>
>      if (env->cp15.c9_pmcr & PMCRD) {
>          /* Increment once every 64 processor clock cycles */
> @@ -768,8 +767,7 @@ static uint64_t pmccntr_read(CPUARMState *env, const ARMCPRegInfo *ri)
>          return env->cp15.c15_ccnt;
>      }
>
> -    total_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
> -                           get_ticks_per_sec(), 1000000);
> +    total_ticks = cpu_get_ticks();
>
>      if (env->cp15.c9_pmcr & PMCRD) {
>          /* Increment once every 64 processor clock cycles */
> @@ -789,8 +787,7 @@ static void pmccntr_write(CPUARMState *env, const ARMCPRegInfo *ri,
>          return;
>      }
>
> -    total_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
> -                           get_ticks_per_sec(), 1000000);
> +    total_ticks = cpu_get_ticks();
>
>      if (env->cp15.c9_pmcr & PMCRD) {
>          /* Increment once every 64 processor clock cycles */
> --
> Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
>
>

Christopher Covington Sept. 29, 2015, 2:07 p.m. UTC | #7

On 09/28/2015 06:05 PM, Alistair Francis wrote:
> On Thu, Sep 24, 2015 at 12:43 PM, Christopher Covington
> <cov@codeaurora.org> wrote:
>> cpu_get_ticks() provides a common interface across targets for
>> calculating CPU cycles. Using this fixes PMCCNTR reads when -icount
>> is specified (previously a non-increasing value was returned).
>>
>> Signed-off-by: Christopher Covington <cov@codeaurora.org>
>> ---
>>  target-arm/helper.c | 9 +++------
>>  1 file changed, 3 insertions(+), 6 deletions(-)
>>
>> diff --git a/target-arm/helper.c b/target-arm/helper.c
>> index 7dc49cb..32923fb 100644
>> --- a/target-arm/helper.c
>> +++ b/target-arm/helper.c
>> @@ -729,8 +729,7 @@ void pmccntr_sync(CPUARMState *env)
>>  {
>>      uint64_t temp_ticks;
>>
>> -    temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
>> -                          get_ticks_per_sec(), 1000000);
>> +    temp_ticks = cpu_get_ticks();

> Also I don't think this is correct. cpu_get_ticks() returns the host
> CPU cycle counter, when in this case we want the guest cycles.

I too find the use of host CPU cycles quite perplexing. Paolo suggested it
[1]. Maybe there are timeouts in some software that tend to work better in
such a mode. Perhaps it is faster, although my intuition is that it's just
providing a facade of frequency scaling to the guest.

1. https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg00162.html

I like to declare guest instructions per guest CPU cycles = 1. As I understand
it, an "-icount 0" pair of parameters is how to do this in QEMU for x86. I'd
like it to work for ARM.

I wrote a simple assembly test case which I'm working on porting it to the
kvm-unit-tests framework. In the non-icount case, I saw roughly the same order
of magnitude for guest IPC before and after the patch. I'd like to also write
CPU frequency (guest CPU cycles per generic timer guest seconds) and (M)IPS
(guest instructions per generic timer guest seconds) tests.

Thanks,
Christopher Covington

Christopher Covington Oct. 2, 2015, 4:44 p.m. UTC | #8

On 09/29/2015 10:07 AM, Christopher Covington wrote:
> On 09/28/2015 06:05 PM, Alistair Francis wrote:
>> On Thu, Sep 24, 2015 at 12:43 PM, Christopher Covington
>> <cov@codeaurora.org> wrote:
>>> cpu_get_ticks() provides a common interface across targets for
>>> calculating CPU cycles. Using this fixes PMCCNTR reads when -icount
>>> is specified (previously a non-increasing value was returned).
>>>
>>> Signed-off-by: Christopher Covington <cov@codeaurora.org>
>>> ---
>>>  target-arm/helper.c | 9 +++------
>>>  1 file changed, 3 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/target-arm/helper.c b/target-arm/helper.c
>>> index 7dc49cb..32923fb 100644
>>> --- a/target-arm/helper.c
>>> +++ b/target-arm/helper.c
>>> @@ -729,8 +729,7 @@ void pmccntr_sync(CPUARMState *env)
>>>  {
>>>      uint64_t temp_ticks;
>>>
>>> -    temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
>>> -                          get_ticks_per_sec(), 1000000);
>>> +    temp_ticks = cpu_get_ticks();
> 
>> Also I don't think this is correct. cpu_get_ticks() returns the host
>> CPU cycle counter, when in this case we want the guest cycles.
> 
> I too find the use of host CPU cycles quite perplexing. Paolo suggested it
> [1]. Maybe there are timeouts in some software that tend to work better in
> such a mode. Perhaps it is faster, although my intuition is that it's just
> providing a facade of frequency scaling to the guest.
> 
> 1. https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg00162.html
> 
> I like to declare guest instructions per guest CPU cycles = 1. As I understand
> it, an "-icount 0" pair of parameters is how to do this in QEMU for x86. I'd
> like it to work for ARM.
> 
> I wrote a simple assembly test case which I'm working on porting it to the
> kvm-unit-tests framework. In the non-icount case, I saw roughly the same order
> of magnitude for guest IPC before and after the patch. I'd like to also write
> CPU frequency (guest CPU cycles per generic timer guest seconds) and (M)IPS
> (guest instructions per generic timer guest seconds) tests.

I've sent out the CPI test case and while exercising it I noticed that
Laurent's patch fixed -icount. So my original goal has been accomplished. I'm
happy to rebase this patch if the authorities (Peter Maydell?) think using
cpu_get_ticks() is the right thing to do. Otherwise I'll probably try to move
on to support for the instructions event in the ARM PMU.

Thanks,
Christopher Covington

Peter Maydell Oct. 2, 2015, 4:56 p.m. UTC | #9

On 2 October 2015 at 17:44, Christopher Covington <cov@codeaurora.org> wrote:
> I've sent out the CPI test case and while exercising it I noticed that
> Laurent's patch fixed -icount. So my original goal has been accomplished. I'm
> happy to rebase this patch if the authorities (Peter Maydell?) think using
> cpu_get_ticks() is the right thing to do. Otherwise I'll probably try to move
> on to support for the instructions event in the ARM PMU.

Authority here is probably Peter Crosthwaite. I can produce an
opinion but I'd have to go and research a bunch of stuff to do
that, so I'm hoping to avoid it...

thanks
-- PMM

Peter Crosthwaite Oct. 2, 2015, 5:25 p.m. UTC | #10

On Fri, Oct 2, 2015 at 9:56 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 2 October 2015 at 17:44, Christopher Covington <cov@codeaurora.org> wrote:
>> I've sent out the CPI test case and while exercising it I noticed that
>> Laurent's patch fixed -icount. So my original goal has been accomplished. I'm
>> happy to rebase this patch if the authorities (Peter Maydell?) think using
>> cpu_get_ticks() is the right thing to do. Otherwise I'll probably try to move
>> on to support for the instructions event in the ARM PMU.
>
> Authority here is probably Peter Crosthwaite. I can produce an
> opinion but I'd have to go and research a bunch of stuff to do
> that, so I'm hoping to avoid it...
>

So my idea here is the CPU input frequency should be a property of the CPU.

Some experimental results confirm that the PMCCNTR on many common ARM
implementations is directly connected to the input clock and can be
relied on as a straight free-running counter. I think a genuine
instruction counter is something else and this timer should be
independent of any core provider of cycle count.

Regards,
Peter

> thanks
> -- PMM

Peter Maydell Oct. 2, 2015, 6:08 p.m. UTC | #11

On 2 October 2015 at 18:25, Peter Crosthwaite
<crosthwaitepeter@gmail.com> wrote:
> So my idea here is the CPU input frequency should be a property of the CPU.
>
> Some experimental results confirm that the PMCCNTR on many common ARM
> implementations is directly connected to the input clock and can be
> relied on as a straight free-running counter. I think a genuine
> instruction counter is something else and this timer should be
> independent of any core provider of cycle count.

Architecturally, the PMCCNTR counter is counting the hardware processor
clock. It's definitely not an instruction counter. (It's also not
counting Processor Element clock cycles, though that only makes a
difference if you have a multi-threaded hw implementation.) It is
generally subject to any hw changes in clock frequency (including if
your WFI/WFE do clock stopping).

What that means for QEMU I'm not totally sure :-)

thanks
-- PMM

Peter Crosthwaite Oct. 2, 2015, 6:14 p.m. UTC | #12

On Fri, Oct 2, 2015 at 11:08 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 2 October 2015 at 18:25, Peter Crosthwaite
> <crosthwaitepeter@gmail.com> wrote:
>> So my idea here is the CPU input frequency should be a property of the CPU.
>>
>> Some experimental results confirm that the PMCCNTR on many common ARM
>> implementations is directly connected to the input clock and can be
>> relied on as a straight free-running counter. I think a genuine
>> instruction counter is something else and this timer should be
>> independent of any core provider of cycle count.
>
> Architecturally, the PMCCNTR counter is counting the hardware processor
> clock. It's definitely not an instruction counter. (It's also not
> counting Processor Element clock cycles, though that only makes a
> difference if you have a multi-threaded hw implementation.) It is
> generally subject to any hw changes in clock frequency (including if
> your WFI/WFE do clock stopping).
>

WFI/WFE halting could be easily implemented directly as that, in
similar way to EL filtering.

> What that means for QEMU I'm not totally sure :-)

I think this all points to it being just another normal timer (like
those in hw/timer).

Regards,
Peter

>
> thanks
> -- PMM

Christopher Covington Oct. 2, 2015, 7:25 p.m. UTC | #13

On 10/02/2015 01:25 PM, Peter Crosthwaite wrote:
> On Fri, Oct 2, 2015 at 9:56 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> On 2 October 2015 at 17:44, Christopher Covington <cov@codeaurora.org> wrote:
>>> I've sent out the CPI test case and while exercising it I noticed that
>>> Laurent's patch fixed -icount. So my original goal has been accomplished. I'm
>>> happy to rebase this patch if the authorities (Peter Maydell?) think using
>>> cpu_get_ticks() is the right thing to do. Otherwise I'll probably try to move
>>> on to support for the instructions event in the ARM PMU.
>>
>> Authority here is probably Peter Crosthwaite. I can produce an
>> opinion but I'd have to go and research a bunch of stuff to do
>> that, so I'm hoping to avoid it...
> 
> So my idea here is the CPU input frequency should be a property of the CPU.
> 
> Some experimental results confirm that the PMCCNTR on many common ARM
> implementations is directly connected to the input clock and can be
> relied on as a straight free-running counter. I think a genuine
> instruction counter is something else

Yes, the "genuine" instruction counter is something else. The instruction
count is only relevant for folks trying to get deterministic execution by
using the -icount option. QEMU TCG mode does not emulate a cycle-level input
clock for the guest (the whole class of functional models skip this
time-consuming step) but rather operates a block at a time. By doing a little
extra, I think it also interpolates the exact instruction count. Specifying a
fixed IPC = n is the most sensible way of deterministically calculating a
PMCCNTR_EL0 value that I know of. The -icount option allows users to choose
such deterministic behavior.

> and this timer should be independent of any core provider of cycle count.

What, if anything, do you think should be hooked up to the core provider of
cycle count?

Christopher Covington

Peter Crosthwaite Oct. 2, 2015, 7:56 p.m. UTC | #14

On Fri, Oct 2, 2015 at 12:25 PM, Christopher Covington
<cov@codeaurora.org> wrote:
> On 10/02/2015 01:25 PM, Peter Crosthwaite wrote:
>> On Fri, Oct 2, 2015 at 9:56 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> On 2 October 2015 at 17:44, Christopher Covington <cov@codeaurora.org> wrote:
>>>> I've sent out the CPI test case and while exercising it I noticed that
>>>> Laurent's patch fixed -icount. So my original goal has been accomplished. I'm
>>>> happy to rebase this patch if the authorities (Peter Maydell?) think using
>>>> cpu_get_ticks() is the right thing to do. Otherwise I'll probably try to move
>>>> on to support for the instructions event in the ARM PMU.
>>>
>>> Authority here is probably Peter Crosthwaite. I can produce an
>>> opinion but I'd have to go and research a bunch of stuff to do
>>> that, so I'm hoping to avoid it...
>>
>> So my idea here is the CPU input frequency should be a property of the CPU.
>>
>> Some experimental results confirm that the PMCCNTR on many common ARM
>> implementations is directly connected to the input clock and can be
>> relied on as a straight free-running counter. I think a genuine
>> instruction counter is something else
>
> Yes, the "genuine" instruction counter is something else. The instruction
> count is only relevant for folks trying to get deterministic execution by
> using the -icount option. QEMU TCG mode does not emulate a cycle-level input
> clock for the guest (the whole class of functional models skip this
> time-consuming step) but rather operates a block at a time. By doing a little
> extra, I think it also interpolates the exact instruction count. Specifying a
> fixed IPC = n is the most sensible way of deterministically calculating a
> PMCCNTR_EL0 value that I know of. The -icount option allows users to choose
> such deterministic behavior.
>
>> and this timer should be independent of any core provider of cycle count.
>
> What, if anything, do you think should be hooked up to the core provider of
> cycle count?
>

Depends, Is this a virtual-machine only concept, or do you have
something with a real-hardware analogue?

Regards,
Peter

> Christopher Covington
>
> --
> Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project

Christopher Covington Oct. 2, 2015, 8:48 p.m. UTC | #15

On 10/02/2015 03:56 PM, Peter Crosthwaite wrote:
> On Fri, Oct 2, 2015 at 12:25 PM, Christopher Covington
> <cov@codeaurora.org> wrote:
>> On 10/02/2015 01:25 PM, Peter Crosthwaite wrote:
>>> On Fri, Oct 2, 2015 at 9:56 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>> On 2 October 2015 at 17:44, Christopher Covington <cov@codeaurora.org> wrote:
>>>>> I've sent out the CPI test case and while exercising it I noticed that
>>>>> Laurent's patch fixed -icount. So my original goal has been accomplished. I'm
>>>>> happy to rebase this patch if the authorities (Peter Maydell?) think using
>>>>> cpu_get_ticks() is the right thing to do. Otherwise I'll probably try to move
>>>>> on to support for the instructions event in the ARM PMU.
>>>>
>>>> Authority here is probably Peter Crosthwaite. I can produce an
>>>> opinion but I'd have to go and research a bunch of stuff to do
>>>> that, so I'm hoping to avoid it...
>>>
>>> So my idea here is the CPU input frequency should be a property of the CPU.
>>>
>>> Some experimental results confirm that the PMCCNTR on many common ARM
>>> implementations is directly connected to the input clock and can be
>>> relied on as a straight free-running counter. I think a genuine
>>> instruction counter is something else
>>
>> Yes, the "genuine" instruction counter is something else. The instruction
>> count is only relevant for folks trying to get deterministic execution by
>> using the -icount option. QEMU TCG mode does not emulate a cycle-level input
>> clock for the guest (the whole class of functional models skip this
>> time-consuming step) but rather operates a block at a time. By doing a little
>> extra, I think it also interpolates the exact instruction count. Specifying a
>> fixed IPC = n is the most sensible way of deterministically calculating a
>> PMCCNTR_EL0 value that I know of. The -icount option allows users to choose
>> such deterministic behavior.
>>
>>> and this timer should be independent of any core provider of cycle count.
>>
>> What, if anything, do you think should be hooked up to the core provider of
>> cycle count?
> 
> Depends, Is this a virtual-machine only concept, or do you have
> something with a real-hardware analogue?

What I meant to ask was, do you see any reason for cpu_get_ticks() to exist?
If no architecture besides i386 wants to use it, perhaps the code should be
moved there.

Thanks,
Christopher Covington

Peter Maydell Oct. 2, 2015, 10:41 p.m. UTC | #16

On 2 October 2015 at 21:48, Christopher Covington <cov@codeaurora.org> wrote:
> What I meant to ask was, do you see any reason for cpu_get_ticks() to exist?
> If no architecture besides i386 wants to use it, perhaps the code should be
> moved there.

OTOH various non-x86 things do use the closely related cpu_get_real_ticks(),
and the implementation of cpu_get_ticks() is very closely related to
the other clock code in cpus.c.

-- PMM

Paolo Bonzini Oct. 5, 2015, 2:09 p.m. UTC | #17

On 03/10/2015 00:41, Peter Maydell wrote:
> > What I meant to ask was, do you see any reason for cpu_get_ticks() to exist?
> > If no architecture besides i386 wants to use it, perhaps the code should be
> > moved there.
>
> OTOH various non-x86 things do use the closely related cpu_get_real_ticks(),
> and the implementation of cpu_get_ticks() is very closely related to
> the other clock code in cpus.c.

cpu_get_real_ticks() is returning the host cycle counter;
cpu_get_ticks() is stopping/resuming it when the VM is stopped/resumed.
 In other words, cpu_get_real_ticks() is to cpu_get_ticks() what
QEMU_CLOCK_REALTIME is to QEMU_CLOCK_VIRTUAL.

They are also similar in that both cpu_get_ticks() and
QEMU_CLOCK_VIRTUAL "morph" to the instruction count in icount mode.

cpu_get_ticks() should be the right implementation for the ARM PMCCNTR
cycle counter, since: 1) PMCCNTR is roughly the same as the x86 RDTSC;
2) cpu_get_ticks() is the only monotonically increasing value outside
icount mode.

Paolo

Peter Maydell Oct. 5, 2015, 2:11 p.m. UTC | #18

On 5 October 2015 at 15:09, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 03/10/2015 00:41, Peter Maydell wrote:
>> > What I meant to ask was, do you see any reason for cpu_get_ticks() to exist?
>> > If no architecture besides i386 wants to use it, perhaps the code should be
>> > moved there.
>>
>> OTOH various non-x86 things do use the closely related cpu_get_real_ticks(),
>> and the implementation of cpu_get_ticks() is very closely related to
>> the other clock code in cpus.c.
>
> cpu_get_real_ticks() is returning the host cycle counter;
> cpu_get_ticks() is stopping/resuming it when the VM is stopped/resumed.
>  In other words, cpu_get_real_ticks() is to cpu_get_ticks() what
> QEMU_CLOCK_REALTIME is to QEMU_CLOCK_VIRTUAL.

...but it seems wrong to have anything in the simulation care
about the host cycle counter, especially since on some hosts
the underlying implementation is terrible.

thanks
-- PMM

Paolo Bonzini Oct. 5, 2015, 2:27 p.m. UTC | #19

On 05/10/2015 16:11, Peter Maydell wrote:
> > > OTOH various non-x86 things do use the closely related cpu_get_real_ticks(),
> > > and the implementation of cpu_get_ticks() is very closely related to
> > > the other clock code in cpus.c.
> >
> > cpu_get_real_ticks() is returning the host cycle counter;
> > cpu_get_ticks() is stopping/resuming it when the VM is stopped/resumed.
> >  In other words, cpu_get_real_ticks() is to cpu_get_ticks() what
> > QEMU_CLOCK_REALTIME is to QEMU_CLOCK_VIRTUAL.
>
> ...but it seems wrong to have anything in the simulation care
> about the host cycle counter,

I agree; instruction count is much better.  Unfortunately, -icount has a
relatively hefty performance penalty.  It is also common that code using
PMCCNTR/RDTSC wants the increment between two reads to be large-ish and
at least remotely related to the number of instructions that were
executed in between.

I've also used rdtsc in the guest from time to time to benchmark changes
to TCG. Having it map to the host cycle counter is somewhat useful for
that purpose, though one might say this usecase is niche.

> especially since on some hosts
> the underlying implementation is terrible.

I remember seeing a bug where this terrible implementation caused
divisions by zero on the host.  The default implementation in
include/qemu/timer.h should be changed to
qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL).

Note that in practice this terrible implementation is only used on
ARM/AArch64.  Is PMCCNTR or something similar accessible from userspace?
 I guess no, since even write access is enabled via PMUSERENR (and in
general PMUSERENR ought to be false on a preemptive-multitasking OS).

In the end I guess qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) would work just
as well for PMCCNTR, but I think non-Linux OSes still have a relatively
slow clock_gettime() so there is still an advantage in using
cpu_get_ticks().

Paolo

ps: on x86, a long time ago rdtsc was reliable and clock_gettime() was
slow, so it was a no-brainer for benchmarks and the like; then rdtsc
became unreliable and clock_gettime() became fast on Linux; but now at
least on new machines rdtsc is usually reliable again.

Peter Maydell Oct. 6, 2015, 12:49 p.m. UTC | #20

On 5 October 2015 at 15:27, Paolo Bonzini <pbonzini@redhat.com> wrote:
> I remember seeing a bug where this terrible implementation caused
> divisions by zero on the host.  The default implementation in
> include/qemu/timer.h should be changed to
> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL).
>
> Note that in practice this terrible implementation is only used on
> ARM/AArch64.  Is PMCCNTR or something similar accessible from userspace?
>  I guess no, since even write access is enabled via PMUSERENR (and in
> general PMUSERENR ought to be false on a preemptive-multitasking OS).

Yeah, there's no guarantee on userspace access. I think the
fastest way to get some kind of count is to use a library
function that boils down to gettimeofday(2), which we will
do purely in userspace in the kernel VDSO if possible.

thanks
-- PMM

Paolo Bonzini Oct. 6, 2015, 12:58 p.m. UTC | #21

On 06/10/2015 14:49, Peter Maydell wrote:
> Yeah, there's no guarantee on userspace access. I think the
> fastest way to get some kind of count is to use a library
> function that boils down to gettimeofday(2), which we will
> do purely in userspace in the kernel VDSO if possible.

Could we just use CNTVCT_EL0?  Which cores have it?

Paolo

Peter Maydell Oct. 6, 2015, 1:06 p.m. UTC | #22

On 6 October 2015 at 13:58, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 06/10/2015 14:49, Peter Maydell wrote:
>> Yeah, there's no guarantee on userspace access. I think the
>> fastest way to get some kind of count is to use a library
>> function that boils down to gettimeofday(2), which we will
>> do purely in userspace in the kernel VDSO if possible.

Looking closer, clock_gettime() also has a userspace
only fastpath in the VDSO.

> Could we just use CNTVCT_EL0?  Which cores have it?

Not guaranteed to be enabled for userspace access by the kernel,
and even if it is enabled, the kernel folks don't (last time I
checked) consider this userspace ABI -- it's only there for
the benefit of the VDSO. (CNTVCT_EL0 is how the fast clock_gettime
and getimeofday paths are implemented.)

thanks
-- PMM

Paolo Bonzini Oct. 6, 2015, 1:10 p.m. UTC | #23

On 06/10/2015 15:06, Peter Maydell wrote:
> Looking closer, clock_gettime() also has a userspace
> only fastpath in the VDSO.

Yup, hence the patch that changed the default cpu_get_real_ticks()
implementation to get_clock().

>> > Could we just use CNTVCT_EL0?  Which cores have it?
>
> Not guaranteed to be enabled for userspace access by the kernel,
> and even if it is enabled, the kernel folks don't (last time I
> checked) consider this userspace ABI -- it's only there for
> the benefit of the VDSO.

Ok, thanks.

> (CNTVCT_EL0 is how the fast clock_gettime
> and getimeofday paths are implemented.)

Yes.

Paolo

Peter Maydell Oct. 13, 2015, 8:53 p.m. UTC | #24

On 24 September 2015 at 20:43, Christopher Covington <cov@codeaurora.org> wrote:
> cpu_get_ticks() provides a common interface across targets for
> calculating CPU cycles. Using this fixes PMCCNTR reads when -icount
> is specified (previously a non-increasing value was returned).
>
> Signed-off-by: Christopher Covington <cov@codeaurora.org>
> ---
>  target-arm/helper.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)

So, I think the conclusion from this thread was that we should
(a) change the default cpu_get_host_ticks() implementation to
call get_clock()
(b) rebase this patch and apply it

Is anybody planning to do that?

In any case, I'm taking this thread off my "must-review" list :-)

thanks
-- PMM

Christopher Covington Oct. 14, 2015, 12:10 p.m. UTC | #25

On 10/13/2015 04:53 PM, Peter Maydell wrote:
> On 24 September 2015 at 20:43, Christopher Covington <cov@codeaurora.org> wrote:
>> cpu_get_ticks() provides a common interface across targets for
>> calculating CPU cycles. Using this fixes PMCCNTR reads when -icount
>> is specified (previously a non-increasing value was returned).
>>
>> Signed-off-by: Christopher Covington <cov@codeaurora.org>
>> ---
>>  target-arm/helper.c | 9 +++------
>>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> So, I think the conclusion from this thread was that we should
> (a) change the default cpu_get_host_ticks() implementation to
> call get_clock()
> (b) rebase this patch and apply it
> 
> Is anybody planning to do that?

It's not in my plans at the moment.

Thanks,
Christopher Covington

[RFC,5/5] arm: Simplify cycle counter

Commit Message

Comments

Patch