Message ID | 20211223030149.1947418-1-richard.henderson@linaro.org |
---|---|
Headers | show |
Series | Reorg ppc64 pmu insn counting | expand |
On 12/23/21 00:01, Richard Henderson wrote: > In contrast to Daniel's version, the code stays in power8-pmu.c, > but is better organized to not take so much overhead. > > Before: > > 32.97% qemu-system-ppc qemu-system-ppc64 [.] pmc_get_event > 20.22% qemu-system-ppc qemu-system-ppc64 [.] helper_insns_inc > 4.52% qemu-system-ppc qemu-system-ppc64 [.] hreg_compute_hflags_value > 3.30% qemu-system-ppc qemu-system-ppc64 [.] helper_lookup_tb_ptr > 2.68% qemu-system-ppc qemu-system-ppc64 [.] tcg_gen_code > 2.28% qemu-system-ppc qemu-system-ppc64 [.] cpu_exec > 1.84% qemu-system-ppc qemu-system-ppc64 [.] pmu_insn_cnt_enabled > > After: > > 8.42% qemu-system-ppc qemu-system-ppc64 [.] hreg_compute_hflags_value > 6.65% qemu-system-ppc qemu-system-ppc64 [.] cpu_exec > 6.63% qemu-system-ppc qemu-system-ppc64 [.] helper_insns_inc > Thanks for looking this up. I had no idea the original C code was that slow. This reorg is breaking PMU-EBB tests, unfortunately. These tests are run from the kernel tree [1] and I test them inside a pSeries TCG guest. You'll need to apply patches 9 and 10 of [2] beforehand (they apply cleanly in current master) because they aren't upstream yet and EBB needs it. The tests that are breaking consistently with this reorg are: back_to_back_ebbs_test.c cpu_event_pinned_vs_ebb_test.c cycles_test.c task_event_pinned_vs_ebb_test.c The issue here is that these tests exercises different Perf events and aspects of branching (e.g. how fast we're detecting a counter overflow, how many times, etc) and I wasn't able to find out a fix using your C reorg yet. With that in mind I decided to post a new version of my TCG rework, with less repetition and a bit more concise, to have an alternative that can be used upstream to fix the Avocado tests. Meanwhile I'll see if I can get your reorg working with all EBB tests we need. All things equal - similar performance, all EBB tests passing - I'd rather stay with your C code than my TCG rework since yours doesn't rely on TCG Ops knowledge to maintain it. Thanks, Daniel [1] https://github.com/torvalds/linux/tree/master/tools/testing/selftests/powerpc/pmu/ebb [2] https://lists.gnu.org/archive/html/qemu-devel/2021-12/msg00073.html > > r~ > > > Richard Henderson (3): > target/ppc: Cache per-pmc insn and cycle count settings > target/ppc: Rewrite pmu_increment_insns > target/ppc: Use env->pnc_cyc_cnt > > target/ppc/cpu.h | 3 + > target/ppc/power8-pmu.h | 14 +-- > target/ppc/cpu_init.c | 1 + > target/ppc/helper_regs.c | 2 +- > target/ppc/machine.c | 2 + > target/ppc/power8-pmu.c | 230 ++++++++++++++++----------------------- > 6 files changed, 108 insertions(+), 144 deletions(-) >
On 12/23/21 12:36 PM, Daniel Henrique Barboza wrote: > This reorg is breaking PMU-EBB tests, unfortunately. These tests are run from the kernel > tree [1] and I test them inside a pSeries TCG guest. You'll need to apply patches 9 and > 10 of [2] beforehand (they apply cleanly in current master) because they aren't upstream > yet and EBB needs it. > > The tests that are breaking consistently with this reorg are: > > back_to_back_ebbs_test.c > cpu_event_pinned_vs_ebb_test.c > cycles_test.c > task_event_pinned_vs_ebb_test.c In which case perhaps drop my first patch for now, and instead simply replicate your tcg algorithm in c exactly -- using none of the helpers that currently exist. We can improve the code, and the use of pmc_get_event from hreg_compute_hregs_value second. r~
On 12/23/21 18:19, Richard Henderson wrote: > On 12/23/21 12:36 PM, Daniel Henrique Barboza wrote: >> This reorg is breaking PMU-EBB tests, unfortunately. These tests are run from the kernel >> tree [1] and I test them inside a pSeries TCG guest. You'll need to apply patches 9 and >> 10 of [2] beforehand (they apply cleanly in current master) because they aren't upstream >> yet and EBB needs it. >> >> The tests that are breaking consistently with this reorg are: >> >> back_to_back_ebbs_test.c >> cpu_event_pinned_vs_ebb_test.c >> cycles_test.c >> task_event_pinned_vs_ebb_test.c > > In which case perhaps drop my first patch for now, and instead simply replicate your tcg algorithm in c exactly -- using none of the helpers that currently exist. > > We can improve the code, and the use of pmc_get_event from hreg_compute_hregs_value second. I'll try this out. I'll let you know how that goes. Thanks, Daniel > > > r~
On 12/23/21 18:19, Richard Henderson wrote: > On 12/23/21 12:36 PM, Daniel Henrique Barboza wrote: >> This reorg is breaking PMU-EBB tests, unfortunately. These tests are run from the kernel >> tree [1] and I test them inside a pSeries TCG guest. You'll need to apply patches 9 and >> 10 of [2] beforehand (they apply cleanly in current master) because they aren't upstream >> yet and EBB needs it. >> >> The tests that are breaking consistently with this reorg are: >> >> back_to_back_ebbs_test.c >> cpu_event_pinned_vs_ebb_test.c >> cycles_test.c >> task_event_pinned_vs_ebb_test.c > > In which case perhaps drop my first patch for now, and instead simply replicate your tcg algorithm in c exactly -- using none of the helpers that currently exist. > > We can improve the code, and the use of pmc_get_event from hreg_compute_hregs_value second. While attempting to do that I figured what was off with this series and ended up fixing it. It's now working with the event-based branch interrupt tests and Avocado seems happy as well. It took some small adjustments/fixes in patches 1/2 and an extra patch of mine tuning the existing logic after the reorg. I'll clean it up and re-send it next week/year. Thanks Daniel > > > r~
On 12/30/21 23:12, Daniel Henrique Barboza wrote: > > > On 12/23/21 18:19, Richard Henderson wrote: >> On 12/23/21 12:36 PM, Daniel Henrique Barboza wrote: >>> This reorg is breaking PMU-EBB tests, unfortunately. These tests are run from the kernel >>> tree [1] and I test them inside a pSeries TCG guest. You'll need to apply patches 9 and >>> 10 of [2] beforehand (they apply cleanly in current master) because they aren't upstream >>> yet and EBB needs it. >>> >>> The tests that are breaking consistently with this reorg are: >>> >>> back_to_back_ebbs_test.c >>> cpu_event_pinned_vs_ebb_test.c >>> cycles_test.c >>> task_event_pinned_vs_ebb_test.c >> >> In which case perhaps drop my first patch for now, and instead simply replicate your tcg algorithm in c exactly -- using none of the helpers that currently exist. >> >> We can improve the code, and the use of pmc_get_event from hreg_compute_hregs_value second. > > > While attempting to do that I figured what was off with this series and ended up > fixing it. > > It's now working with the event-based branch interrupt tests and Avocado seems happy > as well. It took some small adjustments/fixes in patches 1/2 and an extra patch of mine > tuning the existing logic after the reorg. > > > I'll clean it up and re-send it next week/year. Shouldn't we merge this series first ? It is really improving emulation and keeps the check-avocado tests under the timeout limit (which I find important). Thanks, C.
On 1/3/22 03:48, Cédric Le Goater wrote: > On 12/30/21 23:12, Daniel Henrique Barboza wrote: >> >> >> On 12/23/21 18:19, Richard Henderson wrote: >>> On 12/23/21 12:36 PM, Daniel Henrique Barboza wrote: >>>> This reorg is breaking PMU-EBB tests, unfortunately. These tests are run from the kernel >>>> tree [1] and I test them inside a pSeries TCG guest. You'll need to apply patches 9 and >>>> 10 of [2] beforehand (they apply cleanly in current master) because they aren't upstream >>>> yet and EBB needs it. >>>> >>>> The tests that are breaking consistently with this reorg are: >>>> >>>> back_to_back_ebbs_test.c >>>> cpu_event_pinned_vs_ebb_test.c >>>> cycles_test.c >>>> task_event_pinned_vs_ebb_test.c >>> >>> In which case perhaps drop my first patch for now, and instead simply replicate your tcg algorithm in c exactly -- using none of the helpers that currently exist. >>> >>> We can improve the code, and the use of pmc_get_event from hreg_compute_hregs_value second. >> >> >> While attempting to do that I figured what was off with this series and ended up >> fixing it. >> >> It's now working with the event-based branch interrupt tests and Avocado seems happy >> as well. It took some small adjustments/fixes in patches 1/2 and an extra patch of mine >> tuning the existing logic after the reorg. >> >> >> I'll clean it up and re-send it next week/year. > > Shouldn't we merge this series first ? It is really improving emulation > and keeps the check-avocado tests under the timeout limit (which I find > important). As it is this series is breaking EBB tests if we apply the EBB interrupt patches on top of it. If your concern is strictly with the Avocado tests then we'd better off pushing my TCG version which fixes Avocado and don't break anything else. Besides, I'll sent the v2 of this series today, tomorrow at the latest. We're better off pushing the fixed version. Thanks, Daniel > > Thanks, > > C.
Daniel Henrique Barboza <danielhb413@gmail.com> writes: > On 12/23/21 00:01, Richard Henderson wrote: >> In contrast to Daniel's version, the code stays in power8-pmu.c, >> but is better organized to not take so much overhead. >> Before: >> 32.97% qemu-system-ppc qemu-system-ppc64 [.] pmc_get_event >> 20.22% qemu-system-ppc qemu-system-ppc64 [.] helper_insns_inc >> 4.52% qemu-system-ppc qemu-system-ppc64 [.] hreg_compute_hflags_value >> 3.30% qemu-system-ppc qemu-system-ppc64 [.] helper_lookup_tb_ptr >> 2.68% qemu-system-ppc qemu-system-ppc64 [.] tcg_gen_code >> 2.28% qemu-system-ppc qemu-system-ppc64 [.] cpu_exec >> 1.84% qemu-system-ppc qemu-system-ppc64 [.] pmu_insn_cnt_enabled >> After: >> 8.42% qemu-system-ppc qemu-system-ppc64 [.] >> hreg_compute_hflags_value >> 6.65% qemu-system-ppc qemu-system-ppc64 [.] cpu_exec >> 6.63% qemu-system-ppc qemu-system-ppc64 [.] helper_insns_inc >> > > Thanks for looking this up. I had no idea the original C code was that slow. > <snip> > > With that in mind I decided to post a new version of my TCG rework, with less repetition and > a bit more concise, to have an alternative that can be used upstream to fix the Avocado tests. > Meanwhile I'll see if I can get your reorg working with all EBB tests we need. All things > equal - similar performance, all EBB tests passing - I'd rather stay with your C code than my > TCG rework since yours doesn't rely on TCG Ops knowledge to maintain > it. Reading this series did make me wonder if we need a more generic service from the TCG for helping with "internal" instrumentation needed for things like decent PMU emulation. We haven't gone as much for it in ARM yet but it would be nice to. It would be even nicer if such a facility could be used by stuff like icount as well so we don't end up doing the same thing twice. > > > Thanks, > > > Daniel > > > [1] https://github.com/torvalds/linux/tree/master/tools/testing/selftests/powerpc/pmu/ebb > [2] https://lists.gnu.org/archive/html/qemu-devel/2021-12/msg00073.html > >> r~ >> Richard Henderson (3): >> target/ppc: Cache per-pmc insn and cycle count settings >> target/ppc: Rewrite pmu_increment_insns >> target/ppc: Use env->pnc_cyc_cnt >> target/ppc/cpu.h | 3 + >> target/ppc/power8-pmu.h | 14 +-- >> target/ppc/cpu_init.c | 1 + >> target/ppc/helper_regs.c | 2 +- >> target/ppc/machine.c | 2 + >> target/ppc/power8-pmu.c | 230 ++++++++++++++++----------------------- >> 6 files changed, 108 insertions(+), 144 deletions(-) >>
On 1/3/22 12:07, Alex Bennée wrote: > > Daniel Henrique Barboza <danielhb413@gmail.com> writes: > >> On 12/23/21 00:01, Richard Henderson wrote: >>> In contrast to Daniel's version, the code stays in power8-pmu.c, >>> but is better organized to not take so much overhead. >>> Before: >>> 32.97% qemu-system-ppc qemu-system-ppc64 [.] pmc_get_event >>> 20.22% qemu-system-ppc qemu-system-ppc64 [.] helper_insns_inc >>> 4.52% qemu-system-ppc qemu-system-ppc64 [.] hreg_compute_hflags_value >>> 3.30% qemu-system-ppc qemu-system-ppc64 [.] helper_lookup_tb_ptr >>> 2.68% qemu-system-ppc qemu-system-ppc64 [.] tcg_gen_code >>> 2.28% qemu-system-ppc qemu-system-ppc64 [.] cpu_exec >>> 1.84% qemu-system-ppc qemu-system-ppc64 [.] pmu_insn_cnt_enabled >>> After: >>> 8.42% qemu-system-ppc qemu-system-ppc64 [.] >>> hreg_compute_hflags_value >>> 6.65% qemu-system-ppc qemu-system-ppc64 [.] cpu_exec >>> 6.63% qemu-system-ppc qemu-system-ppc64 [.] helper_insns_inc >>> >> >> Thanks for looking this up. I had no idea the original C code was that slow. >> > <snip> >> >> With that in mind I decided to post a new version of my TCG rework, with less repetition and >> a bit more concise, to have an alternative that can be used upstream to fix the Avocado tests. >> Meanwhile I'll see if I can get your reorg working with all EBB tests we need. All things >> equal - similar performance, all EBB tests passing - I'd rather stay with your C code than my >> TCG rework since yours doesn't rely on TCG Ops knowledge to maintain >> it. > > Reading this series did make me wonder if we need a more generic service > from the TCG for helping with "internal" instrumentation needed for > things like decent PMU emulation. We haven't gone as much for it in ARM > yet but it would be nice to. It would be even nicer if such a facility > could be used by stuff like icount as well so we don't end up doing the > same thing twice. Back in May 2021 when I first starting working on this code I tried to base myself in the ARM PMU code. In fact, the cycle and insn calculation done in the very first version of this work was based on what ARM does in target/arm/helper.c, cycles_get_count() and instructions_get_count(). The cycle calculation got simplified because our PPC64 CPU has a 1Ghz clock so it's easier to just consider 1ns = 1 cycle. For instruction count, aside from my 2-3 weeks of spectacular failures trying to count instructions inside translate.c, I also looked into how TCG plugins work and tried to do something similar to what plugin_gen_tb_end() does at the end of the translator_loop() in accel/tcg/translator.c. For some reason I wasn't able to replicate the same behavior that I would have if I used the TCG plugin framework in the 'canonical' way. I ended up doing something similar to what instructions_get_count() from ARM does, which relies on icount. Richard then aided me in figuring out that I could count instructions directly by tapping into the end of each TB. So, for a generic service of sorts I believe it would be nice to re-use the TCG plugins API in the internal instrumentation (I tried it once, failed, not sure if I messed up or it's not possible ATM). That would be a good start to try to get all this logic in a generic code for internal translate code to use. Thanks, Daniel > >> >> >> Thanks, >> >> >> Daniel >> >> >> [1] https://github.com/torvalds/linux/tree/master/tools/testing/selftests/powerpc/pmu/ebb >> [2] https://lists.gnu.org/archive/html/qemu-devel/2021-12/msg00073.html >> >>> r~ >>> Richard Henderson (3): >>> target/ppc: Cache per-pmc insn and cycle count settings >>> target/ppc: Rewrite pmu_increment_insns >>> target/ppc: Use env->pnc_cyc_cnt >>> target/ppc/cpu.h | 3 + >>> target/ppc/power8-pmu.h | 14 +-- >>> target/ppc/cpu_init.c | 1 + >>> target/ppc/helper_regs.c | 2 +- >>> target/ppc/machine.c | 2 + >>> target/ppc/power8-pmu.c | 230 ++++++++++++++++----------------------- >>> 6 files changed, 108 insertions(+), 144 deletions(-) >>> > >
Daniel Henrique Barboza <danielhb413@gmail.com> writes: > On 1/3/22 12:07, Alex Bennée wrote: >> Daniel Henrique Barboza <danielhb413@gmail.com> writes: >> >>> On 12/23/21 00:01, Richard Henderson wrote: >>>> In contrast to Daniel's version, the code stays in power8-pmu.c, >>>> but is better organized to not take so much overhead. >>>> Before: >>>> 32.97% qemu-system-ppc qemu-system-ppc64 [.] pmc_get_event >>>> 20.22% qemu-system-ppc qemu-system-ppc64 [.] helper_insns_inc >>>> 4.52% qemu-system-ppc qemu-system-ppc64 [.] hreg_compute_hflags_value >>>> 3.30% qemu-system-ppc qemu-system-ppc64 [.] helper_lookup_tb_ptr >>>> 2.68% qemu-system-ppc qemu-system-ppc64 [.] tcg_gen_code >>>> 2.28% qemu-system-ppc qemu-system-ppc64 [.] cpu_exec >>>> 1.84% qemu-system-ppc qemu-system-ppc64 [.] pmu_insn_cnt_enabled >>>> After: >>>> 8.42% qemu-system-ppc qemu-system-ppc64 [.] >>>> hreg_compute_hflags_value >>>> 6.65% qemu-system-ppc qemu-system-ppc64 [.] cpu_exec >>>> 6.63% qemu-system-ppc qemu-system-ppc64 [.] helper_insns_inc >>>> >>> >>> Thanks for looking this up. I had no idea the original C code was that slow. >>> >> <snip> >>> >>> With that in mind I decided to post a new version of my TCG rework, with less repetition and >>> a bit more concise, to have an alternative that can be used upstream to fix the Avocado tests. >>> Meanwhile I'll see if I can get your reorg working with all EBB tests we need. All things >>> equal - similar performance, all EBB tests passing - I'd rather stay with your C code than my >>> TCG rework since yours doesn't rely on TCG Ops knowledge to maintain >>> it. >> Reading this series did make me wonder if we need a more generic >> service >> from the TCG for helping with "internal" instrumentation needed for >> things like decent PMU emulation. We haven't gone as much for it in ARM >> yet but it would be nice to. It would be even nicer if such a facility >> could be used by stuff like icount as well so we don't end up doing the >> same thing twice. > > Back in May 2021 when I first starting working on this code I tried to base myself in the > ARM PMU code. In fact, the cycle and insn calculation done in the very first version of > this work was based on what ARM does in target/arm/helper.c, cycles_get_count() and > instructions_get_count(). The cycle calculation got simplified because our PPC64 CPU > has a 1Ghz clock so it's easier to just consider 1ns = 1 cycle. > > For instruction count, aside from my 2-3 weeks of spectacular failures trying to count > instructions inside translate.c, I also looked into how TCG plugins work and tried to do > something similar to what plugin_gen_tb_end() does at the end of the translator_loop() > in accel/tcg/translator.c. For some reason I wasn't able to replicate the same behavior > that I would have if I used the TCG plugin framework in the > 'canonical' way. plugin_gen_tb_end is probably overkill because we should already know how many instructions there are in a translated block on account of the insn_start and insn_end ops that mark them. In fact see gen_tb_end() which is where icount updates the value used in the decrement at the start of each block. Assuming no synchronous exceptions occur you could just increment a counter at the end of the block as no async IRQs will occur until we have executed all of those instructions. Of course it's never quite so simple and when running in full icount mode we have to take into account exceptions that can be triggered by IO accesses. This involves doing a re-translation to ensures the IO instruction is always the last we execute. I'm guessing for PMU counters to be somewhat correct we would want to ensure updates throughout the block (before each memory op and helper call). This would hopefully avoid the cost of "full" icount support which is only single threaded. However this is the opposite to icount's budget and pre-decrement approach which feels messier than it could be. > I ended up doing something similar to what instructions_get_count() from ARM does, which > relies on icount. Richard then aided me in figuring out that I could count instructions > directly by tapping into the end of each TB. instructions_get_count will also work without icount but is affected by wall clock time distortions in that case. > So, for a generic service of sorts I believe it would be nice to re-use the TCG plugins > API in the internal instrumentation (I tried it once, failed, not sure if I messed up > or it's not possible ATM). That would be a good start to try to get all this logic in a > generic code for internal translate code to use. Agreed - although the plugin specific stuff is really just focused on our limited visibility API. Unless you are referring to accel/tcg/plugin-gen.c which are just helpers for manipulating the TCG ops after the initial translation.
On 1/4/22 07:32, Alex Bennée wrote: > > Daniel Henrique Barboza <danielhb413@gmail.com> writes: > >> On 1/3/22 12:07, Alex Bennée wrote: >>> Daniel Henrique Barboza <danielhb413@gmail.com> writes: >>> >>>> On 12/23/21 00:01, Richard Henderson wrote: >>>>> In contrast to Daniel's version, the code stays in power8-pmu.c, >>>>> but is better organized to not take so much overhead. >>>>> Before: >>>>> 32.97% qemu-system-ppc qemu-system-ppc64 [.] pmc_get_event >>>>> 20.22% qemu-system-ppc qemu-system-ppc64 [.] helper_insns_inc >>>>> 4.52% qemu-system-ppc qemu-system-ppc64 [.] hreg_compute_hflags_value >>>>> 3.30% qemu-system-ppc qemu-system-ppc64 [.] helper_lookup_tb_ptr >>>>> 2.68% qemu-system-ppc qemu-system-ppc64 [.] tcg_gen_code >>>>> 2.28% qemu-system-ppc qemu-system-ppc64 [.] cpu_exec >>>>> 1.84% qemu-system-ppc qemu-system-ppc64 [.] pmu_insn_cnt_enabled >>>>> After: >>>>> 8.42% qemu-system-ppc qemu-system-ppc64 [.] >>>>> hreg_compute_hflags_value >>>>> 6.65% qemu-system-ppc qemu-system-ppc64 [.] cpu_exec >>>>> 6.63% qemu-system-ppc qemu-system-ppc64 [.] helper_insns_inc >>>>> >>>> >>>> Thanks for looking this up. I had no idea the original C code was that slow. >>>> >>> <snip> >>>> >>>> With that in mind I decided to post a new version of my TCG rework, with less repetition and >>>> a bit more concise, to have an alternative that can be used upstream to fix the Avocado tests. >>>> Meanwhile I'll see if I can get your reorg working with all EBB tests we need. All things >>>> equal - similar performance, all EBB tests passing - I'd rather stay with your C code than my >>>> TCG rework since yours doesn't rely on TCG Ops knowledge to maintain >>>> it. >>> Reading this series did make me wonder if we need a more generic >>> service >>> from the TCG for helping with "internal" instrumentation needed for >>> things like decent PMU emulation. We haven't gone as much for it in ARM >>> yet but it would be nice to. It would be even nicer if such a facility >>> could be used by stuff like icount as well so we don't end up doing the >>> same thing twice. >> >> Back in May 2021 when I first starting working on this code I tried to base myself in the >> ARM PMU code. In fact, the cycle and insn calculation done in the very first version of >> this work was based on what ARM does in target/arm/helper.c, cycles_get_count() and >> instructions_get_count(). The cycle calculation got simplified because our PPC64 CPU >> has a 1Ghz clock so it's easier to just consider 1ns = 1 cycle. >> >> For instruction count, aside from my 2-3 weeks of spectacular failures trying to count >> instructions inside translate.c, I also looked into how TCG plugins work and tried to do >> something similar to what plugin_gen_tb_end() does at the end of the translator_loop() >> in accel/tcg/translator.c. For some reason I wasn't able to replicate the same behavior >> that I would have if I used the TCG plugin framework in the >> 'canonical' way. > > plugin_gen_tb_end is probably overkill because we should already know > how many instructions there are in a translated block on account of the > insn_start and insn_end ops that mark them. In fact see gen_tb_end() > which is where icount updates the value used in the decrement at the > start of each block. Assuming no synchronous exceptions occur you could > just increment a counter at the end of the block as no async IRQs will > occur until we have executed all of those instructions. > > Of course it's never quite so simple and when running in full icount > mode we have to take into account exceptions that can be triggered by IO > accesses. This involves doing a re-translation to ensures the IO > instruction is always the last we execute. > > I'm guessing for PMU counters to be somewhat correct we would want to > ensure updates throughout the block (before each memory op and helper > call). This would hopefully avoid the cost of "full" icount support > which is only single threaded. However this is the opposite to icount's > budget and pre-decrement approach which feels messier than it could be. What about cycle counting without icount? With icount is a rather simple matter of making some assumptions about the CPU freq and relying on the shift parameter to have a somewhat good precision. Without icount the cycle count, at least in the current implementation in the ppc64 PMU, is erratic. The problem is that, at least as far as I've read pSeries and powernv code (guest and bare metal IBM Power emulation), the CPU freq is a 1Ghz that we write in the FDT and do nothing else with it. We do not enforce (or throttle) the CPU freq in the emulation. A quick look into ARM code also seems to do similar assumptions: static uint64_t cycles_get_count(CPUARMState *env) { #ifndef CONFIG_USER_ONLY return muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), ARM_CPU_FREQ, NANOSECONDS_PER_SECOND); #else return cpu_get_host_ticks(); #endif } #ifndef CONFIG_USER_ONLY static int64_t cycles_ns_per(uint64_t cycles) { return (ARM_CPU_FREQ / NANOSECONDS_PER_SECOND) * cycles; } $ git grep 'ARM_CPU_FREQ' target/arm/helper.c:#define ARM_CPU_FREQ 1000000000 /* FIXME: 1 GHz, should be configurable */ target/arm/helper.c: ARM_CPU_FREQ, NANOSECONDS_PER_SECOND); target/arm/helper.c: return (ARM_CPU_FREQ / NANOSECONDS_PER_SECOND) * cycles; But I digress. Having a generic way of counting instruction across all the boards would be a fine improvement. cycle calculation can wait. > >> I ended up doing something similar to what instructions_get_count() from ARM does, which >> relies on icount. Richard then aided me in figuring out that I could count instructions >> directly by tapping into the end of each TB. > > instructions_get_count will also work without icount but is affected by > wall clock time distortions in that case. > >> So, for a generic service of sorts I believe it would be nice to re-use the TCG plugins >> API in the internal instrumentation (I tried it once, failed, not sure if I messed up >> or it's not possible ATM). That would be a good start to try to get all this logic in a >> generic code for internal translate code to use. > > Agreed - although the plugin specific stuff is really just focused on > our limited visibility API. Unless you are referring to > accel/tcg/plugin-gen.c which are just helpers for manipulating the TCG > ops after the initial translation. TCG plug-ins came to mind because they operate like generic APIs that can be used across multiple archs, but any way of putting generic instrumentation code that can be used internally everywhere would do. TCG plug-ins seems to be a good candidate for that since the infrastructure is already in place. Thanks, Daniel >