Message ID | 1496676177-29356-1-git-send-email-will.deacon@arm.com |
---|---|
Headers | show |
Series | Add support for the ARMv8.2 Statistical Profiling Extension | expand |
On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote: > Hi all, > > This is the sixth posting of the patches previously posted here: > > rfcv1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/476450.html > rfcv2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/479387.html > v1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/483684.html > v2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-April/499938.html > v3: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-May/507132.html > > The main change since v3 is that I have reworked and fixed the CPU hotplug > and notifier bits, in light of review comments from tglx. > > The architecture documentation is available here: > > https://developer.arm.com/products/architecture/a-profile/docs/ddi0586/latest/arm-architecture-reference-manual-supplement-statistical-profiling-extension-for-armv8-a Kim, do you have any version of the userspace side that we could look at? For review, it would be really helpful to have something that can poke the PMU, even if it's incomplete or lacking polish. Thanks, Mark. IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Mon, 12 Jun 2017 12:08:23 +0100 Mark Rutland <mark.rutland@arm.com> wrote: > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote: > > This is the sixth posting of the patches previously posted here: > > > > rfcv1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/476450.html > > rfcv2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/479387.html > > v1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/483684.html > > v2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-April/499938.html > > v3: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-May/507132.html > > > > The main change since v3 is that I have reworked and fixed the CPU hotplug > > and notifier bits, in light of review comments from tglx. > > > > The architecture documentation is available here: > > > > https://developer.arm.com/products/architecture/a-profile/docs/ddi0586/latest/arm-architecture-reference-manual-supplement-statistical-profiling-extension-for-armv8-a > > Kim, do you have any version of the userspace side that we could look > at? > > For review, it would be really helpful to have something that can poke > the PMU, even if it's incomplete or lacking polish. Here's the latest push, based on a a couple of prior versions of this driver: http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1 I don't seem to be able to get any SPE data output after rebasing on this version of the driver. Still don't know why at the moment... Kim
On Mon, 12 Jun 2017 11:20:48 -0500 Kim Phillips <kim.phillips@arm.com> wrote: > On Mon, 12 Jun 2017 12:08:23 +0100 > Mark Rutland <mark.rutland@arm.com> wrote: > > > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote: > > > This is the sixth posting of the patches previously posted here: ... > > Kim, do you have any version of the userspace side that we could look > > at? > > > > For review, it would be really helpful to have something that can poke > > the PMU, even if it's incomplete or lacking polish. > > Here's the latest push, based on a a couple of prior versions of this > driver: > > http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1 > > I don't seem to be able to get any SPE data output after rebasing on > this version of the driver. Still don't know why at the moment... Bisected to commit e38ba76deef "perf tools: force uncore events to system wide monitoring". So, using record with specifying a -C <cpu> explicitly now produces SPE data, but only a couple of valid records at the beginning of each buffer; the rest is filled with PADding (0's). I see Mark's latest comments have found a possible issue in the perf aux buffer handling code in the driver, and that the driver does some memset of padding (0's) itself; could that be responsible for the above behaviour? Thanks, Kim
On Thu, Jun 15, 2017 at 10:57:35AM -0500, Kim Phillips wrote: > On Mon, 12 Jun 2017 11:20:48 -0500 > Kim Phillips <kim.phillips@arm.com> wrote: > > > On Mon, 12 Jun 2017 12:08:23 +0100 > > Mark Rutland <mark.rutland@arm.com> wrote: > > > > > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote: > > > > This is the sixth posting of the patches previously posted here: > ... > > > Kim, do you have any version of the userspace side that we could look > > > at? > > > > > > For review, it would be really helpful to have something that can poke > > > the PMU, even if it's incomplete or lacking polish. > > > > Here's the latest push, based on a a couple of prior versions of this > > driver: > > > > http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1 > > > > I don't seem to be able to get any SPE data output after rebasing on > > this version of the driver. Still don't know why at the moment... > > Bisected to commit e38ba76deef "perf tools: force uncore events to > system wide monitoring". So, using record with specifying a -C > <cpu> explicitly now produces SPE data, but only a couple of valid > records at the beginning of each buffer; the rest is filled with > PADding (0's). > > I see Mark's latest comments have found a possible issue in the perf > aux buffer handling code in the driver, and that the driver does some > memset of padding (0's) itself; could that be responsible for the above > behaviour? Possibly. Do you know how big you're mapping the aux buffer and what (if any) value you're passing as aux_watermark? Will
On Wed, 21 Jun 2017 16:31:09 +0100 Will Deacon <will.deacon@arm.com> wrote: > On Thu, Jun 15, 2017 at 10:57:35AM -0500, Kim Phillips wrote: > > On Mon, 12 Jun 2017 11:20:48 -0500 > > Kim Phillips <kim.phillips@arm.com> wrote: > > > > > On Mon, 12 Jun 2017 12:08:23 +0100 > > > Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote: > > > > > This is the sixth posting of the patches previously posted here: > > ... > > > > Kim, do you have any version of the userspace side that we could look > > > > at? > > > > > > > > For review, it would be really helpful to have something that can poke > > > > the PMU, even if it's incomplete or lacking polish. > > > > > > Here's the latest push, based on a a couple of prior versions of this > > > driver: > > > > > > http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1 > > > > > > I don't seem to be able to get any SPE data output after rebasing on > > > this version of the driver. Still don't know why at the moment... > > > > Bisected to commit e38ba76deef "perf tools: force uncore events to > > system wide monitoring". So, using record with specifying a -C > > <cpu> explicitly now produces SPE data, but only a couple of valid > > records at the beginning of each buffer; the rest is filled with > > PADding (0's). > > > > I see Mark's latest comments have found a possible issue in the perf > > aux buffer handling code in the driver, and that the driver does some > > memset of padding (0's) itself; could that be responsible for the above > > behaviour? > > Possibly. Do you know how big you're mapping the aux buffer 4MiB. > and what (if any) value you're passing as aux_watermark? None passed, but it looks like 4KiB was used since the AUXTRACE size was 4MiB - 4KiB. I'm not seeing the issue with a simple bts-based version I'm working on...yet. We can revisit if I'm able to reproduce again; the problem could have been on the userspace side. Meanwhile, when using fvp-base.dtb, my model setup stops booting the kernel after "smp: Bringing up secondary CPUs ...". If I however take the second SPE node from fvp-base.dts and add it to my working device tree, I get this during the driver probe: [ 1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf] [ 1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf] [ 1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu) [ 1.043784] arm_spe_pmu: probe of spe-pmu@1 failed with error -16 spe-pmu@0 is useable, but doubt spe-pmu@1 is. btw, that 16 is EBUSY "Device or resource busy". Kim
On Thu, Jun 22, 2017 at 10:56:40AM -0500, Kim Phillips wrote: > On Wed, 21 Jun 2017 16:31:09 +0100 > Will Deacon <will.deacon@arm.com> wrote: > > > On Thu, Jun 15, 2017 at 10:57:35AM -0500, Kim Phillips wrote: > > > On Mon, 12 Jun 2017 11:20:48 -0500 > > > Kim Phillips <kim.phillips@arm.com> wrote: > > > > > > > On Mon, 12 Jun 2017 12:08:23 +0100 > > > > Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > > > > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote: > > > > > > This is the sixth posting of the patches previously posted here: > > > ... > > > > > Kim, do you have any version of the userspace side that we could look > > > > > at? > > > > > > > > > > For review, it would be really helpful to have something that can poke > > > > > the PMU, even if it's incomplete or lacking polish. > > > > > > > > Here's the latest push, based on a a couple of prior versions of this > > > > driver: > > > > > > > > http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1 > > > > > > > > I don't seem to be able to get any SPE data output after rebasing on > > > > this version of the driver. Still don't know why at the moment... > > > > > > Bisected to commit e38ba76deef "perf tools: force uncore events to > > > system wide monitoring". So, using record with specifying a -C > > > <cpu> explicitly now produces SPE data, but only a couple of valid > > > records at the beginning of each buffer; the rest is filled with > > > PADding (0's). > > > > > > I see Mark's latest comments have found a possible issue in the perf > > > aux buffer handling code in the driver, and that the driver does some > > > memset of padding (0's) itself; could that be responsible for the above > > > behaviour? > > > > Possibly. Do you know how big you're mapping the aux buffer > > 4MiB. > > > and what (if any) value you're passing as aux_watermark? > > None passed, but it looks like 4KiB was used since the AUXTRACE size > was 4MiB - 4KiB. > > I'm not seeing the issue with a simple bts-based version I'm > working on...yet. We can revisit if I'm able to reproduce again; the > problem could have been on the userspace side. > > Meanwhile, when using fvp-base.dtb, my model setup stops booting the > kernel after "smp: Bringing up secondary CPUs ...". If I however take > the second SPE node from fvp-base.dts and add it to my working device > tree, I get this during the driver probe: > > [ 1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf] > [ 1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf] > [ 1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu) Looks like you've screwed up your IRQ partitions, so you are effectively registering the same device twice, which then blows up due to lack of shared irqs. Either remove one of the devices, or use IRQ partitions to restrict them to unique sets of CPUs. Will
On Thu, 22 Jun 2017 19:36:20 +0100 Will Deacon <will.deacon@arm.com> wrote: > On Thu, Jun 22, 2017 at 10:56:40AM -0500, Kim Phillips wrote: > > On Wed, 21 Jun 2017 16:31:09 +0100 > > Will Deacon <will.deacon@arm.com> wrote: > > > > > On Thu, Jun 15, 2017 at 10:57:35AM -0500, Kim Phillips wrote: > > > > On Mon, 12 Jun 2017 11:20:48 -0500 > > > > Kim Phillips <kim.phillips@arm.com> wrote: > > > > > > > > > On Mon, 12 Jun 2017 12:08:23 +0100 > > > > > Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > > > > > > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote: > > > > > > > This is the sixth posting of the patches previously posted here: > > > > ... > > > > > > Kim, do you have any version of the userspace side that we could look > > > > > > at? > > > > > > > > > > > > For review, it would be really helpful to have something that can poke > > > > > > the PMU, even if it's incomplete or lacking polish. > > > > > > > > > > Here's the latest push, based on a a couple of prior versions of this > > > > > driver: > > > > > > > > > > http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1 > > > > > > > > > > I don't seem to be able to get any SPE data output after rebasing on > > > > > this version of the driver. Still don't know why at the moment... > > > > > > > > Bisected to commit e38ba76deef "perf tools: force uncore events to > > > > system wide monitoring". So, using record with specifying a -C > > > > <cpu> explicitly now produces SPE data, but only a couple of valid > > > > records at the beginning of each buffer; the rest is filled with > > > > PADding (0's). > > > > > > > > I see Mark's latest comments have found a possible issue in the perf > > > > aux buffer handling code in the driver, and that the driver does some > > > > memset of padding (0's) itself; could that be responsible for the above > > > > behaviour? > > > > > > Possibly. Do you know how big you're mapping the aux buffer > > > > 4MiB. > > > > > and what (if any) value you're passing as aux_watermark? > > > > None passed, but it looks like 4KiB was used since the AUXTRACE size > > was 4MiB - 4KiB. > > > > I'm not seeing the issue with a simple bts-based version I'm > > working on...yet. We can revisit if I'm able to reproduce again; the > > problem could have been on the userspace side. I'm close to finishing the bts version of userspace, and have been testing a bit more thoroughly, so now I consistently see the excessive PADding when recording a CPU that's idle. I.e., when I taskset the perf record to the same CPU I specify to record's -C (taskset -c n perf record -C n), I get max. twenty-odd number of PAD bytes at the end of the AUX buffers in the perf.data file. If, OTOH, I taskset -c n perf record -C m, where m != n, I get a couple of valid event records in the buffer, and the rest of the buffer is filled with PADding. It wouldn't be a problem except that it's wastes too much space sometimes. Here is a good output buffer sample from a --mmap-pages=,12 run, with only 4 PADs tacked onto the end: 0xd190 [0x30]: PERF_RECORD_AUXTRACE size: 0x48 offset: 0 ref: 0xe914f7e3ce idx: 0 tid: -1 cpu: 2 . . ... ARM SPE data: size 72 bytes . 00000000: 4a 01 B COND . 00000002: b1 00 00 00 00 00 00 00 c0 TGT 0 el2 ns=1 . 0000000b: 42 42 RETIRED NOT-TAKEN . 0000000d: b0 f4 4e 10 08 00 00 ff ff PC ff000008104ef4 el3 ns=1 . 00000016: 98 00 00 LAT 0 TOT . 00000019: 71 a5 39 e1 14 e9 00 00 00 TS 1001077684645 . 00000022: 4a 02 B IND . 00000024: b1 54 51 11 08 00 00 ff ff TGT ff000008115154 el3 ns=1 . 0000002d: 42 02 RETIRED . 0000002f: b0 68 36 11 08 00 00 ff ff PC ff000008113668 el3 ns=1 . 00000038: 98 00 00 LAT 0 TOT . 0000003b: 71 a5 39 e1 14 e9 00 00 00 TS 1001077684645 . 00000044: 00 PAD . 00000045: 00 PAD . 00000046: 00 PAD . 00000047: 00 PAD whereas this one - from later on in the same run - is over 99% PADs: 0xd250 [0x30]: PERF_RECORD_AUXTRACE size: 0x5fc0 offset: 0xfffff4ae0044 ref: 0xe91cead1dd idx: 0 tid: -1 cpu: 2 . . ... ARM SPE data: size 24512 bytes . 00000000: 4a 00 B . 00000002: b1 cc 4e 10 08 00 00 ff ff TGT ff000008104ecc el3 ns=1 . 0000000b: 42 02 RETIRED . 0000000d: b0 90 4e 10 08 00 00 ff ff PC ff000008104e90 el3 ns=1 . 00000016: 98 00 00 LAT 0 TOT . 00000019: 71 a5 39 e1 14 e9 00 00 00 TS 1001077684645 . 00000022: 49 01 ST . 00000024: b2 e0 2e f5 7d 00 80 ff ff VA ffff80007df52ee0 . 0000002d: b3 e0 2e f5 fd 00 00 00 80 PA fdf52ee0 ns=1 . 00000036: 9a 00 00 LAT 0 XLAT . 00000039: 42 16 RETIRED L1D-ACCESS TLB-ACCESS . 0000003b: b0 e8 41 39 08 00 00 ff ff PC ff0000083941e8 el3 ns=1 . 00000044: 98 00 00 LAT 0 TOT . 00000047: 71 a5 39 e1 14 e9 00 00 00 TS 1001077684645 . 00000050: 4a 00 B . 00000052: b1 58 f2 0f 08 00 00 ff ff TGT ff0000080ff258 el3 ns=1 . 0000005b: 42 02 RETIRED . 0000005d: b0 90 de 0d 08 00 00 ff ff PC ff0000080dde90 el3 ns=1 . 00000066: 98 00 00 LAT 0 TOT . 00000069: 71 8f 4e e1 14 e9 00 00 00 TS 1001077689999 . 00000072: 48 00 INSN-OTHER . 00000074: 42 02 RETIRED . 00000076: b0 f8 16 61 08 00 00 ff ff PC ff0000086116f8 el3 ns=1 . 0000007f: 98 00 00 LAT 0 TOT . 00000082: 71 8f 4e e1 14 e9 00 00 00 TS 1001077689999 . 0000008b: 49 00 LD . 0000008d: b2 10 34 ba 7b 00 80 ff ff VA ffff80007bba3410 . 00000096: b3 10 34 ba fb 00 00 00 80 PA fbba3410 ns=1 . 0000009f: 9a 00 00 LAT 0 XLAT . 000000a2: 42 16 RETIRED L1D-ACCESS TLB-ACCESS . 000000a4: b0 8c be 7a 08 00 00 ff ff PC ff0000087abe8c el3 ns=1 . 000000ad: 98 00 00 LAT 0 TOT . 000000b0: 71 8f 4e e1 14 e9 00 00 00 TS 1001077689999 . 000000b9: 00 PAD ...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs... . 00005fbf: 00 PAD So maybe there's an offset counter that isn't being reset properly; hopefully the parallel discussion with Mark will help find the problem. FWIW, there is also this one I saw with mmap-pages set to 5 (pages), which gets rounded up to 8 pages: the driver started memsetting places it shouldn't?: $ sudo ./perf record -c 512 -C 0 -e arm_spe/branch_filter=0,ts_enable=1,pa_enable=1,event_filter=0,load_filter=0,jitter=1,store_filter=0,min_latency=0/ --mmap-pages=,5 dd if=/dev/urandom of=/dev/null count=10000 rounding mmap pages size to 32K (8 pages) 10000+0 records in 10000+0 records out 5120000 bytes (5.1 MB) copied, 1.3391 s, 3.8 MB/s [ 1885.042803] Unable to handle kernel paging request at virtual address ffff00000adac000 [ 1885.042873] pgd = ffff00000ad48000 [ 1885.042899] [ffff00000adac000] *pgd=00000000fdffe003, *pud=00000000fdffd003, *pmd=00000000fdff8003, *pte=0000000000000000 [ 1885.043083] Internal error: Oops: 96000047 [#1] PREEMPT SMP [ 1885.043131] Modules linked in: [ 1885.043200] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.12.0-rc5-00039-gf1d4a187881e #34 [ 1885.043299] Hardware name: FVP_Base_AEMv8A-AEMv8A (DT) [ 1885.043364] task: ffff000008c21a80 task.stack: ffff000008c10000 [ 1885.043436] PC is at __memset+0x1ac/0x1d0 [ 1885.043499] LR is at __arm_spe_pmu_next_off+0x6c/0xd8 [ 1885.043600] pc : [<ffff00000837dbac>] lr : [<ffff0000086771e4>] pstate: 204001c9 [ 1885.043600] sp : ffff80007df22d10 [ 1885.043733] x29: ffff80007df22d10 x28: ffff000008c21a80 [ 1885.043819] x27: ffff000008c37768 x26: ffff80007df30280 [ 1885.043910] x25: ffff80007a109800 x24: 0000001d507d1906 [ 1885.044012] x23: ffff80007a601018 x22: ffff80007a3ebb00 [ 1885.044102] x21: ffff80007df36ab0 x20: ffff80007a3ebb00 [ 1885.044196] x19: ffff80007df36ab0 x18: 0000000000000000 [ 1885.044253] x17: 0000000000000000 x16: 0000000000000000 [ 1885.044339] x15: 0000000000000000 x14: ffff000008c21a80 [ 1885.044456] x13: 000080007532d000 x12: 000000003445d91d [ 1885.044557] x11: 0000000000000000 x10: 0000000000001000 [ 1885.044600] x9 : 0000000000000000 x8 : ffff00000adac000 [ 1885.044729] x7 : 0000000000000000 x6 : 00000000000003ff [ 1885.044800] x5 : 0000000000000400 x4 : 0000000000000000 [ 1885.044911] x3 : 0000000000000008 x2 : 0000000000003bff [ 1885.045000] x1 : 0000000000000000 x0 : ffff00000ada8000 [ 1885.045100] Process swapper/0 (pid: 0, stack limit = 0xffff000008c10000) [ 1885.045179] Stack: (0xffff80007df22d10 to 0xffff000008c14000) [ 1885.045239] Call trace: [ 1885.045300] Exception stack(0xffff80007df22b40 to 0xffff80007df22c70) [ 1885.045400] 2b40: ffff80007df36ab0 0001000000000000 ffff80007df22d10 ffff00000837dbac [ 1885.045505] 2b60: 0000000000000000 0000000000000000 ffff80007bb4b520 ffff00000837eac0 [ 1885.045605] 2b80: ffff80007df22d10 ffff0000080d6b58 0000000100060b21 ffff80007bb4afa8 [ 1885.045712] 2ba0: ffff80007bb4af20 ffff80007bb4af00 0000000000000000 ffff000008c19f04 [ 1885.045815] 2bc0: ffff000008bff000 ffff000008c17000 ffff80007bb53f00 000000000002fe89 [ 1885.045916] 2be0: ffff00000ada8000 0000000000000000 0000000000003bff 0000000000000008 [ 1885.046013] 2c00: 0000000000000000 0000000000000400 00000000000003ff 0000000000000000 [ 1885.046126] 2c20: ffff00000adac000 0000000000000000 0000000000001000 0000000000000000 [ 1885.046224] 2c40: 000000003445d91d 000080007532d000 ffff000008c21a80 0000000000000000 [ 1885.046326] 2c60: 0000000000000000 0000000000000000 [ 1885.046408] [<ffff00000837dbac>] __memset+0x1ac/0x1d0 [ 1885.046499] [<ffff00000867729c>] arm_spe_perf_aux_output_begin+0x4c/0x1b8 [ 1885.046599] [<ffff000008678424>] arm_spe_pmu_start+0x34/0xf0 [ 1885.046695] [<ffff000008678548>] arm_spe_pmu_add+0x68/0x98 [ 1885.046733] [<ffff00000814da54>] event_sched_in.isra.61+0xcc/0x218 [ 1885.046879] [<ffff00000814dc08>] group_sched_in+0x68/0x1a0 [ 1885.046981] [<ffff00000814dfd0>] ctx_sched_in+0x290/0x468 [ 1885.047080] [<ffff00000814e23c>] perf_event_sched_in+0x94/0xa8 [ 1885.047179] [<ffff00000814e2b4>] ctx_resched+0x64/0xb0 [ 1885.047268] [<ffff00000814e500>] __perf_event_enable+0x200/0x238 [ 1885.047366] [<ffff000008147118>] event_function+0x90/0xf0 [ 1885.047452] [<ffff0000081499e8>] remote_function+0x60/0x70 [ 1885.047514] [<ffff0000081194fc>] flush_smp_call_function_queue+0x9c/0x168 [ 1885.047637] [<ffff00000811a2a0>] generic_smp_call_function_single_interrupt+0x10/0x18 [ 1885.047733] [<ffff00000808e928>] handle_IPI+0xc0/0x1d8 [ 1885.047799] [<ffff000008081700>] gic_handle_irq+0x80/0x178 [ 1885.047799] Exception stack(0xffff000008c13d80 to 0xffff000008c13eb0) [ 1885.047984] 3d80: 0000000000000000 ffff000008c21a80 00000000000003e8 ffff000008651430 [ 1885.048087] 3da0: 000000001999999a 0000000000000020 0000002bedec501b 0000000000000000 [ 1885.048190] 3dc0: 000001b2b5103510 ffff000008081800 0000000000001000 0000000000000000 [ 1885.048300] 3de0: 000000003445d91d 000080007532d000 ffff000008c21a80 0000000000000000 [ 1885.048400] 3e00: 0000000000000000 0000000000000000 0000000000000000 000001b6e54dc796 [ 1885.048505] 3e20: 0000000000000002 ffff80007a983c00 0000000000000002 ffff000008cdc130 [ 1885.048600] 3e40: 000001b6e5424132 ffff000008c21a80 0000000000000000 00000000fef7c684 [ 1885.048716] 3e60: 0000000080b10018 ffff000008c13eb0 ffff00000861014c ffff000008c13eb0 [ 1885.048817] 3e80: ffff00000861018c 0000000040c00149 00000000fef7c684 0000000000000002 [ 1885.048900] 3ea0: ffffffffffffffff ffff00000861014c [ 1885.048900] [<ffff0000080827f4>] el1_irq+0xb4/0x128 [ 1885.049009] [<ffff00000861018c>] cpuidle_enter_state+0x154/0x200 [ 1885.049126] [<ffff000008610270>] cpuidle_enter+0x18/0x20 [ 1885.049207] [<ffff0000080ddd08>] call_cpuidle+0x18/0x30 [ 1885.049332] [<ffff0000080ddf44>] do_idle+0x19c/0x1d8 [ 1885.049400] [<ffff0000080de114>] cpu_startup_entry+0x24/0x28 [ 1885.049453] [<ffff0000087a6b30>] rest_init+0x80/0x90 [ 1885.049500] [<ffff000008b10b3c>] start_kernel+0x374/0x388 [ 1885.049617] [<ffff000008b101e0>] __primary_switched+0x64/0x6c [ 1885.049699] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428) [ 1885.049800] ---[ end trace 31b9a9f27da95f58 ]--- [ 1885.049900] Kernel panic - not syncing: Fatal exception in interrupt [ 1885.050000] SMP: stopping secondary CPUs [ 1885.050204] Kernel Offset: disabled [ 1885.050240] Memory Limit: none [ 1885.050327] ---[ end Kernel panic - not syncing: Fatal exception in interrupt It's not easily reproduced. > > Meanwhile, when using fvp-base.dtb, my model setup stops booting the > > kernel after "smp: Bringing up secondary CPUs ...". If I however take > > the second SPE node from fvp-base.dts and add it to my working device > > tree, I get this during the driver probe: > > > > [ 1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf] > > [ 1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf] > > [ 1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu) > > Looks like you've screwed up your IRQ partitions, so you are effectively > registering the same device twice, which then blows up due to lack of shared > irqs. > > Either remove one of the devices, or use IRQ partitions to restrict them > to unique sets of CPUs. Right, but since I want to get parity with what you're running - fvp_base.dtb - I tried to debug the hang after "smp: Bringing up secondary CPUs ..." problem, and could only debug it to the PSCI driver hitting one of these cases: case PSCI_RET_INVALID_PARAMS: case PSCI_RET_INVALID_ADDRESS: Note: it's yet another place I have to manually instrument the error path in a kernel driver in lieu of it being more naturally verbose by itself; I *implore* you to reconsider adding proper user messaging to arm_spe_pmu_event_init(). I can't tell which part of the fvp-base device tree is not liked by the firmware; I tried different combinations of the PSCI node, different CPU enumerations (cpu@100 vs cpu@1), removing idle-states properties...any hints appreciated. Thanks, Kim
On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote: > I'm close to finishing the bts version of userspace, and have been > testing a bit more thoroughly, so now I consistently see the excessive > PADding when recording a CPU that's idle. I.e., when I taskset the perf > record to the same CPU I specify to record's -C (taskset -c n perf > record -C n), I get max. twenty-odd number of PAD bytes at the end of > the AUX buffers in the perf.data file. If, OTOH, I taskset -c n perf > record -C m, where m != n, I get a couple of valid event records in the > buffer, and the rest of the buffer is filled with PADding. > > It wouldn't be a problem except that it's wastes too much space > sometimes. Here is a good output buffer sample from a --mmap-pages=,12 > run, with only 4 PADs tacked onto the end: > > 0xd190 [0x30]: PERF_RECORD_AUXTRACE size: 0x48 offset: 0 ref: 0xe914f7e3ce idx: 0 tid: -1 cpu: 2 > . > . ... ARM SPE data: size 72 bytes > . 00000000: 4a 01 B COND [...] > . 0000003b: 71 a5 39 e1 14 e9 00 00 00 TS 1001077684645 > . 00000044: 00 PAD > . 00000045: 00 PAD > . 00000046: 00 PAD > . 00000047: 00 PAD > > whereas this one - from later on in the same run - is over 99% PADs: > > 0xd250 [0x30]: PERF_RECORD_AUXTRACE size: 0x5fc0 offset: 0xfffff4ae0044 ref: 0xe91cead1dd idx: 0 tid: -1 cpu: 2 > . > . ... ARM SPE data: size 24512 bytes > . 00000000: 4a 00 B [...] > . 000000b0: 71 8f 4e e1 14 e9 00 00 00 TS 1001077689999 > . 000000b9: 00 PAD > ...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs... > . 00005fbf: 00 PAD Interesting. If you cat /proc/interrupts, do you see many more SPE interrupts on CPU n than on m? Otherwise, I wonder if this is some odd interaction with idle. Can you try to forcefully load that other CPU? e.g. run something like: taskset -c <n> sh -c 'while true; do done' ... in parallel with the tracer. For reference, what was your event sample period (i.e. the value of perf_event_attr::sample_period)? Did you modify that at all with PERF_EVENT_IOC_PERIOD? > So maybe there's an offset counter that isn't being reset properly; > hopefully the parallel discussion with Mark will help find the problem. > > FWIW, there is also this one I saw with mmap-pages set to 5 > (pages), which gets rounded up to 8 pages: Sorry, *what* does the rounding upwards? Userspace, perf core, or the driver? Where? > the driver started memsetting places it shouldn't?: > > $ sudo ./perf record -c 512 -C 0 -e arm_spe/branch_filter=0,ts_enable=1,pa_enable=1,event_filter=0,load_filter=0,jitter=1,store_filter=0,min_latency=0/ --mmap-pages=,5 dd if=/dev/urandom of=/dev/null count=10000 > rounding mmap pages size to 32K (8 pages) > 10000+0 records in > 10000+0 records out > 5120000 bytes (5.1 MB) copied, 1.3391 s, 3.8 MB/s > [ 1885.042803] Unable to handle kernel paging request at virtual address ffff00000adac000 > [ 1885.042873] pgd = ffff00000ad48000 > [ 1885.042899] [ffff00000adac000] *pgd=00000000fdffe003, *pud=00000000fdffd003, *pmd=00000000fdff8003, *pte=0000000000000000 > [ 1885.043083] Internal error: Oops: 96000047 [#1] PREEMPT SMP > [ 1885.043131] Modules linked in: > [ 1885.043200] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.12.0-rc5-00039-gf1d4a187881e #34 > [ 1885.043299] Hardware name: FVP_Base_AEMv8A-AEMv8A (DT) > [ 1885.043364] task: ffff000008c21a80 task.stack: ffff000008c10000 > [ 1885.043436] PC is at __memset+0x1ac/0x1d0 > [ 1885.043499] LR is at __arm_spe_pmu_next_off+0x6c/0xd8 > [ 1885.043600] pc : [<ffff00000837dbac>] lr : [<ffff0000086771e4>] pstate: 204001c9 > [ 1885.043600] sp : ffff80007df22d10 > [ 1885.043733] x29: ffff80007df22d10 x28: ffff000008c21a80 > [ 1885.043819] x27: ffff000008c37768 x26: ffff80007df30280 > [ 1885.043910] x25: ffff80007a109800 x24: 0000001d507d1906 > [ 1885.044012] x23: ffff80007a601018 x22: ffff80007a3ebb00 > [ 1885.044102] x21: ffff80007df36ab0 x20: ffff80007a3ebb00 > [ 1885.044196] x19: ffff80007df36ab0 x18: 0000000000000000 > [ 1885.044253] x17: 0000000000000000 x16: 0000000000000000 > [ 1885.044339] x15: 0000000000000000 x14: ffff000008c21a80 > [ 1885.044456] x13: 000080007532d000 x12: 000000003445d91d > [ 1885.044557] x11: 0000000000000000 x10: 0000000000001000 > [ 1885.044600] x9 : 0000000000000000 x8 : ffff00000adac000 > [ 1885.044729] x7 : 0000000000000000 x6 : 00000000000003ff > [ 1885.044800] x5 : 0000000000000400 x4 : 0000000000000000 > [ 1885.044911] x3 : 0000000000000008 x2 : 0000000000003bff > [ 1885.045000] x1 : 0000000000000000 x0 : ffff00000ada8000 > [ 1885.045100] Process swapper/0 (pid: 0, stack limit = 0xffff000008c10000) > [ 1885.045179] Stack: (0xffff80007df22d10 to 0xffff000008c14000) > [ 1885.045239] Call trace: > [ 1885.046408] [<ffff00000837dbac>] __memset+0x1ac/0x1d0 > [ 1885.046499] [<ffff00000867729c>] arm_spe_perf_aux_output_begin+0x4c/0x1b8 > [ 1885.046599] [<ffff000008678424>] arm_spe_pmu_start+0x34/0xf0 > [ 1885.046695] [<ffff000008678548>] arm_spe_pmu_add+0x68/0x98 > [ 1885.046733] [<ffff00000814da54>] event_sched_in.isra.61+0xcc/0x218 > [ 1885.046879] [<ffff00000814dc08>] group_sched_in+0x68/0x1a0 > [ 1885.046981] [<ffff00000814dfd0>] ctx_sched_in+0x290/0x468 > [ 1885.047080] [<ffff00000814e23c>] perf_event_sched_in+0x94/0xa8 > [ 1885.047179] [<ffff00000814e2b4>] ctx_resched+0x64/0xb0 > [ 1885.047268] [<ffff00000814e500>] __perf_event_enable+0x200/0x238 > [ 1885.047366] [<ffff000008147118>] event_function+0x90/0xf0 > [ 1885.047452] [<ffff0000081499e8>] remote_function+0x60/0x70 > [ 1885.047514] [<ffff0000081194fc>] flush_smp_call_function_queue+0x9c/0x168 > [ 1885.047637] [<ffff00000811a2a0>] generic_smp_call_function_single_interrupt+0x10/0x18 > [ 1885.047733] [<ffff00000808e928>] handle_IPI+0xc0/0x1d8 > [ 1885.047799] [<ffff000008081700>] gic_handle_irq+0x80/0x178 > [ 1885.048900] [<ffff0000080827f4>] el1_irq+0xb4/0x128 > [ 1885.049009] [<ffff00000861018c>] cpuidle_enter_state+0x154/0x200 > [ 1885.049126] [<ffff000008610270>] cpuidle_enter+0x18/0x20 > [ 1885.049207] [<ffff0000080ddd08>] call_cpuidle+0x18/0x30 > [ 1885.049332] [<ffff0000080ddf44>] do_idle+0x19c/0x1d8 > [ 1885.049400] [<ffff0000080de114>] cpu_startup_entry+0x24/0x28 > [ 1885.049453] [<ffff0000087a6b30>] rest_init+0x80/0x90 > [ 1885.049500] [<ffff000008b10b3c>] start_kernel+0x374/0x388 > [ 1885.049617] [<ffff000008b101e0>] __primary_switched+0x64/0x6c > [ 1885.049699] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428) > [ 1885.049800] ---[ end trace 31b9a9f27da95f58 ]--- > [ 1885.049900] Kernel panic - not syncing: Fatal exception in interrupt > [ 1885.050000] SMP: stopping secondary CPUs > [ 1885.050204] Kernel Offset: disabled > [ 1885.050240] Memory Limit: none > [ 1885.050327] ---[ end Kernel panic - not syncing: Fatal exception in interrupt That's worrying. I'll see if I can reproduce this. > It's not easily reproduced. > > > > Meanwhile, when using fvp-base.dtb, my model setup stops booting the > > > kernel after "smp: Bringing up secondary CPUs ...". If I however take > > > the second SPE node from fvp-base.dts and add it to my working device > > > tree, I get this during the driver probe: > > > > > > [ 1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf] > > > [ 1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf] > > > [ 1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu) > > > > Looks like you've screwed up your IRQ partitions, so you are effectively > > registering the same device twice, which then blows up due to lack of shared > > irqs. > > > > Either remove one of the devices, or use IRQ partitions to restrict them > > to unique sets of CPUs. > > Right, but since I want to get parity with what you're running - > fvp_base.dtb - I tried to debug the hang after "smp: Bringing up > secondary CPUs ..." problem, and could only debug it to the PSCI driver > hitting one of these cases: > > case PSCI_RET_INVALID_PARAMS: > case PSCI_RET_INVALID_ADDRESS: Sounds like your DT is describing CPUs that don't exist (or perhaps the same CPU several times). Certainly, PSCI and the kernel disagree on which CPUS exist. What exact DT are you using? Are you using the bootwrapper, or ATF? I'm guessing you're using the bootwrapper. Which version of the bootwrapepr are you using? If it doesn't have commit: ccdc936924b3682d ("Dynamically determine the set of CPUs") ... have you configured it appropriately with --with-cpu-ids? How is your model configured? Which CPU IDs does it think exist? > Note: it's yet another place I have to manually instrument the error > path in a kernel driver in lieu of it being more naturally verbose by > itself; I *implore* you to reconsider adding proper user messaging to > arm_spe_pmu_event_init(). Given this is a FW configuration issue (i.e. a system-level error), I'm more than happy to make the PSCI driver messages more helpful where possible. That's completely orthogonal to the SPE debug messages for requests made by the user. > I can't tell which part of the fvp-base device tree is not liked by the > firmware; I tried different combinations of the PSCI node, different CPU > enumerations (cpu@100 vs cpu@1), removing idle-states properties...any > hints appreciated. The bootwrapper doesn't support idle. So no idle-states should be in the DT. If you can share your DT, bootwrapper configuration, and model configuration, I can try to debug this with you. Thanks, Mark.
On Wed, Jun 28, 2017 at 12:26:02PM +0100, Mark Rutland wrote: > On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote: > > FWIW, there is also this one I saw with mmap-pages set to 5 > > (pages), which gets rounded up to 8 pages: > > Sorry, *what* does the rounding upwards? Userspace, perf core, or the > driver? Where? > > > the driver started memsetting places it shouldn't?: > > > > $ sudo ./perf record -c 512 -C 0 -e arm_spe/branch_filter=0,ts_enable=1,pa_enable=1,event_filter=0,load_filter=0,jitter=1,store_filter=0,min_latency=0/ --mmap-pages=,5 dd if=/dev/urandom of=/dev/null count=10000 > > rounding mmap pages size to 32K (8 pages) > > 10000+0 records in > > 10000+0 records out > > 5120000 bytes (5.1 MB) copied, 1.3391 s, 3.8 MB/s > > [ 1885.042803] Unable to handle kernel paging request at virtual address ffff00000adac000 > > [ 1885.042873] pgd = ffff00000ad48000 > > [ 1885.042899] [ffff00000adac000] *pgd=00000000fdffe003, *pud=00000000fdffd003, *pmd=00000000fdff8003, *pte=0000000000000000 > > [ 1885.043083] Internal error: Oops: 96000047 [#1] PREEMPT SMP > > [ 1885.043131] Modules linked in: > > [ 1885.043200] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.12.0-rc5-00039-gf1d4a187881e #34 > > [ 1885.043299] Hardware name: FVP_Base_AEMv8A-AEMv8A (DT) > > [ 1885.043364] task: ffff000008c21a80 task.stack: ffff000008c10000 > > [ 1885.043436] PC is at __memset+0x1ac/0x1d0 > > [ 1885.043499] LR is at __arm_spe_pmu_next_off+0x6c/0xd8 > > [ 1885.043600] pc : [<ffff00000837dbac>] lr : [<ffff0000086771e4>] pstate: 204001c9 > > [ 1885.043600] sp : ffff80007df22d10 > > [ 1885.043733] x29: ffff80007df22d10 x28: ffff000008c21a80 > > [ 1885.043819] x27: ffff000008c37768 x26: ffff80007df30280 > > [ 1885.043910] x25: ffff80007a109800 x24: 0000001d507d1906 > > [ 1885.044012] x23: ffff80007a601018 x22: ffff80007a3ebb00 > > [ 1885.044102] x21: ffff80007df36ab0 x20: ffff80007a3ebb00 > > [ 1885.044196] x19: ffff80007df36ab0 x18: 0000000000000000 > > [ 1885.044253] x17: 0000000000000000 x16: 0000000000000000 > > [ 1885.044339] x15: 0000000000000000 x14: ffff000008c21a80 > > [ 1885.044456] x13: 000080007532d000 x12: 000000003445d91d > > [ 1885.044557] x11: 0000000000000000 x10: 0000000000001000 > > [ 1885.044600] x9 : 0000000000000000 x8 : ffff00000adac000 > > [ 1885.044729] x7 : 0000000000000000 x6 : 00000000000003ff > > [ 1885.044800] x5 : 0000000000000400 x4 : 0000000000000000 > > [ 1885.044911] x3 : 0000000000000008 x2 : 0000000000003bff > > [ 1885.045000] x1 : 0000000000000000 x0 : ffff00000ada8000 > > [ 1885.045100] Process swapper/0 (pid: 0, stack limit = 0xffff000008c10000) > > [ 1885.045179] Stack: (0xffff80007df22d10 to 0xffff000008c14000) > > [ 1885.045239] Call trace: > > > [ 1885.046408] [<ffff00000837dbac>] __memset+0x1ac/0x1d0 > > [ 1885.046499] [<ffff00000867729c>] arm_spe_perf_aux_output_begin+0x4c/0x1b8 > > [ 1885.046599] [<ffff000008678424>] arm_spe_pmu_start+0x34/0xf0 > > [ 1885.046695] [<ffff000008678548>] arm_spe_pmu_add+0x68/0x98 > > [ 1885.046733] [<ffff00000814da54>] event_sched_in.isra.61+0xcc/0x218 > > [ 1885.046879] [<ffff00000814dc08>] group_sched_in+0x68/0x1a0 > > [ 1885.046981] [<ffff00000814dfd0>] ctx_sched_in+0x290/0x468 > > [ 1885.047080] [<ffff00000814e23c>] perf_event_sched_in+0x94/0xa8 > > [ 1885.047179] [<ffff00000814e2b4>] ctx_resched+0x64/0xb0 > > [ 1885.047268] [<ffff00000814e500>] __perf_event_enable+0x200/0x238 > > [ 1885.047366] [<ffff000008147118>] event_function+0x90/0xf0 > > [ 1885.047452] [<ffff0000081499e8>] remote_function+0x60/0x70 > > [ 1885.047514] [<ffff0000081194fc>] flush_smp_call_function_queue+0x9c/0x168 > > [ 1885.047637] [<ffff00000811a2a0>] generic_smp_call_function_single_interrupt+0x10/0x18 > > [ 1885.047733] [<ffff00000808e928>] handle_IPI+0xc0/0x1d8 > > [ 1885.047799] [<ffff000008081700>] gic_handle_irq+0x80/0x178 > > > [ 1885.048900] [<ffff0000080827f4>] el1_irq+0xb4/0x128 > > [ 1885.049009] [<ffff00000861018c>] cpuidle_enter_state+0x154/0x200 > > [ 1885.049126] [<ffff000008610270>] cpuidle_enter+0x18/0x20 > > [ 1885.049207] [<ffff0000080ddd08>] call_cpuidle+0x18/0x30 > > [ 1885.049332] [<ffff0000080ddf44>] do_idle+0x19c/0x1d8 > > [ 1885.049400] [<ffff0000080de114>] cpu_startup_entry+0x24/0x28 > > [ 1885.049453] [<ffff0000087a6b30>] rest_init+0x80/0x90 > > [ 1885.049500] [<ffff000008b10b3c>] start_kernel+0x374/0x388 > > [ 1885.049617] [<ffff000008b101e0>] __primary_switched+0x64/0x6c > > [ 1885.049699] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428) > > [ 1885.049800] ---[ end trace 31b9a9f27da95f58 ]--- > > [ 1885.049900] Kernel panic - not syncing: Fatal exception in interrupt > > [ 1885.050000] SMP: stopping secondary CPUs > > [ 1885.050204] Kernel Offset: disabled > > [ 1885.050240] Memory Limit: none > > [ 1885.050327] ---[ end Kernel panic - not syncing: Fatal exception in interrupt > > That's worrying. I'll see if I can reproduce this. Actually, this might be down to the IDX2OFF() macro being borked for non power-of-two buffer sizes. Do you have Will's latest fixes? In his tree there's a commit: 4f331cd62531dce2 ("squash! drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension") ... which should fix the IDX2OFF() bug. It's be good to reproduce the issue if we can, regardless. Thanks, Mark.
On Wed, 28 Jun 2017 12:26:02 +0100 Mark Rutland <mark.rutland@arm.com> wrote: > On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote: > > I'm close to finishing the bts version of userspace, and have been > > testing a bit more thoroughly, so now I consistently see the excessive > > PADding when recording a CPU that's idle. I.e., when I taskset the perf > > record to the same CPU I specify to record's -C (taskset -c n perf > > record -C n), I get max. twenty-odd number of PAD bytes at the end of > > the AUX buffers in the perf.data file. If, OTOH, I taskset -c n perf > > record -C m, where m != n, I get a couple of valid event records in the > > buffer, and the rest of the buffer is filled with PADding. > > > > It wouldn't be a problem except that it's wastes too much space > > sometimes. Here is a good output buffer sample from a --mmap-pages=,12 > > run, with only 4 PADs tacked onto the end: > > > > 0xd190 [0x30]: PERF_RECORD_AUXTRACE size: 0x48 offset: 0 ref: 0xe914f7e3ce idx: 0 tid: -1 cpu: 2 > > . > > . ... ARM SPE data: size 72 bytes > > . 00000000: 4a 01 B COND > > [...] > > > . 0000003b: 71 a5 39 e1 14 e9 00 00 00 TS 1001077684645 > > . 00000044: 00 PAD > > . 00000045: 00 PAD > > . 00000046: 00 PAD > > . 00000047: 00 PAD > > > > whereas this one - from later on in the same run - is over 99% PADs: > > > > 0xd250 [0x30]: PERF_RECORD_AUXTRACE size: 0x5fc0 offset: 0xfffff4ae0044 ref: 0xe91cead1dd idx: 0 tid: -1 cpu: 2 > > . > > . ... ARM SPE data: size 24512 bytes > > . 00000000: 4a 00 B > > [...] > > > . 000000b0: 71 8f 4e e1 14 e9 00 00 00 TS 1001077689999 > > . 000000b9: 00 PAD > > ...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs... > > . 00005fbf: 00 PAD > > Interesting. > > If you cat /proc/interrupts, do you see many more SPE interrupts on CPU > n than on m? When n == m, I see approx. 1 IRQ per SPE buffer full. When n != m, I see neither CPU n or m incur SPE interrupts; the workload ran but didn't get recorded, or, rather, 'idleness' got recorded instead. > Otherwise, I wonder if this is some odd interaction with idle. Can you > try to forcefully load that other CPU? > > e.g. run something like: > > taskset -c <n> sh -c 'while true; do done' > > ... in parallel with the tracer. If I do a: taskset -c 1 sh -c 'while true; do echo blah > /dev/null' & taskset -c 0 perf record -C 1 ... then non-idleness and non-PADdingness get recorded. > For reference, what was your event sample period (i.e. the value of > perf_event_attr::sample_period)? > > Did you modify that at all with PERF_EVENT_IOC_PERIOD? If that's the same as 'perf record -c <period>', then, yes, I set the period to values such as 512, 1024. > > > > Meanwhile, when using fvp-base.dtb, my model setup stops booting the > > > > kernel after "smp: Bringing up secondary CPUs ...". If I however take > > > > the second SPE node from fvp-base.dts and add it to my working device > > > > tree, I get this during the driver probe: > > > > > > > > [ 1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf] > > > > [ 1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf] > > > > [ 1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu) > > > > > > Looks like you've screwed up your IRQ partitions, so you are effectively > > > registering the same device twice, which then blows up due to lack of shared > > > irqs. > > > > > > Either remove one of the devices, or use IRQ partitions to restrict them > > > to unique sets of CPUs. > > > > Right, but since I want to get parity with what you're running - > > fvp_base.dtb - I tried to debug the hang after "smp: Bringing up > > secondary CPUs ..." problem, and could only debug it to the PSCI driver > > hitting one of these cases: > > > > case PSCI_RET_INVALID_PARAMS: > > case PSCI_RET_INVALID_ADDRESS: > > Sounds like your DT is describing CPUs that don't exist (or perhaps the > same CPU several times). Certainly, PSCI and the kernel disagree on > which CPUS exist. > > What exact DT are you using? the one this commit to linux-will's perf/spe branch provides: commit 2a73de57eaf61cdfd61be1e20a46e4a2c326775f Author: Marc Zyngier <marc.zyngier@arm.com> Date: Tue Mar 11 18:14:45 2014 +0000 arm64: dts: add model device-tree Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> > Are you using the bootwrapper, or ATF? I'm guessing you're using the > bootwrapper. I'm using the wrapper to wrap arm-trusted-firmware (ATF?) objects, so, both? I noticed the wrapper I was using was pretty old, so I updated it. arm-trusted-firmware, btw, has just been updated to enable SPE at lower ELs, so I don't have to use a hacked-up version anymore. I also updated my BL33 to the latest upstream u-boot vexpress_aemv8a_dram_defconfig, and at least now the kernel continues to boot, even though it can't bring up 6 of the 7 secondary CPUs. > Which version of the bootwrapepr are you using? If it doesn't have > commit: > > ccdc936924b3682d ("Dynamically determine the set of CPUs") > > ... have you configured it appropriately with --with-cpu-ids? > > How is your model configured? CLUSTER0_NUM_CORES=4 CLUSTER1_NUM_CORES=4 > Which CPU IDs does it think exist? 1,2,3,4,0x100,0x101,0x102,0x103 ...which are different from the above device tree!: 0,0x100,0x200,0x300,0x10000,0x10100,0x10200,0x10300 So I imagine that's the problem, thanks! I don't see how to tell the model to put the CPUs at different addresses, only a lot of GICv3 redistributor switches? btw, where can I get updates to the run-model.sh scripts? Answer off-list? > > Note: it's yet another place I have to manually instrument the error > > path in a kernel driver in lieu of it being more naturally verbose by > > itself; I *implore* you to reconsider adding proper user messaging to > > arm_spe_pmu_event_init(). > > Given this is a FW configuration issue (i.e. a system-level error), I'm > more than happy to make the PSCI driver messages more helpful where > possible. > > That's completely orthogonal to the SPE debug messages for requests made > by the user. I respectfully disagree, given the current state of the interfaces involved. > > I can't tell which part of the fvp-base device tree is not liked by the > > firmware; I tried different combinations of the PSCI node, different CPU > > enumerations (cpu@100 vs cpu@1), removing idle-states properties...any > > hints appreciated. > > The bootwrapper doesn't support idle. So no idle-states should be in the > DT. > > If you can share your DT, bootwrapper configuration, and model > configuration, I can try to debug this with you. I reverted the wrapper's ccdc936924b3682d ("Dynamically determine the set of CPUs") commit you mentioned above, and specified the cpu-ids manually, and am now running with 8 CPUs, although linux enumerates them as 0,1,8,9,10,11,12,13? Thanks for your continued support, Kim
On Wed, 28 Jun 2017 12:32:50 +0100 Mark Rutland <mark.rutland@arm.com> wrote: > On Wed, Jun 28, 2017 at 12:26:02PM +0100, Mark Rutland wrote: > > On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote: > > > FWIW, there is also this one I saw with mmap-pages set to 5 > > > (pages), which gets rounded up to 8 pages: > > > > Sorry, *what* does the rounding upwards? Userspace, perf core, or the > > driver? Where? SPE implementations may vary from the minimum buffer alignment of the smallest available page size, so I left the bts userspace tool code's upwards-rounding code intact for now. I'll take this opportunity to submit the SPE perf tool patch in the form of a reply to this email: Look for the rounding code in tools/perf/arch/arm64/util/arm-spe.c:arm_spe_recording_options(). > > That's worrying. I'll see if I can reproduce this. > > Actually, this might be down to the IDX2OFF() macro being borked for non > power-of-two buffer sizes. > > Do you have Will's latest fixes? In his tree there's a commit: > > 4f331cd62531dce2 ("squash! drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension") > > ... which should fix the IDX2OFF() bug. yes, I've been running with that squash! commit since a couple of days after I noticed it over a week ago. > It's be good to reproduce the issue if we can, regardless. FWIW, I couldn't the little I tried today. Kim
On Wed, Jun 28, 2017 at 07:59:53PM -0500, Kim Phillips wrote: > On Wed, 28 Jun 2017 12:26:02 +0100 > Mark Rutland <mark.rutland@arm.com> wrote: > > On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote: > > > > > Meanwhile, when using fvp-base.dtb, my model setup stops booting the > > > > > kernel after "smp: Bringing up secondary CPUs ...". If I however take > > > > > the second SPE node from fvp-base.dts and add it to my working device > > > > > tree, I get this during the driver probe: > > > > > > > > > > [ 1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf] > > > > > [ 1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf] > > > > > [ 1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu) > > > > > > > > Looks like you've screwed up your IRQ partitions, so you are effectively > > > > registering the same device twice, which then blows up due to lack of shared > > > > irqs. > > > > > > > > Either remove one of the devices, or use IRQ partitions to restrict them > > > > to unique sets of CPUs. > > > > > > Right, but since I want to get parity with what you're running - > > > fvp_base.dtb - I tried to debug the hang after "smp: Bringing up > > > secondary CPUs ..." problem, and could only debug it to the PSCI driver > > > hitting one of these cases: > > > > > > case PSCI_RET_INVALID_PARAMS: > > > case PSCI_RET_INVALID_ADDRESS: > > > > Sounds like your DT is describing CPUs that don't exist (or perhaps the > > same CPU several times). Certainly, PSCI and the kernel disagree on > > which CPUS exist. > > > > What exact DT are you using? > > the one this commit to linux-will's perf/spe branch provides: > > commit 2a73de57eaf61cdfd61be1e20a46e4a2c326775f > Author: Marc Zyngier <marc.zyngier@arm.com> > Date: Tue Mar 11 18:14:45 2014 +0000 > > arm64: dts: add model device-tree > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > Signed-off-by: Will Deacon <will.deacon@arm.com> > > > Are you using the bootwrapper, or ATF? I'm guessing you're using the > > bootwrapper. > > I'm using the wrapper to wrap arm-trusted-firmware (ATF?) objects, so, > both? I noticed the wrapper I was using was pretty old, so I updated > it. Ok. So what's likely happening is that ATF and the bootwrapper's FDT disagree as to the set of CPUs. You're using ATF's PSCI implementation, and not the boot-wrapper's. I don't know how ATF enumerates CPUs on the model, so I can't offer much guidance here other than fixing your DT to match whatever ATF believes exists. > arm-trusted-firmware, btw, has just been updated to enable SPE at lower > ELs, so I don't have to use a hacked-up version anymore. > > I also updated my BL33 to the latest upstream u-boot > vexpress_aemv8a_dram_defconfig, and at least now the kernel continues > to boot, even though it can't bring up 6 of the 7 secondary CPUs. Do you mean that you replaced the bootwrapper with u-boot? I'm a little confused here. Regardless, it sounds like whatever DT is passed to the kernel still doesn't match the real model configuration. > > Which version of the bootwrapepr are you using? If it doesn't have > > commit: > > > > ccdc936924b3682d ("Dynamically determine the set of CPUs") > > > > ... have you configured it appropriately with --with-cpu-ids? > > > > How is your model configured? > > CLUSTER0_NUM_CORES=4 > CLUSTER1_NUM_CORES=4 > > > Which CPU IDs does it think exist? > > 1,2,3,4,0x100,0x101,0x102,0x103 > > ...which are different from the above device tree!: > > 0,0x100,0x200,0x300,0x10000,0x10100,0x10200,0x10300 > > So I imagine that's the problem, thanks! Sounds like it, yes. > I don't see how to tell the model to put the CPUs at different > addresses, only a lot of GICv3 redistributor switches? I don't know how to do this, sorry. > btw, where can I get updates to the run-model.sh scripts? Answer > off-list? I don't know which script you're referring to. Contact whoever you got it from initially? [...] > > > I can't tell which part of the fvp-base device tree is not liked by the > > > firmware; I tried different combinations of the PSCI node, different CPU > > > enumerations (cpu@100 vs cpu@1), removing idle-states properties...any > > > hints appreciated. > > > > The bootwrapper doesn't support idle. So no idle-states should be in the > > DT. > > > > If you can share your DT, bootwrapper configuration, and model > > configuration, I can try to debug this with you. > > I reverted the wrapper's ccdc936924b3682d ("Dynamically determine the > set of CPUs") commit you mentioned above, and specified the cpu-ids > manually, and am now running with 8 CPUs, although linux enumerates > them as 0,1,8,9,10,11,12,13? The --with-cpu-ids option *adds* CPU nodes, but leaves the broken ones, and your CPU phandles (and PPI partitions for the SPE node(s)) will all be wrong. Linux is still seeing those erroneous CPU nodes (presumably taking Linux CPU ids 2-7). Generally, --with-cpu-ids doesn't work as you'd expect, which is why it got removed in favour of assuming an initally correct DT. Please fix the DT instead. With a fixed DT, and commit ccdc936924b3682d, the bootwrapper won't further mangle your DT. Thanks, Mark.
Hi Kim, On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote: > 'perf record' and 'perf report --dump-raw-trace' supported in this > release. > > Example usage: > > taskset -c 2 ./perf record -C 2 -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=1/ \ > dd if=/dev/zero of=/dev/null count=10000 > > perf report --dump-raw-trace > > Note that the perf.data file is portable, so the report can be run on another > architecture host if necessary. > > Output will contain raw SPE data and its textual representation, such as: > > 0xc7d0 [0x30]: PERF_RECORD_AUXTRACE size: 0x82f70 offset: 0 ref: 0x1e947e88189 idx: 0 tid: -1 cpu: 2 > . > . ... ARM SPE data: size 536432 bytes > . 00000000: 4a 01 B COND > . 00000002: b1 00 00 00 00 00 00 00 80 TGT 0 el0 ns=1 > . 0000000b: 42 42 RETIRED NOT-TAKEN > . 0000000d: b0 20 41 c0 ad ff ff 00 80 PC ffffadc04120 el0 ns=1 > . 00000016: 98 00 00 LAT 0 TOT > . 00000019: 71 80 3e f7 46 e9 01 00 00 TS 2101429616256 > . 00000022: 49 01 ST > . 00000024: b2 50 bd ba 73 00 80 ff ff VA ffff800073babd50 > . 0000002d: b3 50 bd ba f3 00 00 00 80 PA f3babd50 ns=1 > . 00000036: 9a 00 00 LAT 0 XLAT > . 00000039: 42 16 RETIRED L1D-ACCESS TLB-ACCESS > . 0000003b: b0 8c b4 1e 08 00 00 ff ff PC ff0000081eb48c el3 ns=1 > . 00000044: 98 00 00 LAT 0 TOT > . 00000047: 71 cc 44 f7 46 e9 01 00 00 TS 2101429617868 > . 00000050: 48 00 INSN-OTHER > . 00000052: 42 02 RETIRED > . 00000054: b0 58 54 1f 08 00 00 ff ff PC ff0000081f5458 el3 ns=1 > . 0000005d: 98 00 00 LAT 0 TOT > . 00000060: 71 cc 44 f7 46 e9 01 00 00 TS 2101429617868 > ... > > Other release notes: > > - applies to acme's perf/{core,urgent} branches, likely elsewhere > > - Record requires Will's SPE driver, currently undergoing upstream review > > - the intel-bts implementation was used as a starting point; its > min/default/max buffer sizes and power of 2 pages granularity need to be > revisited for ARM SPE > > - not been tested on platforms with multiple SPE clusters/domains > > - snapshot support (record -S), and conversion to native perf events > (e.g., via 'perf inject --itrace'), are still in development > > - technically both cs-etm and spe can be used simultaneously, however > disabled for simplicity in this release > > Signed-off-by: Kim Phillips <kim.phillips@arm.com> > --- > tools/perf/arch/arm/util/auxtrace.c | 20 +- > tools/perf/arch/arm/util/pmu.c | 3 + > tools/perf/arch/arm64/util/Build | 3 +- > tools/perf/arch/arm64/util/arm-spe.c | 210 ++++++++++++++++ > tools/perf/util/Build | 2 + > tools/perf/util/arm-spe-pkt-decoder.c | 448 ++++++++++++++++++++++++++++++++++ > tools/perf/util/arm-spe-pkt-decoder.h | 52 ++++ > tools/perf/util/arm-spe.c | 318 ++++++++++++++++++++++++ > tools/perf/util/arm-spe.h | 39 +++ > tools/perf/util/auxtrace.c | 3 + > tools/perf/util/auxtrace.h | 1 + > 11 files changed, 1095 insertions(+), 4 deletions(-) > create mode 100644 tools/perf/arch/arm64/util/arm-spe.c > create mode 100644 tools/perf/util/arm-spe-pkt-decoder.c > create mode 100644 tools/perf/util/arm-spe-pkt-decoder.h > create mode 100644 tools/perf/util/arm-spe.c > create mode 100644 tools/perf/util/arm-spe.h > > diff --git a/tools/perf/arch/arm/util/auxtrace.c b/tools/perf/arch/arm/util/auxtrace.c > index 8edf2cb71564..ec071609e8ac 100644 > --- a/tools/perf/arch/arm/util/auxtrace.c > +++ b/tools/perf/arch/arm/util/auxtrace.c > @@ -22,29 +22,43 @@ > #include "../../util/evlist.h" > #include "../../util/pmu.h" > #include "cs-etm.h" > +#include "arm-spe.h" > > struct auxtrace_record > *auxtrace_record__init(struct perf_evlist *evlist, int *err) > { > - struct perf_pmu *cs_etm_pmu; > + struct perf_pmu *cs_etm_pmu, *arm_spe_pmu; > struct perf_evsel *evsel; > - bool found_etm = false; > + bool found_etm = false, found_spe = false; > > cs_etm_pmu = perf_pmu__find(CORESIGHT_ETM_PMU_NAME); > + arm_spe_pmu = perf_pmu__find(ARM_SPE_PMU_NAME); > > if (evlist) { > evlist__for_each_entry(evlist, evsel) { > if (cs_etm_pmu && > evsel->attr.type == cs_etm_pmu->type) > found_etm = true; > + if (arm_spe_pmu && > + evsel->attr.type == arm_spe_pmu->type) > + found_spe = true; Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1" and so on). Can we not find all PMUs with a "arm_spe_" prefix? ... or, taking a step back, do we need some sysfs "class" attribute to identify multi-instance PMUs? > } > } > > + if (found_etm && found_spe) { > + pr_err("Concurrent ARM Coresight ETM and SPE operation not currently supported\n"); > + *err = -EOPNOTSUPP; > + return NULL; > + } > + > if (found_etm) > return cs_etm_record_init(err); > > + if (found_spe) > + return arm_spe_recording_init(err); ... so given the above, this will fail. AFAICT, this means that perf record opens the event, but doesn't touch the aux buffer at all. > + > /* > - * Clear 'err' even if we haven't found a cs_etm event - that way perf > + * Clear 'err' even if we haven't found an event - that way perf > * record can still be used even if tracers aren't present. The NULL > * return value will take care of telling the infrastructure HW tracing > * isn't available. > diff --git a/tools/perf/arch/arm/util/pmu.c b/tools/perf/arch/arm/util/pmu.c > index 98d67399a0d6..71fb8f13b40a 100644 > --- a/tools/perf/arch/arm/util/pmu.c > +++ b/tools/perf/arch/arm/util/pmu.c > @@ -20,6 +20,7 @@ > #include <linux/perf_event.h> > > #include "cs-etm.h" > +#include "arm-spe.h" > #include "../../util/pmu.h" > > struct perf_event_attr > @@ -31,6 +32,8 @@ struct perf_event_attr > pmu->selectable = true; > pmu->set_drv_config = cs_etm_set_drv_config; > } > + if (!strcmp(pmu->name, ARM_SPE_PMU_NAME)) > + pmu->selectable = true; ... likewise I here. I guess we need an is_arm_spe_pmu() helper for both cases, iterating over all PMUs. > #endif > return NULL; > } > diff --git a/tools/perf/arch/arm64/util/Build b/tools/perf/arch/arm64/util/Build > index cef6fb38d17e..f9969bb88ccb 100644 > --- a/tools/perf/arch/arm64/util/Build > +++ b/tools/perf/arch/arm64/util/Build > @@ -3,4 +3,5 @@ libperf-$(CONFIG_LOCAL_LIBUNWIND) += unwind-libunwind.o > > libperf-$(CONFIG_AUXTRACE) += ../../arm/util/pmu.o \ > ../../arm/util/auxtrace.o \ > - ../../arm/util/cs-etm.o > + ../../arm/util/cs-etm.o \ > + arm-spe.o > diff --git a/tools/perf/arch/arm64/util/arm-spe.c b/tools/perf/arch/arm64/util/arm-spe.c > new file mode 100644 > index 000000000000..07172764881c > --- /dev/null > +++ b/tools/perf/arch/arm64/util/arm-spe.c > @@ -0,0 +1,210 @@ > +/* > + * ARM Statistical Profiling Extensions (SPE) support > + * Copyright (c) 2017, ARM Ltd. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms and conditions of the GNU General Public License, > + * version 2, as published by the Free Software Foundation. > + * > + * This program is distributed in the hope it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + * > + */ > + > +#include <linux/kernel.h> > +#include <linux/types.h> > +#include <linux/bitops.h> > +#include <linux/log2.h> > + > +#include "../../util/cpumap.h" > +#include "../../util/evsel.h" > +#include "../../util/evlist.h" > +#include "../../util/session.h" > +#include "../../util/util.h" > +#include "../../util/pmu.h" > +#include "../../util/debug.h" > +#include "../../util/tsc.h" > +#include "../../util/auxtrace.h" > +#include "../../util/arm-spe.h" > + > +#define KiB(x) ((x) * 1024) > +#define MiB(x) ((x) * 1024 * 1024) > +#define KiB_MASK(x) (KiB(x) - 1) > +#define MiB_MASK(x) (MiB(x) - 1) It's a shame we don't have a UAPI version of <linux/sizes.h>, though I guess it's good to do the same thing as the x86 code. I was a little worried that the *_MASK() helpers might be used with a non power-of-two x, but it seems they're unused even for x86. Can we please delete them? > +struct arm_spe_recording { > + struct auxtrace_record itr; > + struct perf_pmu *arm_spe_pmu; > + struct perf_evlist *evlist; > +}; A user may wich to record trace on separate uarches simultaneously, so having a singleton arm_spe_pmu for the entire evlist doesn't seem right. e.g. I don't see why we should allow a user to do: ./perf record -c 1024 \ -e arm_spe_0/ts_enable=1,pa_enable=0/ \ -e arm_spe_1/ts_enable=1,pa_enable=0/ \ ${WORKLOAD} ... which perf-record seems to accept today, but I don't seem to get any aux trace, regardless of whether I taskset the entire thing to any particular CPU. It also seems that the events aren't task bound, judging by -vv output: ------------------------------------------------------------ perf_event_attr: type 7 size 112 config 0x1 { sample_period, sample_freq } 1024 sample_type IP|TID|TIME|CPU|IDENTIFIER read_format ID disabled 1 inherit 1 enable_on_exec 1 sample_id_all 1 exclude_guest 1 ------------------------------------------------------------ sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 4 sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 = 5 sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8 = 6 sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8 = 8 ------------------------------------------------------------ perf_event_attr: type 8 size 112 config 0x1 { sample_period, sample_freq } 1024 sample_type IP|TID|TIME|IDENTIFIER read_format ID disabled 1 inherit 1 enable_on_exec 1 sample_id_all 1 exclude_guest 1 ------------------------------------------------------------ sys_perf_event_open: pid -1 cpu 4 group_fd -1 flags 0x8 = 9 sys_perf_event_open: pid -1 cpu 5 group_fd -1 flags 0x8 = 10 sys_perf_event_open: pid -1 cpu 6 group_fd -1 flags 0x8 = 11 sys_perf_event_open: pid -1 cpu 7 group_fd -1 flags 0x8 = 12 ------------------------------------------------------------ perf_event_attr: type 1 size 112 config 0x9 { sample_period, sample_freq } 1024 sample_type IP|TID|TIME|IDENTIFIER read_format ID disabled 1 inherit 1 exclude_kernel 1 exclude_hv 1 mmap 1 comm 1 enable_on_exec 1 task 1 sample_id_all 1 mmap2 1 comm_exec 1 ------------------------------------------------------------ sys_perf_event_open: pid 2181 cpu 0 group_fd -1 flags 0x8 = 13 sys_perf_event_open: pid 2181 cpu 1 group_fd -1 flags 0x8 = 14 sys_perf_event_open: pid 2181 cpu 2 group_fd -1 flags 0x8 = 15 sys_perf_event_open: pid 2181 cpu 3 group_fd -1 flags 0x8 = 16 sys_perf_event_open: pid 2181 cpu 4 group_fd -1 flags 0x8 = 17 sys_perf_event_open: pid 2181 cpu 5 group_fd -1 flags 0x8 = 18 sys_perf_event_open: pid 2181 cpu 6 group_fd -1 flags 0x8 = 19 sys_perf_event_open: pid 2181 cpu 7 group_fd -1 flags 0x8 = 20 I see something similar (i.e. perf doesn't try to bind the events to the workload PID) when I try to record with only a single PMU. In that case, perf-record blows up because it can't handle events on a subset of CPUs (though it should be able to): nanook@torsk:~$ ./perf record -vv -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=0/ true WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted, check /proc/sys/kernel/kptr_restrict. Samples in kernel functions may not be resolved if a suitable vmlinux file is not found in the buildid cache or in the vmlinux path. Samples in kernel modules won't be resolved at all. If some relocation was applied (e.g. kexec) symbols may be misresolved even with a suitable vmlinux or kallsyms file. Problems creating module maps, continuing anyway... ------------------------------------------------------------ perf_event_attr: type 7 size 112 config 0x1 { sample_period, sample_freq } 1024 sample_type IP|TID|TIME|CPU|IDENTIFIER read_format ID disabled 1 inherit 1 enable_on_exec 1 sample_id_all 1 exclude_guest 1 ------------------------------------------------------------ sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 4 sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 = 5 sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8 = 6 sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8 = 8 ------------------------------------------------------------ perf_event_attr: type 1 size 112 config 0x9 { sample_period, sample_freq } 1024 sample_type IP|TID|TIME|IDENTIFIER read_format ID disabled 1 inherit 1 exclude_kernel 1 exclude_hv 1 mmap 1 comm 1 enable_on_exec 1 task 1 sample_id_all 1 mmap2 1 comm_exec 1 ------------------------------------------------------------ sys_perf_event_open: pid 2185 cpu 0 group_fd -1 flags 0x8 = 9 sys_perf_event_open: pid 2185 cpu 1 group_fd -1 flags 0x8 = 10 sys_perf_event_open: pid 2185 cpu 2 group_fd -1 flags 0x8 = 11 sys_perf_event_open: pid 2185 cpu 3 group_fd -1 flags 0x8 = 12 sys_perf_event_open: pid 2185 cpu 4 group_fd -1 flags 0x8 = 13 sys_perf_event_open: pid 2185 cpu 5 group_fd -1 flags 0x8 = 14 sys_perf_event_open: pid 2185 cpu 6 group_fd -1 flags 0x8 = 15 sys_perf_event_open: pid 2185 cpu 7 group_fd -1 flags 0x8 = 16 mmap size 266240B AUX area mmap length 131072 perf event ring buffer mmapped per cpu failed to mmap AUX area failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0xffffc8596038, 512)=22) ... with a SW event, this works as expected, being bound to the workload PID: nanook@torsk:~$ ./perf record -vvv -e context-switches true WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted, check /proc/sys/kernel/kptr_restrict. Samples in kernel functions may not be resolved if a suitable vmlinux file is not found in the buildid cache or in the vmlinux path. Samples in kernel modules won't be resolved at all. If some relocation was applied (e.g. kexec) symbols may be misresolved even with a suitable vmlinux or kallsyms file. Problems creating module maps, continuing anyway... ------------------------------------------------------------ perf_event_attr: type 1 size 112 config 0x3 { sample_period, sample_freq } 4000 sample_type IP|TID|TIME|PERIOD disabled 1 inherit 1 mmap 1 comm 1 freq 1 enable_on_exec 1 task 1 sample_id_all 1 exclude_guest 1 mmap2 1 comm_exec 1 ------------------------------------------------------------ sys_perf_event_open: pid 2220 cpu 0 group_fd -1 flags 0x8 = 4 sys_perf_event_open: pid 2220 cpu 1 group_fd -1 flags 0x8 = 5 sys_perf_event_open: pid 2220 cpu 2 group_fd -1 flags 0x8 = 6 sys_perf_event_open: pid 2220 cpu 3 group_fd -1 flags 0x8 = 8 sys_perf_event_open: pid 2220 cpu 4 group_fd -1 flags 0x8 = 9 sys_perf_event_open: pid 2220 cpu 5 group_fd -1 flags 0x8 = 10 sys_perf_event_open: pid 2220 cpu 6 group_fd -1 flags 0x8 = 11 sys_perf_event_open: pid 2220 cpu 7 group_fd -1 flags 0x8 = 12 mmap size 528384B perf event ring buffer mmapped per cpu Couldn't record kernel reference relocation symbol Symbol resolution may be skewed if relocation was used (e.g. kexec). Check /proc/kallsyms permission or run as root. [ perf record: Woken up 1 times to write data ] overlapping maps in /lib/aarch64-linux-gnu/ld-2.19.so (disable tui for more info) overlapping maps in [vdso] (disable tui for more info) overlapping maps in /tmp/perf-2220.map (disable tui for more info) Looking at the vmlinux_path (8 entries long) No kallsyms or vmlinux with build-id cc083c873190ff1254624d3137142c6841c118c3 was found [kernel.kallsyms] with build id cc083c873190ff1254624d3137142c6841c118c3 not found, continuing without symbols overlapping maps in /etc/ld.so.cache (disable tui for more info) overlapping maps in /lib/aarch64-linux-gnu/libc-2.19.so (disable tui for more info) overlapping maps in /tmp/perf-2220.map (disable tui for more info) overlapping maps in /tmp/perf-2220.map (disable tui for more info) overlapping maps in /tmp/perf-2220.map (disable tui for more info) overlapping maps in /tmp/perf-2220.map (disable tui for more info) overlapping maps in /lib/aarch64-linux-gnu/libc-2.19.so (disable tui for more info) overlapping maps in /bin/true (disable tui for more info) overlapping maps in /lib/aarch64-linux-gnu/ld-2.19.so (disable tui for more info) failed to write feature HEADER_CPUDESC failed to write feature HEADER_CPUID [ perf record: Captured and wrote 0.002 MB perf.data (4 samples) ] ... so I guess this has something to do with the way the tool tries to use the cpumask, maknig the wrong assumption that this implies system-wide collection is mandatory / expected. > + > +static size_t > +arm_spe_info_priv_size(struct auxtrace_record *itr __maybe_unused, > + struct perf_evlist *evlist __maybe_unused) > +{ > + return ARM_SPE_AUXTRACE_PRIV_SIZE; > +} > + > +static int arm_spe_info_fill(struct auxtrace_record *itr, > + struct perf_session *session, > + struct auxtrace_info_event *auxtrace_info, > + size_t priv_size) > +{ > + struct arm_spe_recording *sper = > + container_of(itr, struct arm_spe_recording, itr); > + struct perf_pmu *arm_spe_pmu = sper->arm_spe_pmu; > + > + if (priv_size != ARM_SPE_AUXTRACE_PRIV_SIZE) > + return -EINVAL; > + > + if (!session->evlist->nr_mmaps) > + return -EINVAL; > + > + auxtrace_info->type = PERF_AUXTRACE_ARM_SPE; > + auxtrace_info->priv[ARM_SPE_PMU_TYPE] = arm_spe_pmu->type; > + > + return 0; > +} > + > +static int arm_spe_recording_options(struct auxtrace_record *itr, > + struct perf_evlist *evlist, > + struct record_opts *opts) > +{ > + struct arm_spe_recording *sper = > + container_of(itr, struct arm_spe_recording, itr); > + struct perf_pmu *arm_spe_pmu = sper->arm_spe_pmu; > + struct perf_evsel *evsel, *arm_spe_evsel = NULL; > + const struct cpu_map *cpus = evlist->cpus; > + bool privileged = geteuid() == 0 || perf_event_paranoid() < 0; > + struct perf_evsel *tracking_evsel; > + int err; > + > + sper->evlist = evlist; > + > + evlist__for_each_entry(evlist, evsel) { > + if (evsel->attr.type == arm_spe_pmu->type) { > + if (arm_spe_evsel) { > + pr_err("There may be only one " ARM_SPE_PMU_NAME " event\n"); > + return -EINVAL; > + } > + evsel->attr.freq = 0; > + evsel->attr.sample_period = 1; > + arm_spe_evsel = evsel; > + opts->full_auxtrace = true; > + } > + } Theoretically, we could ask for different events on different CPUs, but otehrwise, this looks sane. > + > + if (!opts->full_auxtrace) > + return 0; > + > + /* We are in full trace mode but '-m,xyz' wasn't specified */ > + if (opts->full_auxtrace && !opts->auxtrace_mmap_pages) { > + if (privileged) { > + opts->auxtrace_mmap_pages = MiB(4) / page_size; > + } else { > + opts->auxtrace_mmap_pages = KiB(128) / page_size; > + if (opts->mmap_pages == UINT_MAX) > + opts->mmap_pages = KiB(256) / page_size; > + } > + } > + > + /* Validate auxtrace_mmap_pages */ > + if (opts->auxtrace_mmap_pages) { > + size_t sz = opts->auxtrace_mmap_pages * (size_t)page_size; > + size_t min_sz = KiB(8); > + > + if (sz < min_sz || !is_power_of_2(sz)) { > + pr_err("Invalid mmap size for ARM SPE: must be at least %zuKiB and a power of 2\n", > + min_sz / 1024); > + return -EINVAL; > + } > + } > + > + /* > + * To obtain the auxtrace buffer file descriptor, the auxtrace event > + * must come first. > + */ > + perf_evlist__to_front(evlist, arm_spe_evsel); Huh? *what* needs the auxtrace buffer fd? This seems really fragile. Can't we store this elsewhere? > + > + /* > + * In the case of per-cpu mmaps, we need the CPU on the > + * AUX event. > + */ > + if (!cpu_map__empty(cpus)) > + perf_evsel__set_sample_bit(arm_spe_evsel, CPU); > + > + /* Add dummy event to keep tracking */ > + err = parse_events(evlist, "dummy:u", NULL); > + if (err) > + return err; > + > + tracking_evsel = perf_evlist__last(evlist); > + perf_evlist__set_tracking_event(evlist, tracking_evsel); > + > + tracking_evsel->attr.freq = 0; > + tracking_evsel->attr.sample_period = 1; > + > + /* In per-cpu case, always need the time of mmap events etc */ > + if (!cpu_map__empty(cpus)) > + perf_evsel__set_sample_bit(tracking_evsel, TIME); > + > + return 0; > +} > + > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused) > +{ > + u64 ts; > + > + asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts)); > + > + return ts; > +} I do not think it's a good idea to read the counter directly like this. What is this "reference" intended to be meaningful relative to? Why do we need to do this in userspace? Can we not ask the kernel to output timestamps instead? > + > +static void arm_spe_recording_free(struct auxtrace_record *itr) > +{ > + struct arm_spe_recording *sper = > + container_of(itr, struct arm_spe_recording, itr); > + > + free(sper); > +} > + > +static int arm_spe_read_finish(struct auxtrace_record *itr, int idx) > +{ > + struct arm_spe_recording *sper = > + container_of(itr, struct arm_spe_recording, itr); > + struct perf_evsel *evsel; > + > + evlist__for_each_entry(sper->evlist, evsel) { > + if (evsel->attr.type == sper->arm_spe_pmu->type) > + return perf_evlist__enable_event_idx(sper->evlist, > + evsel, idx); > + } > + return -EINVAL; > +} > + > +struct auxtrace_record *arm_spe_recording_init(int *err) > +{ > + struct perf_pmu *arm_spe_pmu = perf_pmu__find(ARM_SPE_PMU_NAME); > + struct arm_spe_recording *sper; > + > + if (!arm_spe_pmu) > + return NULL; No need to set *err here? > + > + sper = zalloc(sizeof(struct arm_spe_recording)); > + if (!sper) { > + *err = -ENOMEM; > + return NULL; > + } ... as we do here? [...] > + > + sper->arm_spe_pmu = arm_spe_pmu; > + sper->itr.recording_options = arm_spe_recording_options; > + sper->itr.info_priv_size = arm_spe_info_priv_size; > + sper->itr.info_fill = arm_spe_info_fill; > + sper->itr.free = arm_spe_recording_free; > + sper->itr.reference = arm_spe_reference; > + sper->itr.read_finish = arm_spe_read_finish; > + sper->itr.alignment = 0; > + return &sper->itr; > +} > diff --git a/tools/perf/util/Build b/tools/perf/util/Build > index 79dea95a7f68..4ed31e88b8ee 100644 > --- a/tools/perf/util/Build > +++ b/tools/perf/util/Build > @@ -82,6 +82,8 @@ libperf-$(CONFIG_AUXTRACE) += auxtrace.o > libperf-$(CONFIG_AUXTRACE) += intel-pt-decoder/ > libperf-$(CONFIG_AUXTRACE) += intel-pt.o > libperf-$(CONFIG_AUXTRACE) += intel-bts.o > +libperf-$(CONFIG_AUXTRACE) += arm-spe.o > +libperf-$(CONFIG_AUXTRACE) += arm-spe-pkt-decoder.o > libperf-y += parse-branch-options.o > libperf-y += dump-insn.o > libperf-y += parse-regs-options.o > diff --git a/tools/perf/util/arm-spe-pkt-decoder.c b/tools/perf/util/arm-spe-pkt-decoder.c > new file mode 100644 > index 000000000000..ca3813d5b91a > --- /dev/null > +++ b/tools/perf/util/arm-spe-pkt-decoder.c > @@ -0,0 +1,448 @@ > +/* > + * ARM Statistical Profiling Extensions (SPE) support > + * Copyright (c) 2017, ARM Ltd. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms and conditions of the GNU General Public License, > + * version 2, as published by the Free Software Foundation. > + * > + * This program is distributed in the hope it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + * > + */ > + > +#include <stdio.h> > +#include <string.h> > +#include <endian.h> > +#include <byteswap.h> > + > +#include "arm-spe-pkt-decoder.h" > + > +#define BIT(n) (1 << (n)) > + > +#define BIT61 ((uint64_t)1 << 61) > +#define BIT62 ((uint64_t)1 << 62) > +#define BIT63 ((uint64_t)1 << 63) > + > +#define NS_FLAG BIT63 > +#define EL_FLAG (BIT62 | BIT61) > + > +#if __BYTE_ORDER == __BIG_ENDIAN > +#define le16_to_cpu bswap_16 > +#define le32_to_cpu bswap_32 > +#define le64_to_cpu bswap_64 > +#define memcpy_le64(d, s, n) do { \ > + memcpy((d), (s), (n)); \ > + *(d) = le64_to_cpu(*(d)); \ > +} while (0) > +#else > +#define le16_to_cpu > +#define le32_to_cpu > +#define le64_to_cpu > +#define memcpy_le64 memcpy > +#endif > + > +static const char * const arm_spe_packet_name[] = { > + [ARM_SPE_PAD] = "PAD", > + [ARM_SPE_END] = "END", > + [ARM_SPE_TIMESTAMP] = "TS", > + [ARM_SPE_ADDRESS] = "ADDR", > + [ARM_SPE_COUNTER] = "LAT", > + [ARM_SPE_CONTEXT] = "CONTEXT", > + [ARM_SPE_INSN_TYPE] = "INSN-TYPE", > + [ARM_SPE_EVENTS] = "EVENTS", > + [ARM_SPE_DATA_SOURCE] = "DATA-SOURCE", > +}; > + > +const char *arm_spe_pkt_name(enum arm_spe_pkt_type type) > +{ > + return arm_spe_packet_name[type]; > +} > + > +/* return ARM SPE payload size from its encoding: > + * 00 : byte > + * 01 : halfword (2) > + * 10 : word (4) > + * 11 : doubleword (8) > + */ > +static int payloadlen(unsigned char byte) > +{ > + return 1 << ((byte & 0x30) >> 4); > +} > + > +static int arm_spe_get_pad(struct arm_spe_pkt *packet) > +{ > + packet->type = ARM_SPE_PAD; > + return 1; > +} > + > +static int arm_spe_get_alignment(const unsigned char *buf, size_t len, > + struct arm_spe_pkt *packet) > +{ > + unsigned int alignment = 1 << ((buf[0] & 0xf) + 1); > + > + if (len < alignment) > + return ARM_SPE_NEED_MORE_BYTES; > + > + packet->type = ARM_SPE_PAD; > + return alignment - (((uint64_t)buf) & (alignment - 1)); > +} > + > +static int arm_spe_get_end(struct arm_spe_pkt *packet) > +{ > + packet->type = ARM_SPE_END; > + return 1; > +} > + > +static int arm_spe_get_timestamp(const unsigned char *buf, size_t len, > + struct arm_spe_pkt *packet) > +{ > + if (len < 8) > + return ARM_SPE_NEED_MORE_BYTES; > + > + packet->type = ARM_SPE_TIMESTAMP; > + memcpy_le64(&packet->payload, buf + 1, 8); > + > + return 1 + 8; > +} > + > +static int arm_spe_get_events(const unsigned char *buf, size_t len, > + struct arm_spe_pkt *packet) > +{ > + unsigned int events_len = payloadlen(buf[0]); > + > + if (len < events_len) > + return ARM_SPE_NEED_MORE_BYTES; Isn't len the size of the whole buffer? So isn't this failing to account for the header byte? > + > + packet->type = ARM_SPE_EVENTS; > + packet->index = events_len; Huh? The events packet has no "index" field, so why do we need this? > + switch (events_len) { > + case 1: packet->payload = *(uint8_t *)(buf + 1); break; > + case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break; > + case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break; > + case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break; > + default: return ARM_SPE_BAD_PACKET; > + } > + > + return 1 + events_len; > +} > + > +static int arm_spe_get_data_source(const unsigned char *buf, > + struct arm_spe_pkt *packet) > +{ > + int len = payloadlen(buf[0]); > + > + packet->type = ARM_SPE_DATA_SOURCE; > + if (len == 1) > + packet->payload = buf[1]; > + else if (len == 2) > + packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); > + > + return 1 + len; > +} For those packets with a payload, the header has a uniform format describing the payload size. Given that, can't we make the payload extraction generic, regardless of the packet type? e.g. something like: static int arm_spe_get_payload(const unsigned char *buf, size_t len, struct arm_spe_pkt *packet) { <determine paylaod size> <length check> <switch> <return nr consumed bytes (inc header), or error> } static int arm_spe_get_events(const unsigned char *buf, size_t len, struct arm_spe_pkt *packet) { packet->type = ARM_SPE_EVENTS; return arm_spe_get_payload(buf, len, packet); } static int arm_spe_get_data_source(const unsigned char *buf, struct arm_spe_pkt *packet) { packet->type = ARM_SPE_DATA_SOURCE; return arm_spe_get_payload(buf, len, packet); } ... and so on for the other packets with a payload. > +static int arm_spe_do_get_packet(const unsigned char *buf, size_t len, > + struct arm_spe_pkt *packet) > +{ > + unsigned int byte; > + > + memset(packet, 0, sizeof(struct arm_spe_pkt)); > + > + if (!len) > + return ARM_SPE_NEED_MORE_BYTES; > + > + byte = buf[0]; > + if (byte == 0) > + return arm_spe_get_pad(packet); > + else if (byte == 1) /* no timestamp at end of record */ > + return arm_spe_get_end(packet); > + else if (byte & 0xc0 /* 0y11000000 */) { > + if (byte & 0x80) { > + /* 0x38 is 0y00111000 */ > + if ((byte & 0x38) == 0x30) /* address packet (short) */ > + return arm_spe_get_addr(buf, len, 0, packet); > + if ((byte & 0x38) == 0x18) /* counter packet (short) */ > + return arm_spe_get_counter(buf, len, 0, packet); > + } else > + if (byte == 0x71) > + return arm_spe_get_timestamp(buf, len, packet); > + else if ((byte & 0xf) == 0x2) > + return arm_spe_get_events(buf, len, packet); > + else if ((byte & 0xf) == 0x3) > + return arm_spe_get_data_source(buf, packet); > + else if ((byte & 0x3c) == 0x24) > + return arm_spe_get_context(buf, len, packet); > + else if ((byte & 0x3c) == 0x8) > + return arm_spe_get_insn_type(buf, packet); Could we have some mnemonics for these? e.g. #define SPE_HEADER0_PAD 0x0 #define SPE_HEADER0_END 0x1 #define SPE_HEADER0_EVENTS 0x42 #define SPE_HEADER0_EVENTS_MASK 0xcf if (byte == SPE_HEADER0_PAD) { ... } else if (byte == SPE_HEADER0_END) { ... } else if ((byte & SPE_HEADER0_EVENTS_MASK) == SPE_HEADER0_EVENTS) { ... } ... which could even be turned into something table-driven. > + } else if ((byte & 0xe0) == 0x20 /* 0y00100000 */) { > + /* 16-bit header */ > + byte = buf[1]; > + if (byte == 0) > + return arm_spe_get_alignment(buf, len, packet); > + else if ((byte & 0xf8) == 0xb0) > + return arm_spe_get_addr(buf, len, 1, packet); > + else if ((byte & 0xf8) == 0x98) > + return arm_spe_get_counter(buf, len, 1, packet); > + } > + > + return ARM_SPE_BAD_PACKET; > +} > + > +int arm_spe_get_packet(const unsigned char *buf, size_t len, > + struct arm_spe_pkt *packet) > +{ > + int ret; > + > + ret = arm_spe_do_get_packet(buf, len, packet); > + if (ret > 0) { > + while (ret < 1 && len > (size_t)ret && !buf[ret]) > + ret += 1; > + } What is this trying to do? > + return ret; > +} > + > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf, > + size_t buf_len) > +{ > + int ret, ns, el, index = packet->index; > + unsigned long long payload = packet->payload; > + const char *name = arm_spe_pkt_name(packet->type); > + > + switch (packet->type) { > + case ARM_SPE_BAD: > + case ARM_SPE_PAD: > + case ARM_SPE_END: > + return snprintf(buf, buf_len, "%s", name); > + case ARM_SPE_EVENTS: { [...] > + case ARM_SPE_DATA_SOURCE: > + case ARM_SPE_TIMESTAMP: > + return snprintf(buf, buf_len, "%s %lld", name, payload); > + case ARM_SPE_ADDRESS: > + switch (index) { > + case 0: > + case 1: ns = !!(packet->payload & NS_FLAG); > + el = (packet->payload & EL_FLAG) >> 61; > + payload &= ~(0xffULL << 56); > + return snprintf(buf, buf_len, "%s %llx el%d ns=%d", > + (index == 1) ? "TGT" : "PC", payload, el, ns); For TTBR1 addresses, this ends up losing the leading 0xff, giving us invalid addresses, which look odd. Can we please sign-extend bit 55 so that this gives us valid addresses regardless of TTBR? Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so that things get padded consistently? Thanks, Mark.
On Thu, 29 Jun 2017 12:11:02 +0100 Mark Rutland <mark.rutland@arm.com> wrote: > On Wed, Jun 28, 2017 at 07:59:53PM -0500, Kim Phillips wrote: > > arm-trusted-firmware, btw, has just been updated to enable SPE at lower > > ELs, so I don't have to use a hacked-up version anymore. > > > > I also updated my BL33 to the latest upstream u-boot > > vexpress_aemv8a_dram_defconfig, and at least now the kernel continues > > to boot, even though it can't bring up 6 of the 7 secondary CPUs. > > Do you mean that you replaced the bootwrapper with u-boot? no, sorry, arm-trusted-firmware wants a BL33 image, which u-boot provides. Sorry but I guess I'm not using the bootwrapper, and we are launching the model in completely different manners. The bootwrapper input is a kernel and a dtb, and it emits a dtb and a linux-system.axf file, the latter of which I don't see how to launch the model with: The model script I'm using uses a kernel, dtb, and an fip.bin and bl1.bin. Can you share how you invoke the model, presumably with the .axf file? > The --with-cpu-ids option *adds* CPU nodes, but leaves the broken ones, > and your CPU phandles (and PPI partitions for the SPE node(s)) will all > be wrong. Linux is still seeing those erroneous CPU nodes (presumably > taking Linux CPU ids 2-7). > > Generally, --with-cpu-ids doesn't work as you'd expect, which is why it > got removed in favour of assuming an initally correct DT. > > Please fix the DT instead. With a fixed DT, and commit ccdc936924b3682d, > the bootwrapper won't further mangle your DT. OK, changing the CPU IDs alone didn't work (kernel didn't even say hi), but taking what commit ccdc936924b3682d does to the cpu_on/off properties makes it work for my arm-trusted-firmware (non-boot-wrapper) invocation, so I have to use the wrapper if I change my DT CPUs for the time being. So I'm OK now for at least the two-partition, four CPUs each setup, but for topologies as described in Marc/Will's fvp-base.dts commit, I don't see how to run without knowing how to make the axf file work with the model, i.e., solely with the boot-wrapper. Thanks, Kim
On Fri, 30 Jun 2017 15:02:41 +0100 Mark Rutland <mark.rutland@arm.com> wrote: > Hi Kim, Hi Mark, > On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote: <snip> > > if (evlist) { > > evlist__for_each_entry(evlist, evsel) { > > if (cs_etm_pmu && > > evsel->attr.type == cs_etm_pmu->type) > > found_etm = true; > > + if (arm_spe_pmu && > > + evsel->attr.type == arm_spe_pmu->type) > > + found_spe = true; > > Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all > SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1" > and so on). > > Can we not find all PMUs with a "arm_spe_" prefix? > > ... or, taking a step back, do we need some sysfs "class" attribute to > identify multi-instance PMUs? Since there is one SPE per core, and it looks like the buffer full interrupt line is the only difference between the SPE device node specification in the device tree, I guess I don't understand why the driver doesn't accept a singular "arm_spe" from the tool, and manage interrupt handling accordingly. Also, if a set of CPUs are missing SPE support, and the user doesn't explicitly define a CPU affinity to outside that partition, then decline to run, stating why. This would also obviously help a lot from an ease-of-use perspective. <snip> > > +struct arm_spe_recording { > > + struct auxtrace_record itr; > > + struct perf_pmu *arm_spe_pmu; > > + struct perf_evlist *evlist; > > +}; > > A user may wich to record trace on separate uarches simultaneously, so > having a singleton arm_spe_pmu for the entire evlist doesn't seem right. > > e.g. I don't see why we should allow a user to do: > > ./perf record -c 1024 \ > -e arm_spe_0/ts_enable=1,pa_enable=0/ \ > -e arm_spe_1/ts_enable=1,pa_enable=0/ \ > ${WORKLOAD} > > ... which perf-record seems to accept today, but I don't seem to get any > aux trace, regardless of whether I taskset the entire thing to any > particular CPU. The implementation-defined components of the SPE don't affect a 'record'/capture operation, so a single arm_spe should be fine with separate uarch setups. > It also seems that the events aren't task bound, judging by -vv output: <snip> > I see something similar (i.e. perf doesn't try to bind the events to the > workload PID) when I try to record with only a single PMU. In that case, > perf-record blows up because it can't handle events on a subset of CPUs > (though it should be able to): > > nanook@torsk:~$ ./perf record -vv -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=0/ true <snip> > mmap size 266240B > AUX area mmap length 131072 > perf event ring buffer mmapped per cpu > failed to mmap AUX area > failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0xffffc8596038, 512)=22) FWIW, that INTERNAL ERROR is fixed by this commit, btw: commit 8a1898db51a3390241cd5fae267dc8aaa9db0f8b Author: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Date: Tue Jun 20 12:26:39 2017 +0200 perf/aux: Correct return code of rb_alloc_aux() if !has_aux(ev) So now it should return: failed to mmap with 95 (Operation not supported) > ... with a SW event, this works as expected, being bound to the workload PID: <snip> > ... so I guess this has something to do with the way the tool tries to > use the cpumask, maknig the wrong assumption that this implies > system-wide collection is mandatory / expected. Right, I'll take a look at it. > > + if (!opts->full_auxtrace) > > + return 0; > > + > > + /* We are in full trace mode but '-m,xyz' wasn't specified */ > > + if (opts->full_auxtrace && !opts->auxtrace_mmap_pages) { > > + if (privileged) { > > + opts->auxtrace_mmap_pages = MiB(4) / page_size; > > + } else { > > + opts->auxtrace_mmap_pages = KiB(128) / page_size; > > + if (opts->mmap_pages == UINT_MAX) > > + opts->mmap_pages = KiB(256) / page_size; > > + } > > + } > > + > > + /* Validate auxtrace_mmap_pages */ > > + if (opts->auxtrace_mmap_pages) { > > + size_t sz = opts->auxtrace_mmap_pages * (size_t)page_size; > > + size_t min_sz = KiB(8); > > + > > + if (sz < min_sz || !is_power_of_2(sz)) { > > + pr_err("Invalid mmap size for ARM SPE: must be at least %zuKiB and a power of 2\n", > > + min_sz / 1024); > > + return -EINVAL; > > + } > > + } > > + > > + /* > > + * To obtain the auxtrace buffer file descriptor, the auxtrace event > > + * must come first. > > + */ > > + perf_evlist__to_front(evlist, arm_spe_evsel); > > Huh? *what* needs the auxtrace buffer fd? > > This seems really fragile. Can't we store this elsewhere? It's copied from the bts code, and the other auxtrace record users do the same; it looks like auxtrace record has implicit dependencies on it? > > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused) > > +{ > > + u64 ts; > > + > > + asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts)); > > + > > + return ts; > > +} > > I do not think it's a good idea to read the counter directly like this. > > What is this "reference" intended to be meaningful relative to? AFAICT, it's just a nonce the perf tool uses to track unique events, and I thought this better than the ETM driver's heavier get-random implementation. > Why do we need to do this in userspace? > > Can we not ask the kernel to output timestamps instead? Why? This gets the job done faster. > > +static int arm_spe_get_events(const unsigned char *buf, size_t len, > > + struct arm_spe_pkt *packet) > > +{ > > + unsigned int events_len = payloadlen(buf[0]); > > + > > + if (len < events_len) > > + return ARM_SPE_NEED_MORE_BYTES; > > Isn't len the size of the whole buffer? So isn't this failing to account > for the header byte? well spotted; I changed /events_len/1 + events_len/. > > + packet->type = ARM_SPE_EVENTS; > > + packet->index = events_len; > > Huh? The events packet has no "index" field, so why do we need this? To identify Events with a less number of comparisons in arm_spe_pkt_desc(): E.g., the LLC-ACCESS, LLC-REFILL, and REMOTE-ACCESS events are identified iff index > 1. > > + switch (events_len) { > > + case 1: packet->payload = *(uint8_t *)(buf + 1); break; > > + case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break; > > + case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break; > > + case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break; > > + default: return ARM_SPE_BAD_PACKET; > > + } > > + > > + return 1 + events_len; > > +} > > + > > +static int arm_spe_get_data_source(const unsigned char *buf, > > + struct arm_spe_pkt *packet) > > +{ > > + int len = payloadlen(buf[0]); > > + > > + packet->type = ARM_SPE_DATA_SOURCE; > > + if (len == 1) > > + packet->payload = buf[1]; > > + else if (len == 2) > > + packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); > > + > > + return 1 + len; > > +} > > For those packets with a payload, the header has a uniform format > describing the payload size. Given that, can't we make the payload > extraction generic, regardless of the packet type? > > e.g. something like: > > static int arm_spe_get_payload(const unsigned char *buf, size_t len, > struct arm_spe_pkt *packet) > { > <determine paylaod size> > <length check> > <switch> > <return nr consumed bytes (inc header), or error> > } > > static int arm_spe_get_events(const unsigned char *buf, size_t len, > struct arm_spe_pkt *packet) > { > packet->type = ARM_SPE_EVENTS; > return arm_spe_get_payload(buf, len, packet); > } > > static int arm_spe_get_data_source(const unsigned char *buf, > struct arm_spe_pkt *packet) > { > packet->type = ARM_SPE_DATA_SOURCE; > return arm_spe_get_payload(buf, len, packet); > } > > ... and so on for the other packets with a payload. done for TIMESTAMP, EVENTS, DATA_SOURCE, CONTEXT, INSN_TYPE. It wouldn't fit ADDR and COUNTER well since they can occur in an extended-header, and their lengths are encoded differently, and are fixed anyway. > > +static int arm_spe_do_get_packet(const unsigned char *buf, size_t len, > > + struct arm_spe_pkt *packet) > > +{ > > + unsigned int byte; > > + > > + memset(packet, 0, sizeof(struct arm_spe_pkt)); > > + > > + if (!len) > > + return ARM_SPE_NEED_MORE_BYTES; > > + > > + byte = buf[0]; > > + if (byte == 0) > > + return arm_spe_get_pad(packet); > > + else if (byte == 1) /* no timestamp at end of record */ > > + return arm_spe_get_end(packet); > > + else if (byte & 0xc0 /* 0y11000000 */) { > > + if (byte & 0x80) { > > + /* 0x38 is 0y00111000 */ > > + if ((byte & 0x38) == 0x30) /* address packet (short) */ > > + return arm_spe_get_addr(buf, len, 0, packet); > > + if ((byte & 0x38) == 0x18) /* counter packet (short) */ > > + return arm_spe_get_counter(buf, len, 0, packet); > > + } else > > + if (byte == 0x71) > > + return arm_spe_get_timestamp(buf, len, packet); > > + else if ((byte & 0xf) == 0x2) > > + return arm_spe_get_events(buf, len, packet); > > + else if ((byte & 0xf) == 0x3) > > + return arm_spe_get_data_source(buf, packet); > > + else if ((byte & 0x3c) == 0x24) > > + return arm_spe_get_context(buf, len, packet); > > + else if ((byte & 0x3c) == 0x8) > > + return arm_spe_get_insn_type(buf, packet); > > Could we have some mnemonics for these? > > e.g. > > #define SPE_HEADER0_PAD 0x0 > #define SPE_HEADER0_END 0x1 > > #define SPE_HEADER0_EVENTS 0x42 > #define SPE_HEADER0_EVENTS_MASK 0xcf > > if (byte == SPE_HEADER0_PAD) { > ... > } else if (byte == SPE_HEADER0_END) { > ... > } else if ((byte & SPE_HEADER0_EVENTS_MASK) == SPE_HEADER0_EVENTS) { > ... > } > > ... which could even be turned into something table-driven. It'd be a pretty sparse table, so I doubt it'd be faster, but if it is, I'd just as soon leave that type of space tradeoff decision to the compiler, given its optimization directives. I'll take a look at replacing the constants that have named equivalents with their named versions, even though it was pretty clear already what they denoted, given the name of the function each branch was calling, and the comments. > > + } else if ((byte & 0xe0) == 0x20 /* 0y00100000 */) { > > + /* 16-bit header */ > > + byte = buf[1]; > > + if (byte == 0) > > + return arm_spe_get_alignment(buf, len, packet); > > + else if ((byte & 0xf8) == 0xb0) > > + return arm_spe_get_addr(buf, len, 1, packet); > > + else if ((byte & 0xf8) == 0x98) > > + return arm_spe_get_counter(buf, len, 1, packet); > > + } > > + > > + return ARM_SPE_BAD_PACKET; > > +} > > + > > +int arm_spe_get_packet(const unsigned char *buf, size_t len, > > + struct arm_spe_pkt *packet) > > +{ > > + int ret; > > + > > + ret = arm_spe_do_get_packet(buf, len, packet); > > + if (ret > 0) { > > + while (ret < 1 && len > (size_t)ret && !buf[ret]) > > + ret += 1; > > + } > > What is this trying to do? Nothing! I've since fixed it to prevent multiple contiguous PADs from coming out on their own lines, and rather accumulate up to 16 (the width of the raw dump format) on one PAD-labeled line, like so: . 00007ec9: 00 00 00 00 00 00 00 00 00 00 PAD instead of this: . 00007ec9: 00 PAD . 00007eca: 00 PAD . 00007ecb: 00 PAD . 00007ecc: 00 PAD . 00007ecd: 00 PAD . 00007ece: 00 PAD . 00007ecf: 00 PAD . 00007ed0: 00 PAD . 00007ed1: 00 PAD . 00007ed2: 00 PAD thanks for pointing it out. > > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf, > > + size_t buf_len) > > +{ > > + int ret, ns, el, index = packet->index; > > + unsigned long long payload = packet->payload; > > + const char *name = arm_spe_pkt_name(packet->type); > > + > > + switch (packet->type) { > > + case ARM_SPE_BAD: > > + case ARM_SPE_PAD: > > + case ARM_SPE_END: > > + return snprintf(buf, buf_len, "%s", name); > > + case ARM_SPE_EVENTS: { > > [...] > > > + case ARM_SPE_DATA_SOURCE: > > + case ARM_SPE_TIMESTAMP: > > + return snprintf(buf, buf_len, "%s %lld", name, payload); > > + case ARM_SPE_ADDRESS: > > + switch (index) { > > + case 0: > > + case 1: ns = !!(packet->payload & NS_FLAG); > > + el = (packet->payload & EL_FLAG) >> 61; > > + payload &= ~(0xffULL << 56); > > + return snprintf(buf, buf_len, "%s %llx el%d ns=%d", > > + (index == 1) ? "TGT" : "PC", payload, el, ns); > > For TTBR1 addresses, this ends up losing the leading 0xff, giving us > invalid addresses, which look odd. > > Can we please sign-extend bit 55 so that this gives us valid addresses > regardless of TTBR? I'll take a look at doing this once I get consistent output from an implementation. > Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so > that things get padded consistently? I've added the 0x prefix, but prefer to not fix the length to 016: I don't see any direct benefit, rather see benefits to having the length vary, for output size control and less obvious reasons, e.g., sorting address lines by their length to get a sense of address groups caught during the run. FWIW, Intel doesn't do the 016 either. If I've omitted a response to the other comments, it's because they are being addressed. Thanks! Kim
Hi Kim, Sorry for the late reply. I see you've send an updated version. I'm replying here first so as to answer your queries, and I intend to look at the updated patch shortly. On Mon, Jul 17, 2017 at 07:48:22PM -0500, Kim Phillips wrote: > On Fri, 30 Jun 2017 15:02:41 +0100 Mark Rutland <mark.rutland@arm.com> wrote: > > On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote: > <snip> > > > if (evlist) { > > > evlist__for_each_entry(evlist, evsel) { > > > if (cs_etm_pmu && > > > evsel->attr.type == cs_etm_pmu->type) > > > found_etm = true; > > > + if (arm_spe_pmu && > > > + evsel->attr.type == arm_spe_pmu->type) > > > + found_spe = true; > > > > Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all > > SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1" > > and so on). > > > > Can we not find all PMUs with a "arm_spe_" prefix? > > > > ... or, taking a step back, do we need some sysfs "class" attribute to > > identify multi-instance PMUs? > > Since there is one SPE per core, and it looks like the buffer full > interrupt line is the only difference between the SPE device node > specification in the device tree, I guess I don't understand why the > driver doesn't accept a singular "arm_spe" from the tool, and manage > interrupt handling accordingly. There are also differences which can be probed from the device, which are not described in the DT (but are described in sysfs). Some of these are exposed under sysfs. There may be further differences in subsequent revisions of the architecture, too. So the safest bet is to expose them separately, as we do for other CPU-affine PMUs in heterogeneous systems. > Also, if a set of CPUs are missing SPE support, and the user doesn't > explicitly define a CPU affinity to outside that partition, then > decline to run, stating why. It's possible for userspace to do this regardless; look for the set of SPE PMUs, and then look at their masks. [...] > > > + /* > > > + * To obtain the auxtrace buffer file descriptor, the auxtrace event > > > + * must come first. > > > + */ > > > + perf_evlist__to_front(evlist, arm_spe_evsel); > > > > Huh? *what* needs the auxtrace buffer fd? > > > > This seems really fragile. Can't we store this elsewhere? > > It's copied from the bts code, and the other auxtrace record users do > the same; it looks like auxtrace record has implicit dependencies on it? Is it definitely required? What happens if this isn't done? > > > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused) > > > +{ > > > + u64 ts; > > > + > > > + asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts)); > > > + > > > + return ts; > > > +} > > > > I do not think it's a good idea to read the counter directly like this. > > > > What is this "reference" intended to be meaningful relative to? > > AFAICT, it's just a nonce the perf tool uses to track unique events, > and I thought this better than the ETM driver's heavier get-random > implementation. > > > Why do we need to do this in userspace? > > > > Can we not ask the kernel to output timestamps instead? > > Why? This gets the job done faster. I had assumed that this needed to be correlated with the timestamps in the event. If this is a nonce, please don't read the counter directly in this way. It may be trapped/emulated by a higher EL, making it very heavyweight. The counter is only exposed so that the VDSO can use it, and that will avoid using it in cases where it is unsafe. [...] > > > +static int arm_spe_get_events(const unsigned char *buf, size_t len, > > > + struct arm_spe_pkt *packet) > > > +{ > > > + unsigned int events_len = payloadlen(buf[0]); > > > + > > > + if (len < events_len) > > > + return ARM_SPE_NEED_MORE_BYTES; > > > > Isn't len the size of the whole buffer? So isn't this failing to account > > for the header byte? > > well spotted; I changed /events_len/1 + events_len/. > > > > + packet->type = ARM_SPE_EVENTS; > > > + packet->index = events_len; > > > > Huh? The events packet has no "index" field, so why do we need this? > > To identify Events with a less number of comparisons in arm_spe_pkt_desc(): > E.g., the LLC-ACCESS, LLC-REFILL, and REMOTE-ACCESS events are > identified iff index > 1. It would be clearer to do the additional comparisons there. Does this make a measureable difference in practice? > > > + switch (events_len) { > > > + case 1: packet->payload = *(uint8_t *)(buf + 1); break; > > > + case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break; > > > + case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break; > > > + case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break; > > > + default: return ARM_SPE_BAD_PACKET; > > > + } > > > + > > > + return 1 + events_len; > > > +} > > > + > > > +static int arm_spe_get_data_source(const unsigned char *buf, > > > + struct arm_spe_pkt *packet) > > > +{ > > > + int len = payloadlen(buf[0]); > > > + > > > + packet->type = ARM_SPE_DATA_SOURCE; > > > + if (len == 1) > > > + packet->payload = buf[1]; > > > + else if (len == 2) > > > + packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); > > > + > > > + return 1 + len; > > > +} > > > > For those packets with a payload, the header has a uniform format > > describing the payload size. Given that, can't we make the payload > > extraction generic, regardless of the packet type? > > > > e.g. something like: > > > > static int arm_spe_get_payload(const unsigned char *buf, size_t len, > > struct arm_spe_pkt *packet) > > { > > <determine paylaod size> > > <length check> > > <switch> > > <return nr consumed bytes (inc header), or error> > > } > > > > static int arm_spe_get_events(const unsigned char *buf, size_t len, > > struct arm_spe_pkt *packet) > > { > > packet->type = ARM_SPE_EVENTS; > > return arm_spe_get_payload(buf, len, packet); > > } > > > > static int arm_spe_get_data_source(const unsigned char *buf, > > struct arm_spe_pkt *packet) > > { > > packet->type = ARM_SPE_DATA_SOURCE; > > return arm_spe_get_payload(buf, len, packet); > > } > > > > ... and so on for the other packets with a payload. > > done for TIMESTAMP, EVENTS, DATA_SOURCE, CONTEXT, INSN_TYPE. It > wouldn't fit ADDR and COUNTER well since they can occur in an > extended-header, and their lengths are encoded differently, and are > fixed anyway. Ok. That sounds good to me. [...] > > > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf, > > > + size_t buf_len) > > > +{ > > > + int ret, ns, el, index = packet->index; > > > + unsigned long long payload = packet->payload; > > > + const char *name = arm_spe_pkt_name(packet->type); > > > + > > > + switch (packet->type) { > > > + case ARM_SPE_ADDRESS: > > > + switch (index) { > > > + case 0: > > > + case 1: ns = !!(packet->payload & NS_FLAG); > > > + el = (packet->payload & EL_FLAG) >> 61; > > > + payload &= ~(0xffULL << 56); > > > + return snprintf(buf, buf_len, "%s %llx el%d ns=%d", > > > + (index == 1) ? "TGT" : "PC", payload, el, ns); > > Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so > > that things get padded consistently? > > I've added the 0x prefix, but prefer to not fix the length to 016: I > don't see any direct benefit, rather see benefits to having the length > vary, for output size control and less obvious reasons, e.g., sorting > address lines by their length to get a sense of address groups caught > during the run. FWIW, Intel doesn't do the 016 either. With padding, sorting will also place address groups together, so I'm not sure I follow. Padding makes it *much* easier to scan over the output by eye, as columns of event data will always share the same alignment. > If I've omitted a response to the other comments, it's because they > are being addressed. Cool! Thanks, Mark.
Hi Kim, On Thu, Aug 17, 2017 at 10:11:50PM -0500, Kim Phillips wrote: > Hi Mark, I've tried to proceed as much as possible without your > response, so if you still have comments to my above comments, please > comment in-line above, otherwise review the v2 patch below? Apologies again for the late response, and thanks for the updated patch! [...] > From 464d943dcac15d946863399001174e4dc4e00594 Mon Sep 17 00:00:00 2001 > From: Kim Phillips <kim.phillips@arm.com> > Date: Wed, 8 Feb 2017 17:11:57 -0600 > Subject: [PATCH v2] perf tools: Add ARM Statistical Profiling Extensions > (SPE) support > > 'perf record' and 'perf report --dump-raw-trace' supported in this release > > Example usage: > > taskset -c 2 ./perf record -C 2 -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=1/ \ > dd if=/dev/zero of=/dev/null count=10000 > > perf report --dump-raw-trace > > Note that the perf.data file is portable, so the report can be run on another > architecture host if necessary. > > Output will contain raw SPE data and its textual representation, such as: > > 0xc7d0 [0x30]: PERF_RECORD_AUXTRACE size: 0x82f70 offset: 0 ref: 0x1e947e88189 idx: 0 tid: -1 cpu: 2 > . > . ... ARM SPE data: size 536432 bytes > . 00000000: 4a 01 B COND > . 00000002: b1 00 00 00 00 00 00 00 80 TGT 0 el0 ns=1 > . 0000000b: 42 42 RETIRED NOT-TAKEN > . 0000000d: b0 20 41 c0 ad ff ff 00 80 PC ffffadc04120 el0 ns=1 > . 00000016: 98 00 00 LAT 0 TOT > . 00000019: 71 80 3e f7 46 e9 01 00 00 TS 2101429616256 > . 00000022: 49 01 ST > . 00000024: b2 50 bd ba 73 00 80 ff ff VA ffff800073babd50 > . 0000002d: b3 50 bd ba f3 00 00 00 80 PA f3babd50 ns=1 > . 00000036: 9a 00 00 LAT 0 XLAT > . 00000039: 42 16 RETIRED L1D-ACCESS TLB-ACCESS > . 0000003b: b0 8c b4 1e 08 00 00 ff ff PC ff0000081eb48c el3 ns=1 > . 00000044: 98 00 00 LAT 0 TOT > . 00000047: 71 cc 44 f7 46 e9 01 00 00 TS 2101429617868 > . 00000050: 48 00 INSN-OTHER > . 00000052: 42 02 RETIRED > . 00000054: b0 58 54 1f 08 00 00 ff ff PC ff0000081f5458 el3 ns=1 > . 0000005d: 98 00 00 LAT 0 TOT > . 00000060: 71 cc 44 f7 46 e9 01 00 00 TS 2101429617868 So FWIW, I think this is a good example of why that padding I requested last time round matters. For the first PC packet, I had to count the number of characters to see that it was a TTBR0 address, which is made much clearer with leading padding, as 0000ffffadc04120. With the addresses padded, the EL and NS fields would also be aligned, making it *much* easier to scan by eye. [...] > - multiple SPE clusters/domains support pending potential driver changes? As covered in my other reply, I don't believe that the driver is going to change in this regard. Userspace will need to handle multiple SPE instances. I'll ignore that in the code below for now. > - CPU mask / new record behaviour bisected to commit e3ba76deef23064 "perf > tools: Force uncore events to system wide monitoring". Waiting to hear back > on why driver can't do system wide monitoring, even across PPIs, by e.g., > sharing the SPE interrupts in one handler (SPE's don't differ in this record > regard). Could you elaborate on this? I don't follow the interrupt handler comments. [...] > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused) > +{ > + u64 ts; > + > + asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts)); > + > + return ts; > +} As covered in my other reply, please don't use the counter for this. It sounds like we need a simple/generic function to get a nonce, that we could share with the ETM code. [...] > +#define BIT(n) (1 << (n)) > + > +#define BIT61 ((uint64_t)1 << 61) > +#define BIT62 ((uint64_t)1 << 62) > +#define BIT63 ((uint64_t)1 << 63) > + > +#define NS_FLAG BIT63 > +#define EL_FLAG (BIT62 | BIT61) This would be far simpler as: #define BIT(n) (1UL << (n)) #define NS_FLAG BIT(63) #define EL_FLAG (BIT(62) | BIT(61)) [...] > +/* return ARM SPE payload size from its encoding: > + * 00 : byte > + * 01 : halfword (2) > + * 10 : word (4) > + * 11 : doubleword (8) > + */ > +static int payloadlen(unsigned char byte) > +{ > + return 1 << ((byte & 0x30) >> 4); > +} It might be worth stating in the comment that this is encoded in bits 5:4 of the byte, since otherwise it looks odd. > + > +static int arm_spe_get_payload(const unsigned char *buf, size_t len, > + struct arm_spe_pkt *packet) > +{ > + size_t payload_len = payloadlen(buf[0]); > + > + if (len < 1 + payload_len) > + return ARM_SPE_NEED_MORE_BYTES; If you did `buf++` here, you could avoid the `+ 1` in all the cases below. > + > + switch (payload_len) { > + case 1: packet->payload = *(uint8_t *)(buf + 1); break; > + case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break; > + case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break; > + case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break; > + default: return ARM_SPE_BAD_PACKET; > + } > + > + return 1 + payload_len; > +} [...] > +int arm_spe_get_packet(const unsigned char *buf, size_t len, > + struct arm_spe_pkt *packet) > +{ > + int ret; > + > + ret = arm_spe_do_get_packet(buf, len, packet); > + if (ret > 0 && packet->type == ARM_SPE_PAD) { > + while (ret < 16 && len > (size_t)ret && !buf[ret]) > + ret += 1; > + } > + return ret; > +} What's this doing? Skipping padding? What's the significance of 16? > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf, > + size_t buf_len) > +{ > + int ret, ns, el, index = packet->index; > + unsigned long long payload = packet->payload; > + const char *name = arm_spe_pkt_name(packet->type); > + > + switch (packet->type) { > + case ARM_SPE_BAD: > + case ARM_SPE_PAD: > + case ARM_SPE_END: > + return snprintf(buf, buf_len, "%s", name); > + case ARM_SPE_EVENTS: { > + size_t blen = buf_len; > + > + ret = 0; > + ret = snprintf(buf, buf_len, "EV"); > + buf += ret; > + blen -= ret; > + if (payload & 0x1) { > + ret = snprintf(buf, buf_len, " EXCEPTION-GEN"); > + buf += ret; > + blen -= ret; > + } > + if (payload & 0x2) { > + ret = snprintf(buf, buf_len, " RETIRED"); > + buf += ret; > + blen -= ret; > + } > + if (payload & 0x4) { > + ret = snprintf(buf, buf_len, " L1D-ACCESS"); > + buf += ret; > + blen -= ret; > + } > + if (payload & 0x8) { > + ret = snprintf(buf, buf_len, " L1D-REFILL"); > + buf += ret; > + blen -= ret; > + } > + if (payload & 0x10) { > + ret = snprintf(buf, buf_len, " TLB-ACCESS"); > + buf += ret; > + blen -= ret; > + } > + if (payload & 0x20) { > + ret = snprintf(buf, buf_len, " TLB-REFILL"); > + buf += ret; > + blen -= ret; > + } > + if (payload & 0x40) { > + ret = snprintf(buf, buf_len, " NOT-TAKEN"); > + buf += ret; > + blen -= ret; > + } > + if (payload & 0x80) { > + ret = snprintf(buf, buf_len, " MISPRED"); > + buf += ret; > + blen -= ret; > + } > + if (index > 1) { > + if (payload & 0x100) { > + ret = snprintf(buf, buf_len, " LLC-ACCESS"); > + buf += ret; > + blen -= ret; > + } > + if (payload & 0x200) { > + ret = snprintf(buf, buf_len, " LLC-REFILL"); > + buf += ret; > + blen -= ret; > + } > + if (payload & 0x400) { > + ret = snprintf(buf, buf_len, " REMOTE-ACCESS"); > + buf += ret; > + blen -= ret; > + } > + } > + if (ret < 0) > + return ret; > + blen -= ret; > + return buf_len - blen; > + } This looks like it could be turned into another switch, sharing the repeated logic. Thanks, Mark.
On Fri, 18 Aug 2017 17:59:25 +0100 Mark Rutland <mark.rutland@arm.com> wrote: > On Mon, Jul 17, 2017 at 07:48:22PM -0500, Kim Phillips wrote: > > On Fri, 30 Jun 2017 15:02:41 +0100 Mark Rutland <mark.rutland@arm.com> wrote: > > > On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote: > > <snip> > > > > if (evlist) { > > > > evlist__for_each_entry(evlist, evsel) { > > > > if (cs_etm_pmu && > > > > evsel->attr.type == cs_etm_pmu->type) > > > > found_etm = true; > > > > + if (arm_spe_pmu && > > > > + evsel->attr.type == arm_spe_pmu->type) > > > > + found_spe = true; > > > > > > Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all > > > SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1" > > > and so on). > > > > > > Can we not find all PMUs with a "arm_spe_" prefix? > > > > > > ... or, taking a step back, do we need some sysfs "class" attribute to > > > identify multi-instance PMUs? > > > > Since there is one SPE per core, and it looks like the buffer full > > interrupt line is the only difference between the SPE device node > > specification in the device tree, I guess I don't understand why the > > driver doesn't accept a singular "arm_spe" from the tool, and manage > > interrupt handling accordingly. > > There are also differences which can be probed from the device, which The only thing I see is PMSIDR fields describing things like minimum recommended sampling interval. So if CPU A's SPE has that as 256, and CPU B's is 512, then deny the user asking for a 256 interval across the two CPUs. Or, better yet, issue a warning stating the driver has raised the interval to the lowest common denominator of all CPU SPEs involved (512 in the above case). > are not described in the DT (but are described in sysfs). Some of these > are exposed under sysfs. > > There may be further differences in subsequent revisions of the > architecture, too. Future SPE lowest common denominator rules can be amended according to the capabilities of the new system. > So the safest bet is to expose them separately, as we > do for other CPU-affine PMUs in heterogeneous systems. Yes, perf is very hard to use on heterogeneous systems for this reason. Cycles are cycles, it doesn't matter whether they're on an A53 or an A72. Meanwhile, this type of driver behaviour - and the fact that the drivers are mute - hurts usability in heterogeneous environments, and can easily be avoided. > > Also, if a set of CPUs are missing SPE support, and the user doesn't > > explicitly define a CPU affinity to outside that partition, then > > decline to run, stating why. > > It's possible for userspace to do this regardless; look for the set of > SPE PMUs, and then look at their masks. The driver still has to check if what the user is asking for, is doable. They also may not be using the perf tool. > > > > + /* > > > > + * To obtain the auxtrace buffer file descriptor, the auxtrace event > > > > + * must come first. > > > > + */ > > > > + perf_evlist__to_front(evlist, arm_spe_evsel); > > > > > > Huh? *what* needs the auxtrace buffer fd? > > > > > > This seems really fragile. Can't we store this elsewhere? > > > > It's copied from the bts code, and the other auxtrace record users do > > the same; it looks like auxtrace record has implicit dependencies on it? > > Is it definitely required? What happens if this isn't done? It says it wouldn't obtain the auxtrace buffer file descriptor. > > > > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused) > > > > +{ > > > > + u64 ts; > > > > + > > > > + asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts)); > > > > + > > > > + return ts; > > > > +} > > > > > > I do not think it's a good idea to read the counter directly like this. > > > > > > What is this "reference" intended to be meaningful relative to? > > > > AFAICT, it's just a nonce the perf tool uses to track unique events, > > and I thought this better than the ETM driver's heavier get-random > > implementation. > > > > > Why do we need to do this in userspace? > > > > > > Can we not ask the kernel to output timestamps instead? > > > > Why? This gets the job done faster. > > I had assumed that this needed to be correlated with the timestamps in > the event. > > If this is a nonce, please don't read the counter directly in this way. > It may be trapped/emulated by a higher EL, making it very heavyweight. > The counter is only exposed so that the VDSO can use it, and that will > avoid using it in cases where it is unsafe. Got it, thanks. > > > > + packet->type = ARM_SPE_EVENTS; > > > > + packet->index = events_len; > > > > > > Huh? The events packet has no "index" field, so why do we need this? > > > > To identify Events with a less number of comparisons in arm_spe_pkt_desc(): > > E.g., the LLC-ACCESS, LLC-REFILL, and REMOTE-ACCESS events are > > identified iff index > 1. > > It would be clearer to do the additional comparisons there. > > Does this make a measureable difference in practice? It should - I'll add a comment. > > > > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf, > > > > + size_t buf_len) > > > > +{ > > > > + int ret, ns, el, index = packet->index; > > > > + unsigned long long payload = packet->payload; > > > > + const char *name = arm_spe_pkt_name(packet->type); > > > > + > > > > + switch (packet->type) { > > > > > + case ARM_SPE_ADDRESS: > > > > + switch (index) { > > > > + case 0: > > > > + case 1: ns = !!(packet->payload & NS_FLAG); > > > > + el = (packet->payload & EL_FLAG) >> 61; > > > > + payload &= ~(0xffULL << 56); > > > > + return snprintf(buf, buf_len, "%s %llx el%d ns=%d", > > > > + (index == 1) ? "TGT" : "PC", payload, el, ns); > > > > Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so > > > that things get padded consistently? > > > > I've added the 0x prefix, but prefer to not fix the length to 016: I > > don't see any direct benefit, rather see benefits to having the length > > vary, for output size control and less obvious reasons, e.g., sorting > > address lines by their length to get a sense of address groups caught > > during the run. FWIW, Intel doesn't do the 016 either. > > With padding, sorting will also place address groups together, so I'm > not sure I follow. sorting by *line length* can be done to easily assess the address groups in a dump: $ grep -w PC dump | awk '{ print length, $0 }' | sort -nu 77 . 00000080: b0 00 00 00 00 00 00 00 a0 PC 0x0 el1 ns=1 82 . 00000000: b0 94 61 43 00 00 00 00 80 PC 0x436194 el0 ns=1 88 . 00000000: b0 50 20 ac a7 ff ff 00 80 PC 0xffffa7ac2050 el0 ns=1 89 . 00000040: b0 80 2d 08 08 00 00 01 a0 PC 0x1000008082d80 el1 ns=1 > Padding makes it *much* easier to scan over the output by eye, as > columns of event data will always share the same alignment. Addresses are already technically misaligned by virtue of their being prepended with "PC" (2 chars) vs. "TGT" (3 chars): 82 . 00000000: b0 94 61 43 00 00 00 00 80 PC 0x436194 el0 ns=1 83 . 0000001e: b1 68 61 43 00 00 00 00 80 TGT 0x436168 el0 ns=1 89 . 00000040: b0 80 2d 08 08 00 00 01 a0 PC 0x1000008082d80 el1 ns=1 91 . 000005de: b1 ec 9a 92 08 00 00 ff a0 TGT 0xff000008929aec el1 ns=1 If you're talking about the postpended "elX ns=Y", well, that less significant given the variable length is more quickly detected by the eye - giving the astute reader hints of which execution level the address is in - and can be parsed using variable length delimeters. OTOH, we can rename the tokens, e.g., current PC -> {NS,SE}EL{0,1,2,3}PC 0xAAAA current TGT -> {NS,SE}EL{0,1,2,3}BT 0xAAAA Where "BT" -> "Branch Target", which admittedly is less obvious to the uninitiated. So the last sample above would be: 89 . 00000040: b0 80 2d 08 08 00 00 01 a0 NSEL1PC 0x1000008082d80 91 . 000005de: b1 ec 9a 92 08 00 00 ff a0 NSEL1BT 0xff000008929aec Is that better though? Are there others opinionated here? I'll get to the v2 review comments later. Thanks for your feedback! Kim
On Fri, 18 Aug 2017 18:36:09 +0100 Mark Rutland <mark.rutland@arm.com> wrote: > Hi Kim, Hi Mark, > On Thu, Aug 17, 2017 at 10:11:50PM -0500, Kim Phillips wrote: > > Hi Mark, I've tried to proceed as much as possible without your > > response, so if you still have comments to my above comments, please > > comment in-line above, otherwise review the v2 patch below? > > Apologies again for the late response, and thanks for the updated patch! Thanks for your prompt response this time around. > > . > > . ... ARM SPE data: size 536432 bytes > > . 00000000: 4a 01 B COND > > . 00000002: b1 00 00 00 00 00 00 00 80 TGT 0 el0 ns=1 > > . 0000000b: 42 42 RETIRED NOT-TAKEN > > . 0000000d: b0 20 41 c0 ad ff ff 00 80 PC ffffadc04120 el0 ns=1 > > . 00000016: 98 00 00 LAT 0 TOT > > . 00000019: 71 80 3e f7 46 e9 01 00 00 TS 2101429616256 > > . 00000022: 49 01 ST > > . 00000024: b2 50 bd ba 73 00 80 ff ff VA ffff800073babd50 > > . 0000002d: b3 50 bd ba f3 00 00 00 80 PA f3babd50 ns=1 > > . 00000036: 9a 00 00 LAT 0 XLAT > > . 00000039: 42 16 RETIRED L1D-ACCESS TLB-ACCESS > > . 0000003b: b0 8c b4 1e 08 00 00 ff ff PC ff0000081eb48c el3 ns=1 > > . 00000044: 98 00 00 LAT 0 TOT > > . 00000047: 71 cc 44 f7 46 e9 01 00 00 TS 2101429617868 > > . 00000050: 48 00 INSN-OTHER > > . 00000052: 42 02 RETIRED > > . 00000054: b0 58 54 1f 08 00 00 ff ff PC ff0000081f5458 el3 ns=1 > > . 0000005d: 98 00 00 LAT 0 TOT > > . 00000060: 71 cc 44 f7 46 e9 01 00 00 TS 2101429617868 > > So FWIW, I think this is a good example of why that padding I requested > last time round matters. > > For the first PC packet, I had to count the number of characters to see > that it was a TTBR0 address, which is made much clearer with leading > padding, as 0000ffffadc04120. With the addresses padded, the EL and NS > fields would also be aligned, making it *much* easier to scan by eye. See my response in my prior email. > > - multiple SPE clusters/domains support pending potential driver changes? > > As covered in my other reply, I don't believe that the driver is going > to change in this regard. Userspace will need to handle multiple SPE > instances. > > I'll ignore that in the code below for now. Please let's continue the discussion in one place, and again in this case, in the last email. > > - CPU mask / new record behaviour bisected to commit e3ba76deef23064 "perf > > tools: Force uncore events to system wide monitoring". Waiting to hear back > > on why driver can't do system wide monitoring, even across PPIs, by e.g., > > sharing the SPE interrupts in one handler (SPE's don't differ in this record > > regard). > > Could you elaborate on this? I don't follow the interrupt handler > comments. Would it be possible for the driver to request the IRQs with IRQF_SHARED, in order to be able to operate across the multiple PPIs? > > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused) > > +{ > > + u64 ts; > > + > > + asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts)); > > + > > + return ts; > > +} > > As covered in my other reply, please don't use the counter for this. > > It sounds like we need a simple/generic function to get a nonce, that > we could share with the ETM code. I've switched to using clock_gettime(CLOCK_MONOTONIC_RAW, ...). The ETM code uses two rand() calls, which, according to some minor benchmarking on Juno, is almost twice as slow as clock_gettime. It's three lines still, so I'll update the ETM code in-place independently of this patch, and after the gettime implementation is reviewed. > > +int arm_spe_get_packet(const unsigned char *buf, size_t len, > > + struct arm_spe_pkt *packet) > > +{ > > + int ret; > > + > > + ret = arm_spe_do_get_packet(buf, len, packet); > > + if (ret > 0 && packet->type == ARM_SPE_PAD) { > > + while (ret < 16 && len > (size_t)ret && !buf[ret]) > > + ret += 1; > > + } > > + return ret; > > +} > > What's this doing? Skipping padding? What's the significance of 16? I'll repeat the relevant part of the v2 changelog here: - do_get_packet fixed to handle excessive, successive PADding from a new source of raw SPE data, so instead of: . 000011ae: 00 PAD . 000011af: 00 PAD . 000011b0: 00 PAD . 000011b1: 00 PAD . 000011b2: 00 PAD . 000011b3: 00 PAD . 000011b4: 00 PAD . 000011b5: 00 PAD . 000011b6: 00 PAD we now get: . 000011ae: 00 00 00 00 00 00 00 00 00 PAD ...the 16 is the width of the dump format: max. 16 byte being displayed per line: I'll add a comment as such. > > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf, > > + size_t buf_len) > > +{ > > + int ret, ns, el, index = packet->index; > > + unsigned long long payload = packet->payload; > > + const char *name = arm_spe_pkt_name(packet->type); > > + > > + switch (packet->type) { > > + case ARM_SPE_BAD: > > + case ARM_SPE_PAD: > > + case ARM_SPE_END: > > + return snprintf(buf, buf_len, "%s", name); > > + case ARM_SPE_EVENTS: { > > + size_t blen = buf_len; > > + > > + ret = 0; > > + ret = snprintf(buf, buf_len, "EV"); > > + buf += ret; > > + blen -= ret; > > + if (payload & 0x1) { > > + ret = snprintf(buf, buf_len, " EXCEPTION-GEN"); > > + buf += ret; > > + blen -= ret; > > + } > > + if (payload & 0x2) { > > + ret = snprintf(buf, buf_len, " RETIRED"); > > + buf += ret; > > + blen -= ret; > > + } > > + if (payload & 0x4) { > > + ret = snprintf(buf, buf_len, " L1D-ACCESS"); > > + buf += ret; > > + blen -= ret; > > + } > > + if (payload & 0x8) { > > + ret = snprintf(buf, buf_len, " L1D-REFILL"); > > + buf += ret; > > + blen -= ret; > > + } > > + if (payload & 0x10) { > > + ret = snprintf(buf, buf_len, " TLB-ACCESS"); > > + buf += ret; > > + blen -= ret; > > + } > > + if (payload & 0x20) { > > + ret = snprintf(buf, buf_len, " TLB-REFILL"); > > + buf += ret; > > + blen -= ret; > > + } > > + if (payload & 0x40) { > > + ret = snprintf(buf, buf_len, " NOT-TAKEN"); > > + buf += ret; > > + blen -= ret; > > + } > > + if (payload & 0x80) { > > + ret = snprintf(buf, buf_len, " MISPRED"); > > + buf += ret; > > + blen -= ret; > > + } > > + if (index > 1) { > > + if (payload & 0x100) { > > + ret = snprintf(buf, buf_len, " LLC-ACCESS"); > > + buf += ret; > > + blen -= ret; > > + } > > + if (payload & 0x200) { > > + ret = snprintf(buf, buf_len, " LLC-REFILL"); > > + buf += ret; > > + blen -= ret; > > + } > > + if (payload & 0x400) { > > + ret = snprintf(buf, buf_len, " REMOTE-ACCESS"); > > + buf += ret; > > + blen -= ret; > > + } > > + } > > + if (ret < 0) > > + return ret; > > + blen -= ret; > > + return buf_len - blen; > > + } > > This looks like it could be turned into another switch, sharing the > repeated logic. How, if the payload may have multiple bits set? I've addressed the rest of your comments and therefore trimmed them out. I can post a v3, but would rather shake out the pending issues first, so please reply to my comments on this and Friday's email (Date: Fri, 18 Aug 2017 17:22:48 -0500). Thanks, Kim