Message ID | 20210404083354.23060-1-psampat@linux.ibm.com |
---|---|
Headers | show |
Series | CPU-Idle latency selftest framework | expand |
Hi Pratik, I tried V3 on a Intel i5-10600K processor with 6 cores and 12 CPUs. The core to cpu mappings are: core 0 has cpus 0 and 6 core 1 has cpus 1 and 7 core 2 has cpus 2 and 8 core 3 has cpus 3 and 9 core 4 has cpus 4 and 10 core 5 has cpus 5 and 11 By default, it will test CPUs 0,2,4,6,10 on cores 0,2,4,0,2,4. wouldn't it make more sense to test each core once? With the source CPU always 0, I think the results from the results from the destination CPUs 0 and 6, on core 0 bias the results, at least in the deeper idle states. They don't make much difference in the shallow states. Myself, I wouldn't include them in the results. Example, where I used the -v option for all CPUs: --IPI Latency Test--- --Baseline IPI Latency measurement: CPU Busy-- SRC_CPU DEST_CPU IPI_Latency(ns) 0 0 101 0 1 790 0 2 609 0 3 595 0 4 737 0 5 759 0 6 780 0 7 741 0 8 574 0 9 681 0 10 527 0 11 552 Baseline Avg IPI latency(ns): 620 <<<< suggest 656 here ---Enabling state: 0--- SRC_CPU DEST_CPU IPI_Latency(ns) 0 0 76 0 1 471 0 2 420 0 3 462 0 4 454 0 5 468 0 6 453 0 7 473 0 8 380 0 9 483 0 10 492 0 11 454 Expected IPI latency(ns): 0 Observed Avg IPI latency(ns) - State 0: 423 <<<<< suggest 456 here ---Enabling state: 1--- SRC_CPU DEST_CPU IPI_Latency(ns) 0 0 112 0 1 866 0 2 663 0 3 851 0 4 1090 0 5 1314 0 6 1941 0 7 1458 0 8 687 0 9 802 0 10 1041 0 11 1284 Expected IPI latency(ns): 1000 Observed Avg IPI latency(ns) - State 1: 1009 <<<< suggest 1006 here ---Enabling state: 2--- SRC_CPU DEST_CPU IPI_Latency(ns) 0 0 75 0 1 16362 0 2 16785 0 3 19650 0 4 17356 0 5 17606 0 6 2217 0 7 17958 0 8 17332 0 9 16615 0 10 17382 0 11 17423 Expected IPI latency(ns): 120000 Observed Avg IPI latency(ns) - State 2: 14730 <<<< suggest 17447 here ---Enabling state: 3--- SRC_CPU DEST_CPU IPI_Latency(ns) 0 0 103 0 1 17416 0 2 17961 0 3 16651 0 4 17867 0 5 17726 0 6 2178 0 7 16620 0 8 20951 0 9 16567 0 10 17131 0 11 17563 Expected IPI latency(ns): 1034000 Observed Avg IPI latency(ns) - State 3: 14894 <<<< suggest 17645 here Hope this helps. ... Doug
Hello Doug, On 09/04/21 10:53 am, Doug Smythies wrote: > Hi Pratik, > > I tried V3 on a Intel i5-10600K processor with 6 cores and 12 CPUs. > The core to cpu mappings are: > core 0 has cpus 0 and 6 > core 1 has cpus 1 and 7 > core 2 has cpus 2 and 8 > core 3 has cpus 3 and 9 > core 4 has cpus 4 and 10 > core 5 has cpus 5 and 11 > > By default, it will test CPUs 0,2,4,6,10 on cores 0,2,4,0,2,4. > wouldn't it make more sense to test each core once? Ideally it would be better to run on all the CPUs, however on larger systems that I'm testing on with hundreds of cores and a high a thread count, the execution time increases while not particularly bringing any additional information to the table. That is why it made sense only run on one of the threads of each core to make the experiment faster while preserving accuracy. To handle various thread topologies it maybe worthwhile if we parse /sys/devices/system/cpu/cpuX/topology/thread_siblings_list for each core and use this information to run only once per physical core, rather than assuming the topology. What are your thoughts on a mechanism like this? > With the source CPU always 0, I think the results from the results > from the destination CPUs 0 and 6, on core 0 bias the results, at > least in the deeper idle states. They don't make much difference in > the shallow states. Myself, I wouldn't include them in the results. I agree, CPU0->CPU0 same core interaction is causing a bias. I could omit that observation while computing the average. In the verbose mode I'll omit all the threads of CPU0 and in the default (quick) mode just CPU0's latency can be omitted while computing average. Thank you, Pratik > Example, where I used the -v option for all CPUs: > > --IPI Latency Test--- > --Baseline IPI Latency measurement: CPU Busy-- > SRC_CPU DEST_CPU IPI_Latency(ns) > 0 0 101 > 0 1 790 > 0 2 609 > 0 3 595 > 0 4 737 > 0 5 759 > 0 6 780 > 0 7 741 > 0 8 574 > 0 9 681 > 0 10 527 > 0 11 552 > Baseline Avg IPI latency(ns): 620 <<<< suggest 656 here > ---Enabling state: 0--- > SRC_CPU DEST_CPU IPI_Latency(ns) > 0 0 76 > 0 1 471 > 0 2 420 > 0 3 462 > 0 4 454 > 0 5 468 > 0 6 453 > 0 7 473 > 0 8 380 > 0 9 483 > 0 10 492 > 0 11 454 > Expected IPI latency(ns): 0 > Observed Avg IPI latency(ns) - State 0: 423 <<<<< suggest 456 here > ---Enabling state: 1--- > SRC_CPU DEST_CPU IPI_Latency(ns) > 0 0 112 > 0 1 866 > 0 2 663 > 0 3 851 > 0 4 1090 > 0 5 1314 > 0 6 1941 > 0 7 1458 > 0 8 687 > 0 9 802 > 0 10 1041 > 0 11 1284 > Expected IPI latency(ns): 1000 > Observed Avg IPI latency(ns) - State 1: 1009 <<<< suggest 1006 here > ---Enabling state: 2--- > SRC_CPU DEST_CPU IPI_Latency(ns) > 0 0 75 > 0 1 16362 > 0 2 16785 > 0 3 19650 > 0 4 17356 > 0 5 17606 > 0 6 2217 > 0 7 17958 > 0 8 17332 > 0 9 16615 > 0 10 17382 > 0 11 17423 > Expected IPI latency(ns): 120000 > Observed Avg IPI latency(ns) - State 2: 14730 <<<< suggest 17447 here > ---Enabling state: 3--- > SRC_CPU DEST_CPU IPI_Latency(ns) > 0 0 103 > 0 1 17416 > 0 2 17961 > 0 3 16651 > 0 4 17867 > 0 5 17726 > 0 6 2178 > 0 7 16620 > 0 8 20951 > 0 9 16567 > 0 10 17131 > 0 11 17563 > Expected IPI latency(ns): 1034000 > Observed Avg IPI latency(ns) - State 3: 14894 <<<< suggest 17645 here > > Hope this helps. > > ... Doug
On Fri, Apr 9, 2021 at 12:43 AM Pratik Sampat <psampat@linux.ibm.com> wrote: > On 09/04/21 10:53 am, Doug Smythies wrote: > > I tried V3 on a Intel i5-10600K processor with 6 cores and 12 CPUs. > > The core to cpu mappings are: > > core 0 has cpus 0 and 6 > > core 1 has cpus 1 and 7 > > core 2 has cpus 2 and 8 > > core 3 has cpus 3 and 9 > > core 4 has cpus 4 and 10 > > core 5 has cpus 5 and 11 > > > > By default, it will test CPUs 0,2,4,6,10 on cores 0,2,4,0,2,4. > > wouldn't it make more sense to test each core once? > > Ideally it would be better to run on all the CPUs, however on larger systems > that I'm testing on with hundreds of cores and a high a thread count, the > execution time increases while not particularly bringing any additional > information to the table. > > That is why it made sense only run on one of the threads of each core to make > the experiment faster while preserving accuracy. > > To handle various thread topologies it maybe worthwhile if we parse > /sys/devices/system/cpu/cpuX/topology/thread_siblings_list for each core and > use this information to run only once per physical core, rather than > assuming the topology. > > What are your thoughts on a mechanism like this? Yes, seems like a good solution. ... Doug