Linux Applications Debugging Techniques/Aiming for and measuring performance
gprof & -pg
[edit | edit source]To profile the application with gprof:
- Compile the code with
-pg
- Link with
-pg
- Run the application. This creates a file
gmon.out
in the current folder of the application. - At the prompt, in the folder where gmon.out lives:
gprof path-to-application
PAPI
[edit | edit source]The Performance Application Programming Interface (PAPI) offers the programmer access to the performance counter hardware found in most major microprocessors. With a decent C++ wrapper, measuring branch mispredictions and cache misses (and much more) is literally one line of code away.
By default, these are the events that papi::counters<Print_policy>
is looking for:
static const events_type events = {
{PAPI_TOT_INS, "Total instructions"}
, {PAPI_TOT_CYC, "Total cpu cycles"}
, {PAPI_L1_DCM, "L1 load misses"}
// , {PAPI_L1_STM, "L1 store missess"}
, {PAPI_L2_DCM, "L2 load misses"}
// , {PAPI_L2_STM, "L2 store missess"}
, {PAPI_BR_MSP, "Branch mispredictions"}
};
The counters class is parameterized with a Print_policy
instructing what to do with the counters once they get out of scope.
As an example, lets look a bit at these lines:
const int nlines = 196608;
const int ncols = 64;
char ctrash[nlines][ncols];
{
int x;
papi::counters<papi::stdout_print> pc("by column"); //<== the famous one-line
for (int c = 0; c < ncols; ++c) {
for (int l = 0; l < nlines; ++l) {
x = ctrash[l][c];
}
}
}
The code just loops over an array but in the wrong order: the innermost loop iterates on the outer index. While the result is the same whether we loop over the first index first or over the last one, theorically, to preserve cache locality, the innermost loop should iterate over the innermost index. This should make a big difference for the time it takes to iterate over the array:
{
int x;
papi::counters<papi::stdout_print> pc("By line");
for (int l = 0; l < nlines; ++l) {
for (int c = 0; c < ncols; ++c) {
x = ctrash[l][c];
}
}
}
papi::counters is a class wrapping around PAPI functionality. It will take a snaphost of some performance counters (in our case, we are interested in cache misses and in branch mispredictions) when a counters object is instantiated and another snapshot when the object is destroyed. Then it will print out the differences.
A first measure, with non-optimized code (-O0), shows the following:
Delta by column:
PAPI_TOT_INS (Total instructions): 188744788 (380506167-191761379)
PAPI_TOT_CYC (Total cpu cycles): 92390347 (187804288-95413941)
PAPI_L1_DCM (L1 load misses): 28427 (30620-2193) <==
PAPI_L2_DCM (L2 load misses): 102 (1269-1167)
PAPI_BR_MSP (Branch mispredictions): 176 (207651-207475) <==
Delta By line:
PAPI_TOT_INS (Total instructions): 190909841 (191734047-824206)
PAPI_TOT_CYC (Total cpu cycles): 94460862 (95387664-926802)
PAPI_L1_DCM (L1 load misses): 403 (2046-1643) <==
PAPI_L2_DCM (L2 load misses): 21 (1081-1060)
PAPI_BR_MSP (Branch mispredictions): 205934 (207350-1416) <==
While the cache misses have indeed improved, branch mispredictions exploded. Not exactly a good tradeoff. Down in the pipeline of the processor, a comparison operation translates into a branch operation. Something is funny with the unoptimized code the compiler generated.
Typically, branch machine code is generated directly by if/else
and ternary operators; and indirectly by virtual calls and by calls
though pointers
Maybe the optimized code (-O2) is behaving better? Or maybe not:
Delta by column:
PAPI_TOT_INS (Total instructions): 329 (229368-229039)
PAPI_TOT_CYC (Total cpu cycles): 513 (186217-185704)
PAPI_L1_DCM (L1 load misses): 2 (1523-1521)
PAPI_L2_DCM (L2 load misses): 0 (993-993)
PAPI_BR_MSP (Branch mispredictions): 7 (1287-1280)
Delta By line:
PAPI_TOT_INS (Total instructions): 330 (209614-209284)
PAPI_TOT_CYC (Total cpu cycles): 499 (173487-172988)
PAPI_L1_DCM (L1 load misses): 2 (1498-1496)
PAPI_L2_DCM (L2 load misses): 0 (992-992)
PAPI_BR_MSP (Branch mispredictions): 7 (1225-1218)
This time the compiler optimized the loops out! It figured we do not really use the data in the array, so it got rid of. Completely!
Let's see how this code behaves:
{
int x;
papi::counters<papi::stdout_print> pc("by column");
for (int c = 0; c < ncols; ++c) {
for (int l = 0; l < nlines; ++l) {
x = ctrash[l][c];
ctrash[l][c] = x + 1;
}
}
}
Delta by column:
PAPI_TOT_INS (Total instructions): 62918492 (63167552-249060)
PAPI_TOT_CYC (Total cpu cycles): 224705473 (224904307-198834)
PAPI_L1_DCM (L1 load misses): 12415661 (12417203-1542)
PAPI_L2_DCM (L2 load misses): 9654638 (9655632-994)
PAPI_BR_MSP (Branch mispredictions): 14217 (15558-1341)
Delta By line:
PAPI_TOT_INS (Total instructions): 51904854 (115092642-63187788)
PAPI_TOT_CYC (Total cpu cycles): 25914254 (250864272-224950018)
PAPI_L1_DCM (L1 load misses): 197104 (12614449-12417345)
PAPI_L2_DCM (L2 load misses): 6330 (9662090-9655760)
PAPI_BR_MSP (Branch mispredictions): 296 (16066-15770)
Both cache misses and branch mispredictions improved by at least an order of magnitude. A run with unoptimized code will show the same order of improvement.
References
[edit | edit source]
OProfile
[edit | edit source]OProfile offers access to the same hardware counters as PAPI but without having to instrument the code:
- It is coarser grained than PAPI - at function level.
- Some out of the box kernels (RedHat) are not OProfile-friendly.
- You need root access.
#!/bin/bash
#
# A script to OProfile a program.
# Must be run as root.
#
if [ $# -ne 1 ]
then
echo "Usage: `basename $0` <for-binary-image>"
exit -1
else
binimg=$1
fi
opcontrol --stop
opcontrol --shutdown
# Out of the box RedHat kernels are OProfile repellent.
opcontrol --no-vmlinux
opcontrol --reset
# List of events for platform to be found in /usr/share/oprofile/<>/events
opcontrol --event=L2_CACHE_MISSES:1000
opcontrol --start
$binimg
opcontrol --stop
opcontrol --dump
rm $binimg.opreport.log
opreport > $binimg.opreport.log
rm $binimg.opreport.sym
opreport -l > $binimg.opreport.sym
opcontrol --shutdown
opcontrol --deinit
echo "Done"
References
[edit | edit source]perf
[edit | edit source]perf
is a kernel-based subsystem that provide a framework for performance analysis of the impact of the progams being run on the kernel. It covers: hardware (CPU/PMU, Performance Monitoring Unit) features; and software features (software counters, tracepoints).
Lists the events available on a particular machine. These events vary based on the performance monitoring hardware and the software configuration of the system.
$ perf list
List of pre-defined events (to be used in -e):
cpu-cycles OR cycles [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cpu-clock [Software event]
task-clock [Software event]
page-faults OR faults [Software event]
minor-faults [Software event]
major-faults [Software event]
context-switches OR cs [Software event]
cpu-migrations OR migrations [Software event]
alignment-faults [Software event]
emulation-faults [Software event]
L1-dcache-loads [Hardware cache event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
L1-dcache-store-misses [Hardware cache event]
L1-dcache-prefetches [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
L1-icache-loads [Hardware cache event]
L1-icache-load-misses [Hardware cache event]
L1-icache-prefetches [Hardware cache event]
L1-icache-prefetch-misses [Hardware cache event]
LLC-loads [Hardware cache event]
LLC-load-misses [Hardware cache event]
LLC-stores [Hardware cache event]
LLC-store-misses [Hardware cache event]
LLC-prefetches [Hardware cache event]
LLC-prefetch-misses [Hardware cache event]
dTLB-loads [Hardware cache event]
dTLB-load-misses [Hardware cache event]
dTLB-stores [Hardware cache event]
dTLB-store-misses [Hardware cache event]
dTLB-prefetches [Hardware cache event]
dTLB-prefetch-misses [Hardware cache event]
iTLB-loads [Hardware cache event]
iTLB-load-misses [Hardware cache event]
branch-loads [Hardware cache event]
branch-load-misses [Hardware cache event]
node-loads [Hardware cache event]
node-load-misses [Hardware cache event]
node-stores [Hardware cache event]
node-store-misses [Hardware cache event]
node-prefetches [Hardware cache event]
node-prefetch-misses [Hardware cache event]
rNNN (see 'perf list --help' on how to encode it) [Raw hardware event descriptor]
mem:<addr>[:access] [Hardware breakpoint]
Note: Running this as root will give out an extended list of events; some of the events (tracepoints?) require root privileges.
Gathers overall statistics for common performance events, including instructions executed and clock cycles consumed. There are option flags to gather statistics on events other than the default measurement events.
$ g++ -std=c++11 -ggdb -fno-omit-frame-pointer perftest.cpp -o perftest
$ perf stat ./perftest
Performance counter stats for './perftest':
379.991103 task-clock # 0.996 CPUs utilized
62 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
6,436 page-faults # 0.017 M/sec
984,969,006 cycles # 2.592 GHz [83.27%]
663,592,329 stalled-cycles-frontend # 67.37% frontend cycles idle [83.17%]
473,904,165 stalled-cycles-backend # 48.11% backend cycles idle [66.42%]
1,010,613,552 instructions # 1.03 insns per cycle
# 0.66 stalled cycles per insn [83.23%]
53,831,403 branches # 141.665 M/sec [84.14%]
401,518 branch-misses # 0.75% of all branches [83.48%]
0.381602838 seconds time elapsed
$ perf stat --event=L1-dcache-load-misses ./perftest
Performance counter stats for './perftest':
12,942,048 L1-dcache-load-misses
0.373719009 seconds time elapsed
Records performance data into a file which can be later analyzed using perf report.
$ perf record --event=L1-dcache-load-misses ./perftest
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.025 MB perf.data (~1078 samples) ]
$ ls -al
...
-rw------- 1 amelinte amelinte 27764 Feb 17 17:23 perf.data
Reads the performance data from a file and analyzes the recorded data.
$ perf report --stdio
# ========
# captured on: Sun Feb 17 17:23:34 2013
# hostname : bear
# os release : 3.2.0-4-amd64
# perf version : 3.2.17
# arch : x86_64
# nrcpus online : 4
# nrcpus avail : 4
# cpudesc : Intel(R) Core(TM) i3 CPU M 390 @ 2.67GHz
# cpuid : GenuineIntel,6,37,5
# total memory : 3857640 kB
# cmdline : /usr/bin/perf_3.2 record --event=L1-dcache-load-misses ./perftest
# event : name = L1-dcache-load-misses, type = 3, config = 0x10000, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, id = {
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# ========
#
# Events: 274 L1-dcache-load-misses
#
# Overhead Command Shared Object Symbol
# ........ ........ ................. ..................
#
# 95.93 percent perftest perftest [.] 0xd35
# 1.06 percent perftest [kernel.kallsyms] [k] clear_page_c
# 0.82 percent perftest [kernel.kallsyms] [k] run_timer_softirq
# 0.42 percent perftest [kernel.kallsyms] [k] trylock_page
# 0.41 percent perftest [kernel.kallsyms] [k] __rcu_pending
# 0.41 percent perftest [kernel.kallsyms] [k] update_curr
# 0.33 percent perftest [kernel.kallsyms] [k] do_raw_spin_lock
# 0.26 percent perftest [kernel.kallsyms] [k] __flush_tlb_one
# 0.18 percent perftest [kernel.kallsyms] [k] flush_old_exec
# 0.06 percent perftest [kernel.kallsyms] [k] __free_one_page
# 0.05 percent perftest [kernel.kallsyms] [k] free_swap_cache
# 0.05 percent perftest [kernel.kallsyms] [k] zone_statistics
# 0.04 percent perftest [kernel.kallsyms] [k] alloc_pages_vma
# 0.01 percent perftest [kernel.kallsyms] [k] mm_init
# 0.01 percent perftest [kernel.kallsyms] [k] vfs_read
# 0.00 percent perftest [kernel.kallsyms] [k] __cond_resched
# 0.00 percent perftest [kernel.kallsyms] [k] finish_task_switch
#
# (For a higher level overview, try: perf report --sort comm,dso)
#
$ perf record -g ./perftest
$ perf report -g --stdio
...
# Overhead Command Shared Object Symbol
# ........ ........ ................. ......................
#
97.23% perftest perftest [.] 0xc75
|
--- 0x400d2c
0x400dfb
__libc_start_main
Reads the input file and displays an annotated version of the code. If the object file has debug symbols then the source code will be displayed alongside assembly code. If there is no debug info in the object, then annotated assembly is displayed. Broken!?
$ perf annotate -i ./perf.data -d ./perftest --stdio -f
Warning:
The ./perf.data file has no samples!
Performance top tool look-alike. It generates and displays a performance counter profile in realtime.
Valgrind: cachegrind
[edit | edit source]Cachegrind simulates a machine with two level ([I1 & D1] and L2) cache and branch (mis)prediction. Useful in that it annotates the code down to line level. Can significantly differ from machine's real CPU. Will not go very far on an AMD64 CPU (vex disassembler issues). Extremely slow, typically slows down the application 12-15 times.
DIY: libhitcount
[edit | edit source]libmemleak can be easily modified to keep track of what calls into a particular point in the code. Just insert an mtrace()
call in that place.
How things scale up
[edit | edit source]L1 cache reference ......................... 0.5 ns
Branch mispredict ............................ 5 ns
L2 cache reference ........................... 7 ns
Mutex lock/unlock ........................... 25 ns
Main memory reference ...................... 100 ns
Compress 1K bytes with Zippy ............. 3,000 ns = 3 µs
Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs
SSD random read ........................ 150,000 ns = 150 µs
Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs
Round trip within same datacenter ...... 500,000 ns = 0.5 ms
Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms
Disk seek ........................... 10,000,000 ns = 10 ms
Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms
Send packet CA->Netherlands->CA .... 150,000,000 ns = 150 ms
- Source: https://gist.github.com/2843375
Operation Cost (ns) Ratio
Clock period 0.6 1.0
Best-case CAS 37.9 63.2
Best-case lock 65.6 109.3
Single cache miss 139.5 232.5
CAS cache miss 306.0 510.0
Comms Fabric 3,000 5,000
Global Comms 130,000,000 216,000,000
Table 2.1: Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System
- Source: Paul E. McKenney
Notes on Hyper Threading
[edit | edit source]- Presumed loss of performance for intensive floating point calculations (only one FPU and one ALU (two pipelines) per core).
- http://ayumilove.net/hyperthreading-faq/
- http://en.wikipedia.org/wiki/Hyper-threading