Press Cmd+I in Xcode to open Instruments, or select Product > Profile. Choose the Time Profiler template and press the red record button (Cmd+R) to start profiling. The Time Profiler samples the call stack every few milliseconds to show where CPU time is spent. Enable 'Hide System Libraries' to focus on your own code, 'Flatten Recursion' to simplify recursive calls, and 'Top Functions' to see cumulative time including called functions. For CPU optimization on Apple Silicon, prefer the CPU Profiler over Time Profiler as it samples based on CPU clock frequency rather than a fixed timer, providing more accurate results and fairer weighting of CPU resources.
Performance Profiling FAQ & Answers
68 expert Performance Profiling answers researched from official documentation. Every answer cites authoritative sources you can verify.
Jump to section:
CPU profiling tools
12 questionsMark regions of interest: 1) Intel VTune ITT API: __itt_resume() and __itt_pause() around regions, run with 'vtune -start-paused'. 2) perf with markers: use 'perf record -D 1000' to delay start, or signal-based control. 3) Programmatic control: Google Benchmark State.PauseTiming()/ResumeTiming(), JMH @CompilerControl annotations. 4) Time-based filtering: record everything, then filter in analysis to specific time ranges. 5) Intel PIN for binary instrumentation of specific functions. 6) Wrapper functions that enable/disable profiling around calls. 7) Perfetto custom track events with TRACE_EVENT macros. Profiling specific regions reduces data volume and focuses analysis on areas you control.
Launch VTune GUI with 'vtune-gui' or use the command line with 'vtune -collect hotspots ./your_program'. In the GUI, create a new project, specify your executable, and select Hotspots analysis from the Analysis tree. VTune offers two collection modes: User-Mode Sampling (higher overhead, no drivers needed) and Hardware Event-Based Sampling (lower overhead, requires sampling drivers). After collection completes, VTune displays a Summary viewpoint showing Top Hotspots sorted by CPU time. The Elapsed Time shows total runtime including idle time, while CPU Time shows the sum of all threads' CPU usage.
Integrated profiling approaches: 1) NVIDIA Nsight Systems: captures CPU and GPU activity on unified timeline, shows kernel launches, memory transfers, and CPU work together. 2) Intel VTune 2025: GPU Compute/Media Hotspots analysis for Intel GPUs and integrated graphics. 3) AMD ROCm Profiler (rocprof): profiles GPU kernels with timeline and counter data. 4) Perfetto: supports GPU traces alongside CPU traces on Android and some desktop configurations. 5) Chrome tracing: includes GPU activity for graphics workloads. For CUDA: use nvprof or Nsight Compute for kernel-level analysis. Key metrics: GPU occupancy, memory throughput, kernel duration, CPU-GPU synchronization points. Look for: idle GPU waiting for CPU, idle CPU waiting for GPU, excessive memory transfers between CPU and GPU.
Use these compiler flags for profiling: '-g' for debug symbols (essential for source-level annotation), '-fno-omit-frame-pointer' to preserve frame pointers for accurate stack traces (modern compilers omit by default), '-O2' or '-O3' to profile optimized code (profiling unoptimized code gives misleading results). Full command: 'gcc -O2 -g -fno-omit-frame-pointer program.c -o program'. For split debug info (smaller binary): '-g -gsplit-dwarf'. Note: debug symbols don't affect performance, only binary size. The -g flag can be combined with any optimization level. Without frame pointers, tools may show incomplete call stacks or use slower DWARF-based unwinding.
Default 99Hz (or 997Hz to avoid lockstep with timer interrupts) is good for most cases. Lower frequency (10-50Hz): less overhead, good for long-running production profiling, but less precision - may miss short-lived hot spots. Higher frequency (1000-10000Hz): more detail on short functions, but higher overhead and risk of perturbation. For flame graphs, 99Hz for 30-60 seconds typically captures 3000-6000 samples - sufficient for statistical accuracy. Rule of thumb: overhead = (samples/second) * (time per sample) / total time. At 99Hz with ~10us per sample interrupt on 1GHz CPU, overhead is about 0.1%. Increase frequency only if hot spots aren't clear in initial profile.
Use perf record to sample CPU activity and perf report to analyze results. Run 'perf record -g ./your_program' to capture stack traces during execution. The -g flag enables call graph recording. After execution completes, run 'perf report' to view an interactive report showing functions sorted by CPU time. For real-time profiling, use 'perf top' to see live CPU usage across all processes. The default sampling rate is 1000Hz (1000 samples per second), which the kernel dynamically adjusts. By default, perf uses the 'cycles' event, which maps to UNHALTED_CORE_CYCLES on Intel processors.
Use perf record with -p flag: 'perf record -g -p PID' to attach to running process by PID. Press Ctrl+C to stop recording, then 'perf report' to analyze. For system-wide profiling: 'perf record -a -g' captures all processes. VTune can also attach: 'vtune -collect hotspots -target-pid PID'. For Java: async-profiler can attach to running JVMs. Python: py-spy attaches without restarting: 'py-spy record -p PID -o profile.svg'. Note some profilers require debug symbols to be present (can be in separate debug package). Sampling profilers generally support attach; instrumentation-based profilers often require process restart. Check kernel parameter perf_event_paranoid if permission denied.
Multiple options based on needs: 1) cProfile (built-in): 'python -m cProfile -s cumtime script.py' for deterministic profiling with call counts. 2) py-spy: sampling profiler with minimal overhead, works on running processes: 'py-spy record -o profile.svg -- python script.py' generates flame graph. 3) Scalene: low-overhead sampling profiler distinguishing Python/native/system time. 4) yappi: supports multi-threaded profiling. 5) perf can profile CPython itself: 'perf record python script.py' but shows C functions, not Python. 6) Python 3.12+ has built-in sampling profiler module (PEP 669). For production, py-spy and Scalene have lowest overhead. For detailed call analysis, use cProfile with snakeviz visualization.
VTune Microarchitecture Exploration (uarch) analysis provides low-level CPU metrics organized by TMAM categories. Key metrics: CPI (Cycles Per Instruction) - inverse of IPC, lower is better. Frontend Bound % - instruction fetch/decode stalls. Backend Bound % - subdivided into Memory Bound (cache misses, DRAM latency) and Core Bound (execution unit contention). Bad Speculation % - mispredicted branches, machine clears. Retiring % - useful work done. Focus optimization on the highest percentage category. Drill down: if Memory Bound is high, check L1/L2/L3 Bound and DRAM Bound sub-metrics. If Core Bound, look at Port Utilization to identify oversubscribed execution units. Compare uarch metrics between code versions to verify optimizations address the right bottleneck.
perf stat counts events and reports aggregate statistics at the end of execution, while perf record samples events over time and stores detailed profiles for later analysis. Use 'perf stat ./program' to get summary counts of cycles, instructions, cache misses, and branch mispredictions. Use 'perf record ./program' followed by 'perf report' when you need to identify which specific functions consume the most time. perf stat has lower overhead since it only maintains counters, whereas perf record captures instruction pointers and call stacks at each sample, creating a perf.data file for detailed analysis.
Performance flags: -O2/-O3 (optimization level), -march=native (CPU-specific instructions), -flto (link-time optimization), -ffast-math (aggressive FP optimization, may change results). Profiling accuracy flags: -g (debug symbols for source annotation), -fno-omit-frame-pointer (accurate stack traces - essential), -fno-inline (optional: prevents inlining for clearer profiles, but changes performance). Flags to avoid during profiling: -fomit-frame-pointer (breaks stack unwinding), -s (strips symbols). Recommended combination: '-O2 -g -fno-omit-frame-pointer -march=native' for production-like profiling. Note: -O3 can inline aggressively making profiles harder to read. Consider building with debug info separately: '-O2 -g0' for production, '-O2 -g -fno-omit-frame-pointer' for profiling.
Hardware performance counters
7 questionsUse 'perf stat ./program' which reports IPC by default, calculated as instructions / cycles. Modern CPUs can execute 4+ instructions per cycle with superscalar execution. Typical IPC values: <1.0 indicates stalls (memory-bound, branch mispredictions, dependency chains), 1.0-2.0 is common for general code, 2.0-4.0 indicates well-optimized compute code, >4.0 possible with SIMD. Low IPC + high cache misses = memory-bound. Low IPC + high branch-misses = branch misprediction bound. Low IPC + neither = likely dependency chains or lack of instruction-level parallelism. IPC alone doesn't tell the full story - use TMAM methodology for detailed bottleneck breakdown. Compare IPC between code versions to assess optimization impact.
Event multiplexing occurs when you request more events than available hardware counters. The PMU time-slices between event groups, measuring each for a portion of total runtime, then scales up the counts. This introduces estimation error. Intel CPUs typically have 4 programmable + 3 fixed counters. Multiplexing matters when: calculating derived metrics (ratios become inaccurate if numerator and denominator weren't measured simultaneously), comparing absolute counts across events (both have estimation error), or when workload behavior varies over time (different events measured during different phases). To minimize impact: group related events (use perf -e '{event1,event2}'), reduce total events requested, or run multiple times measuring different event subsets.
Intel Processor Trace records complete control flow by encoding taken branches into a compressed trace buffer. Unlike sampling which captures point-in-time snapshots, PT provides exact execution history. Use PT when: you need exact function call sequences (debugging race conditions), you want precise timing of specific code paths, branch sampling misses rare events, or you need to understand control flow leading to a bug. PT has higher overhead than sampling and generates large traces. Enable with 'perf record -e intel_pt// ./program'. Requires Broadwell or newer Intel CPU (check 'grep intel_pt /proc/cpuinfo'). PT can generate 'virtual LBRs' of arbitrary size, overcoming the 32-entry hardware LBR limit.
Intel PCM provides real-time access to performance counters without Linux perf. Install from GitHub (opcm/pcm). Run 'sudo pcm' for real-time display of: IPC, cache hit rates, memory bandwidth, QPI traffic, and power consumption across all cores. For specific metrics: 'pcm-memory' for memory bandwidth, 'pcm-pcie' for PCIe traffic, 'pcm-power' for power metrics. PCM works on both Linux and Windows. It accesses uncore PMUs (memory controller, QPI) not available through standard perf interface. Output includes: L2/L3 hit rates, memory read/write bandwidth per channel, core and package power. Useful for understanding system-level behavior that per-process profiling misses. Requires root/admin access for MSR reads.
Hardware Performance Counters (also called Performance Monitoring Counters or PMCs) are special CPU registers that count hardware events like instructions executed, cache misses, and branch mispredictions. The Performance Monitoring Unit (PMU) contains these counters. Most Intel Core processors have 4 fully programmable counters and 3 fixed-function counters per logical core. Fixed counters measure core clocks, reference clocks, and instructions retired. Programmable counters let you choose which events to measure. The Linux perf tool accesses PMU counters through the perf_event_open() system call, providing abstractions over hardware-specific capabilities.
Run 'perf list' to display all available performance events on your system. Events are categorized as: Hardware events (cycles, instructions, cache-references, cache-misses, branches, branch-misses), Software events (context-switches, page-faults, cpu-migrations), Hardware cache events (L1-dcache-loads, L1-dcache-load-misses, LLC-loads, LLC-load-misses), and Tracepoint events (kernel functions, syscalls). The available events depend on your CPU model. For Intel processors, use the pmu-tools 'ocperf' wrapper to access the full list of processor-specific events not exposed by default perf, including detailed microarchitectural events.
Use 'perf stat -e event1,event2,event3 ./program' to measure multiple events. Example: 'perf stat -e cycles,instructions,cache-misses,branch-misses ./program'. Most Intel CPUs have 4 programmable counters plus 3 fixed counters, so measuring more than ~7 events requires multiplexing - perf time-slices between event groups and estimates totals. Use event groups with curly braces to ensure events are measured together: 'perf stat -e '{cycles,instructions}','{cache-references,cache-misses}' ./program'. This ensures cycles and instructions are counted simultaneously (enabling accurate IPC calculation), and cache events are grouped together. Check for '
Flame graphs and call stacks
6 questionsSampling profiler limitations: 1) Misses short functions called less often than sampling interval - a 1us function at 1000Hz sampling has ~0.1% chance of being caught per call. 2) Statistical nature means rare hot paths may not appear in profiles. 3) Cannot accurately count function call frequency - only measures time. 4) 'Skid' on interrupt-based sampling places samples slightly after actual event. 5) Kernel/interrupt time may be attributed incorrectly. 6) Multi-threaded aliasing if sampling correlates with thread scheduling. 7) Cannot detect contention or blocking (need off-CPU analysis). When they fail: very short benchmarks, rarely-called expensive functions, timing-sensitive debugging. For exact call counts and sequences, use tracing or instrumentation-based profilers despite higher overhead.
Sampling-based profiling captures call stack snapshots at regular intervals (typically timer-based) rather than instrumenting every function call. A sampling profiler reads the call stack periodically (e.g., every 10ms) to record what code is running. Functions consuming more CPU time appear in more samples. This is statistically accurate: with enough samples (1000+ minimum, 5000+ ideal), you get reliable hotspot identification. Advantages over instrumentation: minimal overhead (often <1%), no code modification needed, works on optimized binaries. The overhead calculation: at 100Hz sampling with 10,000 instructions per sample on a 1GHz CPU, theoretical overhead is only 0.1%. Most modern profilers (perf, VTune, Instruments) use sampling.
Off-CPU flame graphs show where threads spend time blocked (not on CPU) - waiting for I/O, locks, sleep, page faults, etc. Regular (on-CPU) flame graphs miss this because CPU profilers only sample running threads. Create with: record scheduler events 'perf record -e sched:sched_switch -a -g' or use BPF-based tools like bcc's offcputime. Use when: application seems slow but CPU utilization is low, you suspect I/O or lock contention, threads frequently block, or on-CPU profile doesn't explain observed latency. Off-CPU analysis complements on-CPU profiling - together they account for all wall-clock time. The flame graph shows blocking call stacks, with width indicating total blocked time.
In a flame graph: each box represents a function (stack frame), the y-axis shows stack depth (bottom is entry point, top is leaf functions), the x-axis shows population of samples (NOT time passage - it's sorted alphabetically). Box width indicates relative time spent. Look for wide boxes as these are hotspots. Functions beneath are callers (parents), functions above are callees (children). Colors typically indicate: green for user code, red/orange for kernel code, yellow for C/C++ runtime. To find optimizations, look for the widest towers - these represent code paths consuming the most CPU. Prior to flame graphs, understanding complex profiles took hours; now the hottest paths are immediately visible.
Differential flame graphs show what changed between two profiles. Workflow: 1) Capture baseline: 'perf record -F 99 -a -g -- sleep 30' during workload, 'perf script > before.perf'. 2) Make changes and capture again: 'perf script > after.perf'. 3) Generate differential: './stackcollapse-perf.pl before.perf > before.folded', './stackcollapse-perf.pl after.perf > after.folded', './difffolded.pl before.folded after.folded | ./flamegraph.pl > diff.svg'. Red indicates functions that got slower (more samples), blue indicates faster (fewer samples). Width shows absolute difference. This immediately highlights what changed - faster to identify regressions than comparing two separate flame graphs manually.
First record with stack traces: 'perf record -F 99 -a -g -- sleep 60' (99 Hz sampling, all CPUs, call graphs). Then convert to text: 'perf script > out.perf'. Clone FlameGraph tools: 'git clone https://github.com/brendangregg/FlameGraph'. Generate the SVG: './stackcollapse-perf.pl out.perf | ./flamegraph.pl > flamegraph.svg'. The resulting interactive SVG shows the call stack hierarchy where: x-axis represents stack profile population (sorted alphabetically, NOT time), y-axis shows stack depth, and box width indicates time spent. The widest boxes at any level are your hottest code paths. Click boxes to zoom into subtrees.
Microbenchmarking best practices
5 questionsInclude <benchmark/benchmark.h>. Define benchmark functions taking benchmark::State& state parameter. Time code inside 'for (auto _ : state)' loop - this is the measured section. Use benchmark::DoNotOptimize(result) to prevent dead code elimination. Register with BENCHMARK(BM_FunctionName). End file with BENCHMARK_MAIN(). Build with CMake using -DBENCHMARK_DOWNLOAD_DEPENDENCIES=on -DCMAKE_BUILD_TYPE=Release. Run with flags: --benchmark_format=json for machine-readable output, --benchmark_repetitions=N for statistical reliability, --benchmark_enable_random_interleaving=true to reduce order-dependent variance with TurboBoost CPUs. Google Benchmark automatically determines iteration count for statistical stability.
Key pitfalls: 1) Dead Code Elimination - compiler removes code with unused results. Fix: use JMH's Blackhole.consume() or Google Benchmark's DoNotOptimize(). 2) Constant Folding - compiler pre-computes results at compile time. Fix: use runtime inputs, not compile-time constants. 3) Loop Optimization - compiler may hoist computations out of loops. Fix: use benchmark framework's iteration mechanism, not manual loops. 4) Inadequate Warmup - JIT hasn't optimized code yet. Fix: run sufficient warmup iterations (JMH default: 5). 5) Measurement variance - Fix: run multiple iterations and forks, report with confidence intervals. 6) Benchmark order effects - Fix: use random interleaving (JMH: --benchmark_enable_random_interleaving).
CPU frequency scaling introduces variance as Turbo Boost activates/deactivates based on thermal and power conditions. Options: 1) Disable Turbo Boost during benchmarks: 'echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo' (Linux). 2) Fix CPU frequency with cpupower: 'sudo cpupower frequency-set -g performance -d 3.0GHz -u 3.0GHz'. 3) Use cycles instead of time as primary metric - cycles are invariant to frequency. 4) Let frequency vary but run many iterations - Google Benchmark's --benchmark_enable_random_interleaving helps. 5) Warm up the benchmark to reach steady-state Turbo frequency before measuring. Report whether frequency was fixed and what frequency was used.
Add JMH dependency to Maven: org.openjdk.jmh:jmh-core and jmh-generator-annprocess. Create benchmark class with @Benchmark annotated methods. Use @State(Scope.Thread) for per-thread state. The benchmark loop is: 'for (auto _ : state)' equivalent - JMH handles iteration count automatically based on target measurement time. Configure with annotations: @Warmup(iterations=5) for warmup, @Measurement(iterations=5) for actual measurements, @Fork(2) for JVM forks. Run via Maven plugin or JMH runner main class. JMH automatically handles JIT compilation warmup, dead code elimination prevention, and statistical analysis, outputting mean, error, and confidence intervals.
Control sources of variance: 1) CPU: disable Turbo Boost, fix frequency, pin threads to cores with taskset/numactl. 2) Memory: disable ASLR ('echo 0 | sudo tee /proc/sys/kernel/randomize_va_space'), warm up caches with dry runs. 3) OS: use isolcpus to reserve cores, disable irqbalance, use real-time scheduling if needed. 4) Thermal: let system reach steady-state temperature. 5) Statistical: run many iterations (30+), use median instead of mean, report confidence intervals. 6) Environment: close other applications, disable network, use consistent environment variables. 7) Benchmarking: randomize run order to avoid ordering effects. Check coefficient of variation (stddev/mean) - should be <5% for reliable results. If variance remains high, investigate sources with multiple profiler runs.
Memory profiling
5 questionsMultiple approaches: 1) Valgrind Massif: 'valgrind --tool=massif ./program' then 'ms_print massif.out.' - shows heap usage over time with allocation call stacks. 2) Heaptrack: lower overhead than Massif, tracks every allocation with full backtrace. Run 'heaptrack ./program' then 'heaptrack_gui heaptrack..gz' for visualization. 3) perf with memory events: 'perf record -e malloc -g ./program' using USDT probes if available. 4) gperftools (tcmalloc): link with -ltcmalloc and set HEAPPROFILE environment variable. 5) Address Sanitizer with -fsanitize=address also tracks allocations. For production profiling with minimal overhead, consider sampling-based approaches or eBPF-based tools that don't require recompilation.
Memory leak detection (e.g., Valgrind Memcheck) finds memory that was allocated but never freed - the pointer is lost. Heap profiling (e.g., Massif) tracks all allocations over time to show memory usage patterns and identify allocation hotspots, regardless of whether memory is properly freed. Heap profilers answer: which functions allocate the most memory? How does usage change over time? Where is peak memory? They also detect 'space leaks' - memory that's technically reachable (pointer exists) but not actually used, which leak detectors miss. Use leak detection to find bugs; use heap profiling to optimize memory consumption and identify allocation-heavy code paths.
NUMA (Non-Uniform Memory Access) systems have different latencies to local vs remote memory. Profile with: 'perf stat -e numa_hit,numa_miss,numa_foreign,numa_interleave ./program' to count NUMA-related events. Use 'numactl --hardware' to see topology. Intel VTune has Memory Access analysis showing NUMA traffic. For Linux, check /proc/PID/numa_maps to see memory placement. High numa_miss or remote memory accesses indicate suboptimal placement. Optimize with numactl: 'numactl --membind=0 ./program' binds memory to node 0, 'numactl --cpunodebind=0 --membind=0 ./program' binds both CPU and memory. Consider NUMA-aware data structures that keep data local to accessing threads.
Run 'valgrind --tool=massif ./program' to profile heap allocations over time. Massif measures both useful allocation space and bookkeeping/alignment overhead. Output goes to massif.out.
perf mem records and analyzes memory access samples. Run: 'perf mem record ./program' then 'perf mem report' to see memory access breakdown. It shows: data source (L1/L2/L3 cache, local/remote DRAM), access type (load/store), addresses accessed, and latency. Use for identifying: memory-bound hot spots, cache miss sources, NUMA issues (remote memory access). On Intel, perf mem uses PEBS memory sampling which captures precise load/store addresses. Filter by latency: 'perf mem record -t 30 ./program' to sample only accesses with 30+ cycle latency. The report shows data addresses - combine with 'perf report --sort mem' for source analysis. Useful for optimizing data layout and access patterns.
Tracing tools
5 questionsUse function tracing with 'perf probe' and 'perf trace'. First add probes: 'perf probe --add function_name' for entry, 'perf probe --add function_name%return' for return. Then record: 'perf record -e probe:function_name,probe:function_name__return ./program'. Use 'perf script' to see timestamped events, calculate latencies by matching entry/return pairs. For system calls: 'perf trace ./program' shows all syscalls with latencies. For specific functions without probes, use dynamic tracing: 'perf record -e 'sched:*' -g' for scheduler events. Combine with -T flag to add timestamps. This is more precise than sampling for measuring specific function execution times but has higher overhead than PMU-based profiling.
Chrome Trace Event Format is a JSON format for profiling data viewable in chrome://tracing or Perfetto UI. Basic structure: array of event objects with fields: name (event name), cat (category), ph (phase: B=begin, E=end, X=complete, i=instant), ts (timestamp in microseconds), pid (process ID), tid (thread ID), args (metadata). Example duration event: {"name":"function","cat":"custom","ph":"X","ts":1000,"dur":500,"pid":1,"tid":1}. Write your profiling system to output this format and open in chrome://tracing for visualization without building custom tools. Supports nested events, counters, async events, and flow events for cross-thread/process relationships.
Download tracebox: 'curl -LO https://get.perfetto.dev/tracebox && chmod +x tracebox'. Set permissions: 'sudo chown -R $USER /sys/kernel/tracing', 'echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid', 'echo 0 | sudo tee /proc/sys/kernel/kptr_restrict'. Create a config enabling callstack sampling with the callstack_sampling field in your data source config. Run tracebox with your config to collect traces. Open the resulting .pb file in the Perfetto UI (ui.perfetto.dev). Perfetto shows callstack samples as instant events on the timeline within process track groups, with dynamic flamegraph views when selecting time regions. Convert to pprof format with: 'python3 traceconv profile --perf trace.pb'.
Profiling collects statistical samples to identify where time is spent - you see hotspots and call distribution but not exact execution sequence. Tracing records discrete events with timestamps to show exact execution flow and timing - you see what happened and when, but data volume is large. Use profiling for: finding hotspots, optimizing CPU-bound code, understanding where time goes generally. Use tracing for: debugging timing issues, understanding event sequences, analyzing latency distributions, finding rare slow paths. Profiling has lower overhead (sampling), tracing higher overhead (records all events). Many tools do both: perf can sample or trace, VTune offers Hotspots (profiling) and Platform Profiler (tracing), Perfetto primarily traces but supports sampling.
Navigate to chrome://tracing in Chrome browser. Click 'Record' to start capture, select categories to trace (more categories = more data but potentially noisy), perform the action you want to profile, then 'Stop' recording. The trace shows TRACE_EVENT data from Chrome's instrumented code in a hierarchical timeline view per thread per process. Save traces as JSON with the Save button. Note: chrome://tracing is deprecated in favor of Perfetto (ui.perfetto.dev) which is faster, more stable, and supports custom queries. For web developers, Chrome DevTools Performance panel is often more ergonomic - press F12, go to Performance tab, click record, and it auto-selects appropriate trace categories for the current tab only.
Cache miss analysis
4 questionsProfile TLB (Translation Lookaside Buffer) misses with perf: 'perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./program'. High TLB miss rates indicate working set exceeds TLB coverage. Modern CPUs have: L1 dTLB ~64 entries, L1 iTLB ~128 entries, L2 STLB ~1536 entries. With 4KB pages, max coverage is ~6MB. Solutions: use huge pages (2MB on x86) to increase TLB coverage 512x, improve memory locality to reduce working set, or use transparent huge pages (THP). Enable huge pages: 'echo always > /sys/kernel/mm/transparent_hugepage/enabled' or use madvise(MADV_HUGEPAGE). TLB miss penalty is ~20-100 cycles for page table walk, significant for memory-intensive workloads.
The three key metrics are: 1) LX request rate - number of cache level X requests per instruction. Low request rate means data comes from faster cache levels. 2) LX miss rate - number of cache level X misses per instruction. High request rate with low miss rate means data is mostly served from that cache level. High miss rate means data comes from slower memory. 3) LX miss ratio - ratio of misses to requests at level X. This is commonly cited but only meaningful when request rate is high. When analyzing cache performance, focus on miss rate (misses per instruction) rather than miss ratio alone, as a high miss ratio with low request rate may not indicate a real performance problem.
Run 'valgrind --tool=cachegrind ./program' to simulate L1 and L2 cache behavior. Cachegrind uses a simulation of a machine with split L1 cache (instruction and data) and unified L2 cache. After execution, it outputs summary statistics and creates a cachegrind.out.
Use 'perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program' to count cache events. Key metrics to calculate: L1 miss rate = L1-dcache-load-misses / L1-dcache-loads, LLC miss rate = LLC-load-misses / LLC-loads. For sampling-based analysis, use 'perf record -e cache-misses ./program' followed by 'perf report' to identify functions causing the most cache misses. Note that cache miss latency costs vary significantly: L1 hit is about 3-4 cycles, L2 hit is 10-14 cycles, L3 hit is 40-50 cycles, and a main memory access is 200-300 cycles. Focus optimization efforts on L3 misses as they have the highest latency impact.
Hot spot analysis
3 questionsHot spots are code regions consuming disproportionate execution time. To identify them: 1) Run sampling profiler (perf record, VTune Hotspots, Instruments Time Profiler) on representative workload. 2) Sort functions by CPU time in the report. 3) Top functions are hot spots - focus optimization there. 4) Use call graph to understand how hot functions are reached. 5) Drill down with perf annotate to find hot instructions within functions. 6) Generate flame graphs for visual overview - widest boxes are hottest. Remember: some hot spots are fundamental to the algorithm and can't be eliminated. After optimizing, re-profile to verify improvement and identify new hot spots - optimization often shifts bottlenecks.
Follow this cycle: 1) Profile to identify the current biggest bottleneck (don't guess). 2) Understand WHY it's slow - is it algorithmic, memory access patterns, branch mispredictions? 3) Form hypothesis and implement targeted fix. 4) Re-profile to verify improvement. 5) Compare before/after metrics quantitatively. 6) Repeat - the new hottest spot may be different. Key principles: always start with profiling data, optimize the bottleneck (not random code), measure impact of each change, stop when meeting performance targets or hitting diminishing returns. Use version control to track optimization attempts. Document what you tried and results - some 'obvious' optimizations may not help or may hurt.
Startup profiling techniques: 1) perf record from process start: 'perf record -g program' captures everything including initialization. 2) strace for syscall timing: 'strace -tt -T -f program 2>&1 | head -100' shows early syscalls with timestamps and durations. 3) LD_DEBUG for library loading: 'LD_DEBUG=libs program' shows dynamic library loading order and timing. 4) perf with fork following: 'perf record -F 99 -g --call-graph dwarf program' for complete initialization traces. 5) Application-specific: add timestamps at key initialization points. Analyze: dynamic linking time (consider static linking), configuration file parsing, network/database connection establishment, lazy vs eager initialization. Generate flame graph focusing on early samples to visualize startup hot spots.
Bottleneck identification methodology
3 questionsCPU-bound: high CPU utilization (near 100%), low I/O wait, performance scales with faster CPU. Check with 'top' - if CPU bars are full, you're CPU-bound. Memory-bound: CPU utilization moderate but performance limited by memory bandwidth/latency. Profile cache misses with perf - high L3 miss rate indicates memory-bound. Use Roofline Model - if kernels are on the sloped portion, they're memory-bound. I/O-bound: low CPU utilization, high I/O wait (wa% in top). Use 'iotop' to see disk I/O per process. Check with 'vmstat' - high 'wa' column indicates I/O wait. Intel VTune's Top-Down Microarchitecture Analysis Method classifies as Front-End Bound, Back-End Bound (Memory or Core), Bad Speculation, or Retiring.
TMAM is a hierarchical methodology for identifying CPU bottlenecks. It classifies execution into four top-level categories: 1) Retiring - useful work, ideal state. 2) Bad Speculation - wasted work from mispredicted branches. 3) Front-End Bound - instruction fetch/decode bottlenecks (I-cache misses, complex instructions). 4) Back-End Bound - subdivided into Memory Bound (cache misses, memory latency) and Core Bound (execution unit contention, long-latency operations). Start at the top level to identify which category dominates, then drill down. Key insight: only optimize the bottleneck category - improving non-bottleneck areas won't help. Intel VTune and pmu-tools toplev.py implement TMAM automatically.
Several approaches: 1) perf with lock events: 'perf lock record ./program' then 'perf lock report' shows contention statistics. 2) Valgrind DRD/Helgrind: detect lock order issues and contention. 3) Intel VTune Threading analysis: shows wait time per sync object. 4) Off-CPU analysis to see time spent waiting on locks. 5) Instrumented mutex libraries (e.g., pthread with PTHREAD_MUTEX_ERRORCHECK). 6) eBPF/BCC tools like lockstat. Look for: high lock hold times, threads waiting longer than holding, lock ordering issues, unnecessary locking (consider lock-free structures). Metrics: contention rate = lock_waiters / lock_acquisitions, wait time vs hold time ratio. High contention indicates need for finer-grained locking or lock-free algorithms.
Profiling overhead and observer effect
3 questionsProfiling overhead is the performance cost of measurement itself. Sampling profilers typically have <1-5% overhead since they only periodically capture state. Instrumentation-based profilers can have 10-100x slowdown as they intercept every function call. Tracing tools vary: perf has minimal overhead, Valgrind can be 20-50x slower. Overhead affects both absolute timings and relative hotspot rankings. High overhead can cause measurement to dominate workload, making short functions appear faster than they are. Mitigation: use sampling for minimal overhead, adjust sampling rate (lower = less overhead but less precision), be aware that some tools (Cachegrind, Memcheck) fundamentally require high overhead. Report profiler used when sharing results.
The observer effect occurs when the act of measurement changes the behavior being measured. In performance analysis, profiling instrumentation can alter cache behavior, branch prediction, memory layout, and timing. Research by Mytkowicz et al. shows this can lead to incorrect conclusions - perturbation is non-monotonic and unpredictable with respect to instrumentation amount. Mitigation strategies: 1) Use hardware performance counters which have minimal overhead. 2) Compare results across multiple profiling tools. 3) Use setup randomization - vary environment variables, link order, stack alignment to detect environment-sensitive results. 4) Perform causal analysis to distinguish real effects from measurement artifacts. 5) Report measurement methodology with results.
Measurement bias occurs when the measurement environment systematically favors some configurations over others. Sources include: link order affecting memory layout, environment variable size changing stack alignment, ASLR randomization, filesystem cache state, CPU frequency scaling state, and other processes on the system. Research found that none of 133 papers in major systems conferences adequately addressed measurement bias. To avoid: 1) Randomize setup - vary link order, environment, etc. across runs. 2) Use consistent test environment - same hardware, OS, background load. 3) Run many trials with different random seeds. 4) Report variance alongside means. 5) Test on multiple machines if possible. 6) Use statistical tests that account for variance when comparing results.
Cycle counting and measurement
3 questionsInclude <x86intrin.h> and use __rdtsc() to read the Time Stamp Counter. For accurate measurements, use CPUID to serialize instructions before the first RDTSC, and RDTSCP (which has partial serialization) at the end. A reliable pattern is: call CPUID, call RDTSC (start), run code, call RDTSCP (end), call CPUID. The final CPUID prevents instructions after RDTSCP from being reordered. Modern Intel CPUs since Nehalem (2008) have an 'invariant TSC' that increments at a constant rate regardless of CPU frequency scaling or power states. Note that TSC frequency differs from actual CPU frequency - for example, a CPU ranging 800MHz-4800MHz might have TSC ticking at a fixed 2.3GHz.
RDTSC overhead is approximately 150-200 clock cycles on modern Intel processors. For accurate benchmarking, measure the RDTSC overhead separately and subtract it from your results. Intel and Agner Fog recommend this approach. However, for functions taking 100,000+ cycles, the overhead is negligible and can be ignored. Even RDTSC itself returns varying results, so sample many times - around 3 million samples at 4.2GHz produces stable averages. Also use SetThreadAffinityMask (Windows) or sched_setaffinity (Linux) to pin your thread to a single CPU core, since TSC values are not synchronized across cores on multi-processor systems.
Wall-clock time (elapsed/real time): total time from start to finish, including waiting for I/O, other processes, sleeping. What a stopwatch would measure. CPU time: time CPU spent executing your code, excluding waits. If process sleeps for 1 second, wall time increases by 1s but CPU time doesn't. User time: CPU time spent in user space executing your code. System time: CPU time spent in kernel on behalf of your process (syscalls, I/O operations). User + System = total CPU time. On multi-core: CPU time can exceed wall time if threads run in parallel. The Unix 'time' command reports all three. For benchmarking compute-bound code, use CPU time; for user-facing latency, use wall time.
Statistical analysis of benchmarks
3 questionsEssential metrics to report: 1) Central tendency: mean AND median (median more robust to outliers). 2) Variance: standard deviation, interquartile range, min/max. 3) Confidence intervals: 95% CI for mean. 4) Sample size: number of iterations/runs. 5) Methodology: warmup iterations, measurement iterations, tools used. Hardware: CPU model (exact SKU), RAM size/speed, storage type. Software: OS version, compiler/runtime version, optimization flags. Environment: frequency scaling settings, other running processes, whether virtualized. For comparison claims, report: statistical test used (t-test, Mann-Whitney), p-value or whether confidence intervals overlap. Include raw data or histogram when possible. Note any known sources of variance.
For comparing two configurations: use t-test if data is normally distributed with similar standard deviations. For multiple configurations: use ANOVA (one-factor analysis of variance). For non-normal distributions: use non-parametric tests like Mann-Whitney U. Always report confidence intervals - 95% is standard but 90% or 99% may be appropriate depending on risk tolerance. Use Maritz-Jarrett method for confidence intervals around percentiles/quantiles. For high-variance workloads, consider CUPED (Controlled-experiment Using Pre-Experiment Data) to reduce variance. Avoid comparing just means - overlapping confidence intervals suggest no statistically significant difference. Visualize distributions, not just summary statistics.
Minimum 1000 samples for basic reliability, 5000+ samples for high confidence. Confidence interval width is inversely proportional to square root of sample size - quadrupling samples halves the interval width. For comparing benchmarks, use statistical tests: t-test for two configurations (requires normal distribution), ANOVA for multiple configurations. Report results with confidence intervals (typically 95%) rather than just means. Use median and median absolute deviation for non-normal distributions instead of mean and standard deviation. For LLM and ML benchmarks, recent research shows 10 independent trials per configuration with reported variance and confidence intervals as best practice.
Roofline model analysis
3 questionsIn Intel Advisor GUI: create project, specify executable, run Survey analysis first to collect timing data, then run Trip Counts analysis with FLOPS collection enabled. Alternatively, use the 'Collect Roofline' shortcut which runs both. Command line: 'advisor --collect=survey --project-dir=./adv -- ./program' then 'advisor --collect=tripcounts --flop --project-dir=./adv -- ./program'. The Roofline pane shows kernels as dots with size/color indicating execution time. Check the Recommendation tab for optimization guidance. A kernel's vertical position relative to roofs indicates bottlenecks - if above a roof, that's not the primary bottleneck. Focus on kernels far below roofs (room to optimize) and large dots (high time impact).
The Roofline Model visualizes application performance relative to hardware limits. The X-axis is Arithmetic Intensity (operations per byte of data moved), Y-axis is performance (operations per second). The 'roofline' has two parts: a sloped memory-bound region (limited by memory bandwidth) and a flat compute-bound region (limited by peak FLOPS). Each dot represents a kernel/loop. If a dot is below the sloped roof, it's memory-bound - optimize data movement. If below the flat roof, it's compute-bound - optimize computation. Intel Advisor generates roofline charts automatically. The Cache-Aware Roofline Model extends this with separate roofs for each cache level, helping identify which memory level is the bottleneck.
Arithmetic Intensity (AI) = FLOPs performed / Bytes transferred from memory. It determines whether code is compute-bound or memory-bound. Calculate: count floating-point operations in a kernel, count bytes read/written. Example: DAXPY (y = ax + y) with n elements: 2n FLOPs (multiply + add), 3n8 bytes (read x, read y, write y), AI = 2n / 24n = 0.083 FLOPs/byte - very memory-bound. Intel Advisor calculates AI automatically in Roofline analysis. Compare AI to machine balance (peak FLOPS / peak memory bandwidth). If AI < machine balance, kernel is memory-bound; if AI > machine balance, kernel is compute-bound. Increase AI through cache blocking, data reuse, or algorithm changes to move from memory-bound to compute-bound.
Comparative benchmarking
2 questionsUse A/B comparison methodology: 1) Establish baseline with multiple runs (minimum 10, ideally 30+) to capture variance. 2) Calculate mean, median, standard deviation, and confidence intervals. 3) Make optimization change. 4) Run same number of tests under identical conditions. 5) Use statistical tests - t-test for normally distributed data, Mann-Whitney for non-normal. 6) Check if confidence intervals overlap - non-overlapping indicates statistically significant difference. Tools: VTune has comparison mode, perf diff compares two perf.data files, differential flame graphs show changes visually. Important: control for system noise - use CPU pinning, disable frequency scaling, close other applications, run multiple times.
Use dedicated benchmark runners with fixed hardware for consistency. Steps: 1) Store benchmark baseline results in version control. 2) Run benchmarks on every commit/PR using tools like Google Benchmark, JMH, or Criterion. 3) Compare against baseline with statistical tests - reject if performance regresses beyond threshold (e.g., 5% slower with 95% confidence). 4) Tools: Bencher, Conbench, Pernosco for tracking over time; GitHub Actions for automation. 5) Pin CPU frequency, disable Turbo Boost on benchmark machines. 6) Run benchmarks multiple times (10+) for statistical reliability. 7) Alert/block merges on statistically significant regressions. 8) Store historical results for trend analysis. Consider dedicated benchmark machines to avoid cloud instance variability.
Instruction-level profiling
2 questionsIntel Processor Event-Based Sampling (PEBS) is a hardware mechanism that records precise instruction pointers when performance events occur, unlike regular sampling which has 'skid' due to interrupt latency. When a configured event (cache miss, branch misprediction) occurs, PEBS captures the instruction pointer and register state into a dedicated buffer with minimal overhead. This pinpoints the exact instruction causing the event, not an instruction several cycles later. Enable PEBS in perf with ':pp' or ':ppp' suffix on events, e.g., 'perf record -e cycles:pp ./program'. PEBS is essential for accurate attribution of cache misses and branch mispredictions to specific code locations.
After recording with 'perf record -g ./program', run 'perf annotate function_name' to see per-instruction samples. perf annotate displays assembly with percentage of time spent on each instruction. If compiled with debug info (-g flag), source code appears alongside assembly. For best results, compile with '-fno-omit-frame-pointer -ggdb'. In perf report interactive mode, press 'a' to annotate the selected function. Note that interrupt-based sampling introduces 'skid' - the recorded instruction pointer may be several dozen instructions away from where the counter actually overflowed due to out-of-order execution and pipeline depth. Intel PEBS (Processor Event-Based Sampling) provides more precise instruction attribution.
Branch misprediction profiling
2 questionsUse 'perf stat -e branches,branch-misses ./program' to count total branches and mispredictions. Calculate misprediction rate as branch-misses/branches. A rate above 1-2% may indicate optimization opportunities. For sampling-based analysis, use 'perf record -e branch-misses ./program' then 'perf report' to find functions with the most mispredictions. Intel Last Branch Records (LBR) provide more detailed branch analysis: use 'perf record -b -e cycles ./program' to capture branch stacks with 32 entries showing FROM/TO addresses and predicted/mispredicted flags. Modern CPUs rely heavily on branch prediction to keep pipelines full, so high misprediction rates can severely impact performance.
LBR is a CPU feature that records the last 32 branches taken by the processor, including source and destination addresses, whether the branch was predicted correctly, and cycle counts. To capture LBR data with perf: 'perf record -b -e cycles ./program'. The resulting data shows branch stacks with format: FROM -> TO (M/P for mispredicted/predicted, cycles). LBR provides better coverage than direct branch-misses sampling because it captures branch history without requiring additional performance counters. Use 'perf report --branch-history' to analyze branch patterns. LBR is available on Intel processors since Nehalem and is useful for identifying hot branch paths and misprediction patterns.