Follow this 8-step memory alignment optimization process: 1) Identify alignment requirements - SSE needs 16-byte, AVX needs 32-byte, AVX-512 needs 64-byte alignment; cache lines are 64 bytes. 2) Check current alignment - use (uintptr_t)ptr % alignment or assert((uintptr_t)ptr & (alignment-1) == 0). 3) For stack variables: use alignas(N) specifier in C++11+ or attribute((aligned(N))). 4) For heap allocations: use aligned_alloc(alignment, size), _mm_malloc(size, alignment), or posix_memalign(). 5) For structure members: use alignas() or pack/align pragmas; be aware of padding impact on size. 6) Verify SIMD load/store instructions: use aligned loads (_mm256_load_ps) when alignment guaranteed, unaligned (_mm256_loadu_ps) otherwise. 7) Measure performance difference - on modern CPUs (Haswell+), unaligned access penalty is small UNLESS crossing cache line boundary; crossing penalty is significant. 8) Consider cache line alignment for hot data - align frequently accessed data to 64-byte boundaries to avoid split loads and false sharing. Trade-off: over-alignment wastes memory; align only what's performance-critical.
Low Level Optimization FAQ & Answers
627 expert Low Level Optimization answers researched from official documentation. Every answer cites authoritative sources you can verify.
Jump to section:
Process Checklists
144 questionsFollow this 10-step allocation optimization process: 1) Profile allocations - use heap profiler (Massif, Heaptrack) to count allocations and measure overhead. 2) Identify hot allocators - find code paths with most frequent allocations. 3) Calculate allocation cost - typical malloc: 50-200 cycles; for small allocations, overhead dominates. 4) Consider object pooling - pre-allocate pool of objects, reuse instead of alloc/free. 5) Use arena allocators - allocate from contiguous region, free all at once; excellent for phase-based work. 6) Enable small block optimizers - jemalloc, tcmalloc have optimized small allocation paths. 7) Use stack allocation - alloca or VLAs for temporary, known-bounded data (be careful of stack overflow). 8) Batch allocations - allocate array of N objects instead of N separate allocations. 9) Consider custom allocators - for specific size classes, slab allocators eliminate fragmentation and overhead. 10) Reduce allocation count - reuse buffers, use string_view instead of string copies, pass by reference. Measure: profile allocation rate (per second) and total overhead; target <5% of CPU time on allocations.
Follow this 12-point production monitoring checklist: 1) Define SLOs - establish latency percentiles (p50, p95, p99), throughput targets, error budgets. 2) Instrument code - add timing for critical paths, use low-overhead methods (RDTSC-based). 3) Sample, don't trace everything - 1-10% sampling rate sufficient for trends, avoids overhead. 4) Monitor system metrics - CPU utilization, memory usage, cache hit rates, context switches. 5) Set up alerting - alert when metrics approach SLO boundaries, not when violated. 6) Create dashboards - visualize key metrics over time, correlate with deployments. 7) Establish baselines - record normal performance ranges for comparison. 8) Enable on-demand profiling - mechanism to capture detailed profiles when issues detected. 9) Log slow requests - capture details of requests exceeding latency thresholds. 10) Track performance per version - correlate performance changes with code changes. 11) Monitor resource costs - track CPU-time, memory-hours, cost per request. 12) Regular performance reviews - periodic analysis of trends, capacity planning. Tools: Prometheus+Grafana for metrics, Jaeger/Zipkin for tracing, continuous profiling services (Parca, Pyroscope).
Follow this 12-point cloud optimization checklist: 1) Minimize cold start - for serverless: reduce package size, lazy load dependencies, use provisioned concurrency. 2) Optimize for spot/preemptible instances - checkpoint state, handle termination gracefully. 3) Right-size instances - profile to choose appropriate CPU/memory; oversized wastes money, undersized hurts performance. 4) Use appropriate storage tier - SSD for IOPS, HDD for throughput, object storage for large files. 5) Optimize network architecture - minimize cross-AZ traffic, use private endpoints, consider edge locations. 6) Implement connection pooling - database connections are expensive; reuse across invocations. 7) Cache aggressively - use Redis/Memcached for hot data; CDN for static content. 8) Batch operations - reduce round-trips to managed services which often have high per-request latency. 9) Monitor costs alongside performance - cloud bills can surprise; optimize for cost-performance ratio. 10) Use autoscaling effectively - set appropriate thresholds; don't scale on minor fluctuations. 11) Consider reserved capacity - committed use discounts for predictable workloads. 12) Profile regularly - cloud environment changes; re-profile after updates.
Follow this 10-step linked structure optimization process: 1) Profile access patterns - identify traversal patterns, random vs sequential access frequency. 2) Measure cache misses - linked structures typically have poor locality; quantify the cost. 3) Consider unrolled variants - unrolled linked lists store multiple elements per node, improving locality. 4) Pool allocations - allocate nodes from contiguous memory pool to improve spatial locality. 5) Prefetch next nodes - during traversal, prefetch next pointer's target: prefetch(node->next). 6) Consider cache-oblivious layouts - Van Emde Boas layout for trees improves cache efficiency. 7) Evaluate array-based alternatives - many linked structures can be replaced with arrays + indices. 8) Use flat storage for small structures - for trees with <1000 nodes, array-based often faster than pointer-based. 9) Pack related data - ensure frequently accessed fields are in same cache line as link pointers. 10) Consider converting to array on access - if structure is built once and traversed many times, copy to array for traversal. Key insight: modern CPUs heavily favor arrays due to prefetching; linked structures should be a last resort for performance-critical code.
Follow this 10-step emergency performance triage process: 1) Acknowledge and communicate - alert relevant teams, set expectations for resolution time. 2) Assess impact scope - how many users affected, which services, severity? 3) Check for recent changes - deployments, config changes, traffic patterns in last 24 hours. 4) Gather quick metrics - CPU, memory, disk I/O, network, error rates across affected services. 5) Identify proximate cause - is it application code, database, network, external dependency? 6) Apply temporary mitigation - scale up, enable rate limiting, disable expensive features, rollback if recent change. 7) Capture diagnostic data - while issue is occurring: thread dumps, heap dumps, traces, logs. 8) Root cause analysis - after mitigation, investigate captured data to find actual cause. 9) Implement proper fix - address root cause, not just symptoms. 10) Post-mortem - document incident, timeline, root cause, fix, prevention measures. Key principle: mitigate first, debug second. A working system at reduced capacity is better than extended downtime while investigating.
Follow this 11-step distributed systems performance debugging process: 1) Define the symptom - quantify: which requests are slow, how slow, what percentile affected? 2) Collect distributed traces - use Jaeger, Zipkin, or vendor tracing to see request flow. 3) Identify slow spans - find which service/component contributes most latency. 4) Check for queuing - high queue time indicates insufficient capacity or slow downstream. 5) Profile the slow component - use local profiling techniques on identified service. 6) Check network latency - measure RTT between services; unexpected high latency indicates network issues. 7) Look for retry amplification - retries causing load increase, causing more retries. 8) Check for resource contention - CPU, memory, connections, thread pools hitting limits. 9) Analyze dependencies - slow downstream service can cause upstream queuing. 10) Check for data skew - uneven load distribution causing some instances to be hot. 11) Verify with fix - implement fix, monitor distributed traces to confirm improvement. Key insight: in distributed systems, latency is often dominated by network, queuing, and serialization - not CPU computation.
Follow this 9-step recursive algorithm optimization process: 1) Profile recursion depth and call frequency - identify how deep and how many total calls. 2) Check for tail recursion - if tail-recursive, compiler may optimize to iteration; enable with -O2. 3) Consider explicit iteration - convert to loop with manual stack if recursion overhead significant. 4) Implement memoization - cache results of expensive recursive calls; check if subproblems overlap. 5) Use trampolining - for languages without tail call optimization, avoid stack overflow with continuation-passing. 6) Optimize base cases - ensure base cases are fast; consider handling more cases in base to reduce depth. 7) Increase problem size per call - process more work per recursive call to reduce call overhead ratio. 8) Consider iterative deepening - for search problems, avoid deep recursion with bounded iterations. 9) Profile stack usage - deep recursion can cause cache misses on stack; consider explicit data structures. Trade-off: recursive code is often clearer; only optimize if profiling shows recursion overhead is significant. Memoization often gives biggest wins for overlapping subproblems (dynamic programming).
Follow this diagnostic process when IPC is below 1.0: 1) Measure IPC precisely - use perf stat -e cycles,instructions to get accurate IPC value. 2) Check memory boundedness first - measure cache misses: perf stat -e L1-dcache-load-misses,LLC-load-misses. If LLC miss rate > 1%, memory latency is likely culprit. 3) If memory-bound: follow cache optimization checklist - improve locality, add prefetching, reduce working set. 4) Check branch mispredictions - perf stat -e branch-misses. If misprediction rate > 2%, branch prediction is culprit. 5) If branch-bound: follow branch optimization checklist - improve predictability, use branchless code. 6) Check for long dependency chains - look for serial arithmetic operations where each depends on previous result. 7) If dependency-bound: break chains with temporary variables, use associative transformations, increase ILP. 8) Check for execution port saturation - use VTune Port Utilization or pmu-tools. 9) If port-bound: use different instruction mix, reduce pressure on saturated ports. 10) Check for front-end issues - I-cache misses, complex instructions causing decode stalls. Low IPC is usually memory (most common), branches, or dependencies (in that order).
Use this 12-point checklist to validate benchmark methodology: 1) Representative workload - does benchmark reflect real-world usage patterns and data sizes? 2) Warm-up period - are caches and JIT compilers in steady state before measurement? 3) Sufficient iterations - do you have 30+ measurements for statistical validity? 4) Controlled environment - is CPU frequency fixed, Turbo Boost disabled, other processes minimized? 5) Thread affinity - are benchmark threads pinned to specific cores? 6) Memory state - is ASLR disabled or results averaged across multiple runs? 7) Statistical reporting - do you report mean, median, standard deviation, AND confidence intervals? 8) Baseline comparison - are you comparing against a known baseline, not just measuring absolute numbers? 9) Single variable - are you changing only one thing between compared benchmarks? 10) Result validation - does the benchmark compute correct results? Dead code elimination? 11) Hardware documentation - did you record exact CPU model, RAM speed, compiler version? 12) Reproducibility - can someone else run your benchmark and get similar results? If any point fails, benchmark results may be misleading or meaningless.
Follow this 10-step code size optimization checklist: 1) Measure code size - check binary size, use size command to see text segment. 2) Profile I-cache misses - use perf stat -e L1-icache-load-misses to measure instruction cache behavior. 3) Reduce function size - split large functions, move cold code to separate functions. 4) Use -Os flag - optimize for size; may be faster than -O3 if I-cache bound. 5) Avoid excessive inlining - inline only truly hot, small functions; large inlined functions bloat code. 6) Reduce template instantiations - templates generate separate code for each type; use type erasure where possible. 7) Use link-time optimization (LTO) - -flto enables cross-file optimization and dead code elimination. 8) Arrange functions by hotness - place hot functions together in memory; use profile-guided layout. 9) Consider compression - code compression techniques for embedded or cold code regions. 10) Review loop unrolling - excessive unrolling bloats code; tune unroll factor considering I-cache. Trade-off: smaller code has better I-cache hit rate but may have more instructions executed; profile both.
Follow this 10-step loop tiling process: 1) Identify candidate loops - nested loops accessing multi-dimensional arrays with poor cache reuse. Classic example: matrix operations. 2) Analyze access pattern - understand which indices access which dimensions, identify stride patterns. 3) Calculate cache parameters - L1 data cache size (typically 32KB), line size (64 bytes), associativity. 4) Choose tile sizes - tile should fit in target cache level. For L1: total data per tile < 32KB. For L2: < 256KB. 5) Tile innermost data reuse dimension - place tile loops to maximize reuse within tile. 6) Handle boundary conditions - when array dimensions aren't divisible by tile size, handle remainder tiles. 7) Implement tiled loop nest - original: for(i) for(j) for(k) becomes: for(ii+=TI) for(jj+=TJ) for(i=ii) for(j=jj) for(k). 8) Verify correctness - tiled loop must compute same result as original. 9) Tune tile sizes - optimal sizes depend on problem size and cache hierarchy; try powers of 2: 16, 32, 64. 10) Measure improvement - profile cache miss rates before/after; L1 misses should decrease significantly for properly tiled loops.
Follow this 10-step sort optimization process: 1) Profile current sort - measure comparison count, swap count, cache misses. 2) Check data characteristics - size, distribution (nearly sorted, random, many duplicates), key type. 3) For small arrays (<50 elements) - use insertion sort; best for small N due to low overhead. 4) For medium arrays (50-1000) - quicksort variants with good pivot selection (median-of-three). 5) For large arrays - consider cache-aware algorithms like cache-oblivious mergesort or radix sort for integers. 6) For nearly sorted data - insertion sort or timsort which detects and exploits runs. 7) For many duplicates - use 3-way partitioning quicksort (Dutch National Flag). 8) Optimize comparison function - inline, minimize branches, compare discriminating fields first. 9) Use SIMD for sorting networks - for fixed small sizes, SIMD sorting networks are very fast. 10) Consider radix sort for integers/strings - O(nk) can beat O(n log n) for appropriate data. Benchmark standard library sort (usually introsort) as baseline; often highly optimized already.
Follow this 8-step register optimization process: 1) Identify register pressure issues - look for spill/fill instructions in generated assembly, or use compiler reports. 2) Count live variables - at any point, live variables > available registers causes spills. x86-64 has 16 general-purpose, 16 vector registers. 3) Reduce variable lifetimes - compute values as late as possible, use values as soon as computed. 4) Split hot and cold paths - keep register-intensive code on hot path; cold paths can spill. 5) Use function parameters wisely - first 6 integer args and first 8 FP args passed in registers (x86-64 SysV ABI). 6) Consider manual register hints - use register keyword (hint only), or inline assembly for critical sections. 7) Restructure code - split large functions, move invariants outside loops, reduce nested scopes. 8) Check compiler options - -freg-struct-return, -mprefer-vector-width affect register usage. Key insight: modern compilers have excellent register allocators; manual intervention rarely helps except in extreme cases. Focus on reducing variable count and scope instead.
Follow this 11-step profile-guided optimization process: 1) Build baseline without PGO - measure current performance for comparison. 2) Instrument code for profiling - GCC: -fprofile-generate, Clang: -fprofile-instr-generate. 3) Design representative workload - must cover typical production patterns, not just unit tests. 4) Run instrumented binary - execute with representative workload to collect profile data. 5) Verify coverage - ensure hot code paths were executed; check coverage reports. 6) Build with profile data - GCC: -fprofile-use, Clang: -fprofile-instr-use=profile.profdata. 7) Measure improvement - compare PGO build against baseline on same workload. 8) Verify correctness - run full test suite on PGO build. 9) Establish profile update cadence - re-profile when code or usage patterns change significantly. 10) Automate in CI/CD - integrate profile collection and PGO builds into build pipeline. 11) Consider BOLT - post-link optimization using perf data for additional gains (5-15% typical). Expected gains: 10-30% for branch-heavy code like compilers, interpreters; less for compute-bound numeric code.
Follow this 9-step cache thrashing diagnosis and fix process: 1) Detect thrashing symptoms - high cache miss rate despite working set appearing to fit in cache. 2) Profile cache behavior - use Cachegrind or perf to measure actual miss rates vs expected. 3) Identify conflict misses - thrashing occurs when multiple data items map to same cache set, evicting each other. 4) Analyze access patterns - find which arrays/structures are accessed together and their addresses. 5) Check alignment/stride - power-of-2 strides with power-of-2 array sizes cause conflict misses. 6) Calculate conflict addresses - items conflict if: (address1 / cache_line_size) mod num_sets == (address2 / cache_line_size) mod num_sets. 7) Apply padding - add padding to shift one array's addresses to different sets. Typical: add cache_line_size bytes. 8) Use different allocation strategy - malloc may align large allocations; use memalign with offset. 9) Verify fix - re-profile to confirm miss rate decreased and performance improved. Example: two 4KB arrays both at 4KB-aligned addresses conflict in every L1 set; padding one by 64 bytes eliminates conflicts.
Follow this 12-point GPU kernel optimization checklist: 1) Profile with vendor tools - NVIDIA Nsight Compute, AMD ROCm Profiler, or Intel VTune GPU. 2) Check occupancy - percentage of GPU resources utilized; low occupancy wastes parallelism. 3) Identify bottleneck type - compute-bound, memory-bound, or latency-bound (use roofline analysis). 4) Optimize memory access patterns - ensure coalesced memory access; avoid strided or random patterns. 5) Use shared memory - for data reused across threads in a block; reduces global memory traffic. 6) Minimize divergence - avoid branches where threads in warp take different paths. 7) Optimize block/grid dimensions - tune for occupancy and memory access patterns. 8) Use appropriate precision - FP16/BF16 faster than FP32 on many GPUs; use if accuracy allows. 9) Overlap compute and transfer - async memory copies while kernels execute. 10) Reduce CPU-GPU synchronization - batch operations to minimize sync points. 11) Use tensor cores where applicable - for matrix operations, specialized hardware gives huge speedup. 12) Profile power and thermal - throttling due to heat reduces performance; check sustained performance vs peak. Compare against theoretical limits: memory bandwidth, compute throughput for target GPU.
Follow this 12-step A/B performance comparison process: 1) Define hypothesis - state expected improvement and metrics to measure. 2) Control environment - identical hardware, OS config, background load for both runs. 3) Pin CPU frequency - disable Turbo Boost and frequency scaling. 4) Warm up - run code before measurement to stabilize caches and JIT. 5) Measure baseline (A) - run configuration A, record multiple trials (minimum 30). 6) Calculate baseline statistics - mean, median, standard deviation, 95% confidence interval. 7) Measure treatment (B) - identical conditions, same number of trials. 8) Calculate treatment statistics - same metrics as baseline. 9) Perform statistical test - use t-test if normally distributed, Mann-Whitney otherwise. Check p-value < 0.05 for significance. 10) Calculate effect size - percent improvement and confidence interval on the difference. 11) Verify non-overlapping confidence intervals - if intervals overlap, difference may not be significant. 12) Document results - record all details: hardware, software versions, metrics, statistical analysis, raw data. Report inconclusive results honestly; negative results are still results.
Follow this 10-step loop unrolling checklist: 1) Verify loop is hot - don't unroll cold loops; consumes code cache for no benefit. 2) Check compiler auto-unrolling - compile with -funroll-loops and check if already unrolled. 3) Measure baseline - record cycles per iteration, IPC, code size. 4) Assess unroll benefit potential - highest benefit when: loop overhead significant vs body, iterations are independent, body has instruction-level parallelism. 5) Choose unroll factor - start with 2x or 4x; optimal is often power of 2. 6) Handle remainder - for N iterations with unroll factor U: handle N%U iterations separately (epilog loop or Duff's device). 7) Check register pressure - unrolling increases live variables; if registers spill to stack, benefit is lost. 8) Verify vectorization interaction - unrolling may enable or prevent auto-vectorization depending on structure. 9) Measure result - compare cycles/iteration, watch for: code bloat causing I-cache misses, register spills. 10) Document decision - record unroll factor and why, as optimal may change with compiler versions or hardware. Rule of thumb: if unrolled code > 64 instructions, risk of I-cache pressure increases.
Follow this 7-step TMAM analysis process: 1) Run top-level analysis - use VTune or pmu-tools toplev.py to get percentages for: Retiring (useful work), Bad Speculation (wasted work), Frontend Bound (instruction fetch issues), Backend Bound (execution issues). 2) Identify dominant category - focus on the category with highest percentage (often Backend Bound for unoptimized code). 3) If Frontend Bound dominant: drill into Frontend Latency (I-cache misses, ITLB misses) vs Frontend Bandwidth (decoder throughput). Fix: simplify code, improve instruction locality. 4) If Bad Speculation dominant: drill into Branch Mispredict vs Machine Clears. Fix: improve branch predictability, avoid self-modifying code. 5) If Backend Bound dominant: drill into Memory Bound vs Core Bound. 6) If Memory Bound: check L1/L2/L3 Bound and DRAM Bound to identify which level. Fix: improve cache usage, prefetch, reduce memory traffic. 7) If Core Bound: check Port Utilization to find saturated execution units. Fix: reduce dependency chains, increase ILP, use different instruction mix. Key insight: only optimize the bottleneck category - improving non-bottleneck areas provides no benefit.
Follow this 8-step false sharing prevention checklist: 1) Identify potential false sharing - look for arrays or structures where different threads write to adjacent elements. 2) Understand cache line size - typically 64 bytes on modern x86; false sharing occurs when threads write to same 64-byte region. 3) Profile for false sharing - look for high cache coherency traffic, poor scaling with thread count, or use tools like Intel Inspector. 4) Pad per-thread data - add padding to ensure each thread's data is in its own cache line: struct alignas(64) ThreadData { int data; char padding[60]; }; 5) Use thread-local storage - thread_local in C++, __thread in GCC, or explicit thread-local arrays. 6) Restructure arrays - instead of data[thread_id].field, use field[thread_id] with proper spacing. 7) Avoid sharing where possible - duplicate read-only data per thread rather than sharing single copy if write contention occurs. 8) Verify fix - re-profile to confirm cache coherency traffic reduced and scaling improved. Common pitfall: padding increases memory usage; only pad frequently-written data in hot paths.
Follow this 11-step scalar-to-vector transformation process: 1) Identify vectorization target - loop or repeated computation with data parallelism. 2) Verify independence - iterations must be independent; no loop-carried dependencies (or handle with reductions). 3) Choose vector width - match hardware: SSE=4 floats, AVX2=8 floats, AVX-512=16 floats. 4) Handle alignment - align arrays to vector width boundaries. 5) Transform data access - convert scalar loads to vector loads: float a = arr[i] becomes __m256 a = _mm256_load_ps(&arr[i]). 6) Transform operations - scalar ops to vector: a+b becomes _mm256_add_ps(a,b). 7) Handle horizontal operations - reductions (sum, max) need special handling with horizontal adds. 8) Handle conditionals - replace if/else with blending: _mm256_blendv_ps. 9) Handle loop remainder - iterations not divisible by vector width need scalar epilog or masked operations. 10) Verify correctness - compare vector results against scalar reference. 11) Benchmark - measure actual speedup; theoretical 8x rarely achieved due to memory bandwidth limits. Tools: Intel SDE for emulating unavailable instructions, compiler auto-vectorization reports for reference.
Follow this 13-point multi-threaded profiling checklist: 1) Profile single-threaded first - establish baseline, understand serial hot spots. 2) Use thread-aware profiler - VTune Threading analysis, perf with -t flag, or dedicated tools like Helgrind. 3) Measure scalability - run with 1, 2, 4, 8, N threads; calculate speedup = T1/TN. 4) Identify scalability limiters - is speedup sub-linear? Look for serial sections, contention, imbalanced work. 5) Profile lock contention - use perf lock, VTune Locks and Waits, or mutex debugging libraries. 6) Check for false sharing - if performance degrades with more threads, look for shared cache lines. 7) Verify work balance - check per-thread execution times; imbalance causes idle waiting. 8) Profile context switches - high context switch rate indicates excessive locking or thread thrashing. 9) Check NUMA effects - threads accessing remote memory have higher latency; use numactl to diagnose. 10) Measure synchronization overhead - time spent in barriers, locks, atomic operations. 11) Check memory bandwidth saturation - may limit scaling even if CPUs available. 12) Profile with production thread count - behavior may differ from small-scale testing. 13) Use timeline views - visualize thread activity over time to spot idle periods and synchronization patterns.
Follow this 10-step critical path latency reduction process: 1) Identify critical path - sequence of dependent operations determining minimum execution time. Use dependency analysis or VTune critical path view. 2) Measure path latency - sum instruction latencies along critical path. 3) Look for high-latency instructions - division (15-50 cycles), some shuffles, memory accesses on cache miss. 4) Replace high-latency ops - use approximations (rsqrt instead of 1/sqrt), lookup tables, or different algorithms. 5) Break dependency chains - use associativity to parallelize: (a+b)+(c+d) instead of a+b+c+d. 6) Use instruction-level parallelism - interleave independent operations to fill pipeline while waiting for dependencies. 7) Reduce memory latency impact - prefetch, improve locality, keep critical data in registers. 8) Avoid false dependencies - use different registers for independent computations to prevent false RAW hazards. 9) Consider out-of-order execution - modern CPUs reorder, but ROB size limits how far ahead they can look (~200 instructions). 10) Measure improvement - verify critical path shortened and latency reduced. Critical insight: reducing latency matters most for throughput when many instances are in-flight; for single execution, may not matter.
Follow this 10-step lock-free optimization process: 1) Profile contention - measure CAS failure rates, retry counts, throughput under contention. 2) Understand memory ordering - verify correct use of acquire/release/seq_cst; incorrect ordering causes bugs. 3) Minimize atomic operations - reduce number of atomics in hot path; batch updates where possible. 4) Use appropriate memory orders - relaxed is cheapest, acquire/release sufficient for most cases, avoid seq_cst unless necessary. 5) Avoid false sharing - pad atomic variables to separate cache lines. 6) Consider DCAS vs CAS - some algorithms need double-word CAS; x86 has CMPXCHG16B. 7) Profile cache line bouncing - high cache coherency traffic indicates contention; consider partitioning. 8) Use backoff strategies - exponential backoff on CAS failure reduces contention. 9) Consider hybrid approaches - lock-free for common case, fallback to locks for rare cases. 10) Benchmark against locked alternatives - lock-free isn't always faster; mutex with short critical section may be better under low contention. Correctness first: lock-free bugs are subtle and hard to reproduce; use formal verification or extensive testing.
Follow this 9-step function call overhead reduction process: 1) Profile call frequency - identify functions called millions of times in hot paths. 2) Measure call overhead - typical overhead: 5-20 cycles for direct call, 10-30 for indirect/virtual, 50-100+ if cache miss on code. 3) Consider inlining - for small functions (<10 instructions), inlining eliminates call overhead entirely. Use inline keyword, -finline-functions, or attribute((always_inline)). 4) Check compiler inlining decisions - use -Winline (GCC) to see what wasn't inlined and why. 5) Reduce argument passing overhead - pass large structures by pointer/reference, not value. On x86-64: first 6 integer args in registers, rest on stack. 6) Avoid unnecessary virtual calls - use final keyword for classes not inherited, devirtualization with -fdevirtualize. 7) Consider function pointers vs switch - switch may be faster for small fixed set of functions due to better branch prediction. 8) Batch calls - instead of calling function N times with single items, pass array of N items. 9) Verify benefit - measure actual improvement; modern CPUs predict calls well, so overhead may be less than expected. Focus on calls in innermost loops where overhead multiplies.
Use this 12-point checklist to validate benchmark methodology: 1) Representative workload - does benchmark reflect real-world usage patterns and data sizes? 2) Warm-up period - are caches and JIT compilers in steady state before measurement? 3) Sufficient iterations - do you have 30+ measurements for statistical validity? 4) Controlled environment - is CPU frequency fixed, Turbo Boost disabled, other processes minimized? 5) Thread affinity - are benchmark threads pinned to specific cores? 6) Memory state - is ASLR disabled or results averaged across multiple runs? 7) Statistical reporting - do you report mean, median, standard deviation, AND confidence intervals? 8) Baseline comparison - are you comparing against a known baseline, not just measuring absolute numbers? 9) Single variable - are you changing only one thing between compared benchmarks? 10) Result validation - does the benchmark compute correct results? Dead code elimination? 11) Hardware documentation - did you record exact CPU model, RAM speed, compiler version? 12) Reproducibility - can someone else run your benchmark and get similar results? If any point fails, benchmark results may be misleading or meaningless.
Apply loop optimizations in this recommended order: 1) Algorithm optimization - ensure you're using best algorithm; O(n) beats optimized O(n^2). 2) Loop-invariant code motion - move calculations that don't change per iteration outside loop. 3) Strength reduction - replace expensive ops (mul, div, mod) with cheaper ones (add, shift, mask). 4) Dead code elimination - remove calculations whose results are never used. 5) Loop interchange - for nested loops, reorder to improve memory access pattern. 6) Loop tiling/blocking - for cache locality when processing large arrays. 7) Loop fusion - combine adjacent loops with same bounds to improve locality. 8) Loop fission - split loop if it enables vectorization or reduces register pressure. 9) Loop unrolling - reduce loop overhead, expose ILP; try 2x, 4x factors. 10) Vectorization - SIMD for data-parallel operations. 11) Parallelization - multi-thread outer loops if sufficient work per thread. 12) Software pipelining - for VLIW or when other opts insufficient. Rationale: early opts may enable later ones (hoisting enables vectorization), algorithmic changes give biggest wins, parallelization adds complexity so comes last.
Follow this 12-step A/B performance comparison process: 1) Define hypothesis - state expected improvement and metrics to measure. 2) Control environment - identical hardware, OS config, background load for both runs. 3) Pin CPU frequency - disable Turbo Boost and frequency scaling. 4) Warm up - run code before measurement to stabilize caches and JIT. 5) Measure baseline (A) - run configuration A, record multiple trials (minimum 30). 6) Calculate baseline statistics - mean, median, standard deviation, 95% confidence interval. 7) Measure treatment (B) - identical conditions, same number of trials. 8) Calculate treatment statistics - same metrics as baseline. 9) Perform statistical test - use t-test if normally distributed, Mann-Whitney otherwise. Check p-value < 0.05 for significance. 10) Calculate effect size - percent improvement and confidence interval on the difference. 11) Verify non-overlapping confidence intervals - if intervals overlap, difference may not be significant. 12) Document results - record all details: hardware, software versions, metrics, statistical analysis, raw data. Report inconclusive results honestly; negative results are still results.
Follow this 11-step container optimization process: 1) Minimize image size - use multi-stage builds, alpine bases, remove unnecessary files. Smaller = faster pull. 2) Layer effectively - put rarely-changing layers first (dependencies) to maximize cache reuse. 3) Optimize application startup - same startup optimizations as native, plus container-specific considerations. 4) Set appropriate resource limits - too low causes throttling, too high wastes cluster resources. 5) Use init containers wisely - parallelize where possible, minimize init work. 6) Configure health checks properly - fast to execute, appropriate intervals to avoid unnecessary restarts. 7) Use local image caching - pre-pull images to nodes, use image caching solutions. 8) Optimize overlay filesystem - many small files perform poorly; consider init container to unpack. 9) Profile container overhead - compare performance in container vs bare metal; identify container-specific costs. 10) Use appropriate base images - distroless or scratch for minimal overhead. 11) Consider sidecar overhead - service meshes, log collectors add CPU/memory; profile their impact. Kubernetes-specific: tune kubelet, use node-local DNS caching, appropriate pod disruption budgets.
Follow this 12-step performance regression debugging process: 1) Confirm the regression - verify with multiple runs that performance actually degraded (not measurement noise). 2) Quantify the regression - measure exact percentage slowdown and which metrics changed (latency, throughput, CPU time). 3) Identify the regression window - use git bisect to find the exact commit that introduced the regression. 4) Compare profiles before/after - run same profiler on good and bad versions with identical workload. 5) Generate differential flame graph - visualize what got slower (red) vs faster (blue). 6) Analyze the diff - identify which functions show increased time. 7) Check for obvious causes - new allocations, additional logging, changed algorithms, new dependencies. 8) Profile specific metrics - if not obvious, profile cache misses, branch mispredictions, IPC separately. 9) Inspect code changes - review the diff of the regression commit for performance-impacting changes. 10) Form hypothesis - propose specific cause based on profile differences. 11) Verify hypothesis - create minimal reproducer or targeted benchmark. 12) Fix and verify - implement fix, confirm regression is resolved, ensure no new regressions introduced. Add regression test to CI to prevent recurrence.
Follow this 11-step data structure optimization process: 1) Profile access patterns - identify which fields are accessed together, frequency of operations (read/write/iterate). 2) Measure current performance - baseline cache miss rates, memory bandwidth usage. 3) Calculate structure size and alignment - use sizeof(), check for padding with offsetof(). 4) Identify hot fields - fields accessed in tight loops or critical paths. 5) Group hot fields together - reorder struct members so hot fields are in same cache line (64 bytes). 6) Consider AoS vs SoA - if iterating over one field of many objects, SoA (separate arrays per field) improves spatial locality. 7) Reduce structure size - use smaller types where possible (int16_t vs int32_t), use bitfields for flags, remove unused fields. 8) Align to cache lines - use alignas(64) for frequently accessed structures to avoid split accesses. 9) Consider padding to avoid false sharing - in multi-threaded code, pad to 64 bytes between thread-local data. 10) Evaluate pointer vs index - indices into arrays have better cache behavior than pointers to dispersed memory. 11) Benchmark alternatives - test different layouts with production-like workload, as optimal depends on access pattern.
Follow this 9-step recursive algorithm optimization process: 1) Profile recursion depth and call frequency - identify how deep and how many total calls. 2) Check for tail recursion - if tail-recursive, compiler may optimize to iteration; enable with -O2. 3) Consider explicit iteration - convert to loop with manual stack if recursion overhead significant. 4) Implement memoization - cache results of expensive recursive calls; check if subproblems overlap. 5) Use trampolining - for languages without tail call optimization, avoid stack overflow with continuation-passing. 6) Optimize base cases - ensure base cases are fast; consider handling more cases in base to reduce depth. 7) Increase problem size per call - process more work per recursive call to reduce call overhead ratio. 8) Consider iterative deepening - for search problems, avoid deep recursion with bounded iterations. 9) Profile stack usage - deep recursion can cause cache misses on stack; consider explicit data structures. Trade-off: recursive code is often clearer; only optimize if profiling shows recursion overhead is significant. Memoization often gives biggest wins for overlapping subproblems (dynamic programming).
Follow this 8-step loop fission process: 1) Identify fission candidates - single loop with multiple independent statement groups accessing different data. 2) Check dependency safety - statements in resulting separate loops must not have cross-loop dependencies. 3) Analyze benefit potential - fission beneficial when: different statements access different data (improved spatial locality), one statement group can be vectorized but others prevent it, register pressure in combined loop causes spills. 4) Identify split points - group statements that access same data or have dependencies together. 5) Implement fission - split single loop into multiple loops. Before: for(i) { A[i]=..; B[i]=..; } After: for(i) A[i]=..; for(i) B[i]=..; 6) Handle carried dependencies - if dependency exists, fission may require reordering or may not be legal. 7) Verify correctness - split loops must produce identical results to original. 8) Measure improvement - check vectorization status (compiler reports), cache behavior, overall performance. Fission is inverse of fusion; useful when combined loop body is too large for registers or prevents vectorization.
Follow this 10-step linked structure optimization process: 1) Profile access patterns - identify traversal patterns, random vs sequential access frequency. 2) Measure cache misses - linked structures typically have poor locality; quantify the cost. 3) Consider unrolled variants - unrolled linked lists store multiple elements per node, improving locality. 4) Pool allocations - allocate nodes from contiguous memory pool to improve spatial locality. 5) Prefetch next nodes - during traversal, prefetch next pointer's target: prefetch(node->next). 6) Consider cache-oblivious layouts - Van Emde Boas layout for trees improves cache efficiency. 7) Evaluate array-based alternatives - many linked structures can be replaced with arrays + indices. 8) Use flat storage for small structures - for trees with <1000 nodes, array-based often faster than pointer-based. 9) Pack related data - ensure frequently accessed fields are in same cache line as link pointers. 10) Consider converting to array on access - if structure is built once and traversed many times, copy to array for traversal. Key insight: modern CPUs heavily favor arrays due to prefetching; linked structures should be a last resort for performance-critical code.
Follow this 12-point production monitoring checklist: 1) Define SLOs - establish latency percentiles (p50, p95, p99), throughput targets, error budgets. 2) Instrument code - add timing for critical paths, use low-overhead methods (RDTSC-based). 3) Sample, don't trace everything - 1-10% sampling rate sufficient for trends, avoids overhead. 4) Monitor system metrics - CPU utilization, memory usage, cache hit rates, context switches. 5) Set up alerting - alert when metrics approach SLO boundaries, not when violated. 6) Create dashboards - visualize key metrics over time, correlate with deployments. 7) Establish baselines - record normal performance ranges for comparison. 8) Enable on-demand profiling - mechanism to capture detailed profiles when issues detected. 9) Log slow requests - capture details of requests exceeding latency thresholds. 10) Track performance per version - correlate performance changes with code changes. 11) Monitor resource costs - track CPU-time, memory-hours, cost per request. 12) Regular performance reviews - periodic analysis of trends, capacity planning. Tools: Prometheus+Grafana for metrics, Jaeger/Zipkin for tracing, continuous profiling services (Parca, Pyroscope).
Follow this 9-step dependency chain optimization process: 1) Identify long dependency chains - look for serial sequences where each instruction depends on previous result. 2) Measure chain length - count instructions from first to last in chain; compare to loop iteration count. 3) Calculate chain latency - sum latencies of instructions in chain; this is minimum time per iteration. 4) Check IPC - low IPC with low cache misses often indicates dependency-bound code. 5) Break arithmetic chains - use associativity: sum = (a+b) + (c+d) instead of a+b+c+d allows parallel execution. 6) Use multiple accumulators - for reductions, use 4-8 independent accumulators, combine at end. 7) Unroll with independent computations - unroll loop and interleave independent iterations. 8) Reorder instructions - place independent instructions between dependent ones to fill latency. 9) Verify improvement - check IPC increased and execution time decreased. Example: summing array with single accumulator has N-instruction chain; with 4 accumulators, chain is N/4, allowing 4x parallelism until final combination.
Follow this 14-step cache optimization checklist: 1) Measure baseline - profile L1/L2/L3 hit rates and miss penalties using perf or Cachegrind. 2) Understand cache parameters - L1: 32-64KB, 4-8 way, 64B lines; L2: 256KB-1MB; L3: shared, multi-MB. 3) Calculate working set size - total unique data accessed in hot regions. 4) If working set > cache: apply blocking/tiling to process data in cache-sized chunks. 5) If spatial locality poor: reorganize data layout for sequential access, convert AoS to SoA. 6) If temporal locality poor: reorder computations to reuse data while still in cache. 7) Align data structures - align to 64-byte cache line boundaries to avoid split accesses. 8) Avoid false sharing in parallel code - pad structures so different threads don't share cache lines. 9) Use prefetching for predictable access - software prefetch 200-400 cycles ahead of use. 10) Minimize pointer chasing - linearize linked structures or use indices into arrays. 11) Consider cache-oblivious algorithms - algorithms that work well regardless of cache size. 12) Profile TLB misses - if high, consider huge pages (2MB instead of 4KB). 13) Check NUMA locality - ensure data allocated on same node as accessing CPU. 14) Verify improvements - confirm cache miss reduction yields proportional speedup.
Use this 12-point performance documentation checklist: 1) Describe the problem - what was slow, how slow, under what conditions? 2) Include baseline measurements - exact numbers with methodology, environment details. 3) Explain root cause analysis - how did you identify the bottleneck? What tools used? 4) Document what you tried - including approaches that didn't work and why. 5) Describe the solution - what change was made, why does it work? 6) Include after measurements - same methodology as baseline for valid comparison. 7) Quantify improvement - percentage, absolute numbers, statistical significance. 8) Note tradeoffs - does optimization increase memory usage, code complexity, or reduce maintainability? 9) Specify applicability - when does this optimization apply? What conditions might make it not apply? 10) Add code comments - explain non-obvious optimization in code near the optimization. 11) Update performance tests - add regression tests to prevent future slowdowns. 12) Record environment details - compiler version, flags, hardware, OS - for reproducibility. Good documentation prevents: re-investigating same issues, accidentally reverting optimizations, applying optimizations where they don't help.
Follow this 9-step Roofline analysis checklist: 1) Generate roofline plot - use Intel Advisor or manual calculation with collected FLOPS and bandwidth data. 2) Identify kernel locations - each dot represents a kernel; position shows performance vs potential. 3) Check which roof limits each kernel - sloped (memory) or flat (compute) portion. 4) For kernels on memory roof (left/sloped): improve arithmetic intensity by: increasing cache reuse (tiling), reducing memory traffic, using smaller data types. 5) For kernels on compute roof (right/flat): improve by: vectorizing (wider SIMD), reducing instruction count, better ILP. 6) For kernels below all roofs: investigate microarchitectural bottlenecks - latency, port contention, cache conflicts. 7) Check distance from roof - large gap indicates significant optimization opportunity. 8) Prioritize by time - optimize kernels that take most time, not those furthest from roof. 9) Re-analyze after optimization - verify kernel moved closer to roof or shifted to different roof. Goal: kernels should approach respective roofs; reaching roof means near-optimal for that resource.
Follow this 10-step Valgrind usage checklist: 1) Build with debug info - compile with -g for source-level annotations. Keep -O2 for representative performance but consider -O0 for clearer traces. 2) Choose appropriate tool - Memcheck (default): memory errors; Cachegrind: cache simulation; Callgrind: call graph profiling; Massif: heap profiling; Helgrind/DRD: threading errors. 3) Expect slowdown - Memcheck: 10-50x slower, Cachegrind: 20-100x, Callgrind: 10-100x. Reduce input size accordingly. 4) Suppress known false positives - use suppression files to ignore expected errors from system libraries. 5) Run representative workload - even small inputs reveal cache behavior patterns. 6) Use annotation tools - cg_annotate for Cachegrind, callgrind_annotate for Callgrind, ms_print for Massif. 7) Focus on significant results - sort by impact (cache misses, time, allocations); ignore noise. 8) Check Cachegrind cache configuration - verify simulated cache matches target hardware: --I1, --D1, --LL options. 9) Use Callgrind with instrumentation control - CALLGRIND_START_INSTRUMENTATION/STOP to focus on hot code. 10) Interpret results carefully - Valgrind simulates, doesn't measure; use as relative comparison, not absolute numbers.
Follow this 11-step scalar-to-vector transformation process: 1) Identify vectorization target - loop or repeated computation with data parallelism. 2) Verify independence - iterations must be independent; no loop-carried dependencies (or handle with reductions). 3) Choose vector width - match hardware: SSE=4 floats, AVX2=8 floats, AVX-512=16 floats. 4) Handle alignment - align arrays to vector width boundaries. 5) Transform data access - convert scalar loads to vector loads: float a = arr[i] becomes __m256 a = _mm256_load_ps(&arr[i]). 6) Transform operations - scalar ops to vector: a+b becomes _mm256_add_ps(a,b). 7) Handle horizontal operations - reductions (sum, max) need special handling with horizontal adds. 8) Handle conditionals - replace if/else with blending: _mm256_blendv_ps. 9) Handle loop remainder - iterations not divisible by vector width need scalar epilog or masked operations. 10) Verify correctness - compare vector results against scalar reference. 11) Benchmark - measure actual speedup; theoretical 8x rarely achieved due to memory bandwidth limits. Tools: Intel SDE for emulating unavailable instructions, compiler auto-vectorization reports for reference.
Follow this 9-step Roofline analysis checklist: 1) Generate roofline plot - use Intel Advisor or manual calculation with collected FLOPS and bandwidth data. 2) Identify kernel locations - each dot represents a kernel; position shows performance vs potential. 3) Check which roof limits each kernel - sloped (memory) or flat (compute) portion. 4) For kernels on memory roof (left/sloped): improve arithmetic intensity by: increasing cache reuse (tiling), reducing memory traffic, using smaller data types. 5) For kernels on compute roof (right/flat): improve by: vectorizing (wider SIMD), reducing instruction count, better ILP. 6) For kernels below all roofs: investigate microarchitectural bottlenecks - latency, port contention, cache conflicts. 7) Check distance from roof - large gap indicates significant optimization opportunity. 8) Prioritize by time - optimize kernels that take most time, not those furthest from roof. 9) Re-analyze after optimization - verify kernel moved closer to roof or shifted to different roof. Goal: kernels should approach respective roofs; reaching roof means near-optimal for that resource.
Follow this 10-point optimization correctness checklist: 1) Preserve semantics - optimized code must produce identical output for all valid inputs. 2) Test with original test suite - all existing tests must pass unchanged. 3) Add edge case tests - empty inputs, single elements, maximum sizes, boundary values. 4) Compare against reference - run both versions on diverse inputs, compare outputs exactly. 5) Test numerical stability - for floating-point, verify acceptable precision (may differ due to reassociation). 6) Check corner cases - negative numbers, zeros, NaN, infinity for floating-point. 7) Test with sanitizers - run with AddressSanitizer, UndefinedBehaviorSanitizer to catch subtle bugs. 8) Verify thread safety - if parallelized, test with ThreadSanitizer, various thread counts. 9) Stress test - large inputs, sustained load, verify no degradation or crashes. 10) Document semantic changes - if optimization intentionally changes behavior (fast-math), document clearly. Golden rule: never trust an optimization until independently verified; subtle bugs in optimized code can be worse than slow correct code.
Use this 15-point code review performance checklist: 1) N+1 queries - loops making database calls; should be batched. 2) Unbounded allocations - lists/buffers that can grow without limit. 3) String concatenation in loops - use StringBuilder or pre-sized buffer. 4) Synchronous I/O in hot path - should be async or moved off critical path. 5) Logging in tight loops - logging overhead adds up; use sampling or move outside loop. 6) Exception-based control flow - exceptions are expensive; don't use for normal cases. 7) Unnecessary boxing/unboxing - primitive to object conversion overhead. 8) Missing short-circuit evaluation - expensive checks should come last in conditions. 9) Unnecessary defensive copies - copying large objects 'just in case'. 10) Lock contention - locks held too long or too broadly. 11) Cache-unfriendly data structures - linked lists, hash tables with poor locality. 12) Repeated computation - same calculation done multiple times without caching. 13) Blocking on I/O - synchronously waiting for network/disk. 14) Inefficient algorithms - O(n^2) where O(n log n) exists. 15) Missing resource limits - unbounded queues, connection pools, thread creation. Not all patterns are problems everywhere; focus on hot paths identified by profiling.
Follow this 8-step hot spot identification process: 1) Choose appropriate profiler - use sampling profiler (perf, VTune, Instruments) for low overhead and statistical accuracy. 2) Run representative workload - use production-like data and usage patterns, not toy inputs. 3) Ensure sufficient samples - minimum 1000 samples in hot regions for statistical reliability (run longer if needed). 4) Generate initial report - sort functions by CPU time (self time first, then including callees). 5) Identify hot spots - functions consuming >5% of total time are candidates for optimization. 6) Analyze call context - use call graph to understand HOW hot functions are reached (same function called from multiple paths may have different optimization strategies). 7) Generate flame graph - visual overview shows widest boxes as hottest paths, helps spot unexpected hot code. 8) Drill down with annotation - use perf annotate or VTune source view to find hot instructions within hot functions. Remember: some hot spots are fundamental (main computation) and cannot be eliminated; focus on reducible overhead and inefficiencies.
Follow this systematic 8-step process: 1) Establish baseline metrics - measure current performance with representative workloads, record wall time, CPU time, memory usage, and throughput. 2) Profile to identify hotspots - use sampling profiler (perf, VTune) to find where >10% of time is spent. 3) Categorize the bottleneck - determine if CPU-bound (high CPU utilization), memory-bound (high cache misses), I/O-bound (high wait time), or branch-bound (high misprediction rate). 4) Analyze the specific bottleneck - if memory-bound, profile cache behavior; if CPU-bound, check IPC and instruction mix. 5) Form optimization hypothesis - propose specific change based on bottleneck type. 6) Implement targeted fix - make ONE change at a time for measurable impact. 7) Measure improvement - re-profile with same workload, compare metrics quantitatively. 8) Iterate - if target not met, return to step 2; new hotspots may emerge after optimization. Critical rule: never optimize without profiling data first.
Follow this 11-point ARM optimization checklist: 1) Identify target ARM variant - ARMv7, ARMv8 (AArch64), or specific cores (Cortex-A72, Apple M1). 2) Check NEON availability - ARM's SIMD extension; 128-bit vectors on ARMv7/v8, wider on some v9. 3) Use ARM-appropriate compiler flags - -march=armv8-a+simd, -mcpu=native for target-specific. 4) Consider in-order vs out-of-order - many ARM cores are in-order; instruction scheduling matters more. 5) Optimize for different cache sizes - ARM cores vary widely in cache configuration. 6) Use intrinsics for NEON - <arm_neon.h> for portable SIMD; different from x86 intrinsics. 7) Check alignment requirements - some ARM implementations require aligned SIMD access. 8) Consider big.LITTLE - heterogeneous cores; schedule appropriately for efficiency vs performance cores. 9) Profile on target hardware - ARM simulators don't capture real performance; test on actual devices. 10) Watch for endianness - some ARM can be big or little endian; verify code handles correctly. 11) Use SVE/SVE2 where available - Scalable Vector Extension on ARMv9; variable vector lengths. Apple Silicon: specific optimizations for M1/M2/M3 differ from Android ARM SoCs.
Follow this 11-step real-time media optimization process: 1) Define latency budget - audio typically <10ms, video <33ms for 30fps; allocate per stage. 2) Use lock-free communication - between capture/process/playback threads to avoid priority inversion. 3) Pre-allocate all buffers - no runtime allocation; use circular buffers for streaming. 4) Avoid syscalls in processing - no file I/O, logging, or allocation in real-time thread. 5) Use SIMD extensively - media processing (filters, transforms) benefits greatly from vectorization. 6) Pin to dedicated cores - isolate real-time threads from system interrupts and other processes. 7) Use appropriate thread priority - SCHED_FIFO on Linux, THREAD_PRIORITY_TIME_CRITICAL on Windows. 8) Profile worst-case latency - measure 99.9th percentile, not average; spikes cause glitches. 9) Handle overload gracefully - drop frames rather than accumulating latency. 10) Use specialized DSP libraries - IPP, FFTW, vendor-specific audio libraries are highly optimized. 11) Test under stress - full CPU load, memory pressure, disk activity to find edge cases. Real-time constraint: every processing deadline must be met; missing one causes audible/visible glitch.
Follow this 10-step loop unrolling checklist: 1) Verify loop is hot - don't unroll cold loops; consumes code cache for no benefit. 2) Check compiler auto-unrolling - compile with -funroll-loops and check if already unrolled. 3) Measure baseline - record cycles per iteration, IPC, code size. 4) Assess unroll benefit potential - highest benefit when: loop overhead significant vs body, iterations are independent, body has instruction-level parallelism. 5) Choose unroll factor - start with 2x or 4x; optimal is often power of 2. 6) Handle remainder - for N iterations with unroll factor U: handle N%U iterations separately (epilog loop or Duff's device). 7) Check register pressure - unrolling increases live variables; if registers spill to stack, benefit is lost. 8) Verify vectorization interaction - unrolling may enable or prevent auto-vectorization depending on structure. 9) Measure result - compare cycles/iteration, watch for: code bloat causing I-cache misses, register spills. 10) Document decision - record unroll factor and why, as optimal may change with compiler versions or hardware. Rule of thumb: if unrolled code > 64 instructions, risk of I-cache pressure increases.
Follow this 11-step SIMD vectorization process: 1) Identify vectorization candidate - find loops with independent iterations and arithmetic operations. 2) Check auto-vectorization first - compile with -O3 -march=native and check compiler reports (gcc: -fopt-info-vec-all, icc: -qopt-report=5). 3) If not auto-vectorized, identify blockers - look for: non-unit stride, data dependencies, function calls, complex control flow. 4) Simplify loop structure - remove conditionals, ensure trip count is known or add #pragma loop count. 5) Ensure proper alignment - align arrays to 16/32/64 bytes matching SIMD width (SSE=16, AVX=32, AVX-512=64). 6) Eliminate pointer aliasing - use restrict keyword or #pragma ivdep to tell compiler pointers don't alias. 7) Handle remainder iterations - decide between scalar epilog, masked operations, or padding. 8) Choose SIMD width - SSE (128-bit), AVX2 (256-bit), or AVX-512 (512-bit) based on target CPUs. 9) If manual vectorization needed - use intrinsics (mm256*), start with loads/stores, then operations. 10) Verify correctness - compare results against scalar version. 11) Benchmark - measure actual speedup, as theoretical gains may not materialize due to memory bandwidth limits.
Follow this 8-step parallelization decision process: 1) Profile first - identify if code is CPU-bound; parallelizing I/O-bound code rarely helps. 2) Measure serial fraction - use Amdahl's Law: if 50% is serial, max speedup is 2x regardless of core count. 3) Estimate parallelism - count independent operations; need more parallel work than available cores. 4) Calculate overhead - thread creation/joining costs ~1-10 microseconds, synchronization adds latency. 5) Decision point: if work per thread < 10,000 cycles, overhead likely exceeds benefit - do not parallelize. 6) Check memory bandwidth - if already memory-bound, more threads may contend for bandwidth without speedup. 7) Assess data sharing - heavy sharing requires synchronization; consider data partitioning or thread-local copies. 8) Choose parallelization strategy - task parallel (independent work items) or data parallel (same operation on partitioned data). For loops: parallelize outermost loop if possible to minimize fork/join overhead. Rule of thumb: parallelize when expected speedup > 2x and work per thread > 1 million cycles.
Follow this 12-point GPU kernel optimization checklist: 1) Profile with vendor tools - NVIDIA Nsight Compute, AMD ROCm Profiler, or Intel VTune GPU. 2) Check occupancy - percentage of GPU resources utilized; low occupancy wastes parallelism. 3) Identify bottleneck type - compute-bound, memory-bound, or latency-bound (use roofline analysis). 4) Optimize memory access patterns - ensure coalesced memory access; avoid strided or random patterns. 5) Use shared memory - for data reused across threads in a block; reduces global memory traffic. 6) Minimize divergence - avoid branches where threads in warp take different paths. 7) Optimize block/grid dimensions - tune for occupancy and memory access patterns. 8) Use appropriate precision - FP16/BF16 faster than FP32 on many GPUs; use if accuracy allows. 9) Overlap compute and transfer - async memory copies while kernels execute. 10) Reduce CPU-GPU synchronization - batch operations to minimize sync points. 11) Use tensor cores where applicable - for matrix operations, specialized hardware gives huge speedup. 12) Profile power and thermal - throttling due to heat reduces performance; check sustained performance vs peak. Compare against theoretical limits: memory bandwidth, compute throughput for target GPU.
Follow this 9-step string optimization process: 1) Profile string operations - identify hot string functions (strcmp, strlen, memcpy, parsing). 2) Use SIMD string functions - modern libraries (glibc, MSVC) have AVX2-optimized implementations. 3) Avoid unnecessary allocations - reuse buffers, use string views/spans instead of copying. 4) Use small string optimization (SSO) - std::string typically stores strings <15-23 chars inline without heap allocation. 5) Process in bulk - don't call strlen in every loop iteration; compute once and store. 6) Use memcmp/memcpy for known lengths - faster than strcmp/strcpy when length is known. 7) Consider specialized algorithms - for pattern matching, use SIMD-optimized search (SWAR, SSE4.2 PCMPESTRI). 8) Minimize format string parsing - printf-style formatting is slow; consider faster alternatives or pre-formatting. 9) Use interned strings for comparisons - compare pointers instead of content for frequently compared strings. Benchmark: simple optimizations (avoiding redundant strlen) often give bigger wins than complex SIMD; measure before implementing complex solutions.
Follow this 11-point ARM optimization checklist: 1) Identify target ARM variant - ARMv7, ARMv8 (AArch64), or specific cores (Cortex-A72, Apple M1). 2) Check NEON availability - ARM's SIMD extension; 128-bit vectors on ARMv7/v8, wider on some v9. 3) Use ARM-appropriate compiler flags - -march=armv8-a+simd, -mcpu=native for target-specific. 4) Consider in-order vs out-of-order - many ARM cores are in-order; instruction scheduling matters more. 5) Optimize for different cache sizes - ARM cores vary widely in cache configuration. 6) Use intrinsics for NEON - <arm_neon.h> for portable SIMD; different from x86 intrinsics. 7) Check alignment requirements - some ARM implementations require aligned SIMD access. 8) Consider big.LITTLE - heterogeneous cores; schedule appropriately for efficiency vs performance cores. 9) Profile on target hardware - ARM simulators don't capture real performance; test on actual devices. 10) Watch for endianness - some ARM can be big or little endian; verify code handles correctly. 11) Use SVE/SVE2 where available - Scalable Vector Extension on ARMv9; variable vector lengths. Apple Silicon: specific optimizations for M1/M2/M3 differ from Android ARM SoCs.
Follow these optimization steps for matrix multiplication in order: 1) Baseline - implement naive O(n^3) triple loop, measure GFLOPS. 2) Loop reordering - change loop order to i-k-j for row-major arrays (improves B access pattern). 3) Register blocking - unroll inner loops to keep small blocks in registers, reducing load/store. 4) Cache tiling - tile for L1 cache (typically 32x32 to 64x64 blocks for double precision). 5) Multi-level tiling - add L2 tile level for larger matrices. 6) SIMD vectorization - vectorize innermost loop using AVX2 (4 doubles) or AVX-512 (8 doubles). 7) Software pipelining - prefetch next tiles while computing current. 8) Parallelization - parallelize outer tile loops with OpenMP. 9) NUMA awareness - ensure data locality for thread-tile mapping. 10) Specialized handling - separate paths for small matrices (direct computation) vs large (blocked). Compare against BLAS library (OpenBLAS, MKL) as ultimate baseline; well-tuned BLAS typically achieves 80%+ of theoretical peak FLOPS.
Follow this 9-step software prefetching process: 1) Identify prefetch candidates - loops with: predictable access pattern, large data that doesn't fit cache, high cache miss rate on profiling. 2) Verify hardware prefetcher isn't sufficient - sequential and simple strided patterns are often handled automatically. 3) Calculate prefetch distance - prefetch D iterations ahead where D = memory_latency / loop_iteration_time. Typical: 100-300 cycles latency / 10-50 cycles per iteration = 3-30 iterations ahead. 4) Choose prefetch type - _mm_prefetch with _MM_HINT_T0 (all cache levels) for data used soon, _MM_HINT_NTA (non-temporal) for data used once. 5) Insert prefetch instructions - at loop start, prefetch data for iteration i+D while processing iteration i. 6) Avoid over-prefetching - too many prefetches can evict useful data and saturate memory bandwidth. 7) Handle loop boundaries - don't prefetch beyond array bounds (use conditional or limit prefetch iterations). 8) Measure impact - compare cache miss rate and runtime; prefetching should reduce misses without increasing total memory traffic significantly. 9) Test on target hardware - optimal distance varies by CPU; tune for deployment platform.
Follow this 10-step branch optimization process for code with high branch density: 1) Profile branch behavior - identify which branches mispredicted most: perf record -e branch-misses. 2) Categorize branches - predictable (taken/not-taken >90% of time), moderately predictable (70-90%), unpredictable (<70%). 3) For predictable branches: use likely/unlikely hints (__builtin_expect), ensure hot path is fall-through. 4) For unpredictable data-dependent branches: consider branchless alternatives using bitwise ops, conditional moves, or arithmetic. 5) For loop-exit branches: use loop unswitching to move branch outside loop if condition is loop-invariant. 6) For multiple chained conditions: reorder to evaluate most likely first, or use short-circuit evaluation. 7) For switch statements: ensure compiler generates jump table for dense cases; consider profile-guided ordering. 8) For function dispatch: replace virtual calls with switch if small fixed set of types. 9) Use SIMD masking - process all paths simultaneously, select results with blend instructions. 10) Apply PGO - profile-guided optimization provides measured branch probabilities to compiler. Verify: branchless code may have higher latency per operation but removes unpredictability; measure to confirm improvement.
Follow this 9-step compiler optimization flag selection process: 1) Start with -O2 - balanced optimization, safe for production, good baseline. 2) Profile with -O2 - identify remaining hotspots and bottleneck types. 3) Try -O3 - adds aggressive optimizations (vectorization, unrolling); measure if it helps or hurts (code bloat can cause instruction cache misses). 4) Add -march=native - enables CPU-specific instructions (AVX2, etc.) for target machine; omit for portable binaries. 5) Consider -flto - Link-Time Optimization enables cross-file inlining and optimization; significant build time increase. 6) For floating-point heavy code: evaluate -ffast-math (may change results) or individual flags like -fno-math-errno. 7) Profile-Guided Optimization: compile with -fprofile-generate, run representative workload, recompile with -fprofile-use for best results. 8) Check for regressions - some optimizations can cause slowdowns; always measure. 9) Document final flags - record which flags used and why, include in build system. Avoid: -O0 in production (no optimization), -Os unless binary size critical (sacrifices speed), untested flag combinations.
Follow this 10-step startup optimization process: 1) Profile startup - use strace -tt -T to see syscalls, perf record from process start. 2) Identify phases - categorize: loading, parsing, initialization, warm-up. 3) Measure dynamic linking time - large number of shared libraries adds loading overhead; consider static linking. 4) Lazy initialize - defer initialization until first use; avoid loading unused features. 5) Reduce I/O - combine config files, use memory-mapped files, parallelize independent reads. 6) Optimize parsing - use faster parsers, binary formats instead of JSON/XML for large configs. 7) Use dlopen for optional features - load plugins on-demand rather than at startup. 8) Cache computed state - serialize initialized state to disk, reload on subsequent starts. 9) Parallelize initialization - independent subsystems can initialize concurrently. 10) Profile and eliminate - use flame graph of startup to identify and remove unnecessary work. Tools: systemd-analyze for service startup, bootchart for system-level, custom timestamps for application-level. Goal: minimize time-to-first-useful-output.
Follow this 15-step micro-benchmark creation checklist: 1) Define what you're measuring - single specific operation, not compound behavior. 2) Isolate the code under test - remove unrelated setup/teardown from timed region. 3) Prevent dead code elimination - use result (Blackhole.consume in JMH, DoNotOptimize in Google Benchmark). 4) Prevent constant folding - use runtime inputs, not compile-time constants. 5) Control iteration count - let framework determine iterations, don't add manual loops. 6) Warm up properly - run enough iterations for JIT compilation (JVM) or cache warming. 7) Pin CPU frequency - disable Turbo Boost and frequency scaling for consistency. 8) Pin threads to cores - use taskset/numactl to avoid migration. 9) Run sufficient trials - minimum 10 iterations, ideally 30+ for statistical validity. 10) Measure variance - report standard deviation and confidence intervals, not just mean. 11) Use appropriate time resolution - std::chrono::high_resolution_clock, QueryPerformanceCounter, or RDTSC. 12) Account for timer overhead - measure and subtract if measuring sub-microsecond operations. 13) Test multiple input sizes - performance may vary non-linearly with size. 14) Compare against baseline - always measure relative improvement, not just absolute numbers. 15) Document environment - record CPU model, compiler version, flags, OS version.
Follow this 9-step function call overhead reduction process: 1) Profile call frequency - identify functions called millions of times in hot paths. 2) Measure call overhead - typical overhead: 5-20 cycles for direct call, 10-30 for indirect/virtual, 50-100+ if cache miss on code. 3) Consider inlining - for small functions (<10 instructions), inlining eliminates call overhead entirely. Use inline keyword, -finline-functions, or attribute((always_inline)). 4) Check compiler inlining decisions - use -Winline (GCC) to see what wasn't inlined and why. 5) Reduce argument passing overhead - pass large structures by pointer/reference, not value. On x86-64: first 6 integer args in registers, rest on stack. 6) Avoid unnecessary virtual calls - use final keyword for classes not inherited, devirtualization with -fdevirtualize. 7) Consider function pointers vs switch - switch may be faster for small fixed set of functions due to better branch prediction. 8) Batch calls - instead of calling function N times with single items, pass array of N items. 9) Verify benefit - measure actual improvement; modern CPUs predict calls well, so overhead may be less than expected. Focus on calls in innermost loops where overhead multiplies.
Follow this 7-step TMAM analysis process: 1) Run top-level analysis - use VTune or pmu-tools toplev.py to get percentages for: Retiring (useful work), Bad Speculation (wasted work), Frontend Bound (instruction fetch issues), Backend Bound (execution issues). 2) Identify dominant category - focus on the category with highest percentage (often Backend Bound for unoptimized code). 3) If Frontend Bound dominant: drill into Frontend Latency (I-cache misses, ITLB misses) vs Frontend Bandwidth (decoder throughput). Fix: simplify code, improve instruction locality. 4) If Bad Speculation dominant: drill into Branch Mispredict vs Machine Clears. Fix: improve branch predictability, avoid self-modifying code. 5) If Backend Bound dominant: drill into Memory Bound vs Core Bound. 6) If Memory Bound: check L1/L2/L3 Bound and DRAM Bound to identify which level. Fix: improve cache usage, prefetch, reduce memory traffic. 7) If Core Bound: check Port Utilization to find saturated execution units. Fix: reduce dependency chains, increase ILP, use different instruction mix. Key insight: only optimize the bottleneck category - improving non-bottleneck areas provides no benefit.
Follow this 10-step sort optimization process: 1) Profile current sort - measure comparison count, swap count, cache misses. 2) Check data characteristics - size, distribution (nearly sorted, random, many duplicates), key type. 3) For small arrays (<50 elements) - use insertion sort; best for small N due to low overhead. 4) For medium arrays (50-1000) - quicksort variants with good pivot selection (median-of-three). 5) For large arrays - consider cache-aware algorithms like cache-oblivious mergesort or radix sort for integers. 6) For nearly sorted data - insertion sort or timsort which detects and exploits runs. 7) For many duplicates - use 3-way partitioning quicksort (Dutch National Flag). 8) Optimize comparison function - inline, minimize branches, compare discriminating fields first. 9) Use SIMD for sorting networks - for fixed small sizes, SIMD sorting networks are very fast. 10) Consider radix sort for integers/strings - O(nk) can beat O(n log n) for appropriate data. Benchmark standard library sort (usually introsort) as baseline; often highly optimized already.
Follow this 10-step network I/O benchmarking process: 1) Define metrics - throughput (bytes/second), latency (time per operation), connections per second. 2) Establish baseline - use standard tools: iperf3 for throughput, ping/hping for latency. 3) Control network conditions - test on isolated network or emulate conditions with tc (traffic control). 4) Warm up connections - TCP slow start affects initial measurements; run warm-up period. 5) Measure at multiple layers - raw socket, TCP, HTTP, application protocol - to isolate overhead sources. 6) Test various payload sizes - performance characteristics differ for small vs large transfers. 7) Test concurrent connections - single connection vs many; contention and scaling behavior. 8) Measure both client and server - bottleneck may be on either end. 9) Profile CPU usage - network I/O can be CPU-bound with high connection counts or encryption. 10) Compare against theoretical limits - calculate maximum based on link speed, latency; measure achieved percentage. Tools: wrk/wrk2 for HTTP, netperf for low-level, custom benchmarks for application protocols. Account for: TCP overhead, encryption (TLS), serialization.
Follow this 11-point microarchitecture optimization checklist: 1) Identify target microarchitecture - determine specific CPU family (Skylake, Zen 3, etc.) in deployment. 2) Study microarchitecture resources - Agner Fog's instruction tables, Intel/AMD optimization manuals. 3) Understand execution resources - number of ports, which instructions use which ports, throughput/latency. 4) Check instruction latencies - critical for dependency chains; prefer lower-latency alternatives. 5) Balance port utilization - avoid saturating single port; spread work across available units. 6) Respect retirement width - 4-6 micro-ops/cycle on modern CPUs; don't exceed sustainable rate. 7) Use appropriate vector width - wider SIMD (AVX-512) may cause frequency reduction on some CPUs. 8) Consider cache hierarchy - L1/L2/L3 sizes and latencies vary by microarchitecture. 9) Check feature availability - not all instructions available on all CPUs; use CPUID for runtime detection. 10) Use -march appropriately - -march=native for single target, specific -march=skylake etc for known deployment. 11) Test on target hardware - simulator/emulator performance doesn't reflect real microarchitecture behavior. Trade-off: highly tuned code may pessimize on other microarchitectures; consider multi-versioning.
Follow this 10-step memory bandwidth optimization process: 1) Measure current bandwidth usage - use Intel PCM, likwid, or perf with memory controller events. 2) Calculate theoretical peak - check CPU specs for max memory bandwidth (e.g., DDR4-3200 dual channel = 51.2 GB/s). 3) Calculate achieved vs peak ratio - if <50% of peak, there's room for improvement; if near peak, must reduce memory traffic. 4) Reduce unnecessary memory traffic - eliminate redundant loads/stores, use registers for temporaries. 5) Improve cache utilization - tiling and blocking to keep data in cache, reducing main memory accesses. 6) Use non-temporal stores - for write-only data that won't be read again, skip cache pollution with mm_stream*. 7) Optimize prefetching - ensure prefetches don't exceed available bandwidth, causing contention. 8) Consider compression - if compute-cheap, compress data to reduce memory traffic. 9) Use appropriate data types - float vs double halves bandwidth requirement; int16_t vs int32_t similarly. 10) For multi-threaded code - ensure threads access local NUMA memory, consider bandwidth-aware thread scheduling. Key insight: memory-bound code scales poorly - adding threads doesn't help if bandwidth is saturated.
Follow this 10-step emergency performance triage process: 1) Acknowledge and communicate - alert relevant teams, set expectations for resolution time. 2) Assess impact scope - how many users affected, which services, severity? 3) Check for recent changes - deployments, config changes, traffic patterns in last 24 hours. 4) Gather quick metrics - CPU, memory, disk I/O, network, error rates across affected services. 5) Identify proximate cause - is it application code, database, network, external dependency? 6) Apply temporary mitigation - scale up, enable rate limiting, disable expensive features, rollback if recent change. 7) Capture diagnostic data - while issue is occurring: thread dumps, heap dumps, traces, logs. 8) Root cause analysis - after mitigation, investigate captured data to find actual cause. 9) Implement proper fix - address root cause, not just symptoms. 10) Post-mortem - document incident, timeline, root cause, fix, prevention measures. Key principle: mitigate first, debug second. A working system at reduced capacity is better than extended downtime while investigating.
Follow this 11-step data structure optimization process: 1) Profile access patterns - identify which fields are accessed together, frequency of operations (read/write/iterate). 2) Measure current performance - baseline cache miss rates, memory bandwidth usage. 3) Calculate structure size and alignment - use sizeof(), check for padding with offsetof(). 4) Identify hot fields - fields accessed in tight loops or critical paths. 5) Group hot fields together - reorder struct members so hot fields are in same cache line (64 bytes). 6) Consider AoS vs SoA - if iterating over one field of many objects, SoA (separate arrays per field) improves spatial locality. 7) Reduce structure size - use smaller types where possible (int16_t vs int32_t), use bitfields for flags, remove unused fields. 8) Align to cache lines - use alignas(64) for frequently accessed structures to avoid split accesses. 9) Consider padding to avoid false sharing - in multi-threaded code, pad to 64 bytes between thread-local data. 10) Evaluate pointer vs index - indices into arrays have better cache behavior than pointers to dispersed memory. 11) Benchmark alternatives - test different layouts with production-like workload, as optimal depends on access pattern.
Follow this 11-step database query optimization process: 1) Profile query frequency and latency - identify hot queries (frequent) and slow queries (high latency). 2) Examine query plans - use EXPLAIN ANALYZE to see execution plan and actual times. 3) Check index usage - verify indexes exist and are used; add missing indexes for WHERE/JOIN columns. 4) Reduce round trips - batch queries, use JOINs instead of N+1 queries, fetch multiple results at once. 5) Limit result sets - use LIMIT, pagination; don't fetch more than needed. 6) Optimize data transfer - select only needed columns, not SELECT *; reduce transferred bytes. 7) Use connection pooling - connection creation is expensive; reuse connections. 8) Add application-level caching - cache read-heavy, rarely-changing data; use appropriate TTLs. 9) Use prepared statements - reduce parsing overhead for repeated queries. 10) Optimize transactions - minimize transaction scope and duration; avoid holding locks during app logic. 11) Consider denormalization - for read-heavy workloads, duplicate data to avoid expensive joins. Measure at application level: database latency is only part; include serialization, network, and app processing time.
Follow this diagnostic process when IPC is below 1.0: 1) Measure IPC precisely - use perf stat -e cycles,instructions to get accurate IPC value. 2) Check memory boundedness first - measure cache misses: perf stat -e L1-dcache-load-misses,LLC-load-misses. If LLC miss rate > 1%, memory latency is likely culprit. 3) If memory-bound: follow cache optimization checklist - improve locality, add prefetching, reduce working set. 4) Check branch mispredictions - perf stat -e branch-misses. If misprediction rate > 2%, branch prediction is culprit. 5) If branch-bound: follow branch optimization checklist - improve predictability, use branchless code. 6) Check for long dependency chains - look for serial arithmetic operations where each depends on previous result. 7) If dependency-bound: break chains with temporary variables, use associative transformations, increase ILP. 8) Check for execution port saturation - use VTune Port Utilization or pmu-tools. 9) If port-bound: use different instruction mix, reduce pressure on saturated ports. 10) Check for front-end issues - I-cache misses, complex instructions causing decode stalls. Low IPC is usually memory (most common), branches, or dependencies (in that order).
Follow this 10-step memory bandwidth optimization process: 1) Measure current bandwidth usage - use Intel PCM, likwid, or perf with memory controller events. 2) Calculate theoretical peak - check CPU specs for max memory bandwidth (e.g., DDR4-3200 dual channel = 51.2 GB/s). 3) Calculate achieved vs peak ratio - if <50% of peak, there's room for improvement; if near peak, must reduce memory traffic. 4) Reduce unnecessary memory traffic - eliminate redundant loads/stores, use registers for temporaries. 5) Improve cache utilization - tiling and blocking to keep data in cache, reducing main memory accesses. 6) Use non-temporal stores - for write-only data that won't be read again, skip cache pollution with mm_stream*. 7) Optimize prefetching - ensure prefetches don't exceed available bandwidth, causing contention. 8) Consider compression - if compute-cheap, compress data to reduce memory traffic. 9) Use appropriate data types - float vs double halves bandwidth requirement; int16_t vs int32_t similarly. 10) For multi-threaded code - ensure threads access local NUMA memory, consider bandwidth-aware thread scheduling. Key insight: memory-bound code scales poorly - adding threads doesn't help if bandwidth is saturated.
Follow this 10-step code size optimization checklist: 1) Measure code size - check binary size, use size command to see text segment. 2) Profile I-cache misses - use perf stat -e L1-icache-load-misses to measure instruction cache behavior. 3) Reduce function size - split large functions, move cold code to separate functions. 4) Use -Os flag - optimize for size; may be faster than -O3 if I-cache bound. 5) Avoid excessive inlining - inline only truly hot, small functions; large inlined functions bloat code. 6) Reduce template instantiations - templates generate separate code for each type; use type erasure where possible. 7) Use link-time optimization (LTO) - -flto enables cross-file optimization and dead code elimination. 8) Arrange functions by hotness - place hot functions together in memory; use profile-guided layout. 9) Consider compression - code compression techniques for embedded or cold code regions. 10) Review loop unrolling - excessive unrolling bloats code; tune unroll factor considering I-cache. Trade-off: smaller code has better I-cache hit rate but may have more instructions executed; profile both.
Follow this 10-step branch optimization process: 1) Profile branch mispredictions - use perf stat -e branches,branch-misses to get overall rate, target <1% for hot paths. 2) Identify mispredicted branches - use perf record -e branch-misses and perf report to find specific locations. 3) Analyze branch pattern - determine if branch is: always taken, always not-taken, alternating, random, or data-dependent. 4) For predictable branches: ensure hot path is fall-through, use likely/unlikely hints if available. 5) For unpredictable branches in loops: try loop unswitching to move branch outside loop. 6) For data-dependent branches: consider branchless alternatives using conditional moves (CMOV), arithmetic, or SIMD masking. 7) For multiple related branches: convert to lookup table or switch statement that compilers optimize better. 8) Reduce branch count: combine conditions, use short-circuit evaluation strategically. 9) Consider profile-guided optimization (PGO): compile with instrumentation, run representative workload, recompile with profile data. 10) Verify improvement: confirm misprediction rate dropped AND overall performance improved (branchless code may have higher latency).
Follow this 8-step parallelization decision process: 1) Profile first - identify if code is CPU-bound; parallelizing I/O-bound code rarely helps. 2) Measure serial fraction - use Amdahl's Law: if 50% is serial, max speedup is 2x regardless of core count. 3) Estimate parallelism - count independent operations; need more parallel work than available cores. 4) Calculate overhead - thread creation/joining costs ~1-10 microseconds, synchronization adds latency. 5) Decision point: if work per thread < 10,000 cycles, overhead likely exceeds benefit - do not parallelize. 6) Check memory bandwidth - if already memory-bound, more threads may contend for bandwidth without speedup. 7) Assess data sharing - heavy sharing requires synchronization; consider data partitioning or thread-local copies. 8) Choose parallelization strategy - task parallel (independent work items) or data parallel (same operation on partitioned data). For loops: parallelize outermost loop if possible to minimize fork/join overhead. Rule of thumb: parallelize when expected speedup > 2x and work per thread > 1 million cycles.
Follow this 8-step hot spot identification process: 1) Choose appropriate profiler - use sampling profiler (perf, VTune, Instruments) for low overhead and statistical accuracy. 2) Run representative workload - use production-like data and usage patterns, not toy inputs. 3) Ensure sufficient samples - minimum 1000 samples in hot regions for statistical reliability (run longer if needed). 4) Generate initial report - sort functions by CPU time (self time first, then including callees). 5) Identify hot spots - functions consuming >5% of total time are candidates for optimization. 6) Analyze call context - use call graph to understand HOW hot functions are reached (same function called from multiple paths may have different optimization strategies). 7) Generate flame graph - visual overview shows widest boxes as hottest paths, helps spot unexpected hot code. 8) Drill down with annotation - use perf annotate or VTune source view to find hot instructions within hot functions. Remember: some hot spots are fundamental (main computation) and cannot be eliminated; focus on reducible overhead and inefficiencies.
Follow this 9-step compiler optimization flag selection process: 1) Start with -O2 - balanced optimization, safe for production, good baseline. 2) Profile with -O2 - identify remaining hotspots and bottleneck types. 3) Try -O3 - adds aggressive optimizations (vectorization, unrolling); measure if it helps or hurts (code bloat can cause instruction cache misses). 4) Add -march=native - enables CPU-specific instructions (AVX2, etc.) for target machine; omit for portable binaries. 5) Consider -flto - Link-Time Optimization enables cross-file inlining and optimization; significant build time increase. 6) For floating-point heavy code: evaluate -ffast-math (may change results) or individual flags like -fno-math-errno. 7) Profile-Guided Optimization: compile with -fprofile-generate, run representative workload, recompile with -fprofile-use for best results. 8) Check for regressions - some optimizations can cause slowdowns; always measure. 9) Document final flags - record which flags used and why, include in build system. Avoid: -O0 in production (no optimization), -Os unless binary size critical (sacrifices speed), untested flag combinations.
Follow this 11-point microarchitecture optimization checklist: 1) Identify target microarchitecture - determine specific CPU family (Skylake, Zen 3, etc.) in deployment. 2) Study microarchitecture resources - Agner Fog's instruction tables, Intel/AMD optimization manuals. 3) Understand execution resources - number of ports, which instructions use which ports, throughput/latency. 4) Check instruction latencies - critical for dependency chains; prefer lower-latency alternatives. 5) Balance port utilization - avoid saturating single port; spread work across available units. 6) Respect retirement width - 4-6 micro-ops/cycle on modern CPUs; don't exceed sustainable rate. 7) Use appropriate vector width - wider SIMD (AVX-512) may cause frequency reduction on some CPUs. 8) Consider cache hierarchy - L1/L2/L3 sizes and latencies vary by microarchitecture. 9) Check feature availability - not all instructions available on all CPUs; use CPUID for runtime detection. 10) Use -march appropriately - -march=native for single target, specific -march=skylake etc for known deployment. 11) Test on target hardware - simulator/emulator performance doesn't reflect real microarchitecture behavior. Trade-off: highly tuned code may pessimize on other microarchitectures; consider multi-versioning.
Try optimizations in this order for CPU-bound loops: 1) Algorithm improvement - can you reduce complexity? O(n) vs O(n log n) matters more than micro-optimization. 2) Compiler optimization check - ensure -O2/-O3 enabled, check vectorization reports. 3) Loop-invariant code motion - move calculations that don't change per iteration outside the loop. 4) Strength reduction - replace expensive operations (multiply/divide) with cheaper ones (shift/add). 5) Common subexpression elimination - compute shared expressions once, store in variable. 6) Loop unrolling - try 2x, 4x, 8x unroll factors; let compiler try first with -funroll-loops. 7) Vectorization - if not auto-vectorized, address blockers (aliasing, alignment, dependencies). 8) Reduce function call overhead - inline small functions, avoid virtual calls in hot path. 9) Reduce branch count - combine conditions, use branchless techniques for unpredictable branches. 10) Instruction-level parallelism - reorder independent operations to fill pipeline. Stop when: target performance achieved, profiler shows different bottleneck, or optimization yields <5% improvement (diminishing returns).
Follow this 11-step SIMD vectorization process: 1) Identify vectorization candidate - find loops with independent iterations and arithmetic operations. 2) Check auto-vectorization first - compile with -O3 -march=native and check compiler reports (gcc: -fopt-info-vec-all, icc: -qopt-report=5). 3) If not auto-vectorized, identify blockers - look for: non-unit stride, data dependencies, function calls, complex control flow. 4) Simplify loop structure - remove conditionals, ensure trip count is known or add #pragma loop count. 5) Ensure proper alignment - align arrays to 16/32/64 bytes matching SIMD width (SSE=16, AVX=32, AVX-512=64). 6) Eliminate pointer aliasing - use restrict keyword or #pragma ivdep to tell compiler pointers don't alias. 7) Handle remainder iterations - decide between scalar epilog, masked operations, or padding. 8) Choose SIMD width - SSE (128-bit), AVX2 (256-bit), or AVX-512 (512-bit) based on target CPUs. 9) If manual vectorization needed - use intrinsics (mm256*), start with loads/stores, then operations. 10) Verify correctness - compare results against scalar version. 11) Benchmark - measure actual speedup, as theoretical gains may not materialize due to memory bandwidth limits.
Follow this 10-step branch optimization process: 1) Profile branch mispredictions - use perf stat -e branches,branch-misses to get overall rate, target <1% for hot paths. 2) Identify mispredicted branches - use perf record -e branch-misses and perf report to find specific locations. 3) Analyze branch pattern - determine if branch is: always taken, always not-taken, alternating, random, or data-dependent. 4) For predictable branches: ensure hot path is fall-through, use likely/unlikely hints if available. 5) For unpredictable branches in loops: try loop unswitching to move branch outside loop. 6) For data-dependent branches: consider branchless alternatives using conditional moves (CMOV), arithmetic, or SIMD masking. 7) For multiple related branches: convert to lookup table or switch statement that compilers optimize better. 8) Reduce branch count: combine conditions, use short-circuit evaluation strategically. 9) Consider profile-guided optimization (PGO): compile with instrumentation, run representative workload, recompile with profile data. 10) Verify improvement: confirm misprediction rate dropped AND overall performance improved (branchless code may have higher latency).
Follow this 11-step floating-point optimization process: 1) Profile FP operations - use perf with FP-specific events or VTune to count FLOPS and FP stalls. 2) Check for denormal penalties - denormalized numbers (very small values near zero) can be 100x slower; consider flushing to zero with _MM_SET_FLUSH_ZERO_MODE. 3) Identify precision requirements - float (32-bit) is 2x throughput of double (64-bit) for SIMD; use float if precision allows. 4) Use FMA instructions - fused multiply-add has lower latency than separate mul+add and may be more accurate. Enable with -mfma. 5) Consider -ffast-math - enables aggressive optimizations (reassociation, reciprocal approximations) that may change results. Use -fno-math-errno, -ffinite-math-only, -fno-trapping-math individually for finer control. 6) Reduce division - multiply by reciprocal when denominator is loop-invariant; division is 15-20x slower than multiplication. 7) Use intrinsics for special functions - approximate rsqrt, rcp are much faster than precise versions when acceptable. 8) Vectorize FP loops - SIMD provides 4x (SSE/float) to 16x (AVX-512/float) throughput. 9) Check for unnecessary type conversions - int-to-float conversions have latency. 10) Order operations for accuracy - sum small numbers first to reduce rounding error. 11) Verify numerical results - aggressive optimizations can change results; compare against reference implementation.
Follow this 10-step memory leak debugging process: 1) Confirm leak exists - monitor memory usage over time; growing usage without corresponding workload increase indicates leak. 2) Reproduce reliably - identify workload pattern that triggers leak; create minimal reproducer. 3) Quantify leak rate - measure bytes leaked per operation or per time period. 4) Run with leak detector - Valgrind Memcheck, AddressSanitizer (-fsanitize=address), or Dr. Memory. 5) Identify allocation site - detector reports where leaked memory was allocated. 6) Trace ownership - follow code path from allocation to understand intended deallocation point. 7) Identify missing free - find where deallocation should occur but doesn't. 8) Categorize leak type - true leak (lost pointer), logical leak (growing cache/queue), or reference cycle (garbage-collected languages). 9) Implement fix - add missing deallocation, limit cache size, break reference cycles. 10) Verify fix - rerun leak detector, monitor memory usage over extended period. For production: use sampling-based leak detectors (TCMalloc heap profiler) that have lower overhead than Valgrind.
Follow this 15-point benchmarking environment checklist: 1) Dedicated hardware - use machine not running other workloads. 2) Disable Turbo Boost - 'echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo'. 3) Set fixed CPU frequency - 'cpupower frequency-set -g performance -d FREQ -u FREQ'. 4) Disable hyper-threading if testing single-thread - isolate physical cores. 5) Set CPU governor to performance - 'cpupower frequency-set -g performance'. 6) Disable ASLR - 'echo 0 > /proc/sys/kernel/randomize_va_space'. 7) Set CPU affinity - pin benchmark to specific cores with taskset. 8) Disable swap - 'swapoff -a' to ensure consistent memory behavior. 9) Drop caches before test - 'echo 3 > /proc/sys/vm/drop_caches'. 10) Minimize background processes - stop unnecessary services. 11) Use isolcpus - boot parameter to reserve cores from scheduler. 12) Disable network interrupts - on benchmark cores. 13) Wait for thermal equilibrium - let system temperature stabilize. 14) Record environment - CPU model, frequencies, kernel version, compiler version. 15) Validate repeatability - run multiple times, verify low variance (<5% coefficient of variation).
Follow this 9-step VLIW instruction scheduling process: 1) Understand the target architecture - identify number of functional units, instruction latencies, and bundle width (how many ops per VLIW word). 2) Build dependency graph - create DAG of instructions showing true dependencies (RAW), anti-dependencies (WAR), and output dependencies (WAW). 3) Calculate critical path - find longest dependency chain; this is minimum execution time. 4) Identify parallelism - find independent instruction groups that can execute simultaneously. 5) Apply list scheduling - prioritize instructions by: critical path length, number of successors, resource constraints. 6) Pack into bundles - place compatible instructions in same VLIW word respecting functional unit constraints. 7) Handle hazards - insert NOPs where dependencies cannot be satisfied by scheduling. 8) Consider software pipelining - for loops, overlap iterations by starting iteration N+1 before N completes, using modulo scheduling. 9) Minimize code size - balance between NOP insertion (code bloat) and register pressure. Software pipelining particularly effective for VLIW as it exposes more parallelism than single-iteration scheduling.
Follow this 12-step hot loop optimization checklist: 1) Confirm hotness - verify loop consumes >10% of total cycles using profiler. 2) Measure baseline IPC - if IPC<1.0, suspect memory or dependency issues; if IPC>1.0, focus on reducing instruction count. 3) Count operations per iteration - tally loads, stores, FLOPs, branches, function calls. 4) Check loop-carried dependencies - identify if iteration N depends on iteration N-1 results. 5) Verify compiler optimizations applied - check compiler reports for vectorization, unrolling status. 6) Profile cache behavior - measure L1/L2/L3 miss rates for loop accesses. 7) Check memory access pattern - verify sequential/strided access, look for indirect addressing. 8) Assess branch predictability - profile branch misprediction rate within loop. 9) Evaluate vectorization potential - check for data dependencies preventing SIMD. 10) Consider loop transformations - interchange, tiling, fusion, fission as applicable. 11) Implement changes one at a time - measure after each. 12) Verify correctness - ensure loop produces identical results post-optimization.
Follow this 9-step instruction count reduction process: 1) Profile instruction mix - use perf stat to get total instructions, breakdown by type if available. 2) Identify high-instruction-count regions - use perf annotate to see instruction counts per source line. 3) Look for redundant computations - common subexpressions, loop-invariant calculations, repeated checks. 4) Apply strength reduction - replace expensive ops: multiply by power of 2 -> shift, divide -> multiply by reciprocal, modulo -> bitwise AND for power-of-2 divisors. 5) Use lookup tables - for complex calculations with small input domain, precompute results. 6) Leverage SIMD - one vector instruction replaces 4-16 scalar instructions. 7) Use specialized instructions - POPCNT for bit counting, LZCNT for leading zeros, FMA for multiply-add. 8) Simplify control flow - merge similar branches, eliminate dead code, simplify predicates. 9) Measure actual impact - fewer instructions doesn't always mean faster if new instructions are higher latency or cause other bottlenecks. Focus on hot code paths; reducing cold code instructions has minimal impact.
Follow this 10-step power optimization process: 1) Profile power usage - use Intel RAPL (Running Average Power Limit), powertop, or hardware power meters. 2) Understand power-performance trade-off - reducing frequency saves power but increases latency. 3) Optimize for completion time - faster completion allows longer idle periods; Race to Sleep strategy. 4) Use appropriate SIMD width - AVX-512 can cause frequency reduction on some CPUs; AVX2 may be more power-efficient. 5) Reduce memory traffic - DRAM access consumes significant power; cache optimization helps power too. 6) Use sleep states - yield CPU during waits rather than spinning; use condition variables. 7) Batch operations - process data in bursts allowing CPU to enter deeper sleep states between. 8) Consider compute intensity - power ~= constant + activity; more compute per memory access is more efficient. 9) Tune frequency governors - for sustained workloads, consider 'powersave' governor with occasional bursts. 10) Profile energy-delay product - minimize Energy x Time rather than just one metric. For servers: power efficiency (perf/watt) often more important than raw performance; for mobile: battery life critical.
Follow these optimization steps for matrix multiplication in order: 1) Baseline - implement naive O(n^3) triple loop, measure GFLOPS. 2) Loop reordering - change loop order to i-k-j for row-major arrays (improves B access pattern). 3) Register blocking - unroll inner loops to keep small blocks in registers, reducing load/store. 4) Cache tiling - tile for L1 cache (typically 32x32 to 64x64 blocks for double precision). 5) Multi-level tiling - add L2 tile level for larger matrices. 6) SIMD vectorization - vectorize innermost loop using AVX2 (4 doubles) or AVX-512 (8 doubles). 7) Software pipelining - prefetch next tiles while computing current. 8) Parallelization - parallelize outer tile loops with OpenMP. 9) NUMA awareness - ensure data locality for thread-tile mapping. 10) Specialized handling - separate paths for small matrices (direct computation) vs large (blocked). Compare against BLAS library (OpenBLAS, MKL) as ultimate baseline; well-tuned BLAS typically achieves 80%+ of theoretical peak FLOPS.
Try optimizations in this order for memory-bound code: 1) Algorithmic improvement - reduce memory access complexity (O(n) accesses vs O(n^2)). 2) Data layout optimization - convert AoS to SoA if processing single field at a time. 3) Loop tiling/blocking - process cache-sized chunks to maximize reuse. 4) Loop fusion - combine loops accessing same data to improve temporal locality. 5) Data alignment - align to cache lines to avoid split accesses. 6) Prefetching - software prefetch for irregular access patterns not handled by hardware. 7) Reduce data size - use smaller types (float vs double, int16 vs int32) if precision allows. 8) Compress data - if compute-cheap, compress to reduce memory traffic. 9) Non-temporal stores - for write-only streaming, bypass cache to avoid pollution. 10) NUMA optimization - bind threads and data to same NUMA node. 11) Huge pages - reduce TLB misses for large working sets. Stop when: profile shows code is no longer memory-bound (check TMAM), or target performance achieved. Common mistake: trying to improve IPC when memory-bound - IPC will remain low until memory pressure reduced.
Follow this 10-step allocation optimization process: 1) Profile allocations - use heap profiler (Massif, Heaptrack) to count allocations and measure overhead. 2) Identify hot allocators - find code paths with most frequent allocations. 3) Calculate allocation cost - typical malloc: 50-200 cycles; for small allocations, overhead dominates. 4) Consider object pooling - pre-allocate pool of objects, reuse instead of alloc/free. 5) Use arena allocators - allocate from contiguous region, free all at once; excellent for phase-based work. 6) Enable small block optimizers - jemalloc, tcmalloc have optimized small allocation paths. 7) Use stack allocation - alloca or VLAs for temporary, known-bounded data (be careful of stack overflow). 8) Batch allocations - allocate array of N objects instead of N separate allocations. 9) Consider custom allocators - for specific size classes, slab allocators eliminate fragmentation and overhead. 10) Reduce allocation count - reuse buffers, use string_view instead of string copies, pass by reference. Measure: profile allocation rate (per second) and total overhead; target <5% of CPU time on allocations.
Follow this 10-step PGO checklist: 1) Understand PGO mechanism - compiler instruments code to collect runtime data, then uses data to optimize. 2) Build instrumented binary - GCC: -fprofile-generate, Clang: -fprofile-instr-generate, MSVC: /GENPROFILE. 3) Run with representative workload - critical step; profile data must reflect production usage patterns. Include common paths, edge cases, various input sizes. 4) Cover code paths adequately - aim for >80% code coverage; uncovered code won't benefit. 5) Run sufficient iterations - ensure statistical significance in collected data. 6) Collect profile data - GCC: .gcda files generated on program exit; Clang: merge with llvm-profdata. 7) Build optimized binary - GCC: -fprofile-use, Clang: -fprofile-use=default.profdata, MSVC: /USEPROFILE. 8) Measure improvement - typical gains: 10-30% for branch-heavy code; less for compute-heavy. 9) Validate correctness - PGO-optimized code must produce same results. 10) Maintain profile data - re-profile when code changes significantly; stale profiles can cause pessimization. PGO benefits: better branch prediction hints, optimal function inlining, improved basic block placement.
Follow this 8-step loop fusion process: 1) Identify fusion candidates - separate loops with same iteration bounds and no conflicting dependencies. 2) Check dependency safety - loop 2 must not read values that loop 1 writes in later iterations. Fusion is safe if: loops are independent, or loop 2 only reads what loop 1 wrote in same iteration. 3) Analyze cache benefit - fusion beneficial when: loops access same data (improved temporal locality), combined body still fits in registers. 4) Check code size impact - fused loop body must not exceed I-cache capacity; may hurt if bodies are large. 5) Implement fusion - combine loop bodies, merge induction variables. Before: for(i) A[i]=..; for(i) B[i]=A[i]+..; After: for(i) { A[i]=..; B[i]=A[i]+..; } 6) Handle different iteration counts - if counts differ, fuse common portion, separate epilog for remainder. 7) Verify correctness - fused loop must produce identical results to original sequence. 8) Measure improvement - check reduced cache misses and overall speedup. Fusion reduces loop overhead and improves locality but increases register pressure and code size.
Follow this 10-point cross-vendor optimization checklist: 1) Identify target CPUs - determine if code will run on Intel-only, AMD-only, or both. 2) Check instruction availability - AVX-512 availability varies; AMD added support in Zen 4. 3) Consider cache differences - AMD typically has larger L3, different associativity; Intel has larger L2 per core. 4) Check SIMD performance - some SIMD instructions have different throughput/latency between vendors. 5) Profile on both architectures - performance characteristics can differ significantly. 6) Use vendor-agnostic intrinsics - prefer generic _mm* over vendor-specific extensions. 7) Check branch predictor differences - some patterns predicted differently; profile branch misses on both. 8) Consider NUMA topology - AMD chiplet design has different NUMA characteristics than Intel monolithic. 9) Test AVX2 vs AVX-512 - on AMD, AVX-512 may not be faster due to frequency scaling; benchmark. 10) Use runtime dispatch - detect CPU features and dispatch to optimized code paths: __builtin_cpu_supports(). For portable code: optimize for common subset (AVX2), add special paths for vendor-specific features with runtime detection.
Follow this 14-step cache optimization checklist: 1) Measure baseline - profile L1/L2/L3 hit rates and miss penalties using perf or Cachegrind. 2) Understand cache parameters - L1: 32-64KB, 4-8 way, 64B lines; L2: 256KB-1MB; L3: shared, multi-MB. 3) Calculate working set size - total unique data accessed in hot regions. 4) If working set > cache: apply blocking/tiling to process data in cache-sized chunks. 5) If spatial locality poor: reorganize data layout for sequential access, convert AoS to SoA. 6) If temporal locality poor: reorder computations to reuse data while still in cache. 7) Align data structures - align to 64-byte cache line boundaries to avoid split accesses. 8) Avoid false sharing in parallel code - pad structures so different threads don't share cache lines. 9) Use prefetching for predictable access - software prefetch 200-400 cycles ahead of use. 10) Minimize pointer chasing - linearize linked structures or use indices into arrays. 11) Consider cache-oblivious algorithms - algorithms that work well regardless of cache size. 12) Profile TLB misses - if high, consider huge pages (2MB instead of 4KB). 13) Check NUMA locality - ensure data allocated on same node as accessing CPU. 14) Verify improvements - confirm cache miss reduction yields proportional speedup.
Follow this 10-step PGO checklist: 1) Understand PGO mechanism - compiler instruments code to collect runtime data, then uses data to optimize. 2) Build instrumented binary - GCC: -fprofile-generate, Clang: -fprofile-instr-generate, MSVC: /GENPROFILE. 3) Run with representative workload - critical step; profile data must reflect production usage patterns. Include common paths, edge cases, various input sizes. 4) Cover code paths adequately - aim for >80% code coverage; uncovered code won't benefit. 5) Run sufficient iterations - ensure statistical significance in collected data. 6) Collect profile data - GCC: .gcda files generated on program exit; Clang: merge with llvm-profdata. 7) Build optimized binary - GCC: -fprofile-use, Clang: -fprofile-use=default.profdata, MSVC: /USEPROFILE. 8) Measure improvement - typical gains: 10-30% for branch-heavy code; less for compute-heavy. 9) Validate correctness - PGO-optimized code must produce same results. 10) Maintain profile data - re-profile when code changes significantly; stale profiles can cause pessimization. PGO benefits: better branch prediction hints, optimal function inlining, improved basic block placement.
Try optimizations in this order for memory-bound code: 1) Algorithmic improvement - reduce memory access complexity (O(n) accesses vs O(n^2)). 2) Data layout optimization - convert AoS to SoA if processing single field at a time. 3) Loop tiling/blocking - process cache-sized chunks to maximize reuse. 4) Loop fusion - combine loops accessing same data to improve temporal locality. 5) Data alignment - align to cache lines to avoid split accesses. 6) Prefetching - software prefetch for irregular access patterns not handled by hardware. 7) Reduce data size - use smaller types (float vs double, int16 vs int32) if precision allows. 8) Compress data - if compute-cheap, compress to reduce memory traffic. 9) Non-temporal stores - for write-only streaming, bypass cache to avoid pollution. 10) NUMA optimization - bind threads and data to same NUMA node. 11) Huge pages - reduce TLB misses for large working sets. Stop when: profile shows code is no longer memory-bound (check TMAM), or target performance achieved. Common mistake: trying to improve IPC when memory-bound - IPC will remain low until memory pressure reduced.
Follow this 11-step distributed systems performance debugging process: 1) Define the symptom - quantify: which requests are slow, how slow, what percentile affected? 2) Collect distributed traces - use Jaeger, Zipkin, or vendor tracing to see request flow. 3) Identify slow spans - find which service/component contributes most latency. 4) Check for queuing - high queue time indicates insufficient capacity or slow downstream. 5) Profile the slow component - use local profiling techniques on identified service. 6) Check network latency - measure RTT between services; unexpected high latency indicates network issues. 7) Look for retry amplification - retries causing load increase, causing more retries. 8) Check for resource contention - CPU, memory, connections, thread pools hitting limits. 9) Analyze dependencies - slow downstream service can cause upstream queuing. 10) Check for data skew - uneven load distribution causing some instances to be hot. 11) Verify with fix - implement fix, monitor distributed traces to confirm improvement. Key insight: in distributed systems, latency is often dominated by network, queuing, and serialization - not CPU computation.
Follow this 8-step memory alignment optimization process: 1) Identify alignment requirements - SSE needs 16-byte, AVX needs 32-byte, AVX-512 needs 64-byte alignment; cache lines are 64 bytes. 2) Check current alignment - use (uintptr_t)ptr % alignment or assert((uintptr_t)ptr & (alignment-1) == 0). 3) For stack variables: use alignas(N) specifier in C++11+ or attribute((aligned(N))). 4) For heap allocations: use aligned_alloc(alignment, size), _mm_malloc(size, alignment), or posix_memalign(). 5) For structure members: use alignas() or pack/align pragmas; be aware of padding impact on size. 6) Verify SIMD load/store instructions: use aligned loads (_mm256_load_ps) when alignment guaranteed, unaligned (_mm256_loadu_ps) otherwise. 7) Measure performance difference - on modern CPUs (Haswell+), unaligned access penalty is small UNLESS crossing cache line boundary; crossing penalty is significant. 8) Consider cache line alignment for hot data - align frequently accessed data to 64-byte boundaries to avoid split loads and false sharing. Trade-off: over-alignment wastes memory; align only what's performance-critical.
Follow this 10-step JSON parsing optimization process: 1) Profile current parser - measure parse time, memory allocations, CPU cycles per byte. 2) Choose fast parser library - simdjson (fastest), rapidjson, yyjson outperform standard library parsers. 3) Use lazy parsing - parse on-demand rather than building full DOM if only accessing few fields. 4) Avoid string copies - use string views into original buffer rather than copying strings. 5) Pre-allocate buffers - reuse parser state and output buffers across parses. 6) Use binary formats for internal communication - MessagePack, Protocol Buffers, FlatBuffers avoid parse overhead. 7) Schema-aware parsing - if structure is known, generate specialized parser (code generation). 8) SIMD acceleration - simdjson uses SIMD for structural character scanning; huge speedup. 9) Streaming parse for large documents - avoid loading entire document into memory. 10) Profile memory allocation - JSON parsing often allocation-heavy; arena allocators help. Benchmark: simdjson achieves 2-3 GB/s parsing speed; compare against your current solution. For serialization, similar process: use fast library, avoid intermediate strings, stream output.
Follow this 15-step micro-benchmark creation checklist: 1) Define what you're measuring - single specific operation, not compound behavior. 2) Isolate the code under test - remove unrelated setup/teardown from timed region. 3) Prevent dead code elimination - use result (Blackhole.consume in JMH, DoNotOptimize in Google Benchmark). 4) Prevent constant folding - use runtime inputs, not compile-time constants. 5) Control iteration count - let framework determine iterations, don't add manual loops. 6) Warm up properly - run enough iterations for JIT compilation (JVM) or cache warming. 7) Pin CPU frequency - disable Turbo Boost and frequency scaling for consistency. 8) Pin threads to cores - use taskset/numactl to avoid migration. 9) Run sufficient trials - minimum 10 iterations, ideally 30+ for statistical validity. 10) Measure variance - report standard deviation and confidence intervals, not just mean. 11) Use appropriate time resolution - std::chrono::high_resolution_clock, QueryPerformanceCounter, or RDTSC. 12) Account for timer overhead - measure and subtract if measuring sub-microsecond operations. 13) Test multiple input sizes - performance may vary non-linearly with size. 14) Compare against baseline - always measure relative improvement, not just absolute numbers. 15) Document environment - record CPU model, compiler version, flags, OS version.
Follow this 11-step hash table optimization checklist: 1) Profile access patterns - measure lookup/insert/delete frequency, hit rate, probe length. 2) Choose appropriate load factor - 0.5-0.7 balances memory and performance; lower for latency-critical. 3) Select hash function - use fast, high-quality hash (xxHash, wyhash, or built-in); avoid cryptographic hashes. 4) Minimize key comparison cost - store hash values to avoid recomputing; use prefix comparison for strings. 5) Optimize probing strategy - linear probing has better cache behavior than separate chaining. 6) Use power-of-2 sizes - enables fast modulo via bitwise AND. 7) Implement SIMD parallel lookup - compare multiple keys simultaneously using vector instructions. 8) Prefetch during probing - prefetch next probe location while checking current. 9) Consider flat hash maps - store entries inline rather than following pointers (better cache). 10) Handle hot keys - very frequently accessed keys can use separate fast path or front-of-bucket. 11) Profile memory usage - hash tables can waste significant memory; Swiss table and similar designs improve density. Benchmark against robin hood, cuckoo, and Swiss table variants for your access pattern.
Follow this 8-step loop fusion process: 1) Identify fusion candidates - separate loops with same iteration bounds and no conflicting dependencies. 2) Check dependency safety - loop 2 must not read values that loop 1 writes in later iterations. Fusion is safe if: loops are independent, or loop 2 only reads what loop 1 wrote in same iteration. 3) Analyze cache benefit - fusion beneficial when: loops access same data (improved temporal locality), combined body still fits in registers. 4) Check code size impact - fused loop body must not exceed I-cache capacity; may hurt if bodies are large. 5) Implement fusion - combine loop bodies, merge induction variables. Before: for(i) A[i]=..; for(i) B[i]=A[i]+..; After: for(i) { A[i]=..; B[i]=A[i]+..; } 6) Handle different iteration counts - if counts differ, fuse common portion, separate epilog for remainder. 7) Verify correctness - fused loop must produce identical results to original sequence. 8) Measure improvement - check reduced cache misses and overall speedup. Fusion reduces loop overhead and improves locality but increases register pressure and code size.
Follow this 12-step performance regression debugging process: 1) Confirm the regression - verify with multiple runs that performance actually degraded (not measurement noise). 2) Quantify the regression - measure exact percentage slowdown and which metrics changed (latency, throughput, CPU time). 3) Identify the regression window - use git bisect to find the exact commit that introduced the regression. 4) Compare profiles before/after - run same profiler on good and bad versions with identical workload. 5) Generate differential flame graph - visualize what got slower (red) vs faster (blue). 6) Analyze the diff - identify which functions show increased time. 7) Check for obvious causes - new allocations, additional logging, changed algorithms, new dependencies. 8) Profile specific metrics - if not obvious, profile cache misses, branch mispredictions, IPC separately. 9) Inspect code changes - review the diff of the regression commit for performance-impacting changes. 10) Form hypothesis - propose specific cause based on profile differences. 11) Verify hypothesis - create minimal reproducer or targeted benchmark. 12) Fix and verify - implement fix, confirm regression is resolved, ensure no new regressions introduced. Add regression test to CI to prevent recurrence.
Follow this 9-step software prefetching process: 1) Identify prefetch candidates - loops with: predictable access pattern, large data that doesn't fit cache, high cache miss rate on profiling. 2) Verify hardware prefetcher isn't sufficient - sequential and simple strided patterns are often handled automatically. 3) Calculate prefetch distance - prefetch D iterations ahead where D = memory_latency / loop_iteration_time. Typical: 100-300 cycles latency / 10-50 cycles per iteration = 3-30 iterations ahead. 4) Choose prefetch type - _mm_prefetch with _MM_HINT_T0 (all cache levels) for data used soon, _MM_HINT_NTA (non-temporal) for data used once. 5) Insert prefetch instructions - at loop start, prefetch data for iteration i+D while processing iteration i. 6) Avoid over-prefetching - too many prefetches can evict useful data and saturate memory bandwidth. 7) Handle loop boundaries - don't prefetch beyond array bounds (use conditional or limit prefetch iterations). 8) Measure impact - compare cache miss rate and runtime; prefetching should reduce misses without increasing total memory traffic significantly. 9) Test on target hardware - optimal distance varies by CPU; tune for deployment platform.
Follow this 10-step startup optimization process: 1) Profile startup - use strace -tt -T to see syscalls, perf record from process start. 2) Identify phases - categorize: loading, parsing, initialization, warm-up. 3) Measure dynamic linking time - large number of shared libraries adds loading overhead; consider static linking. 4) Lazy initialize - defer initialization until first use; avoid loading unused features. 5) Reduce I/O - combine config files, use memory-mapped files, parallelize independent reads. 6) Optimize parsing - use faster parsers, binary formats instead of JSON/XML for large configs. 7) Use dlopen for optional features - load plugins on-demand rather than at startup. 8) Cache computed state - serialize initialized state to disk, reload on subsequent starts. 9) Parallelize initialization - independent subsystems can initialize concurrently. 10) Profile and eliminate - use flame graph of startup to identify and remove unnecessary work. Tools: systemd-analyze for service startup, bootchart for system-level, custom timestamps for application-level. Goal: minimize time-to-first-useful-output.
Follow this 10-step memory access optimization process: 1) Profile cache miss rates - use perf stat -e L1-dcache-load-misses,LLC-load-misses to identify miss levels. 2) Identify miss source - if L1 misses high but L2 hits, data not fitting L1; if LLC misses high, data not in cache at all. 3) Analyze access pattern - check if sequential (spatial locality good), strided (may miss cache lines), or random (poor locality). 4) Check data structure layout - Array of Structures (AoS) vs Structure of Arrays (SoA), identify if accessing cold fields. 5) Evaluate working set size - compare data size to cache sizes (L1: 32KB, L2: 256KB, L3: varies). 6) If random access pattern: consider prefetching, restructure algorithm, or use cache-oblivious data structures. 7) If strided access: apply loop interchange to improve stride, consider data layout transformation. 8) If working set too large: apply loop tiling/blocking to fit in cache. 9) Check alignment - ensure data aligned to cache line boundaries (64 bytes). 10) Re-profile after each change - verify cache miss reduction translates to speedup.
Follow this 8-step register optimization process: 1) Identify register pressure issues - look for spill/fill instructions in generated assembly, or use compiler reports. 2) Count live variables - at any point, live variables > available registers causes spills. x86-64 has 16 general-purpose, 16 vector registers. 3) Reduce variable lifetimes - compute values as late as possible, use values as soon as computed. 4) Split hot and cold paths - keep register-intensive code on hot path; cold paths can spill. 5) Use function parameters wisely - first 6 integer args and first 8 FP args passed in registers (x86-64 SysV ABI). 6) Consider manual register hints - use register keyword (hint only), or inline assembly for critical sections. 7) Restructure code - split large functions, move invariants outside loops, reduce nested scopes. 8) Check compiler options - -freg-struct-return, -mprefer-vector-width affect register usage. Key insight: modern compilers have excellent register allocators; manual intervention rarely helps except in extreme cases. Focus on reducing variable count and scope instead.
Follow this 9-step cache thrashing diagnosis and fix process: 1) Detect thrashing symptoms - high cache miss rate despite working set appearing to fit in cache. 2) Profile cache behavior - use Cachegrind or perf to measure actual miss rates vs expected. 3) Identify conflict misses - thrashing occurs when multiple data items map to same cache set, evicting each other. 4) Analyze access patterns - find which arrays/structures are accessed together and their addresses. 5) Check alignment/stride - power-of-2 strides with power-of-2 array sizes cause conflict misses. 6) Calculate conflict addresses - items conflict if: (address1 / cache_line_size) mod num_sets == (address2 / cache_line_size) mod num_sets. 7) Apply padding - add padding to shift one array's addresses to different sets. Typical: add cache_line_size bytes. 8) Use different allocation strategy - malloc may align large allocations; use memalign with offset. 9) Verify fix - re-profile to confirm miss rate decreased and performance improved. Example: two 4KB arrays both at 4KB-aligned addresses conflict in every L1 set; padding one by 64 bytes eliminates conflicts.
Follow these optimization steps for memory-intensive batch processing: 1) Profile memory usage - identify peak memory, allocation patterns, and hot allocation sites. 2) Stream instead of load all - process data incrementally rather than loading entire dataset into memory. 3) Use memory-mapped files - let OS manage paging for large files instead of explicit reads. 4) Implement chunked processing - process data in cache-friendly chunk sizes (e.g., 1-4MB chunks). 5) Optimize data representation - use smaller types, compression, or packed formats to reduce footprint. 6) Reuse buffers - allocate buffers once and reuse across iterations rather than reallocating. 7) Use arena allocators - for temporary allocations within batch items, reset arena between items. 8) Process in cache-optimal order - access data in patterns that maximize cache hits. 9) Prefetch upcoming chunks - while processing chunk N, prefetch chunk N+1 data. 10) Parallelize with memory awareness - partition data to minimize cross-thread memory contention. 11) Consider out-of-core algorithms - for datasets exceeding RAM, use external memory algorithms. Goal: maintain constant memory usage regardless of input size while maximizing throughput.
Follow this 9-step string optimization process: 1) Profile string operations - identify hot string functions (strcmp, strlen, memcpy, parsing). 2) Use SIMD string functions - modern libraries (glibc, MSVC) have AVX2-optimized implementations. 3) Avoid unnecessary allocations - reuse buffers, use string views/spans instead of copying. 4) Use small string optimization (SSO) - std::string typically stores strings <15-23 chars inline without heap allocation. 5) Process in bulk - don't call strlen in every loop iteration; compute once and store. 6) Use memcmp/memcpy for known lengths - faster than strcmp/strcpy when length is known. 7) Consider specialized algorithms - for pattern matching, use SIMD-optimized search (SWAR, SSE4.2 PCMPESTRI). 8) Minimize format string parsing - printf-style formatting is slow; consider faster alternatives or pre-formatting. 9) Use interned strings for comparisons - compare pointers instead of content for frequently compared strings. Benchmark: simple optimizations (avoiding redundant strlen) often give bigger wins than complex SIMD; measure before implementing complex solutions.
Follow this 10-step network I/O benchmarking process: 1) Define metrics - throughput (bytes/second), latency (time per operation), connections per second. 2) Establish baseline - use standard tools: iperf3 for throughput, ping/hping for latency. 3) Control network conditions - test on isolated network or emulate conditions with tc (traffic control). 4) Warm up connections - TCP slow start affects initial measurements; run warm-up period. 5) Measure at multiple layers - raw socket, TCP, HTTP, application protocol - to isolate overhead sources. 6) Test various payload sizes - performance characteristics differ for small vs large transfers. 7) Test concurrent connections - single connection vs many; contention and scaling behavior. 8) Measure both client and server - bottleneck may be on either end. 9) Profile CPU usage - network I/O can be CPU-bound with high connection counts or encryption. 10) Compare against theoretical limits - calculate maximum based on link speed, latency; measure achieved percentage. Tools: wrk/wrk2 for HTTP, netperf for low-level, custom benchmarks for application protocols. Account for: TCP overhead, encryption (TLS), serialization.
Follow this 11-step real-time media optimization process: 1) Define latency budget - audio typically <10ms, video <33ms for 30fps; allocate per stage. 2) Use lock-free communication - between capture/process/playback threads to avoid priority inversion. 3) Pre-allocate all buffers - no runtime allocation; use circular buffers for streaming. 4) Avoid syscalls in processing - no file I/O, logging, or allocation in real-time thread. 5) Use SIMD extensively - media processing (filters, transforms) benefits greatly from vectorization. 6) Pin to dedicated cores - isolate real-time threads from system interrupts and other processes. 7) Use appropriate thread priority - SCHED_FIFO on Linux, THREAD_PRIORITY_TIME_CRITICAL on Windows. 8) Profile worst-case latency - measure 99.9th percentile, not average; spikes cause glitches. 9) Handle overload gracefully - drop frames rather than accumulating latency. 10) Use specialized DSP libraries - IPP, FFTW, vendor-specific audio libraries are highly optimized. 11) Test under stress - full CPU load, memory pressure, disk activity to find edge cases. Real-time constraint: every processing deadline must be met; missing one causes audible/visible glitch.
Follow this 10-step Valgrind usage checklist: 1) Build with debug info - compile with -g for source-level annotations. Keep -O2 for representative performance but consider -O0 for clearer traces. 2) Choose appropriate tool - Memcheck (default): memory errors; Cachegrind: cache simulation; Callgrind: call graph profiling; Massif: heap profiling; Helgrind/DRD: threading errors. 3) Expect slowdown - Memcheck: 10-50x slower, Cachegrind: 20-100x, Callgrind: 10-100x. Reduce input size accordingly. 4) Suppress known false positives - use suppression files to ignore expected errors from system libraries. 5) Run representative workload - even small inputs reveal cache behavior patterns. 6) Use annotation tools - cg_annotate for Cachegrind, callgrind_annotate for Callgrind, ms_print for Massif. 7) Focus on significant results - sort by impact (cache misses, time, allocations); ignore noise. 8) Check Cachegrind cache configuration - verify simulated cache matches target hardware: --I1, --D1, --LL options. 9) Use Callgrind with instrumentation control - CALLGRIND_START_INSTRUMENTATION/STOP to focus on hot code. 10) Interpret results carefully - Valgrind simulates, doesn't measure; use as relative comparison, not absolute numbers.
Follow this 12-step hot loop optimization checklist: 1) Confirm hotness - verify loop consumes >10% of total cycles using profiler. 2) Measure baseline IPC - if IPC<1.0, suspect memory or dependency issues; if IPC>1.0, focus on reducing instruction count. 3) Count operations per iteration - tally loads, stores, FLOPs, branches, function calls. 4) Check loop-carried dependencies - identify if iteration N depends on iteration N-1 results. 5) Verify compiler optimizations applied - check compiler reports for vectorization, unrolling status. 6) Profile cache behavior - measure L1/L2/L3 miss rates for loop accesses. 7) Check memory access pattern - verify sequential/strided access, look for indirect addressing. 8) Assess branch predictability - profile branch misprediction rate within loop. 9) Evaluate vectorization potential - check for data dependencies preventing SIMD. 10) Consider loop transformations - interchange, tiling, fusion, fission as applicable. 11) Implement changes one at a time - measure after each. 12) Verify correctness - ensure loop produces identical results post-optimization.
Follow this 15-point benchmarking environment checklist: 1) Dedicated hardware - use machine not running other workloads. 2) Disable Turbo Boost - 'echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo'. 3) Set fixed CPU frequency - 'cpupower frequency-set -g performance -d FREQ -u FREQ'. 4) Disable hyper-threading if testing single-thread - isolate physical cores. 5) Set CPU governor to performance - 'cpupower frequency-set -g performance'. 6) Disable ASLR - 'echo 0 > /proc/sys/kernel/randomize_va_space'. 7) Set CPU affinity - pin benchmark to specific cores with taskset. 8) Disable swap - 'swapoff -a' to ensure consistent memory behavior. 9) Drop caches before test - 'echo 3 > /proc/sys/vm/drop_caches'. 10) Minimize background processes - stop unnecessary services. 11) Use isolcpus - boot parameter to reserve cores from scheduler. 12) Disable network interrupts - on benchmark cores. 13) Wait for thermal equilibrium - let system temperature stabilize. 14) Record environment - CPU model, frequencies, kernel version, compiler version. 15) Validate repeatability - run multiple times, verify low variance (<5% coefficient of variation).
Follow this 13-point multi-threaded profiling checklist: 1) Profile single-threaded first - establish baseline, understand serial hot spots. 2) Use thread-aware profiler - VTune Threading analysis, perf with -t flag, or dedicated tools like Helgrind. 3) Measure scalability - run with 1, 2, 4, 8, N threads; calculate speedup = T1/TN. 4) Identify scalability limiters - is speedup sub-linear? Look for serial sections, contention, imbalanced work. 5) Profile lock contention - use perf lock, VTune Locks and Waits, or mutex debugging libraries. 6) Check for false sharing - if performance degrades with more threads, look for shared cache lines. 7) Verify work balance - check per-thread execution times; imbalance causes idle waiting. 8) Profile context switches - high context switch rate indicates excessive locking or thread thrashing. 9) Check NUMA effects - threads accessing remote memory have higher latency; use numactl to diagnose. 10) Measure synchronization overhead - time spent in barriers, locks, atomic operations. 11) Check memory bandwidth saturation - may limit scaling even if CPUs available. 12) Profile with production thread count - behavior may differ from small-scale testing. 13) Use timeline views - visualize thread activity over time to spot idle periods and synchronization patterns.
Try optimizations in this order for CPU-bound loops: 1) Algorithm improvement - can you reduce complexity? O(n) vs O(n log n) matters more than micro-optimization. 2) Compiler optimization check - ensure -O2/-O3 enabled, check vectorization reports. 3) Loop-invariant code motion - move calculations that don't change per iteration outside the loop. 4) Strength reduction - replace expensive operations (multiply/divide) with cheaper ones (shift/add). 5) Common subexpression elimination - compute shared expressions once, store in variable. 6) Loop unrolling - try 2x, 4x, 8x unroll factors; let compiler try first with -funroll-loops. 7) Vectorization - if not auto-vectorized, address blockers (aliasing, alignment, dependencies). 8) Reduce function call overhead - inline small functions, avoid virtual calls in hot path. 9) Reduce branch count - combine conditions, use branchless techniques for unpredictable branches. 10) Instruction-level parallelism - reorder independent operations to fill pipeline. Stop when: target performance achieved, profiler shows different bottleneck, or optimization yields <5% improvement (diminishing returns).
Apply loop optimizations in this recommended order: 1) Algorithm optimization - ensure you're using best algorithm; O(n) beats optimized O(n^2). 2) Loop-invariant code motion - move calculations that don't change per iteration outside loop. 3) Strength reduction - replace expensive ops (mul, div, mod) with cheaper ones (add, shift, mask). 4) Dead code elimination - remove calculations whose results are never used. 5) Loop interchange - for nested loops, reorder to improve memory access pattern. 6) Loop tiling/blocking - for cache locality when processing large arrays. 7) Loop fusion - combine adjacent loops with same bounds to improve locality. 8) Loop fission - split loop if it enables vectorization or reduces register pressure. 9) Loop unrolling - reduce loop overhead, expose ILP; try 2x, 4x factors. 10) Vectorization - SIMD for data-parallel operations. 11) Parallelization - multi-thread outer loops if sufficient work per thread. 12) Software pipelining - for VLIW or when other opts insufficient. Rationale: early opts may enable later ones (hoisting enables vectorization), algorithmic changes give biggest wins, parallelization adds complexity so comes last.
Follow this systematic 8-step process: 1) Establish baseline metrics - measure current performance with representative workloads, record wall time, CPU time, memory usage, and throughput. 2) Profile to identify hotspots - use sampling profiler (perf, VTune) to find where >10% of time is spent. 3) Categorize the bottleneck - determine if CPU-bound (high CPU utilization), memory-bound (high cache misses), I/O-bound (high wait time), or branch-bound (high misprediction rate). 4) Analyze the specific bottleneck - if memory-bound, profile cache behavior; if CPU-bound, check IPC and instruction mix. 5) Form optimization hypothesis - propose specific change based on bottleneck type. 6) Implement targeted fix - make ONE change at a time for measurable impact. 7) Measure improvement - re-profile with same workload, compare metrics quantitatively. 8) Iterate - if target not met, return to step 2; new hotspots may emerge after optimization. Critical rule: never optimize without profiling data first.
Follow this 10-step critical path latency reduction process: 1) Identify critical path - sequence of dependent operations determining minimum execution time. Use dependency analysis or VTune critical path view. 2) Measure path latency - sum instruction latencies along critical path. 3) Look for high-latency instructions - division (15-50 cycles), some shuffles, memory accesses on cache miss. 4) Replace high-latency ops - use approximations (rsqrt instead of 1/sqrt), lookup tables, or different algorithms. 5) Break dependency chains - use associativity to parallelize: (a+b)+(c+d) instead of a+b+c+d. 6) Use instruction-level parallelism - interleave independent operations to fill pipeline while waiting for dependencies. 7) Reduce memory latency impact - prefetch, improve locality, keep critical data in registers. 8) Avoid false dependencies - use different registers for independent computations to prevent false RAW hazards. 9) Consider out-of-order execution - modern CPUs reorder, but ROB size limits how far ahead they can look (~200 instructions). 10) Measure improvement - verify critical path shortened and latency reduced. Critical insight: reducing latency matters most for throughput when many instances are in-flight; for single execution, may not matter.
Follow this 10-point optimization correctness checklist: 1) Preserve semantics - optimized code must produce identical output for all valid inputs. 2) Test with original test suite - all existing tests must pass unchanged. 3) Add edge case tests - empty inputs, single elements, maximum sizes, boundary values. 4) Compare against reference - run both versions on diverse inputs, compare outputs exactly. 5) Test numerical stability - for floating-point, verify acceptable precision (may differ due to reassociation). 6) Check corner cases - negative numbers, zeros, NaN, infinity for floating-point. 7) Test with sanitizers - run with AddressSanitizer, UndefinedBehaviorSanitizer to catch subtle bugs. 8) Verify thread safety - if parallelized, test with ThreadSanitizer, various thread counts. 9) Stress test - large inputs, sustained load, verify no degradation or crashes. 10) Document semantic changes - if optimization intentionally changes behavior (fast-math), document clearly. Golden rule: never trust an optimization until independently verified; subtle bugs in optimized code can be worse than slow correct code.
Follow this 8-step loop fission process: 1) Identify fission candidates - single loop with multiple independent statement groups accessing different data. 2) Check dependency safety - statements in resulting separate loops must not have cross-loop dependencies. 3) Analyze benefit potential - fission beneficial when: different statements access different data (improved spatial locality), one statement group can be vectorized but others prevent it, register pressure in combined loop causes spills. 4) Identify split points - group statements that access same data or have dependencies together. 5) Implement fission - split single loop into multiple loops. Before: for(i) { A[i]=..; B[i]=..; } After: for(i) A[i]=..; for(i) B[i]=..; 6) Handle carried dependencies - if dependency exists, fission may require reordering or may not be legal. 7) Verify correctness - split loops must produce identical results to original. 8) Measure improvement - check vectorization status (compiler reports), cache behavior, overall performance. Fission is inverse of fusion; useful when combined loop body is too large for registers or prevents vectorization.
Follow this 10-step power optimization process: 1) Profile power usage - use Intel RAPL (Running Average Power Limit), powertop, or hardware power meters. 2) Understand power-performance trade-off - reducing frequency saves power but increases latency. 3) Optimize for completion time - faster completion allows longer idle periods; Race to Sleep strategy. 4) Use appropriate SIMD width - AVX-512 can cause frequency reduction on some CPUs; AVX2 may be more power-efficient. 5) Reduce memory traffic - DRAM access consumes significant power; cache optimization helps power too. 6) Use sleep states - yield CPU during waits rather than spinning; use condition variables. 7) Batch operations - process data in bursts allowing CPU to enter deeper sleep states between. 8) Consider compute intensity - power ~= constant + activity; more compute per memory access is more efficient. 9) Tune frequency governors - for sustained workloads, consider 'powersave' governor with occasional bursts. 10) Profile energy-delay product - minimize Energy x Time rather than just one metric. For servers: power efficiency (perf/watt) often more important than raw performance; for mobile: battery life critical.
Follow this 8-step false sharing prevention checklist: 1) Identify potential false sharing - look for arrays or structures where different threads write to adjacent elements. 2) Understand cache line size - typically 64 bytes on modern x86; false sharing occurs when threads write to same 64-byte region. 3) Profile for false sharing - look for high cache coherency traffic, poor scaling with thread count, or use tools like Intel Inspector. 4) Pad per-thread data - add padding to ensure each thread's data is in its own cache line: struct alignas(64) ThreadData { int data; char padding[60]; }; 5) Use thread-local storage - thread_local in C++, __thread in GCC, or explicit thread-local arrays. 6) Restructure arrays - instead of data[thread_id].field, use field[thread_id] with proper spacing. 7) Avoid sharing where possible - duplicate read-only data per thread rather than sharing single copy if write contention occurs. 8) Verify fix - re-profile to confirm cache coherency traffic reduced and scaling improved. Common pitfall: padding increases memory usage; only pad frequently-written data in hot paths.
Use this 15-point code review performance checklist: 1) N+1 queries - loops making database calls; should be batched. 2) Unbounded allocations - lists/buffers that can grow without limit. 3) String concatenation in loops - use StringBuilder or pre-sized buffer. 4) Synchronous I/O in hot path - should be async or moved off critical path. 5) Logging in tight loops - logging overhead adds up; use sampling or move outside loop. 6) Exception-based control flow - exceptions are expensive; don't use for normal cases. 7) Unnecessary boxing/unboxing - primitive to object conversion overhead. 8) Missing short-circuit evaluation - expensive checks should come last in conditions. 9) Unnecessary defensive copies - copying large objects 'just in case'. 10) Lock contention - locks held too long or too broadly. 11) Cache-unfriendly data structures - linked lists, hash tables with poor locality. 12) Repeated computation - same calculation done multiple times without caching. 13) Blocking on I/O - synchronously waiting for network/disk. 14) Inefficient algorithms - O(n^2) where O(n log n) exists. 15) Missing resource limits - unbounded queues, connection pools, thread creation. Not all patterns are problems everywhere; focus on hot paths identified by profiling.
Follow this 11-step container optimization process: 1) Minimize image size - use multi-stage builds, alpine bases, remove unnecessary files. Smaller = faster pull. 2) Layer effectively - put rarely-changing layers first (dependencies) to maximize cache reuse. 3) Optimize application startup - same startup optimizations as native, plus container-specific considerations. 4) Set appropriate resource limits - too low causes throttling, too high wastes cluster resources. 5) Use init containers wisely - parallelize where possible, minimize init work. 6) Configure health checks properly - fast to execute, appropriate intervals to avoid unnecessary restarts. 7) Use local image caching - pre-pull images to nodes, use image caching solutions. 8) Optimize overlay filesystem - many small files perform poorly; consider init container to unpack. 9) Profile container overhead - compare performance in container vs bare metal; identify container-specific costs. 10) Use appropriate base images - distroless or scratch for minimal overhead. 11) Consider sidecar overhead - service meshes, log collectors add CPU/memory; profile their impact. Kubernetes-specific: tune kubelet, use node-local DNS caching, appropriate pod disruption budgets.
Follow this 9-step dependency chain optimization process: 1) Identify long dependency chains - look for serial sequences where each instruction depends on previous result. 2) Measure chain length - count instructions from first to last in chain; compare to loop iteration count. 3) Calculate chain latency - sum latencies of instructions in chain; this is minimum time per iteration. 4) Check IPC - low IPC with low cache misses often indicates dependency-bound code. 5) Break arithmetic chains - use associativity: sum = (a+b) + (c+d) instead of a+b+c+d allows parallel execution. 6) Use multiple accumulators - for reductions, use 4-8 independent accumulators, combine at end. 7) Unroll with independent computations - unroll loop and interleave independent iterations. 8) Reorder instructions - place independent instructions between dependent ones to fill latency. 9) Verify improvement - check IPC increased and execution time decreased. Example: summing array with single accumulator has N-instruction chain; with 4 accumulators, chain is N/4, allowing 4x parallelism until final combination.
Follow this 10-step memory access optimization process: 1) Profile cache miss rates - use perf stat -e L1-dcache-load-misses,LLC-load-misses to identify miss levels. 2) Identify miss source - if L1 misses high but L2 hits, data not fitting L1; if LLC misses high, data not in cache at all. 3) Analyze access pattern - check if sequential (spatial locality good), strided (may miss cache lines), or random (poor locality). 4) Check data structure layout - Array of Structures (AoS) vs Structure of Arrays (SoA), identify if accessing cold fields. 5) Evaluate working set size - compare data size to cache sizes (L1: 32KB, L2: 256KB, L3: varies). 6) If random access pattern: consider prefetching, restructure algorithm, or use cache-oblivious data structures. 7) If strided access: apply loop interchange to improve stride, consider data layout transformation. 8) If working set too large: apply loop tiling/blocking to fit in cache. 9) Check alignment - ensure data aligned to cache line boundaries (64 bytes). 10) Re-profile after each change - verify cache miss reduction translates to speedup.
Follow this 10-step lock-free optimization process: 1) Profile contention - measure CAS failure rates, retry counts, throughput under contention. 2) Understand memory ordering - verify correct use of acquire/release/seq_cst; incorrect ordering causes bugs. 3) Minimize atomic operations - reduce number of atomics in hot path; batch updates where possible. 4) Use appropriate memory orders - relaxed is cheapest, acquire/release sufficient for most cases, avoid seq_cst unless necessary. 5) Avoid false sharing - pad atomic variables to separate cache lines. 6) Consider DCAS vs CAS - some algorithms need double-word CAS; x86 has CMPXCHG16B. 7) Profile cache line bouncing - high cache coherency traffic indicates contention; consider partitioning. 8) Use backoff strategies - exponential backoff on CAS failure reduces contention. 9) Consider hybrid approaches - lock-free for common case, fallback to locks for rare cases. 10) Benchmark against locked alternatives - lock-free isn't always faster; mutex with short critical section may be better under low contention. Correctness first: lock-free bugs are subtle and hard to reproduce; use formal verification or extensive testing.
Follow this 10-step branch optimization process for code with high branch density: 1) Profile branch behavior - identify which branches mispredicted most: perf record -e branch-misses. 2) Categorize branches - predictable (taken/not-taken >90% of time), moderately predictable (70-90%), unpredictable (<70%). 3) For predictable branches: use likely/unlikely hints (__builtin_expect), ensure hot path is fall-through. 4) For unpredictable data-dependent branches: consider branchless alternatives using bitwise ops, conditional moves, or arithmetic. 5) For loop-exit branches: use loop unswitching to move branch outside loop if condition is loop-invariant. 6) For multiple chained conditions: reorder to evaluate most likely first, or use short-circuit evaluation. 7) For switch statements: ensure compiler generates jump table for dense cases; consider profile-guided ordering. 8) For function dispatch: replace virtual calls with switch if small fixed set of types. 9) Use SIMD masking - process all paths simultaneously, select results with blend instructions. 10) Apply PGO - profile-guided optimization provides measured branch probabilities to compiler. Verify: branchless code may have higher latency per operation but removes unpredictability; measure to confirm improvement.
Follow this 10-step loop tiling process: 1) Identify candidate loops - nested loops accessing multi-dimensional arrays with poor cache reuse. Classic example: matrix operations. 2) Analyze access pattern - understand which indices access which dimensions, identify stride patterns. 3) Calculate cache parameters - L1 data cache size (typically 32KB), line size (64 bytes), associativity. 4) Choose tile sizes - tile should fit in target cache level. For L1: total data per tile < 32KB. For L2: < 256KB. 5) Tile innermost data reuse dimension - place tile loops to maximize reuse within tile. 6) Handle boundary conditions - when array dimensions aren't divisible by tile size, handle remainder tiles. 7) Implement tiled loop nest - original: for(i) for(j) for(k) becomes: for(ii+=TI) for(jj+=TJ) for(i=ii) for(j=jj) for(k). 8) Verify correctness - tiled loop must compute same result as original. 9) Tune tile sizes - optimal sizes depend on problem size and cache hierarchy; try powers of 2: 16, 32, 64. 10) Measure improvement - profile cache miss rates before/after; L1 misses should decrease significantly for properly tiled loops.
Follow this 11-step profile-guided optimization process: 1) Build baseline without PGO - measure current performance for comparison. 2) Instrument code for profiling - GCC: -fprofile-generate, Clang: -fprofile-instr-generate. 3) Design representative workload - must cover typical production patterns, not just unit tests. 4) Run instrumented binary - execute with representative workload to collect profile data. 5) Verify coverage - ensure hot code paths were executed; check coverage reports. 6) Build with profile data - GCC: -fprofile-use, Clang: -fprofile-instr-use=profile.profdata. 7) Measure improvement - compare PGO build against baseline on same workload. 8) Verify correctness - run full test suite on PGO build. 9) Establish profile update cadence - re-profile when code or usage patterns change significantly. 10) Automate in CI/CD - integrate profile collection and PGO builds into build pipeline. 11) Consider BOLT - post-link optimization using perf data for additional gains (5-15% typical). Expected gains: 10-30% for branch-heavy code like compilers, interpreters; less for compute-bound numeric code.
Follow this 11-step database query optimization process: 1) Profile query frequency and latency - identify hot queries (frequent) and slow queries (high latency). 2) Examine query plans - use EXPLAIN ANALYZE to see execution plan and actual times. 3) Check index usage - verify indexes exist and are used; add missing indexes for WHERE/JOIN columns. 4) Reduce round trips - batch queries, use JOINs instead of N+1 queries, fetch multiple results at once. 5) Limit result sets - use LIMIT, pagination; don't fetch more than needed. 6) Optimize data transfer - select only needed columns, not SELECT *; reduce transferred bytes. 7) Use connection pooling - connection creation is expensive; reuse connections. 8) Add application-level caching - cache read-heavy, rarely-changing data; use appropriate TTLs. 9) Use prepared statements - reduce parsing overhead for repeated queries. 10) Optimize transactions - minimize transaction scope and duration; avoid holding locks during app logic. 11) Consider denormalization - for read-heavy workloads, duplicate data to avoid expensive joins. Measure at application level: database latency is only part; include serialization, network, and app processing time.
Follow these optimization steps for memory-intensive batch processing: 1) Profile memory usage - identify peak memory, allocation patterns, and hot allocation sites. 2) Stream instead of load all - process data incrementally rather than loading entire dataset into memory. 3) Use memory-mapped files - let OS manage paging for large files instead of explicit reads. 4) Implement chunked processing - process data in cache-friendly chunk sizes (e.g., 1-4MB chunks). 5) Optimize data representation - use smaller types, compression, or packed formats to reduce footprint. 6) Reuse buffers - allocate buffers once and reuse across iterations rather than reallocating. 7) Use arena allocators - for temporary allocations within batch items, reset arena between items. 8) Process in cache-optimal order - access data in patterns that maximize cache hits. 9) Prefetch upcoming chunks - while processing chunk N, prefetch chunk N+1 data. 10) Parallelize with memory awareness - partition data to minimize cross-thread memory contention. 11) Consider out-of-core algorithms - for datasets exceeding RAM, use external memory algorithms. Goal: maintain constant memory usage regardless of input size while maximizing throughput.
Follow this 9-step VLIW instruction scheduling process: 1) Understand the target architecture - identify number of functional units, instruction latencies, and bundle width (how many ops per VLIW word). 2) Build dependency graph - create DAG of instructions showing true dependencies (RAW), anti-dependencies (WAR), and output dependencies (WAW). 3) Calculate critical path - find longest dependency chain; this is minimum execution time. 4) Identify parallelism - find independent instruction groups that can execute simultaneously. 5) Apply list scheduling - prioritize instructions by: critical path length, number of successors, resource constraints. 6) Pack into bundles - place compatible instructions in same VLIW word respecting functional unit constraints. 7) Handle hazards - insert NOPs where dependencies cannot be satisfied by scheduling. 8) Consider software pipelining - for loops, overlap iterations by starting iteration N+1 before N completes, using modulo scheduling. 9) Minimize code size - balance between NOP insertion (code bloat) and register pressure. Software pipelining particularly effective for VLIW as it exposes more parallelism than single-iteration scheduling.
Follow this 11-step hash table optimization checklist: 1) Profile access patterns - measure lookup/insert/delete frequency, hit rate, probe length. 2) Choose appropriate load factor - 0.5-0.7 balances memory and performance; lower for latency-critical. 3) Select hash function - use fast, high-quality hash (xxHash, wyhash, or built-in); avoid cryptographic hashes. 4) Minimize key comparison cost - store hash values to avoid recomputing; use prefix comparison for strings. 5) Optimize probing strategy - linear probing has better cache behavior than separate chaining. 6) Use power-of-2 sizes - enables fast modulo via bitwise AND. 7) Implement SIMD parallel lookup - compare multiple keys simultaneously using vector instructions. 8) Prefetch during probing - prefetch next probe location while checking current. 9) Consider flat hash maps - store entries inline rather than following pointers (better cache). 10) Handle hot keys - very frequently accessed keys can use separate fast path or front-of-bucket. 11) Profile memory usage - hash tables can waste significant memory; Swiss table and similar designs improve density. Benchmark against robin hood, cuckoo, and Swiss table variants for your access pattern.
Use this 12-point performance documentation checklist: 1) Describe the problem - what was slow, how slow, under what conditions? 2) Include baseline measurements - exact numbers with methodology, environment details. 3) Explain root cause analysis - how did you identify the bottleneck? What tools used? 4) Document what you tried - including approaches that didn't work and why. 5) Describe the solution - what change was made, why does it work? 6) Include after measurements - same methodology as baseline for valid comparison. 7) Quantify improvement - percentage, absolute numbers, statistical significance. 8) Note tradeoffs - does optimization increase memory usage, code complexity, or reduce maintainability? 9) Specify applicability - when does this optimization apply? What conditions might make it not apply? 10) Add code comments - explain non-obvious optimization in code near the optimization. 11) Update performance tests - add regression tests to prevent future slowdowns. 12) Record environment details - compiler version, flags, hardware, OS - for reproducibility. Good documentation prevents: re-investigating same issues, accidentally reverting optimizations, applying optimizations where they don't help.
Follow this 10-step JSON parsing optimization process: 1) Profile current parser - measure parse time, memory allocations, CPU cycles per byte. 2) Choose fast parser library - simdjson (fastest), rapidjson, yyjson outperform standard library parsers. 3) Use lazy parsing - parse on-demand rather than building full DOM if only accessing few fields. 4) Avoid string copies - use string views into original buffer rather than copying strings. 5) Pre-allocate buffers - reuse parser state and output buffers across parses. 6) Use binary formats for internal communication - MessagePack, Protocol Buffers, FlatBuffers avoid parse overhead. 7) Schema-aware parsing - if structure is known, generate specialized parser (code generation). 8) SIMD acceleration - simdjson uses SIMD for structural character scanning; huge speedup. 9) Streaming parse for large documents - avoid loading entire document into memory. 10) Profile memory allocation - JSON parsing often allocation-heavy; arena allocators help. Benchmark: simdjson achieves 2-3 GB/s parsing speed; compare against your current solution. For serialization, similar process: use fast library, avoid intermediate strings, stream output.
Follow this 12-point cloud optimization checklist: 1) Minimize cold start - for serverless: reduce package size, lazy load dependencies, use provisioned concurrency. 2) Optimize for spot/preemptible instances - checkpoint state, handle termination gracefully. 3) Right-size instances - profile to choose appropriate CPU/memory; oversized wastes money, undersized hurts performance. 4) Use appropriate storage tier - SSD for IOPS, HDD for throughput, object storage for large files. 5) Optimize network architecture - minimize cross-AZ traffic, use private endpoints, consider edge locations. 6) Implement connection pooling - database connections are expensive; reuse across invocations. 7) Cache aggressively - use Redis/Memcached for hot data; CDN for static content. 8) Batch operations - reduce round-trips to managed services which often have high per-request latency. 9) Monitor costs alongside performance - cloud bills can surprise; optimize for cost-performance ratio. 10) Use autoscaling effectively - set appropriate thresholds; don't scale on minor fluctuations. 11) Consider reserved capacity - committed use discounts for predictable workloads. 12) Profile regularly - cloud environment changes; re-profile after updates.
Follow this 11-step floating-point optimization process: 1) Profile FP operations - use perf with FP-specific events or VTune to count FLOPS and FP stalls. 2) Check for denormal penalties - denormalized numbers (very small values near zero) can be 100x slower; consider flushing to zero with _MM_SET_FLUSH_ZERO_MODE. 3) Identify precision requirements - float (32-bit) is 2x throughput of double (64-bit) for SIMD; use float if precision allows. 4) Use FMA instructions - fused multiply-add has lower latency than separate mul+add and may be more accurate. Enable with -mfma. 5) Consider -ffast-math - enables aggressive optimizations (reassociation, reciprocal approximations) that may change results. Use -fno-math-errno, -ffinite-math-only, -fno-trapping-math individually for finer control. 6) Reduce division - multiply by reciprocal when denominator is loop-invariant; division is 15-20x slower than multiplication. 7) Use intrinsics for special functions - approximate rsqrt, rcp are much faster than precise versions when acceptable. 8) Vectorize FP loops - SIMD provides 4x (SSE/float) to 16x (AVX-512/float) throughput. 9) Check for unnecessary type conversions - int-to-float conversions have latency. 10) Order operations for accuracy - sum small numbers first to reduce rounding error. 11) Verify numerical results - aggressive optimizations can change results; compare against reference implementation.
Follow this 9-step instruction count reduction process: 1) Profile instruction mix - use perf stat to get total instructions, breakdown by type if available. 2) Identify high-instruction-count regions - use perf annotate to see instruction counts per source line. 3) Look for redundant computations - common subexpressions, loop-invariant calculations, repeated checks. 4) Apply strength reduction - replace expensive ops: multiply by power of 2 -> shift, divide -> multiply by reciprocal, modulo -> bitwise AND for power-of-2 divisors. 5) Use lookup tables - for complex calculations with small input domain, precompute results. 6) Leverage SIMD - one vector instruction replaces 4-16 scalar instructions. 7) Use specialized instructions - POPCNT for bit counting, LZCNT for leading zeros, FMA for multiply-add. 8) Simplify control flow - merge similar branches, eliminate dead code, simplify predicates. 9) Measure actual impact - fewer instructions doesn't always mean faster if new instructions are higher latency or cause other bottlenecks. Focus on hot code paths; reducing cold code instructions has minimal impact.
Follow this 10-point cross-vendor optimization checklist: 1) Identify target CPUs - determine if code will run on Intel-only, AMD-only, or both. 2) Check instruction availability - AVX-512 availability varies; AMD added support in Zen 4. 3) Consider cache differences - AMD typically has larger L3, different associativity; Intel has larger L2 per core. 4) Check SIMD performance - some SIMD instructions have different throughput/latency between vendors. 5) Profile on both architectures - performance characteristics can differ significantly. 6) Use vendor-agnostic intrinsics - prefer generic _mm* over vendor-specific extensions. 7) Check branch predictor differences - some patterns predicted differently; profile branch misses on both. 8) Consider NUMA topology - AMD chiplet design has different NUMA characteristics than Intel monolithic. 9) Test AVX2 vs AVX-512 - on AMD, AVX-512 may not be faster due to frequency scaling; benchmark. 10) Use runtime dispatch - detect CPU features and dispatch to optimized code paths: __builtin_cpu_supports(). For portable code: optimize for common subset (AVX2), add special paths for vendor-specific features with runtime detection.
Follow this 10-step memory leak debugging process: 1) Confirm leak exists - monitor memory usage over time; growing usage without corresponding workload increase indicates leak. 2) Reproduce reliably - identify workload pattern that triggers leak; create minimal reproducer. 3) Quantify leak rate - measure bytes leaked per operation or per time period. 4) Run with leak detector - Valgrind Memcheck, AddressSanitizer (-fsanitize=address), or Dr. Memory. 5) Identify allocation site - detector reports where leaked memory was allocated. 6) Trace ownership - follow code path from allocation to understand intended deallocation point. 7) Identify missing free - find where deallocation should occur but doesn't. 8) Categorize leak type - true leak (lost pointer), logical leak (growing cache/queue), or reference cycle (garbage-collected languages). 9) Implement fix - add missing deallocation, limit cache size, break reference cycles. 10) Verify fix - rerun leak detector, monitor memory usage over extended period. For production: use sampling-based leak detectors (TCMalloc heap profiler) that have lower overhead than Valgrind.
Cost Models
75 questionsSorting cost model (comparison-based): Comparisons: O(N log N), typically 1.5N log N for quicksort, 1.0N log N for mergesort. Memory accesses: quicksort mostly in-place, mergesort needs N extra space. Cache behavior: quicksort better cache locality, mergesort more predictable. At scale (N > L3/sizeof(element)): Quicksort: ~2N log N * 200 cycles (random access). Mergesort: more sequential, ~1.5N log N * 40 cycles (better prefetching). Radix sort (non-comparison): O(N * key_width), better for integers. Example: sorting 1M 64-bit integers. std::sort: ~5 million comparisons, ~200ms. Radix sort: 8 passes * 1M = 8M operations, ~50ms. Rule: radix sort wins for large arrays of fixed-size keys.
An integer ADD instruction has a latency of 1 cycle on modern x86 CPUs (Intel and AMD). The throughput is typically 3-4 operations per cycle due to multiple execution units. This means while each individual ADD takes 1 cycle to produce its result, the CPU can execute 3-4 independent ADDs simultaneously. For dependency chains, count 1 cycle per ADD. For independent operations, divide total ADDs by 3-4 for approximate cycle count.
Unaligned access cost on modern x86: Within cache line: 0 extra cycles (handled by hardware). Crossing cache line boundary: +4-10 cycles (two cache accesses needed). Crossing page boundary: very expensive, potentially 100+ cycles (two TLB lookups). SIMD unaligned loads (VMOVDQU vs VMOVDQA): historically 1-2 cycles more, now essentially free on Haswell+. Cost model: ensure arrays are aligned to 16/32/64 bytes for SIMD. For structures with mixed sizes, use alignas() or manual padding. Unaligned atomics may not be atomic (split across cache lines). Rule: align to max(element_size, SIMD_width) for best performance.
Division by constant is optimized by compilers: Compile-time constant divisor: replaced with multiply + shift, ~4-6 cycles. Power of 2: becomes shift, 1 cycle. Runtime variable divisor: full division, 26-40+ cycles. Savings: 5-10x faster for constant divisors. Cost model for repeated division: If dividing N values by same runtime divisor: precompute reciprocal once, then N multiplications. libdivide library automates this. Single division: full cost unavoidable. Modulo by constant: also optimized, same ~4-6 cycles. Example: x / 7 becomes (x * 0x2492492...) >> 32, about 4 cycles. Use const or constexpr for divisors when possible to enable optimization.
Basic function call overhead is 2-5 cycles for the CALL/RET instructions themselves, but total overhead includes: (1) Stack frame setup: PUSH RBP, MOV RBP,RSP = 2-3 cycles. (2) Register saves if needed: 1 cycle per register. (3) Stack alignment: potentially 1-2 cycles. (4) Return: POP RBP, RET = 2-3 cycles. Typical total: 5-15 cycles for small functions. However, if function is not inlined and causes instruction cache miss, add 10-50+ cycles. For tiny functions (1-3 instructions), call overhead can exceed function body cost by 5-10x.
Memory bandwidth = Memory_Clock * 2 (DDR) * Bus_Width * Channels / 8. Example for DDR4-3200 dual channel: 1600 MHz * 2 * 64 bits * 2 channels / 8 = 51.2 GB/s. DDR5-6400 dual channel: 3200 * 2 * 64 * 2 / 8 = 102.4 GB/s. Real-world efficiency is 70-90% of theoretical maximum due to refresh cycles, command overhead, and access patterns. Single-threaded code typically achieves only 15-30% of peak bandwidth due to limited memory-level parallelism (need ~64 outstanding requests to saturate bandwidth).
Float vs double on modern x86: Scalar operations: same latency and throughput (both use 64-bit FP unit). SIMD operations: float has 2x throughput (8 floats vs 4 doubles in AVX). Memory: float uses half the bandwidth and cache. Specific costs: float ADD/MUL: 4 cycles latency, 2/cycle throughput (same as double). Division: float ~11 cycles, double ~13-14 cycles (float slightly faster). SIMD ADD (AVX): 8 floats or 4 doubles per instruction, same latency. Cost model: memory-bound code benefits 2x from float (half the bytes). Compute-bound SIMD code: float 2x faster. Pure scalar compute: roughly equal. Choose float for performance unless precision requires double.
Modulo has the same cost as division since it uses the same hardware (DIV/IDIV produces both quotient and remainder). For 32-bit modulo: 26-35 cycles latency, throughput of one every 6 cycles on Intel. Key optimizations: (1) Modulo by power of 2 becomes AND operation (1 cycle): x % 8 = x & 7. (2) For constant divisors, compilers use multiply-by-reciprocal: about 4 cycles total (multiply + shift). (3) For runtime divisors, precompute reciprocal if dividing many values by same divisor. Avoid modulo in tight loops.
L1 cache access latency is 4-5 cycles on modern Intel and AMD processors (approximately 1-1.5 nanoseconds at 3-4 GHz). L1 data cache (L1D) is typically 32-48KB per core. L1 instruction cache (L1I) is typically 32KB per core. The L1 cache can sustain 2 loads and 1 store per cycle. For cost estimation: assume 4 cycles for any memory access that hits L1. This is your best-case memory latency and baseline for comparison with other cache levels.
Hash table lookup cost: Hash computation: 5-50 cycles depending on hash function (CRC32: ~5, SHA: ~100+). Table lookup: 4-200 cycles depending on table size vs cache. Collision handling: adds one lookup per collision. Best case (small table in L1, no collision): ~10-15 cycles. Typical case (table in L3, avg 0.5 collisions): ~50-80 cycles. Worst case (table in DRAM, several collisions): ~400+ cycles. Cost model: lookup_cost = hash_cycles + expected_probes * memory_latency_for_table_size. Keep tables sized to fit in cache when possible. For high-performance: use cache-line-aligned buckets, minimize pointer chasing.
Integer to string conversion cost: Naive div/mod by 10 approach: ~26 cycles per digit (division cost). 64-bit integer (up to 20 digits): ~520 cycles worst case. Optimized approaches: lookup table for 00-99: ~10 cycles per 2 digits. Multiply by reciprocal: ~5 cycles per digit. Modern libraries (fmt, to_chars): 50-100 cycles for typical integers. Cost model: digit_count = floor(log10(n)) + 1. Naive: digit_count * 26 cycles. Optimized: digit_count * 5-10 cycles. sprintf overhead: adds 100-500 cycles for format parsing. For high throughput: use specialized integer formatting (fmt library is 5-10x faster than sprintf).
Loop unrolling tradeoffs: Benefits: reduces loop overhead (test, branch) by factor of unroll. 15x unroll eliminates ~94% of branches. Typical speedup: 1.2-2x for unroll factor 4-8. Costs: code size grows linearly with unroll factor. Excessive unrolling causes: instruction cache misses, defeats micro-op cache (typically holds ~1500 micro-ops). Sweet spot: unroll 4-8x usually optimal, beyond 16x rarely helps. Cost model: unrolled_cycles = base_cycles / unroll_factor + overhead. But if code exceeds L1I cache (32KB), add i-cache miss penalty (10-20 cycles per miss). Diminishing returns beyond instruction_cache_size / loop_body_size iterations.
Array of Structs (AoS) vs Struct of Arrays (SoA): AoS: struct Point { float x, y, z; } points[N]; - loads 12 bytes per point, may waste cache if only accessing x. SoA: struct Points { float x[N], y[N], z[N]; }; - loads only needed components, SIMD-friendly. Cache efficiency: if using 1 field: AoS wastes 2/3 of loaded cache lines. SIMD vectorization: SoA naturally aligns for SIMD (8 x values contiguous). AoS requires gather or shuffle. Cost model: AoS processing all fields: ~equal to SoA. AoS processing one field: 3x memory bandwidth wasted (for 3-field struct). SoA with SIMD: potential 8x speedup from vectorization. Rule: use SoA for hot data processed in bulk; AoS for random access of complete records.
Memory barrier/fence costs on x86: MFENCE (full barrier): 30-100 cycles on Intel, varies by microarchitecture. SFENCE (store barrier): 5-30 cycles. LFENCE (load barrier): near zero cycles (serializing, mainly for timing). LOCK prefix (implicit barrier): 12-35 cycles (same as atomic operation). On x86, most barriers are cheap due to strong memory model - stores already have release semantics. Cost model: MFENCE dominates when used; avoid in tight loops. std::atomic with sequential consistency uses MFENCE on stores. Prefer acquire/release semantics when possible (free on x86). ARM/PowerPC: barriers are more expensive (weaker memory model), 50-200 cycles typical.
Page fault costs: Minor fault (page in memory, just needs mapping): 1-10 microseconds (3,000-30,000 cycles). Major fault (page must be read from disk): 1-10 milliseconds for SSD (3-30 million cycles). HDD: 10-100 milliseconds (30-300 million cycles). Kernel overhead per fault: ~1000-5000 cycles for trap handling. Cost model: first touch of allocated memory triggers minor faults. mmap'd file access triggers major faults if not cached. For performance: pre-fault memory with memset, use madvise(MADV_WILLNEED) for file mappings. Huge pages reduce fault frequency by 512x (2MB vs 4KB pages). Detection: perf stat -e page-faults shows fault count.
Indirect branch cost depends on prediction: Predicted correctly: 1-2 cycles (similar to direct branch). Mispredicted: 15-20 cycles (full pipeline flush). Prediction accuracy depends on call site pattern: always same target = near 100% predicted. few targets, stable pattern = well predicted. many targets or random = poorly predicted. Cost model: average_cost = 1 + (1 - prediction_rate) * 15. For virtual function calls: usually well-predicted if object type is consistent. For switch statements compiled to jump tables: depends on case distribution. Profile-guided optimization (PGO) significantly improves indirect branch prediction.
L3 cache access latency is 30-50 cycles on modern CPUs (approximately 10-20 nanoseconds). L3 is shared across all cores, typically 8-96MB total. Accessing a local L3 slice takes about 20 cycles, while accessing a remote slice (owned by another core) can take 40+ cycles. For cost estimation: L3 hit adds 25-45 cycles over L1 baseline. This is 5-10x slower than L1. If working set exceeds L2 but fits in L3, expect average latency around 35-40 cycles.
Floating-point to string is much more expensive than integer: Simple cases (small integers, simple decimals): 200-500 cycles. Complex cases (large exponents, many decimal places): 1000-5000 cycles. Full precision (17 digits for double): often requires arbitrary precision arithmetic. Algorithms: Grisu2/Grisu3: 100-300 cycles for simple cases. Ryu (newer): ~100 cycles average, handles all cases correctly. Dragonbox: similar to Ryu, different tradeoffs. Cost model: ~5-10x slower than integer conversion. printf/sprintf: adds parsing overhead, total 500-2000 cycles. For high throughput: use Ryu or Dragonbox-based libraries (10x faster than naive approaches).
Cache miss costs (cycles added over L1 hit baseline of 4 cycles): L1 miss, L2 hit: +8-10 cycles (total ~12-14). L1+L2 miss, L3 hit: +26-36 cycles (total ~30-40). L1+L2+L3 miss, DRAM: +150-250 cycles (total ~160-260). Cost model for array access: if array_size <= 32KB (L1): 4 cycles/access. if array_size <= 256KB (L2): mix of 4 and 12 cycles. if array_size <= 8MB (L3): mix including ~40 cycle accesses. if array_size > L3: approaching 200+ cycles for random access. Sequential access amortizes: one miss per 64 bytes (8 doubles), so effective cost = miss_latency / 8 per element.
Latency-bound: each operation depends on previous, limited by instruction latency. Example: sum += array[i] - each add waits for previous add (3-4 cycles each). Rate: 1 add per 4 cycles = 0.25 adds/cycle. Throughput-bound: operations are independent, limited by execution units. Example: four parallel sums merged at end. Rate: 2 adds/cycle (using both FP add units). Speedup: 8x by using 4 accumulators. Cost model: dependency_chain_length * latency vs total_ops / throughput. Take maximum. Optimization: unroll loops with multiple accumulators to convert latency-bound to throughput-bound. Typical improvement: 4-8x for reduction operations.
Reduction operation costs: Scalar sum of N elements: latency-bound at ~4 cycles per add (dependency chain). Total: 4N cycles. Multi-accumulator (4 accumulators): throughput-bound at 2 adds/cycle. Total: N/2 cycles (8x faster). SIMD reduction (AVX with 8 floats): vector adds at 8 elements per instruction, then horizontal reduce. Vertical: N/8 vector adds = N/16 cycles. Horizontal reduction: ~10-15 cycles (shuffle + add tree). Total: N/16 + 15 cycles. Cost model: scalar_reduction = N * op_latency. Optimized: N / (accumulators * throughput) + horizontal_reduction. Memory-bound check: if N * element_size > L3, time = N * element_size / bandwidth.
IPC varies dramatically by workload: Compute-bound vectorized code: 2-4 IPC (utilizing multiple execution units). Integer-heavy general code: 1-2 IPC. Memory-bound code: 0.2-0.5 IPC (waiting for cache misses). Branch-heavy code: 0.5-1 IPC (misprediction stalls). Database workloads: often 0.3-0.7 IPC (random memory access). Modern CPUs can retire 4-6 instructions per cycle theoretically. Rule of thumb: IPC < 0.7 indicates optimization opportunity (likely memory-bound). Measure with: perf stat -e instructions,cycles your_program. IPC = instructions / cycles.
Network operation costs: Syscall overhead: 500-1500 cycles per send/recv. Localhost roundtrip: 10-50 microseconds. Same datacenter roundtrip: 0.5-2 milliseconds. Cross-datacenter: 10-100+ milliseconds. TCP overhead: connection setup 3-way handshake adds 1.5x RTT. Kernel network stack: ~10 microseconds processing per packet. Cost model: network_time = syscall + RTT + size/bandwidth. For small messages: RTT dominates. For large transfers: bandwidth dominates. Optimization: batch small messages, use TCP_NODELAY to disable Nagle, use kernel bypass (DPDK) for lowest latency. 10 Gbps network: 1.25 GB/s = 400 cycles per byte at 3 GHz.
Branch misprediction penalty is 15-20 cycles on modern Intel and AMD processors. Intel Skylake: approximately 16-17 cycles. AMD Zen 1/2: approximately 19 cycles. When mispredicted, the pipeline must be flushed and refilled from the correct target. For cost estimation: if branch prediction accuracy is P, average cost = (1-P) * 15-20 cycles per branch. Random branches (50% predictable) cost about 8-10 cycles on average. Well-predicted branches (>95%) cost <1 cycle average. Avoid data-dependent branches in hot loops.
Garbage collection pause costs: Minor GC (young generation): 1-10 milliseconds typically. Major GC (full heap): 100ms to several seconds depending on heap size. Concurrent collectors (G1, ZGC, Shenandoah): target 10ms max pause. Cost model: pause_time proportional to live_objects for copying collectors. For stop-the-world GC: pause = heap_size / scan_rate (~1GB/s typical). Rule of thumb: 1ms per 100MB of live data for minor GC. Mitigation: reduce allocation rate, tune heap size, use low-pause collectors. For latency-critical: budget 10-20% of cycle time for GC. Allocation cost: JIT-optimized bump pointer = 10-20 cycles. Measured: Java allocation can be faster than malloc.
Min/max operation costs: Integer: compare + conditional move = 2 cycles. Or using bit tricks: 3-4 cycles for branchless. Floating point (MINSS/MAXSS): 3-4 cycles latency. SIMD (VPMINSB, etc.): 1 cycle for packed integers, 4 cycles for packed floats. Branching min/max: 1 cycle if predicted, 15-20 cycles if mispredicted. Cost model: for predictable comparisons (e.g., clamping), branches may be faster. For random comparisons, branchless (CMOV or SIMD) wins. Reduction to find min/max of array: N comparisons, but SIMD can do 8-16 per instruction. SIMD min/max of N elements: N/16 cycles for AVX2, plus horizontal reduction (10 cycles).
Cache line size is 64 bytes on x86 (Intel/AMD) and 128 bytes on Apple M-series. Memory is transferred in whole cache lines - accessing 1 byte loads 64 bytes. Performance implications: (1) Sequential access: effectively 64 bytes per cache miss. (2) Strided access: if stride >= 64 bytes, every access is a cache miss. (3) False sharing: two threads modifying different variables on same cache line cause ping-pong. Cost model: array traversal with stride S bytes: cache_misses = array_size / max(S, 64). For stride 8 (double): miss every 8 elements. For stride 64+: miss every element.
Bitwise operation costs: AND, OR, XOR, NOT: 1 cycle latency, 3-4 per cycle throughput. Shifts (SHL, SHR, SAR): 1 cycle latency, 2 per cycle throughput. Variable shift: 1-2 cycles latency (depends on whether count is CL register). Rotate (ROL, ROR): 1 cycle latency. Population count (POPCNT): 3 cycles latency, 1 per cycle throughput. Leading zeros (LZCNT): 3 cycles latency. Bit test (BT): 1 cycle for register, more for memory. Cost model: bitwise operations are extremely cheap - essentially free compared to memory access. Use bit manipulation for flags, masks, and compact data structures. SIMD bitwise: processes 256-512 bits in 1 cycle.
Log and exp function costs: Library exp/log (double): 50-100 cycles typically. Fast approximations: 15-30 cycles with reduced precision. SIMD (Intel SVML): ~10-15 cycles per element vectorized. pow(x,y) = exp(ylog(x)): ~200 cycles (two transcendentals). exp2/log2: slightly faster than exp/ln (base matches hardware). Cost model: transcendental = ~80 cycles rule of thumb. For bulk computation: always vectorize with SIMD libraries. Integer power (x^n for integer n): use exponentiation by squaring, log2(n) multiplies = ~10 cycles for typical n. Avoid pow() for integer exponents - use xx*x or multiplication loop.
Thread creation/destruction costs: Linux pthread_create: 10-50 microseconds (30,000-150,000 cycles). Windows CreateThread: 20-100 microseconds. Thread destruction: similar cost to creation. Stack allocation (typically 1-8MB): may cause page faults if touched. Cost model: thread_overhead = creation_time + work_time + destruction_time. Break-even: work must exceed 100 microseconds to justify thread creation. For short tasks: use thread pool to amortize creation cost. Pool dispatch overhead: ~1-5 microseconds (mutex + condition variable). Maximum useful threads: typically number_of_cores for CPU-bound work, more for I/O-bound. Oversubscription causes context switch overhead (30 microseconds each).
Gather/scatter is much slower than contiguous access: Contiguous vector load: 4-7 cycles from L1 cache for 256-512 bits. Gather (AVX2 VGATHERPS): ~32 cycles for 8 elements from L1 - approximately 8x slower. Each gather element is a separate cache access. AVX-512 gather: improved to ~1.2-1.8x speedup vs scalar in best cases. Cost model: gather_cycles = elements * L1_latency (approximately). Only use gather when data is truly scattered. For rearrangement, prefer: load contiguous + permute (3 cycles) over gather. Scatter (AVX-512 only): similar cost to gather, avoid when possible.
SIMD operations process multiple elements with similar latency to scalar: SSE (128-bit): 4 floats or 2 doubles per instruction. Same latency as scalar (3-4 cycles for add/mul). AVX (256-bit): 8 floats or 4 doubles per instruction. Same latency, double throughput vs SSE. AVX-512 (512-bit): 16 floats or 8 doubles. May cause frequency throttling on some CPUs. Cost model: SIMD_time = Scalar_time / Vector_width (ideal). Real speedup typically 2-6x due to overhead, alignment, and remainder handling. For N elements: cycles = N / vector_width * latency (throughput bound) or N * latency / vector_width (latency bound).
CMOV vs branch tradeoffs: CMOV: 1-2 cycles latency, always executed (no prediction). Branch: 1 cycle if correctly predicted, 15-20 cycles if mispredicted. Break-even point: if branch prediction accuracy < 85-90%, CMOV wins. If accuracy > 95%, branch is typically faster. Cost model: branch_cost = 1 + (1 - prediction_rate) * misprediction_penalty. CMOV_cost = 2 cycles (always). Example: 75% predictable branch: 1 + 0.25 * 17 = 5.25 cycles average (CMOV wins). 99% predictable branch: 1 + 0.01 * 17 = 1.17 cycles (branch wins). Compiler heuristics: often conservative with CMOV. Use profile-guided optimization to help compiler choose correctly.
Main memory (DRAM) access latency is 150-300+ cycles (50-100 nanoseconds) on modern systems. This is approximately 50-100x slower than L1 cache. At 3 GHz, 100ns equals 300 cycles. Latency components: L3 miss detection (10-20 cycles) + memory controller (10-20 cycles) + DRAM access (40-60ns). For cost estimation: every cache miss that goes to DRAM costs roughly 200 cycles. A random access pattern in array larger than L3 will average close to DRAM latency.
Little's Law: Concurrency = Throughput * Latency. For memory systems: Outstanding_requests = Bandwidth * Latency. Example: to achieve 50 GB/s with 100ns latency: Requests = (50e9 bytes/s) * (100e-9 s) / 64 bytes = 78 cache lines in flight. A single core typically supports 10-12 outstanding L1D misses. To saturate memory bandwidth, need multiple cores or threads. Cost model: max_single_thread_bandwidth = max_outstanding_misses * cache_line_size / memory_latency. With 10 misses, 64B lines, 100ns: 10 * 64 / 100ns = 6.4 GB/s per thread (far below 50 GB/s peak). This explains why single-threaded code rarely achieves peak memory bandwidth.
Memory hierarchy cost ratios (normalized to L1=1): L1 hit: 1x (4 cycles baseline). L2 hit: 3x (12 cycles). L3 hit: 10x (40 cycles). DRAM: 50-75x (200-300 cycles). NUMA remote: 75-150x (300-600 cycles). Bandwidth ratios: L1: 64-128 bytes/cycle. L2: 32-64 bytes/cycle (0.5x L1). L3: 16-32 bytes/cycle (0.25x L1). DRAM: 10-20 bytes/cycle shared (0.1-0.15x L1). Cost model rule of thumb: each cache level is 3-4x slower than previous. Missing all caches is 50-100x slower than L1 hit. Design data structures to maximize L1/L2 hits - the latency difference is dramatic.
Arithmetic intensity (AI) = FLOPs / Bytes moved from memory. Count operations and data movement: FLOPs = adds + multiplies + divides + sqrts (count FMA as 2 FLOPs). Bytes = unique cache lines touched * 64 (for cold cache) or actual bytes for hot cache. Example - dot product of N floats: FLOPs = 2N (one mul + one add per element). Bytes = 2 * 4N = 8N (read two vectors). AI = 2N / 8N = 0.25 FLOPs/byte. Example - matrix multiply (NxN, naive): FLOPs = 2N^3. Bytes = 3 * 4N^2 (read two matrices, write one). AI = 2N^3 / 12N^2 = N/6 FLOPs/byte. For N=1000: AI = 166 FLOPs/byte (compute-bound).
System call overhead: 150-1500 cycles depending on the syscall and CPU mitigations. Minimal syscall (e.g., getpid): ~760 cycles, ~250 nanoseconds. Typical syscall (e.g., read): 1000-1500 cycles. With Spectre/Meltdown mitigations: can add 100-500 cycles. vDSO optimization (clock_gettime, gettimeofday): ~150 cycles, ~50 nanoseconds - about 10x faster than true syscall. For cost estimation: budget 500-1000 cycles per syscall in hot paths. Batch operations when possible to amortize overhead. Prefer vDSO functions for high-frequency calls.
Cache thrashing occurs when working set exceeds cache, causing repeated evictions: Thrashing signatures: L1 miss rate > 10%, L3 miss rate > 5%. Cost model: if working_set > cache_size, every access potentially misses. Effective latency approaches next level: L1 thrash -> L2 latency (12 cycles). L2 thrash -> L3 latency (40 cycles). L3 thrash -> DRAM latency (200 cycles). Example: matrix multiply without blocking, N=1000. Working set per row: 8KB (1000 doubles). L1 (32KB) holds 4 rows - heavy L1 thrashing. Performance: 10-50x slower than cache-blocked version. Detection: perf stat -e cache-misses,cache-references. Fix: restructure algorithm for cache-oblivious or explicit blocking to fit working set in cache.
Memory bandwidth formula: Required_BW = (bytes_read + bytes_written) * iterations / time. For array sum of N floats: reads 4N bytes, no writes. At 50 GB/s bandwidth: minimum time = 4N / 50e9 seconds. Compare to compute: N additions at 8 adds/cycle on 3GHz = N/24e9 seconds. Memory-bound when 4N/50e9 > N/24e9, simplifying: 4/50 > 1/24, or 0.08 > 0.04 - always true for simple operations. Rule of thumb: operation is memory-bound if arithmetic intensity < 10-15 FLOPS/byte (the 'ridge point').
CPU performance notation explained: Latency: cycles from instruction start to result availability. Used for dependency chains. Throughput (reciprocal): cycles between starting consecutive independent instructions. Example: 4L1T means 4 cycles latency, 1 cycle throughput (start one per cycle). 4L0.5T means 4 cycles latency, 0.5 cycle throughput (start two per cycle). Throughput 0.33 means 3 instructions per cycle. Reading Agner Fog tables: columns show latency and reciprocal throughput separately. uops.info format: similar, also shows micro-op breakdown. Cost model application: dependent chain: multiply latencies. Independent operations: divide by throughput. Real code: somewhere between, use profiling to measure actual performance.
Virtual function call overhead is 10-20 cycles in the best case (warm cache, predicted branch). Components: (1) Load vptr from object: 4 cycles if L1 hit. (2) Load function pointer from vtable: 4 cycles if L1 hit. (3) Indirect branch: 1-3 cycles if predicted, 15-20 cycles if mispredicted. Cold cache worst case: 100+ cycles (two cache misses + misprediction). Rule of thumb: virtual calls are 3-10x slower than direct calls when cache is warm. In tight loops, devirtualization or template polymorphism can eliminate overhead entirely.
L2 cache access latency is 12-14 cycles on modern CPUs (approximately 3-5 nanoseconds). L2 cache is typically 256KB-1MB per core. This is roughly 3x slower than L1 cache. For cost estimation: L2 miss but L1 hit adds about 8-10 cycles over L1 baseline. Memory access pattern analysis: if your working set fits in L2 (256KB-1MB per core), expect average latency around 12-14 cycles. L2 bandwidth is typically 32-64 bytes per cycle.
Context switch direct cost: 1-3 microseconds (3,000-10,000 cycles at 3 GHz). Components: (1) Save registers and state: 100-200 cycles. (2) Scheduler decision: 200-500 cycles. (3) Load new page tables: 100-200 cycles. (4) Restore registers: 100-200 cycles. (5) Pipeline refill: 10-50 cycles. Indirect costs dominate: TLB flush (100+ cycles per subsequent miss), cache pollution (potentially 1000s of cycles to rewarm). With virtualization: 2.5-3x more expensive. Rule of thumb: budget 30 microseconds total cost including indirect effects.
File I/O cost model: Syscall overhead (read/write): 500-1500 cycles per call. Disk latency (SSD): 50-200 microseconds (150,000-600,000 cycles). Disk latency (HDD): 5-10 milliseconds (15,000,000-30,000,000 cycles). SSD throughput: 500MB/s - 7GB/s depending on interface (SATA vs NVMe). Cost model: small_read_time = syscall + disk_latency + size/bandwidth. For many small reads: syscall overhead dominates. For large sequential reads: bandwidth dominates. Optimization: batch small reads, use mmap for random access, use async I/O. Buffered I/O (fread): fewer syscalls but memory copy overhead. For maximum throughput: use direct I/O with aligned buffers, multiple in-flight requests.
Floating-point division is expensive and not fully pipelined: Single precision (DIVSS): Latency 11 cycles, throughput 0.33/cycle (one every 3 cycles). Double precision (DIVSD): Latency 13-14 cycles, throughput 0.25/cycle (one every 4 cycles). This is 3-4x slower than multiplication. Optimization: multiply by reciprocal when dividing by constant. For approximate division, use RCPSS (4 cycles) then Newton-Raphson refinement. For cost estimation in loops: N divisions = N * 3-4 cycles minimum (throughput bound), or N * 11-14 cycles if dependent (latency bound).
Integer absolute value costs: Branching (if x < 0) x = -x: 1 cycle if predicted, 15-20 if mispredicted. Branchless using CMOV: 2-3 cycles. Branchless arithmetic: (x ^ (x >> 31)) - (x >> 31) = 3 cycles for 32-bit. SIMD (PABSD): 1 cycle for 8 int32s in AVX2. Cost model: for random signs (50/50), branchless wins. For mostly positive or mostly negative, branch may win due to prediction. Unsigned has no absolute value - already positive. Floating point: ANDPS with sign mask = 1 cycle (just clears sign bit). For bulk processing: always use SIMD PABS instructions (1 cycle per 8-16 elements).
Software prefetch (PREFETCH) cost: Instruction overhead: 1-4 cycles to issue. No immediate benefit - just initiates memory fetch. Prefetch must arrive before data is needed: prefetch_distance = cache_miss_latency / loop_iteration_cycles. Example: 200 cycle miss latency, 7 cycles per iteration = prefetch 28 iterations ahead. Too early: data evicted before use. Too late: no hiding of latency. Cost model: effective if hiding latency exceeds instruction overhead. For irregular access patterns, prefetch is often counterproductive. Modern CPUs have good hardware prefetchers for regular patterns - software prefetch mainly helps irregular but predictable patterns.
Floating-point square root cost: Single precision (SQRTSS): Latency 12-15 cycles, throughput ~0.33/cycle. Double precision (SQRTSD): Latency 15-20 cycles, throughput ~0.25/cycle. Not pipelined - one sqrt must complete before next starts in same unit. Approximate alternatives: RSQRTSS (reciprocal sqrt): 4 cycles, ~12-bit precision. RSQRTSS + Newton-Raphson: 8-10 cycles, full precision. For cost estimation: treat sqrt like division - expensive, avoid in tight loops. If precision allows, use rsqrt approximation (3-4x faster).
malloc/free cost varies by size and allocator: Small allocation (<256 bytes): 20-100 cycles with modern allocators (tcmalloc, jemalloc). Medium allocation (<1MB): 100-500 cycles. Large allocation (>1MB, uses mmap): 1000+ cycles, involves syscall. Multithreaded overhead: glibc ptmalloc has lock contention; tcmalloc/jemalloc use per-thread caches. Google data: malloc consumes ~7% of datacenter CPU cycles. Cost model: minimize allocations in hot paths. Use object pools, stack allocation, or arena allocators. Reusing buffers eliminates allocation overhead entirely.
Mutex lock/unlock cost depends on contention: Uncontended (fast path): 15-25 nanoseconds (45-75 cycles). Uses single atomic instruction, no syscall. Contended (must wait): 1-10 microseconds (3,000-30,000 cycles). Includes context switch if blocking. Heavily contended: can serialize to 100x slower than uncontended. Linux futex fast path: ~25ns. Windows CRITICAL_SECTION: ~23ns. For cost estimation: uncontended mutex adds ~50-100 cycles per lock/unlock pair. Design to minimize contention; use lock-free structures or fine-grained locking for hot paths.
Atomic CAS (LOCK CMPXCHG) cost varies dramatically by contention: Uncontended (exclusive cache line): 12-40 cycles. Intel: ~35 cycles on Core 2, ~20 cycles on Skylake. AMD: ~12-15 cycles on Zen. Contended (cache line bouncing): 70-200+ cycles per operation due to cache coherency traffic. Highly contended: effectively serializes, can be 100x slower than uncontended. For cost estimation: budget 20-40 cycles for uncontended atomics, but design to minimize contention. A contended atomic is worse than a mutex in most cases.
C++ exception cost (zero-cost exception model used by modern compilers): No exception thrown: 0 cycles overhead in hot path (tables consulted only on throw). Exception thrown: extremely expensive, 10,000-100,000+ cycles. Components of throw cost: stack unwinding, destructor calls, table lookups, dynamic type matching. Exception construction: typical exception object allocation + string formatting = 1000+ cycles. Cost model: never use exceptions for control flow. Throw rate should be < 0.1% for negligible impact. Code size: exception tables add 10-20% to binary size. For performance-critical code: prefer error codes or std::expected, reserve exceptions for truly exceptional cases.
Instruction mix by workload: Scientific/HPC: 30-50% FP operations, 20-30% loads/stores, 10-20% branches. Integer-heavy (databases): 40-50% integer ALU, 30-40% loads/stores, 10-15% branches. Web servers: heavy load/store (40-50%), many branches (15-20%), string operations. Games: 30-40% FP/SIMD, 30% loads/stores, 15-20% branches. Compilers: 50%+ branches (control flow heavy), moderate loads/stores. Cost model impact: high branch count = misprediction dominates. High load/store = cache behavior dominates. High FP/SIMD = execution unit throughput matters. Profile your workload to understand which category applies, then optimize accordingly.
Integer multiplication (IMUL) has a latency of 3-4 cycles on modern Intel and AMD CPUs. On Intel Skylake and later, 64-bit IMUL has 3-cycle latency with throughput of 1 per cycle. The 128-bit result (RDX:RAX) from MUL r64 has 3-cycle latency for the low 64 bits (RAX) and 4-cycle latency for the high 64 bits (RDX). AMD Zen processors show similar characteristics. For cost estimation: count 3 cycles per dependent multiply, or total_multiplies / throughput for independent operations.
AVX-512 instructions can cause CPU frequency reduction: Light AVX-512 (most instructions): 0-100 MHz reduction. Heavy AVX-512 (FMA, multiply): 100-200 MHz reduction. On Skylake-X: ~0.85x base frequency for heavy AVX-512. Throttling transition: 10-20 microseconds on early Xeon, near-zero on 3rd gen Xeon. Cost model: if AVX-512 section is short, transition overhead may exceed benefit. Break-even: typically need 100+ microseconds of AVX-512 work. Non-AVX code after AVX-512 runs at reduced frequency until transition back. Consider AVX2 for short bursts.
SIMD permute costs vary by type: In-lane permute (within 128-bit lane): 1 cycle latency, 1/cycle throughput. Examples: PSHUFB, PSHUFD. Cross-lane permute (across 128-bit boundaries): 3 cycles latency, 1/cycle throughput. Examples: VPERMPS, VPERMD. Full permute (AVX-512 VPERMB): 3-6 cycles. Cost model for data rearrangement: prefer in-lane shuffles (1 cycle) over cross-lane (3 cycles). Two in-lane shuffles often faster than one cross-lane if possible. VPERMPS for float permutation: 3-cycle latency, handles arbitrary 8-element permutation in AVX2.
Type conversion costs on x86: int to float (CVTSI2SS): 4-6 cycles latency. float to int (CVTTSS2SI): 4-6 cycles latency. int32 to int64 (sign extend): 1 cycle. float to double: 1 cycle (just register renaming on modern CPUs). double to float: 4 cycles (rounding). int64 to float: 4-6 cycles. Narrowing conversions may have additional checking cost in safe languages. Cost model: conversions are cheap relative to memory access but expensive relative to basic ALU. Avoid conversions in inner loops when possible. SIMD conversion: 8 int32 to 8 float in 3 cycles (VCVTDQ2PS). Bulk data: convert once, operate in target format.
Memory operation costs: memset (REP STOSB optimized): 32-64 bytes per cycle for large sizes. memcpy (REP MOVSB optimized): 16-32 bytes per cycle for large sizes. Small sizes (<64 bytes): 10-30 cycles overhead. SIMD explicit copy: can achieve 32-64 bytes per cycle from L1. Cost model: Large memcpy (N bytes) from L1: N / 32 cycles. From L3: N / 16 cycles (limited by L3 bandwidth). From DRAM: N / (DRAM_bandwidth_bytes_per_cycle). Example: memcpy of 1MB from DRAM at 50 GB/s: 1MB / 17 bytes-per-cycle = 62,000 cycles = 20 microseconds at 3 GHz. For small copies: inline code often faster than function call overhead. memmove (handles overlap): similar cost to memcpy.
Binary search cost model: Comparisons: log2(N). Memory accesses: log2(N), all potentially cache misses (random access pattern). If array fits in L1 (N < 4K for 4-byte elements): ~4 cycles per comparison. If array fits in L3 (N < 2M elements): ~40 cycles per comparison. If array exceeds L3: ~200 cycles per comparison (DRAM). Example: binary search in 1 million integers exceeding L3. Comparisons = 20. Cost = 20 * 200 = 4000 cycles = 1.3 microseconds. Optimization: for small arrays, linear search may win due to prefetching. For huge sorted arrays, consider B-tree (multiple elements per cache line) or interpolation search.
SIMD speedup estimation: Theoretical maximum: vector_width / scalar_width. SSE: 4x for float, 2x for double. AVX: 8x for float, 4x for double. AVX-512: 16x for float, 8x for double. Real speedup factors: Vectorization overhead (setup, remainder): reduces by 10-20%. Memory-bound code: limited by bandwidth, not compute. Complex control flow: may not vectorize. Measured typical speedups: compute-bound: 60-80% of theoretical. Memory-bound: 20-50% of theoretical (bandwidth limited). Mixed: 40-70% of theoretical. Cost model: actual_speedup = min(theoretical_speedup, memory_bandwidth_ratio). For bandwidth-bound: speedup = bytes_loaded_per_scalar / bytes_loaded_per_SIMD (often ~1-2x, not 8x).
DRAM refresh overhead: Each row refreshed every 64ms (DDR4 standard). Refresh takes 350ns, blocks access to that bank. Average impact: 5-10% of bandwidth lost to refresh. Worst case: request hits refreshing bank, adds 350ns (1000 cycles) to latency. Temperature effects: above 85C, refresh rate doubles (per DDR4 spec). Cost model: for latency-sensitive code, 99th percentile latency can be 2-3x median due to refresh collisions. Mitigation: modern memory controllers schedule around refresh when possible. For real-time systems: account for refresh-induced jitter. Server-grade memory: higher refresh rates for reliability can increase performance impact.
TLB miss penalty ranges from 20-150 cycles depending on page table depth and caching. Best case (page table in L2): 20-40 cycles. Typical case: 40-80 cycles. Worst case (page table in DRAM): 100-150+ cycles. x86-64 uses 4-level page tables, requiring up to 4 memory accesses. With virtualization (nested page tables): 6x more lookups, penalty can exceed 500 cycles. Mitigation: use huge pages (2MB/1GB) to reduce TLB pressure. TLB typically holds 1500-2000 4KB page entries or 32-64 huge page entries.
Use arithmetic intensity (AI) = FLOPs / Bytes transferred. Ridge point = Peak_FLOPS / Peak_Bandwidth. If AI < ridge_point: memory-bound. If AI > ridge_point: compute-bound. Modern x86 example: Peak = 100 GFLOPS, Bandwidth = 50 GB/s, Ridge = 2 FLOPS/byte. For dot product (2 FLOPs per 8 bytes loaded): AI = 0.25 < 2, memory-bound. For matrix multiply (2N FLOPs per element, amortized): AI can exceed ridge with blocking. Cost model: Memory_bound_time = Bytes / Bandwidth. Compute_bound_time = FLOPs / Peak_FLOPS. Actual_time = max(memory_time, compute_time).
String operation costs: Comparison (strcmp): best case 1 cycle (first char differs), worst case = length cycles + cache misses. SIMD comparison: 16-32 chars per cycle with SSE/AVX. Copy (memcpy): for small strings (<64 bytes): 10-30 cycles. For large strings: limited by memory bandwidth (10-20 GB/s). SIMD copy: 32-64 bytes per cycle from L1. Search (strstr): naive O(nm), optimized (SIMD + hashing) approaches O(n). Cost model: memcpy of N bytes = max(20, N / effective_bandwidth_bytes_per_cycle) cycles. For N=1KB from L1: ~30-50 cycles. For N=1MB from DRAM: ~50,000 cycles at 20 GB/s. Small string optimization (SSO) in std::string: avoids allocation for strings <15-23 chars.
Trigonometric function costs on x86: Hardware x87 (FSIN/FCOS): 50-120 cycles, variable based on input range. Library implementations (libm): typically 50-100 cycles for double precision. Fast approximations (polynomial): 10-30 cycles for single precision, reduced accuracy. SIMD vectorized (Intel SVML): 10 cycles per element for 8 floats. Cost model: use fast approximations when accuracy allows (games, graphics). Library sin/cos: assume 80 cycles per call. For bulk computation: use SIMD versions (vectorize the loop). sincos() computes both for price of one (100 cycles). Avoid sin/cos in tight loops if possible; precompute lookup tables.
False sharing occurs when threads modify different variables on the same cache line. Cost: 40-200 cycles per access instead of 4 cycles (10-50x slowdown). The cache line ping-pongs between cores via coherency protocol. Each modification invalidates other cores' copies. Measured impact: 25-50% performance degradation typical, can be 10x in extreme cases. Detection: look for high cache coherency traffic in perf counters (HITM events). Prevention: pad shared structures to cache line boundaries (64 bytes on x86, 128 bytes on ARM). In C++: alignas(64) or compiler-specific attributes.
Prefetch distance formula: distance = (cache_miss_latency * IPC) / instructions_per_element. Or simpler: distance = miss_latency_cycles / cycles_per_iteration. Example: L3 miss = 200 cycles, loop body = 25 cycles, distance = 200/25 = 8 iterations ahead. For arrays: prefetch &array[i + distance] while processing array[i]. Tuning: measure actual cache miss rate. Start with distance = 8-16, adjust based on profiling. Too short: still waiting for data. Too long: prefetched data evicted before use. Bandwidth consideration: don't prefetch faster than memory bandwidth allows.
Execution time estimation: time = max(compute_time, memory_time, branch_time). Compute_time = sum(instruction_count[i] * cycles[i]) / IPC. Memory_time = cache_misses * miss_penalty / memory_level_parallelism. Branch_time = mispredictions * misprediction_penalty. Simplified model: total_cycles = instructions / IPC + L3_misses * 200 + mispredictions * 17. Example: 1M instructions at IPC=2, 1000 L3 misses, 5000 mispredictions. Cycles = 1M/2 + 1000200 + 500017 = 500K + 200K + 85K = 785K cycles. Time at 3GHz = 262 microseconds. Validate with: perf stat -e cycles,instructions,cache-misses,branch-misses. Adjust model based on measured vs predicted discrepancies.
Pointer chasing is latency-bound because each load depends on previous: If list fits in L1: ~4 cycles per node. If list fits in L2: ~12 cycles per node. If list fits in L3: ~40 cycles per node. If list exceeds L3: ~200 cycles per node (DRAM latency). Total cost = nodes * memory_latency_at_that_cache_level. Example: 1000-node list in DRAM = 200,000 cycles = 67 microseconds at 3 GHz. Optimization: convert to array when possible (sequential access is 10-50x faster). Software prefetching rarely helps (can't prefetch next without completing current load). B-trees beat linked lists precisely because of this cost model.
Memory throughput at each level: L1 cache: 64-128 bytes/cycle (2 loads + 1 store of 32-64 bytes each). L2 cache: 32-64 bytes/cycle. L3 cache: 16-32 bytes/cycle per core. DRAM: ~10-20 bytes/cycle per core (but shared across all cores). Example: DDR4-3200 dual channel = 51.2 GB/s total. At 3 GHz = 17 bytes/cycle shared. Single thread typically achieves 5-8 bytes/cycle from DRAM due to limited memory-level parallelism. Cost model: for bandwidth-bound code, time = bytes / effective_bandwidth_bytes_per_cycle cycles.
Integer division is expensive: 32-bit DIV takes 26 cycles on Intel Skylake, and 8-bit DIV takes about 25 cycles on Coffee Lake. 64-bit division can take 35-90 cycles depending on operand values. AMD processors are generally faster: Zen 2/3 significantly improved division performance. Throughput is very low (one division every 6+ cycles). For cost estimation: assume 26-40 cycles per 32-bit division in dependency chains. Avoid division in hot loops; use multiplication by reciprocal when divisor is constant.
Roofline model: Attainable_FLOPS = min(Peak_FLOPS, Arithmetic_Intensity * Memory_Bandwidth). Where Arithmetic_Intensity (AI) = Work (FLOPs) / Data (Bytes). Example: Peak = 200 GFLOPS, BW = 50 GB/s. If AI = 1 FLOP/byte: Attainable = min(200, 150) = 50 GFLOPS (memory-bound). If AI = 10 FLOPS/byte: Attainable = min(200, 1050) = 200 GFLOPS (compute-bound). Ridge point (crossover): 200/50 = 4 FLOPS/byte. Hierarchical roofline uses different bandwidths for L1/L2/L3/DRAM to identify which cache level is the bottleneck.
NUMA remote memory access costs 1.5-2x more than local: Local node latency: 90-100 nanoseconds (300 cycles at 3 GHz). Remote node latency: 150-250 nanoseconds (450-750 cycles). Bandwidth reduction: 1.5-2x lower for remote access. Measured examples: local 118ns vs remote 242ns (2.0x). Under contention: remote can reach 1200 cycles vs 300 local (4x). Cost model: for NUMA-aware allocation, bind memory to same node as compute thread. Random access across NUMA domains: average_latency = local_latency * local_fraction + remote_latency * remote_fraction.
Floating-point ADD/MUL on modern CPUs: Single precision (float): Latency 3-4 cycles, throughput 2 per cycle. Double precision (double): Latency 3-4 cycles, throughput 2 per cycle. FMA (fused multiply-add): Latency 4-5 cycles, throughput 2 per cycle. Intel Skylake: ADD latency 4 cycles, MUL latency 4 cycles, FMA latency 4 cycles - all symmetric. For cost estimation: floating-point math is nearly as fast as integer math on modern CPUs. Dependency chains are the limiting factor, not throughput. Use FMA when possible (a*b+c in one instruction).
Pattern Transformations
75 questionsBEFORE: for(i=0; i<n; i++) { if(a[i]>0) sum += a[i]; }. AFTER: for(i=0; i<n; i++) { sum += a[i] & -(a[i]>0); } or using arithmetic selection: sum += (a[i]>0) ? a[i] : 0; which compilers convert to CMOV. The bitwise version works because -(a[i]>0) produces all 1s (0xFFFFFFFF) when true, all 0s when false. AND with the value keeps or zeros it. Speedup: 2-4x when branch misprediction rate exceeds 20%. On modern CPUs, a mispredicted branch costs 15-20 cycles. Branchless code has constant latency of 2-3 cycles regardless of data patterns. Profile with perf stat to check branch-misses; if above 5%, consider branchless. Best for random/unpredictable data; predictable patterns may be faster with branches due to speculative execution.
BEFORE: for(i=0; i<n; i++) arr[i] = 0;. AFTER: memset(arr, 0, n * sizeof(*arr)); or SIMD: __m256i zero = _mm256_setzero_si256(); for(i=0; i<n; i+=8) _mm256_storeu_si256((__m256i*)&arr[i], zero);. For large zeroing (>1MB), use non-temporal stores: _mm256_stream_si256. Special case: calloc() may get zero pages from OS without memset (lazy allocation). For non-zero patterns: __m256i pattern = _mm256_set1_epi32(value);. Speedup: Similar to memcpy, 3-5x over naive loop. The compiler may optimize arr = {} or std::fill to memset internally. For partial zeroing of structs, use = {} initialization which compilers optimize well.
BEFORE: uint32_t swap = ((x >> 24) & 0xFF) | ((x >> 8) & 0xFF00) | ((x << 8) & 0xFF0000) | ((x << 24) & 0xFF000000);. AFTER: uint32_t swap = __builtin_bswap32(x); compiles to single BSWAP instruction. For 16-bit: __builtin_bswap16(x) or use ROL by 8 bits. For 64-bit: __builtin_bswap64(x). SIMD byte shuffle: _mm_shuffle_epi8(vec, shuffle_mask) with mask reversing byte order within each element. Speedup: Shift-and-OR is 8+ operations, BSWAP is 1 instruction (1-2 cycles). Critical for network protocols (ntohl/htonl), file format parsing, cross-platform data exchange. Use htobe32/be32toh (POSIX) or std::byteswap (C++23) for portability.
BEFORE: result = x * 8;. AFTER: result = x << 3;. General pattern: x * (2^n) = x << n. Compilers do this automatically, but understanding helps when reading assembly or writing SIMD. For SIMD: _mm256_slli_epi32(vec, 3) shifts all 8 integers left by 3. Combined patterns: x * 10 = (x << 3) + (x << 1) = x8 + x2. x * 7 = (x << 3) - x = x*8 - x. Speedup: Shift is 1 cycle, multiply is 3-4 cycles on modern x86. However, modern CPUs have fast multipliers, so only matters in extremely hot paths. For division by power-of-2: unsigned x/8 = x >> 3; signed requires adjustment for negative numbers.
BEFORE: for(i=0;i<n;i++) arr[i] = value;. AFTER (AVX): __m256 val_vec = _mm256_set1_ps(value); for(i=0;i<n;i+=8) _mm256_storeu_ps(&arr[i], val_vec);. For large arrays (>L2 cache), use non-temporal stores: _mm256_stream_ps(&arr[i], val_vec); to avoid cache pollution. For sequential values (0,1,2,...): __m256i indices = _mm256_setr_epi32(0,1,2,3,4,5,6,7); __m256i increment = _mm256_set1_epi32(8); for(i=0;i<n;i+=8) { _mm256_storeu_si256((__m256i*)&arr[i], indices); indices = _mm256_add_epi32(indices, increment); }. Speedup: 4-8x for fill, near memory bandwidth for streaming stores. memset uses this internally for byte patterns.
BEFORE: for(i=0; i<n; i++) dst[i] = src[i];. AFTER: memcpy(dst, src, n * sizeof(*dst)); or SIMD: for(i=0; i<n; i+=8) { _mm256_storeu_ps(&dst[i], _mm256_loadu_ps(&src[i])); }. For large copies (>1MB), use non-temporal stores: _mm256_stream_ps(dst, _mm256_loadu_ps(src)); bypasses cache to avoid polluting it. For tiny copies (<64 bytes), rep movsb may be optimal on modern Intel (ERMSB). Speedup: Naive loop achieves ~20% bandwidth, optimized memcpy achieves >90%. glibc memcpy uses SIMD with runtime CPU detection. For moves (overlapping): memmove handles overlap correctly; memcpy may not. Use __builtin_memcpy for compiler optimization opportunities.
BEFORE: sum = a[0]; for(i=1; i<n; i++) sum += a[i]; (serial dependency chain, 3-4 cycles per add). AFTER: sum0=sum1=sum2=sum3=0; for(i=0; i<n; i+=4) { sum0+=a[i]; sum1+=a[i+1]; sum2+=a[i+2]; sum3+=a[i+3]; } sum=sum0+sum1+sum2+sum3;. This creates 4 independent dependency chains that execute in parallel via out-of-order execution. Speedup: 2-4x on modern CPUs with 4+ execution ports. The critical insight: floating-point addition is associative mathematically but not in IEEE 754 (slight precision differences). GCC -ffast-math or -fassociative-math enables automatic reassociation. For exact results, use Kahan summation instead.
BEFORE: uint64_t product = (uint64_t)a * b; uint32_t high = product >> 32;. AFTER: Use compiler intrinsic or inline asm: uint32_t high = __umulh(a, b); (MSVC) or use asm for MULX. For 64-bit: unsigned __int128 prod = (unsigned __int128)a * b; uint64_t high = prod >> 64;. Or: __uint128_t support in GCC/Clang. SIMD: _mm256_mulhi_epu16 for 16-bit, _mm256_mul_epu32 returns 64-bit products. For modular arithmetic and Montgomery multiplication, mulhi is essential. Speedup: Avoiding 128-bit types can be 1.5-2x faster on 32-bit systems. On 64-bit, compilers handle it well, but direct mulhi intrinsics guarantee optimal code generation.
BEFORE: for(i=0;i<n;i++) gray[i] = 0.299fr[i] + 0.587fg[i] + 0.114f*b[i];. AFTER (SoA with AVX): __m256 coef_r = _mm256_set1_ps(0.299f); __m256 coef_g = _mm256_set1_ps(0.587f); __m256 coef_b = _mm256_set1_ps(0.114f); for(i=0;i<n;i+=8) { __m256 rv = _mm256_loadu_ps(&r[i]); __m256 gv = _mm256_loadu_ps(&g[i]); __m256 bv = _mm256_loadu_ps(&b[i]); __m256 gray_v = _mm256_fmadd_ps(rv, coef_r, _mm256_fmadd_ps(gv, coef_g, _mm256_mul_ps(bv, coef_b))); _mm256_storeu_ps(&gray[i], gray_v); }. For packed RGB bytes: deinterleave first, convert to float, compute, convert back. Speedup: 4-8x. Use FMA for 3-multiply-add pattern.
BEFORE: int16_t result = a + b; if(result > 32767) result = 32767; if(result < -32768) result = -32768;. AFTER (SIMD): __m256i result = _mm256_adds_epi16(a, b); for signed, _mm256_adds_epu16 for unsigned. These automatically saturate instead of wrapping. For scalar: int32_t sum = (int32_t)a + b; result = (sum > 32767) ? 32767 : (sum < -32768) ? -32768 : sum; with branchless: int32_t sum = a + b; sum = sum < -32768 ? -32768 : sum; sum = sum > 32767 ? 32767 : sum;. ARM NEON: vqaddq_s16 (saturating add). Speedup: 2-4x with SIMD saturation instructions. Essential for audio (preventing clipping), image processing (pixel clamping), DSP applications.
BEFORE: int factorial(int n) { if(n<=1) return 1; return n * factorial(n-1); }. AFTER: int factorial(int n) { int result=1; while(n>1) { result*=n; n--; } return result; }. For tree traversal: BEFORE: void dfs(Node* n) { if(!n) return; process(n); dfs(n->left); dfs(n->right); }. AFTER: stack<Node*> s; s.push(root); while(!s.empty()) { Node* n=s.top(); s.pop(); if(!n) continue; process(n); s.push(n->right); s.push(n->left); }. Speedup: 1.5-3x from avoiding function call overhead and potential stack overflow. Use explicit stack sized to max expected depth. Tail recursion can be optimized by compiler (-O2), but complex recursion requires manual transformation.
BEFORE: float rsqrt = 1.0f / sqrtf(x);. AFTER (fast inverse square root): float rsqrt(float x) { int i = (int)&x; i = 0x5f3759df - (i >> 1); float y = (float)&i; y = y * (1.5f - 0.5f * x * y * y); return y; }. One Newton-Raphson iteration. The magic constant approximates by exploiting IEEE 754 float format. Modern CPUs: _mm_rsqrt_ps provides hardware approximation (~12 bits accuracy), follow with Newton-Raphson for more precision. Speedup: 4x over sqrtf+division. Used in graphics normalization, physics engines. Note: Modern SSE/AVX rsqrt instructions are preferred over the integer trick, as they're faster and more accurate.
BEFORE: for(i=0; i<N; i++) for(j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; (cache-thrashing for large N). AFTER: for(ii=0; ii<N; ii+=BLOCK) for(jj=0; jj<N; jj+=BLOCK) for(kk=0; kk<N; kk+=BLOCK) for(i=ii; i<min(ii+BLOCK,N); i++) for(j=jj; j<min(jj+BLOCK,N); j++) for(k=kk; k<min(kk+BLOCK,N); k++) C[i][j] += A[i][k] * B[k][j];. Choose BLOCK so 3BLOCKBLOCK*sizeof(element) fits in L1 cache (~32KB). For doubles: BLOCK=32-64. Speedup: 2-10x for matrices larger than cache. Reduces cache misses from O(N^3) to O(N^3/BLOCK). This is the foundation of high-performance BLAS implementations.
BEFORE: if(x<a) f0(); else if(x<b) f1(); else if(x<c) f2(); else if(x<d) f3(); else f4();. AFTER (binary decision tree for uniform distribution): if(x<c) { if(x<a) f0(); else if(x<b) f1(); else f2(); } else { if(x<d) f3(); else f4(); }. This ensures average 2-3 comparisons instead of worst-case 4. For sorted thresholds: use binary search then dispatch. For very many cases: int idx = binary_search(thresholds, x); handlersidx;. Speedup: O(n) to O(log n) comparisons for n cases. The balanced tree minimizes expected comparisons when all branches are equally likely. Profile branch frequencies to optimize tree shape for skewed distributions.
BEFORE: for(j=0; j<N; j++) for(i=0; i<M; i++) sum += matrix[i][j]; (stride of N elements between accesses, cache thrashing). AFTER: Either transpose first, then access row-major: transpose(matrix, transposed); for(i=0; i<M; i++) for(j=0; j<N; j++) sum += transposed[j][i];. Or interchange loops: for(i=0; i<M; i++) for(j=0; j<N; j++) sum += matrix[i][j];. In-place transpose for square matrices: for(i=0; i<N; i++) for(j=i+1; j<N; j++) swap(matrix[i][j], matrix[j][i]);. Speedup: 3-10x depending on stride and cache size. Strided access with stride >= cache line wastes entire cache line per access. Blocking/tiling helps when full transpose isn't feasible.
BEFORE: result = a * b + c; (2 operations: MUL then ADD, potential intermediate rounding). AFTER: result = fma(a, b, c); or use compiler flag -ffp-contract=fast. FMA computes ab+c in single instruction with single rounding. In intrinsics: __m256 r = _mm256_fmadd_ps(a, b, c);. Benefits: (1) Single cycle throughput vs 2 cycles for separate MUL+ADD on Haswell+, (2) Higher precision - no intermediate rounding, (3) 2x FLOPS potential. Variants: fmadd (ab+c), fmsub (ab-c), fnmadd (-ab+c), fnmsub (-a*b-c). Speedup: 1.5-2x for FMA-bound code. Available on x86 since Haswell (2013), ARM since Cortex-A15. Check with: __builtin_cpu_supports('fma').
BEFORE: for(i=0; i<n; i++) if(arr[i]==target) return i;. AFTER (AVX2): __m256i target_vec = _mm256_set1_epi32(target); for(i=0; i<n; i+=8) { __m256i data = _mm256_loadu_si256((__m256i*)&arr[i]); __m256i cmp = _mm256_cmpeq_epi32(data, target_vec); int mask = _mm256_movemask_ps(_mm256_castsi256_ps(cmp)); if(mask) return i + __builtin_ctz(mask); }. Compares 8 integers simultaneously. Speedup: 4-8x for large arrays. For strings, use _mm256_cmpeq_epi8 and process 32 bytes at once. SIMD search beats binary search for n<1000 due to sequential memory access. Combine approaches: binary search to narrow range, then SIMD scan the final segment.
BEFORE: for(i=0;i<n;i++) for(j=0;j<n;j++) if(matrix[i][j]) process(i, j, matrix[i][j]); O(n^2) even for sparse. AFTER (CSR format): int row_ptr[n+1], col_idx[nnz]; float values[nnz]; for(i=0;i<n;i++) for(k=row_ptr[i]; k<row_ptr[i+1]; k++) process(i, col_idx[k], values[k]); O(nnz). SpMV (sparse matrix-vector multiply): for(i=0;i<n;i++) { y[i] = 0; for(k=row_ptr[i];k<row_ptr[i+1];k++) y[i] += values[k] * x[col_idx[k]]; }. Speedup: For 99% sparse 1000x1000 matrix, from 1M iterations to 10K (100x faster). CSR is the standard format for scientific computing, graph algorithms, and sparse linear algebra.
BEFORE (direct): for(i=0;i<n;i++) for(j=0;j<k;j++) out[i] += in[i+j] * kernel[j]; O(n*k) complexity. AFTER (FFT): FFT(in), FFT(kernel), pointwise multiply, IFFT(result). O(n log n) complexity. Use when kernel size k > ~64. Implementation: pad both to next power of 2 >= n+k-1, use FFT library (FFTW, Intel MKL). Speedup: For n=1M, k=1K: direct is 10^9 ops, FFT is ~60M ops (15x faster). For small kernels (k<16), direct convolution with SIMD is faster. Libraries like cuDNN use FFT internally for large convolutions. The crossover point depends on FFT implementation efficiency.
BEFORE: uint32_t hash = 0; for(i=0;i<len;i++) hash = hash*31 + data[i];. AFTER (process 16 bytes at once): Use CRC32 instruction: for(i=0;i<len;i+=8) hash = _mm_crc32_u64(hash, (uint64_t)&data[i]);. Or xxHash/MurmurHash3 SIMD: process 32-byte blocks with AVX2, vectorized multiplication and mixing. Example (simplified xxHash-like): __m256i acc = seed_vec; for(i=0;i<len;i+=32) { __m256i data = _mm256_loadu_si256(&input[i]); acc = _mm256_add_epi64(acc, _mm256_mul_epu32(data, prime_vec)); acc = _mm256_xor_si256(acc, _mm256_srli_epi64(acc, 17)); }. Speedup: 5-10x. Modern hash functions (xxHash3, wyhash) achieve >10GB/s using SIMD.
BEFORE: if(fabs(a - b) < epsilon) (expensive fabs, floating-point subtract). AFTER for IEEE 754 positive floats: Reinterpret as integers and compare: int32_t ia = (int32_t)&a; int32_t ib = (int32_t)&b; if(abs(ia - ib) < ulps). This uses ULPs (Units in Last Place) for comparison. Works because IEEE 754 floats are ordered like integers when positive. For signed floats, adjust: if(ia < 0) ia = 0x80000000 - ia;. SIMD: Cast to integer, compare with _mm256_cmpgt_epi32. Speedup: 1.5-2x for comparison-heavy code. This technique is used in physics engines and numerical software. Caveat: Fails for NaN and infinity; add special handling if needed.
BEFORE: int result = x / 3;. AFTER (unsigned): result = ((uint64_t)x * 0xAAAAAAABULL) >> 33;. AFTER (signed): More complex due to rounding toward zero. The magic constant 0xAAAAAAAB = ceil(2^33 / 3). For x/7: multiply by 0x24924925 and shift right 34. For x/10: multiply by 0xCCCCCCCD and shift right 35. Compilers generate this automatically for constant divisors (inspect assembly!). The technique from Granlund/Montgomery 1994 handles any constant. Speedup: Division is 20-90 cycles, multiply-shift is 3-4 cycles (5-20x faster). For repeated division by same dynamic value, compute reciprocal once: inv = ((1ULL << 32) + d - 1) / d; then result = (x * inv) >> 32;.
BEFORE: for(i=0;i<n;i++) if(arr[i] == target) return i;. AFTER (AVX2): __m256i target_vec = _mm256_set1_epi32(target); for(i=0; i<n; i+=8) { __m256i data = _mm256_loadu_si256((__m256i*)&arr[i]); __m256i cmp = _mm256_cmpeq_epi32(data, target_vec); int mask = _mm256_movemask_ps(_mm256_castsi256_ps(cmp)); if(mask) return i + __builtin_ctz(mask); } return -1;. Checks 8 elements per iteration. For bytes: check 32 per iteration with _mm256_cmpeq_epi8 and _mm256_movemask_epi8. Speedup: 4-8x for large arrays. Critical insight: early-exit preserves first-match semantics. For find-all, remove early exit and collect all positions.
BEFORE: for(i=0; i<8; i++) result[i] = data[indices[i]];. AFTER (AVX2): __m256i idx = _mm256_loadu_si256((__m256i*)indices); __m256 result = _mm256_i32gather_ps(data, idx, sizeof(float));. Scale parameter (4 for float) handles element size. AVX-512 adds mask support: _mm512_mask_i32gather_ps(src, mask, idx, base, scale). Speedup: Varies widely. Gather is NOT parallel memory access - it serializes internally. Effective when: indices fit in cache, or when combined with other SIMD operations. For truly random access, explicit loads may be faster. Benchmark your specific case. Gather is 12-20 cycles on Intel, faster on AMD Zen4+.
BEFORE: size_t len = 0; while(str[len]) len++;. AFTER (SSE2): __m128i zero = _mm_setzero_si128(); size_t i = 0; while(1) { __m128i chunk = _mm_loadu_si128((__m128i*)(str + i)); __m128i cmp = _mm_cmpeq_epi8(chunk, zero); int mask = _mm_movemask_epi8(cmp); if(mask) return i + __builtin_ctz(mask); i += 16; }. This checks 16 bytes per iteration. PCMPISTRI (SSE4.2) handles null termination implicitly: return _mm_cmpistri(_mm_loadu_si128(str), zero, _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_EACH);. Speedup: 8-16x for long strings. glibc strlen uses this approach with alignment handling. Watch for reading past string end crossing page boundary - align start to 16 bytes.
BEFORE: for each 3 bytes, split into 4 6-bit values, look up in table. AFTER (SSE/AVX): Load 12 bytes (4 groups of 3). Reshuffle to align 6-bit fields: __m128i shuffled = _mm_shuffle_epi8(input, shuffle_mask); Shift and mask to extract: __m128i indices = ...; Use _mm_shuffle_epi8 as 16-entry lookup table for encoding. Or use comparison and add-if-greater for the base64 alphabet ranges (A-Z, a-z, 0-9, +/). Speedup: 5-10x. The key insight: base64 is a deterministic character-by-character transformation, perfect for SIMD. Modern implementations (like Turbo-Base64) achieve 4-8 GB/s encode speed. See: https://github.com/lemire/fastbase64 for production implementations.
BEFORE (AoS): struct Particle { float x, y, z, mass; }; Particle particles[N]; for(i=0; i<N; i++) particles[i].x += dt * particles[i].vx;. AFTER (SoA): struct Particles { float x[N], y[N], z[N], mass[N]; }; Particles p; for(i=0; i<N; i++) p.x[i] += dt * p.vx[i];. Speedup: 2-4x for SIMD operations, 1.5-2x for scalar due to cache efficiency. AoS loads entire struct (16+ bytes) when you need one field (4 bytes), wasting 75% bandwidth. SoA enables: (1) SIMD processing of contiguous x values, (2) Better cache utilization when accessing single field across many objects, (3) Streaming stores. Use AoS when all fields accessed together; SoA when iterating over single field.
BEFORE: for(i=0; i<n; i++) { sum += data[indices[i]]; } (random access pattern). AFTER: Step 1: Sort indices with payload, Step 2: Access sequentially, Step 3: Unsort if needed. Or use prefetching: for(i=0; i<n; i++) { __builtin_prefetch(&data[indices[i+8]], 0, 1); sum += data[indices[i]]; }. For GPU: restructure to ensure threads in a warp access consecutive addresses. BEFORE (GPU): val = data[threadIdx.x * stride]; AFTER: val = data[blockIdx.x * blockDim.x + threadIdx.x];. Speedup: 5-50x depending on access pattern. Random access achieves ~1% of sequential bandwidth due to cache line waste (load 64 bytes, use 4). Sorting indices can provide 3-10x speedup even with sort overhead for large datasets.
BEFORE: j=0; for(i=0;i<n;i++) if(pred(arr[i])) out[j++] = arr[i];. AFTER (AVX2 with pext): __m256i data = _mm256_loadu_si256(src); __m256i mask = predicate_simd(data); int m = _mm256_movemask_ps(data); __m256i indices = _mm256_loadu_si256(&shuffle_table[m]); __m256i compacted = _mm256_permutevar8x32_epi32(data, indices); _mm256_storeu_si256(dst, compacted); dst += __builtin_popcount(m);. Requires precomputed 256-entry shuffle table for each possible 8-bit mask. Speedup: 2-5x. Used in filtering, removing whitespace, extracting valid elements. AVX-512 has VPCOMPRESSD which does this in one instruction: _mm512_mask_compress_epi32.
BEFORE: for(i=0;i<N;i++) for(j=0;j<N;j++) for(k=0;k<N;k++) C[i][j] += A[i][k] * B[k][j];. AFTER: #define BLOCK 64 for(ii=0;ii<N;ii+=BLOCK) for(jj=0;jj<N;jj+=BLOCK) for(kk=0;kk<N;kk+=BLOCK) for(i=ii;i<ii+BLOCK;i++) for(j=jj;j<jj+BLOCK;j++) for(k=kk;k<kk+BLOCK;k++) C[i][j] += A[i][k] * B[k][j];. Block size chosen so 3 blocks fit in L1 cache: 36464*8 bytes = 96KB for doubles (too large), use BLOCK=32 for 24KB. Speedup: 2-10x for large matrices. The reordered loops ensure A[i][k:k+BLOCK] and B[k:k+BLOCK][j] remain in cache. Further optimize with SIMD, unrolling inner loops, and prefetching. This is the basis of BLAS Level 3 operations.
BEFORE: if(a && b) return 3; else if(a && !b) return 2; else if(!a && b) return 1; else return 0;. AFTER: int table[2][2] = {{0, 1}, {2, 3}}; return table[a != 0][b != 0];. For multi-variable conditions: pack bits into index: int idx = (a?4:0) | (b?2:0) | (c?1:0); return table[idx];. This eliminates all branches. For character classification: bool is_alpha[256]; return is_alpha[(unsigned char)c];. Speedup: Eliminates O(n) branch mispredictions for n conditions. Best when: conditions are data-dependent (unpredictable), table fits in cache (< 64KB), and access pattern is random. Tables trade memory for speed.
BEFORE: int floor_val = (int)floor(x);. AFTER: int floor_val = (int)x - (x < (int)x); handles negative correctly. Or: floor_val = x >= 0 ? (int)x : (int)x - 1;. For SIMD: _mm256_floor_ps then _mm256_cvttps_epi32 (requires SSE4.1/AVX). Faster when already in integer math: floor(a/b) for positive a,b is simply a/b. For ceil: ceil_val = (int)x + (x > (int)x);. Or: ceil(a/b) = (a + b - 1) / b for positive integers. Speedup: floor() function call is 10-20 cycles, cast with adjustment is 2-3 cycles. The SIMD round functions (_mm256_round_ps with _MM_FROUND_FLOOR) are single instructions. Use -ffast-math to allow compiler floor optimization.
BEFORE: for(i=0;i<n;i++) hist[data[i]]++;. AFTER (parallel with private histograms): #pragma omp parallel { int local_hist[256] = {0}; #pragma omp for for(i=0;i<n;i++) local_hist[data[i]]++; #pragma omp critical for(j=0;j<256;j++) hist[j] += local_hist[j]; }. SIMD approach: Use conflict detection (_mm512_conflict_epi32) and masked accumulation. Alternative: Sort data first, then count runs (better for SIMD). Speedup: Near-linear with threads for large n. The key insight is avoiding atomic operations on shared histogram by using thread-local copies and merging. For small bin counts (256), merge overhead is negligible.
BEFORE: if(x >= low && x <= high) in_range();. AFTER: if((unsigned)(x - low) <= (unsigned)(high - low)) in_range();. This uses unsigned arithmetic to combine two comparisons into one. Works because if x < low, then x - low wraps to large unsigned value > (high - low). If x > high, then x - low > high - low directly. Speedup: 1 comparison + 1 subtraction vs 2 comparisons + AND. Most significant when checking array bounds: if((unsigned)index < array_size). Compilers often generate this optimization for signed range checks with -O2. For SIMD: _mm256_cmpgt_epu32 for unsigned comparison handles range checks efficiently.
BEFORE: unsigned extract_bits(unsigned x, int start, int len) { return (x >> start) & ((1 << len) - 1); }. AFTER: Use BFE (Bit Field Extract) instruction via intrinsic: _bextr_u32(x, start, len) (BMI1). For constant start/len, compilers optimize the shift-and-mask. Without BMI1: precompute mask: unsigned masks[33] = {0, 1, 3, 7, 15, ...}; return (x >> start) & masks[len];. SIMD: No direct support, use shift and AND. Speedup: BFE is 1 instruction vs 3 for shift-AND-mask. Essential for parsing packed binary formats, compression algorithms, bit manipulation. Check BMI1 support: __builtin_cpu_supports('bmi'). AMD has supported BFE since Piledriver (2012), Intel since Haswell (2013).
BEFORE: int abs_val = (x < 0) ? -x : x; (branch). AFTER: int mask = x >> 31; int abs_val = (x + mask) ^ mask;. Explanation: For positive x, mask=0, result=(x+0)^0=x. For negative x, mask=-1 (all 1s), result=(x-1)^(-1). XOR with -1 flips all bits, and (x-1) with flipped bits equals -x (two's complement). Alternative: abs_val = (x ^ mask) - mask;. For floating-point: Clear sign bit directly: (uint32_t)&f &= 0x7FFFFFFF;. SIMD: _mm256_andnot_ps(sign_mask, vec) where sign_mask = _mm256_set1_ps(-0.0f). Speedup: 1.5-2x when branches mispredict. Many compilers optimize abs() to branchless form automatically.
BEFORE: switch(opcode) { case 0: fn0(); break; case 1: fn1(); break; ... case N: fnN(); break; }. AFTER: typedef void (*Handler)(void); Handler table[N+1] = {fn0, fn1, ..., fnN}; tableopcode;. For small dense ranges, jump tables are most efficient. For sparse ranges: use perfect hashing or binary search. SIMD lookup: _mm256_permutevar8x32_epi32 for 8-entry tables, vpshufb for 16-entry byte tables. Speedup: Eliminates branch misprediction chain (N/2 mispredictions on average for N cases). Jump table is O(1), switch can be O(N) worst case. Compilers often generate jump tables automatically for dense switch statements (check -O2 assembly).
BEFORE: void inorder(Node* n) { if(!n) return; inorder(n->left); process(n); inorder(n->right); } (O(h) stack space). AFTER (Morris traversal): Node* curr = root; while(curr) { if(!curr->left) { process(curr); curr = curr->right; } else { Node* pred = curr->left; while(pred->right && pred->right != curr) pred = pred->right; if(!pred->right) { pred->right = curr; curr = curr->left; } else { pred->right = NULL; process(curr); curr = curr->right; } } }. O(1) space by temporarily modifying tree structure. Speedup: Not faster (2x more pointer operations), but eliminates stack overflow risk for deep trees. Used when memory is extremely constrained or tree depth is unbounded.
BEFORE (linear): sum = 0; for(i=0; i<n; i++) sum += a[i]; (serial dependency chain, n iterations). AFTER (tree reduction): Step 1: Parallel pairwise sum: b[i] = a[2i] + a[2i+1] for i in [0,n/2). Step 2: Repeat on b until single element. In SIMD: __m256 acc = _mm256_loadu_ps(arr); for(i=8; i<n; i+=8) acc = _mm256_add_ps(acc, _mm256_loadu_ps(&arr[i])); then reduce 8->4->2->1. Tree depth is log2(n) vs n for linear. Speedup: For 1M elements, log2(1M)=20 steps with max parallelism vs 1M serial adds. Practical speedup: 4-8x with SIMD, even more on GPU. All parallel reduction algorithms (MPI_Reduce, CUDA reduction kernels) use tree structure.
BEFORE: uint32_t next_pow2 = 1; while(next_pow2 < x) next_pow2 *= 2;. AFTER: x--; x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; x++;. This fills all bits below the highest set bit with 1s, then adds 1 to get next power of 2. For round down to power of 2: y = 1 << (31 - __builtin_clz(x)); using leading zero count. Alternative: next_pow2 = 1 << (32 - __builtin_clz(x - 1)); for x > 1. Speedup: Loop is O(log n), bit manipulation is O(1). Use cases: hash table sizing, memory allocator bucket sizes, FFT length requirements. For 64-bit, extend the pattern with x |= x >> 32;.
BEFORE: while(*p) { if(*p == '"') handle_quote(); else if(*p == '\') handle_escape(); p++; }. AFTER (SIMD character search): __m256i quote = _mm256_set1_epi8('"'); __m256i backslash = _mm256_set1_epi8('\'); while(p < end) { __m256i chunk = _mm256_loadu_si256(p); __m256i q = _mm256_cmpeq_epi8(chunk, quote); __m256i b = _mm256_cmpeq_epi8(chunk, backslash); int qm = _mm256_movemask_epi8(q); int bm = _mm256_movemask_epi8(b); if(qm | bm) { handle_special(p, qm, bm); } p += 32; }. This is how simdjson achieves 2-4GB/s JSON parsing. Speedup: 4-10x for parsing-heavy workloads. The key insight: scan for special characters in bulk, then handle them individually.
BEFORE: float dot = 0; for(i=0; i<n; i++) dot += a[i] * b[i];. AFTER (AVX): __m256 sum = _mm256_setzero_ps(); for(i=0; i<n; i+=8) { sum = _mm256_fmadd_ps(_mm256_loadu_ps(&a[i]), _mm256_loadu_ps(&b[i]), sum); }. Horizontal reduction: __m128 lo = _mm256_castps256_ps128(sum); __m128 hi = _mm256_extractf128_ps(sum, 1); __m128 r = _mm_add_ps(lo, hi); r = _mm_hadd_ps(r, r); r = _mm_hadd_ps(r, r); float dot = _mm_cvtss_f32(r);. For SSE4.1, single vector: _mm_dp_ps(a, b, 0xF1) computes dot product directly but only for 4 elements. Speedup: 4-8x. Use FMA (_mm256_fmadd_ps) instead of separate multiply-add for 2x throughput.
BEFORE: uint32_t morton = 0; for(i=0; i<16; i++) morton |= ((x & (1<<i)) << i) | ((y & (1<<i)) << (i+1));. AFTER: Use parallel bit deposit (PDEP) instruction: uint64_t morton = _pdep_u32(x, 0x55555555) | _pdep_u32(y, 0xAAAAAAAA);. Without BMI2: x = (x | (x << 8)) & 0x00FF00FF; x = (x | (x << 4)) & 0x0F0F0F0F; x = (x | (x << 2)) & 0x33333333; x = (x | (x << 1)) & 0x55555555; (same for y, then OR). Speedup: Loop is 64+ operations, PDEP is 1 instruction. Morton codes enable Z-order curves for spatial locality in 2D/3D data, improving cache performance for spatial queries. Check BMI2 support: __builtin_cpu_supports('bmi2').
BEFORE: if(condition) count++;. AFTER: count += condition; or count += (int)(condition != 0);. The boolean expression evaluates to 0 or 1 in C/C++, which directly adds to count. For counting set bits in array: BEFORE: for(i=0;i<n;i++) if(arr[i]) count++;. AFTER: for(i=0;i<n;i++) count += (arr[i] != 0);. SIMD: __m256i mask = _mm256_cmpgt_epi32(arr, zero); count_vec = _mm256_sub_epi32(count_vec, mask); (subtract -1 to add 1 where true). Speedup: 1.5-3x when condition is unpredictable. Compilers often generate this transformation automatically, but explicit form guarantees it. Profile to confirm branch misprediction before optimizing.
BEFORE: result = x / 7; (integer division: 20-90 cycles). AFTER: result = (x * 0x24924925ULL) >> 34; (multiply-shift: 3-4 cycles). For floating-point: result = x * 0.142857142857f; (1/7). Compilers do this automatically for constant divisors using the technique from 'Division by Invariant Integers using Multiplication' (Granlund/Montgomery 1994). The magic constant and shift are precomputed. For power-of-2 divisors: x/8 becomes x>>3 for unsigned, (x + ((x>>31)&7)) >> 3 for signed (handles negative rounding). Speedup: 5-20x for integer division. Always prefer multiplication by reciprocal for floating-point hot paths. Use compiler explorer to verify the transformation occurs.
BEFORE: bool is_pow2 = false; for(int p=1; p>0; p<<=1) if(x==p) { is_pow2=true; break; }. AFTER: bool is_pow2 = x && !(x & (x - 1));. Explanation: x-1 flips all bits from the lowest set bit down. AND with x is zero only if there was exactly one set bit. The x && handles the x=0 case. Alternative: is_pow2 = __builtin_popcount(x) == 1;. For finding which power: int log2 = __builtin_ctz(x); when x is known to be power of 2. Speedup: O(1) vs O(log n). Essential for hash table operations, memory alignment checks, bit manipulation algorithms. The pattern x & (x-1) also clears the lowest set bit, useful for iteration: while(x) { process(__builtin_ctz(x)); x &= x-1; }.
BEFORE: if(condition) x = a; else x = b;. AFTER: mask = -(int)(condition); x = (a & mask) | (b & ~mask);. The expression -(int)(condition) converts boolean to all-1s or all-0s mask. When condition is true: mask=0xFFFFFFFF, ~mask=0, so x = (a & 0xFF...F) | (b & 0) = a. When false: mask=0, ~mask=0xFF...F, so x = (a & 0) | (b & 0xFF...F) = b. Alternative using XOR: x = b ^ ((a ^ b) & mask);. Speedup: 2-3x for random conditions. This pattern is essential for cryptographic code (constant-time operations) and SIMD where all lanes must execute the same path. Compilers often generate this automatically from ternary operator when optimizing.
BEFORE: qsort(arr, n, sizeof(int), compare); (pointer chasing, cache-unfriendly for large n). AFTER: Use radix sort for integers: void radix_sort(uint32_t* arr, int n) { uint32_t* aux = malloc(n4); for(int shift=0; shift<32; shift+=8) { int count[256]={0}; for(int i=0;i<n;i++) count[(arr[i]>>shift)&0xFF]++; for(int i=1;i<256;i++) count[i]+=count[i-1]; for(int i=n-1;i>=0;i--) aux[--count[(arr[i]>>shift)&0xFF]]=arr[i]; swap(arr,aux); } }. Speedup: O(n log n) vs O(nw/r) where w=key bits, r=radix bits. For n=1M 32-bit integers, radix sort is 2-5x faster than quicksort due to sequential memory access and predictable branches.
BEFORE: int ctz = 0; while((x & 1) == 0 && ctz < 32) { x >>= 1; ctz++; } (up to 32 iterations). AFTER: int ctz = __builtin_ctz(x); compiles to BSF (Bit Scan Forward) or TZCNT instruction. TZCNT (BMI1) is preferred: defined for x=0 (returns operand size), constant latency. BSF has undefined result for x=0. For 64-bit: __builtin_ctzll(x) uses BSF/TZCNT on 64-bit operand. The de Bruijn method without hardware: static const int table[32] = {...}; return table[((x & -x) * 0x077CB531U) >> 27];. Speedup: Loop is 32 cycles worst case, hardware instruction is 1-3 cycles. Use for finding lowest set bit position, extracting rightmost 1 bit.
BEFORE: for(i=0; i<n; i++) if(arr[i]==target) return i; return -1; (O(n) linear scan). AFTER: int lo=0, hi=n-1; while(lo<=hi) { int mid=(lo+hi)/2; if(arr[mid]==target) return mid; if(arr[mid]<target) lo=mid+1; else hi=mid-1; } return -1;. Requires sorted array. Speedup: O(log n) vs O(n). For n=1M: ~20 comparisons vs ~500K average. Branchless binary search is even faster: while(len>1) { half=len/2; len-=half; lo+= (arr[lo+half-1]<target)*half; }. For small n (<64), linear search with SIMD may be faster due to cache friendliness. Use std::lower_bound in C++ which is highly optimized.
BEFORE: for(i=0; i<16; i++) data[indices[i]] = values[i];. AFTER (AVX-512): __m512i idx = _mm512_loadu_si512(indices); __m512 vals = _mm512_loadu_ps(values); _mm512_i32scatter_ps(data, idx, vals, sizeof(float));. Mask variant: _mm512_mask_i32scatter_ps(data, mask, idx, vals, scale). Important: Scatter has conflict detection issues - if two indices are equal, behavior is undefined (implementation-dependent which value wins). Use _mm512_conflict_epi32 to detect and handle conflicts. Speedup: Limited. Scatter is primarily for code simplification, not performance. It serializes stores internally. Only AVX-512 has scatter; AVX2 does not. Consider keeping data in SIMD registers and scattering only at boundaries.
BEFORE: int fib(int n) { if(n<=1) return n; return fib(n-1)+fib(n-2); } O(2^n). AFTER (matrix exponentiation): [[F(n+1), F(n)], [F(n), F(n-1)]] = [[1,1],[1,0]]^n. Use square-and-multiply for matrix power: O(log n). Matrix multiply is 8 multiplications + 4 additions. For n=1000000, naive recursion is impossible, matrix method computes in ~60 matrix multiplications. Speedup: O(2^n) to O(log n), exponentially faster. This pattern applies to any linear recurrence: a(n) = c1a(n-1) + c2a(n-2) + ... can be expressed as matrix power. Used in competitive programming and computing large Fibonacci numbers modulo prime.
BEFORE: int count = 0; while(x) { count += x & 1; x >>= 1; } (32 iterations worst case). AFTER: Use hardware instruction via __builtin_popcount(x) or POPCNT instruction directly. Without hardware: int count = x - ((x >> 1) & 0x55555555); count = (count & 0x33333333) + ((count >> 2) & 0x33333333); count = (count + (count >> 4)) & 0x0f0f0f0f; count = (count * 0x01010101) >> 24;. SIMD: _mm_popcnt_u64 or vpshufb lookup table for bytes then sum. Speedup: Loop is 100+ cycles, POPCNT is 1 cycle (3 cycle latency). The bit manipulation version is ~12 cycles. Enable POPCNT with -mpopcnt or -march=native. Check support: __builtin_cpu_supports('popcnt').
BEFORE: for(i=1; i<n; i++) prefix[i] = prefix[i-1] + arr[i];. AFTER (SIMD parallel prefix): __m256 x = _mm256_loadu_ps(arr); x = _mm256_add_ps(x, _mm256_slli_si256(x, 4)); // shift by 1 float x = _mm256_add_ps(x, _mm256_slli_si256(x, 8)); // shift by 2 floats x = _mm256_add_ps(x, _mm256_permute2f128_ps(x, x, 0x08)); // cross-lane. For larger arrays, compute local prefix sums in blocks, then adjust each block by adding the sum of previous blocks. The Blelloch scan algorithm achieves O(n/p) work with p processors. Speedup: 2-4x for SIMD within block, more with parallelization. Used in stream compaction, sorting, histogram computation.
BEFORE: for(i=0;i<n;i++) if(cond[i]) out[i] = val[i];. AFTER (AVX-512): __mmask16 mask = _mm512_cmpneq_epi32_mask(cond_vec, zero); _mm512_mask_storeu_ps(out, mask, val);. For AVX2 (no mask store for floats): __m256 mask = _mm256_castsi256_ps(_mm256_cmpgt_epi32(cond, zero)); __m256 result = _mm256_blendv_ps(_mm256_loadu_ps(out), val, mask); _mm256_storeu_ps(out, result);. The blendv approach reads existing values and selectively replaces them. Speedup: 2-4x with proper masking. AVX-512 masking is more efficient as it doesn't require loading existing values. Masked stores also suppress page faults on inactive lanes, enabling safe boundary handling.
BEFORE: crc = 0xFFFFFFFF; for each bit: crc = (crc >> 1) ^ (polynomial & -(crc & 1));. AFTER (table lookup, 1 byte at a time): static uint32_t table[256]; // precomputed for(i=0;i<len;i++) crc = (crc >> 8) ^ table[(crc ^ data[i]) & 0xFF];. Table generation: for(i=0;i<256;i++) { crc=i; for(j=0;j<8;j++) crc = (crc>>1) ^ (poly & -(crc&1)); table[i]=crc; }. For more speed: 4-way table (slicing-by-4) processes 4 bytes per iteration. Modern CPUs: use CRC32 instruction _mm_crc32_u64 for hardware CRC32C. Speedup: 8x with table lookup, 50x+ with hardware instruction. CRC32C achieves >10GB/s with hardware support.
BEFORE: struct Node { int val; Node* next; }; Node* p = head; while(p) { process(p->val); p = p->next; } (pointer chasing, cache-hostile). AFTER: Store data in contiguous array: int arr[n]; for(i=0; i<n; i++) process(arr[i]);. If order matters, use array of indices for logical next pointers: int next[n]; for(i=start; i!=-1; i=next[i]) process(arr[i]);. Or flatten: copy list to array, process array, rebuild list if needed. Speedup: 3-10x. Linked list traversal achieves ~5% of memory bandwidth due to pointer chasing latency (one cache miss per node). Arrays enable prefetching and SIMD. Only use linked lists when O(1) insertion/deletion is critical and cache locality isn't.
BEFORE (AoS): struct Pixel { uint8_t r, g, b, a; } pixels[n]; Processing interleaved RGBA requires gather/scatter. AFTER (SoA): uint8_t r[n], g[n], b[n], a[n];. Deinterleave: __m256i rgbargba = _mm256_loadu_si256(src); // 8 pixels __m256i shuffled = _mm256_shuffle_epi8(rgbargba, deinterleave_mask); // group channels. Or process AoS using AVX2 shuffle to extract channels in-place. For conversion: for(i=0;i<n;i+=4) { uint32_t* p = (uint32_t*)&pixels[i]; for(j=0;j<4;j++) { r[i+j]=p[j]&0xFF; g[i+j]=(p[j]>>8)&0xFF; ... } }. Speedup: 2-4x for channel-independent operations. Keep SoA internally, convert at boundaries.
BEFORE: if(x < min) x = min; else if(x > max) x = max;. AFTER (scalar branchless): x = x < min ? min : (x > max ? max : x); compilers generate CMOV. AFTER (SIMD): __m256 clamped = _mm256_min_ps(_mm256_max_ps(values, min_vec), max_vec);. Double min/max is the standard pattern. For integers: _mm256_min_epi32/_mm256_max_epi32 (SSE4.1+). Saturating arithmetic for specific ranges: _mm256_adds_epi16 clamps to [-32768, 32767] automatically. Speedup: 4-8x with SIMD. Clamp is ubiquitous in graphics (color clamping), audio (sample limiting), and physics (bounds checking). The nested min(max()) pattern works for any ordered type with min/max operations.
BEFORE: int cmp = strcmp(a, b); (byte-by-byte comparison). AFTER (SSE4.2): int cmp = 0; for(i=0; ; i+=16) { __m128i va = _mm_loadu_si128((__m128i*)&a[i]); __m128i vb = _mm_loadu_si128((__m128i*)&b[i]); int idx = _mm_cmpistri(va, vb, _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY); if(idx < 16) { cmp = (unsigned char)a[i+idx] - (unsigned char)b[i+idx]; break; } if(_mm_cmpistrz(va, vb, flags)) break; }. PCMPISTRI compares 16 bytes, handles null terminator implicitly. Speedup: 2-4x for long strings. For known-length (memcmp style): use _mm_cmpeq_epi8 and _mm_movemask_epi8. glibc uses this approach for optimized string functions.
BEFORE: remainder = x % 16; (division instruction, 20+ cycles). AFTER: remainder = x & 15; (AND instruction, 1 cycle). General pattern: x % (2^n) = x & ((1 << n) - 1) for unsigned integers. For signed integers, the pattern is more complex due to negative number representation: remainder = ((x % n) + n) % n or use: int mask = n - 1; remainder = x & mask; if (x < 0 && remainder) remainder |= ~mask; Speedup: 10-20x. This is why hash tables use power-of-2 sizes. Compilers optimize x % CONST automatically when CONST is power of 2. For non-power-of-2, combine with Barrett reduction for repeated modulo by same divisor.
BEFORE: if(a > b) max = a; else max = b;. AFTER using subtraction and sign bit: int diff = a - b; int mask = diff >> 31; max = a - (diff & mask);. Explanation: If a>b, diff>0, mask=0, max=a-0=a. If a<=b, diff<=0, mask=-1 (all 1s), max=a-diff=a-(a-b)=b. Alternative using XOR: max = a ^ ((a ^ b) & mask);. For min: min = b + (diff & mask);. These compile to pure arithmetic without branches. Speedup: 2-3x when branches mispredict. Compilers generate CMOV for simple ternary operators, but complex conditions may need manual transformation. Profile to verify branch misprediction is the bottleneck before optimizing.
BEFORE: result = a[0] + a[1]x + a[2]xx + a[3]xxx + ...;. AFTER (Horner's method): result = a[n]; for(i=n-1;i>=0;i--) result = resultx + a[i];. Or: result = a[n]; result = resultx + a[n-1]; result = result*x + a[n-2]; .... Horner's method uses n multiplications and n additions instead of n(n+1)/2 multiplications. With FMA: for(i=n-1;i>=0;i--) result = fma(result, x, a[i]);. Speedup: O(n^2) multiplies to O(n). For degree-7 polynomial: 28 muls -> 7 muls (4x faster). This is the standard method for polynomial evaluation in numerical computing. Estrin's method offers more parallelism for SIMD but requires more operations.
BEFORE: uint64_t result = 1; for(i=0; i<exp; i++) result = (result * base) % mod; (O(exp) multiplications). AFTER (square-and-multiply): uint64_t result = 1; while(exp > 0) { if(exp & 1) result = (result * base) % mod; base = (base * base) % mod; exp >>= 1; }. O(log exp) multiplications. Further optimize with Montgomery multiplication to avoid modulo. SIMD: Limited applicability, but multiple independent exponentiations can be parallelized. Speedup: O(exp) to O(log exp). For exp=1000000, from 1M mults to ~20 mults (50000x faster). This is the standard algorithm for RSA, Diffie-Hellman, and other cryptographic operations.
BEFORE: Byte-by-byte state machine checking continuation bytes, overlong encodings, surrogate pairs. AFTER: Use SIMD lookup tables for byte classification. Core algorithm (simdjson approach): 1) Classify each byte (ASCII, 2-byte start, 3-byte start, 4-byte start, continuation). 2) Use _mm256_shuffle_epi8 as 16-entry lookup for byte->class. 3) Compute expected continuation count, compare with actual. 4) Check for overlong encodings and invalid ranges using comparisons. Implementation validates 32-64 bytes per iteration. Speedup: 10-20x. Validating 1GB UTF-8 takes ~50ms with SIMD vs ~800ms scalar. See simdjson and simdutf libraries for production implementations.
BEFORE: for(i=0;i<n;i++) result[i] = data[i] / divisor;. AFTER: float recip = 1.0f / divisor; for(i=0;i<n;i++) result[i] = data[i] * recip;. For integer: uint32_t recip = ((1ULL << 32) + divisor - 1) / divisor; for(i=0;i<n;i++) result[i] = ((uint64_t)data[i] * recip) >> 32;. SIMD: __m256 recip_vec = _mm256_set1_ps(1.0f / divisor); result = _mm256_mul_ps(data, recip_vec);. Speedup: Division is 10-20 cycles, multiply is 4-5 cycles (2-4x faster). Essential when dividing many values by the same divisor. Precision consideration: floating-point reciprocal has rounding error; for exact integer division, use the magic number technique.
BEFORE (Euclidean): while(b) { int t = b; b = a % b; a = t; } return a;. AFTER (Binary GCD): int shift = __builtin_ctz(a | b); a >>= __builtin_ctz(a); while(b) { b >>= __builtin_ctz(b); if(a > b) { int t = a; a = b; b = t; } b -= a; } return a << shift;. Binary GCD replaces expensive division/modulo with cheap shifts and subtraction. Speedup: 2-4x on modern CPUs. The ctz (count trailing zeros) efficiently finds factors of 2. While Euclidean is simpler and compilers optimize division well, binary GCD has more predictable performance and is preferred in cryptographic implementations to avoid timing attacks.
BEFORE: int clz = 0; while((x & 0x80000000) == 0 && clz < 32) { x <<= 1; clz++; }. AFTER: int clz = __builtin_clz(x); compiles to BSR (Bit Scan Reverse) + subtraction or LZCNT instruction. LZCNT (ABM/BMI) directly returns leading zero count, defined for x=0 (returns 32/64). BSR finds highest set bit position, then 31-BSR gives leading zeros. For floor(log2(x)): Use 31 - __builtin_clz(x) when x > 0. For ceiling(log2(x)): 32 - __builtin_clz(x - 1) when x > 1. Speedup: Loop 32 cycles vs hardware 1-3 cycles. Applications: finding number magnitude, normalizing floating-point, fast log2 approximation.
BEFORE: Extract bits at positions defined by mask: uint32_t result = 0, j = 0; for(i=0; i<32; i++) if(mask & (1<<i)) result |= ((src >> i) & 1) << j++;. AFTER: uint32_t result = _pext_u32(src, mask); extracts bits where mask has 1s and packs them contiguously. Example: _pext_u32(0xABCD, 0x0F0F) extracts nibbles B and D, producing 0x00BD. The inverse, PDEP, deposits bits: _pdep_u32(0x00BD, 0x0F0F) produces 0x0B0D. Speedup: Loop is 32+ iterations, PEXT is 1 instruction (3 cycles on Intel). Applications: extracting bit fields, implementing chess move generators, parsing packed formats. Note: PEXT/PDEP are slow on AMD Zen1/2 (~18 cycles), fast on Zen3+ and all Intel.
BEFORE: result = a + t * (b - a); (3 ops: sub, mul, add). AFTER: result = fma(t, b, fma(-t, a, a)); or result = fma(t, b - a, a);. Best form: result = a + t * (b - a); let compiler use FMA. With SIMD: __m256 result = _mm256_fmadd_ps(t, _mm256_sub_ps(b, a), a);. Alternative formulation: result = (1-t)a + tb; becomes result = fma(t, b, fma(-t, a, a)); for better numerical stability near t=1. The a + t*(b-a) form has better stability near t=0. Speedup: FMA reduces 3 operations to 2, with better precision. For animation, color blending, and physics interpolation. Ensure -ffp-contract=fast or use explicit fma() to guarantee fusion.
BEFORE: bool any = (a[0] || a[1] || a[2] || ... || a[n-1]);. AFTER (SIMD): __m256i zero = _mm256_setzero_si256(); __m256i acc = zero; for(i=0;i<n;i+=8) acc = _mm256_or_si256(acc, _mm256_loadu_si256(&a[i])); bool any = !_mm256_testz_si256(acc, acc);. The VPTEST instruction sets ZF if all bits are zero. For short-circuit evaluation (early exit when found): for(i=0;i<n;i+=8) { __m256i v = _mm256_loadu_si256(&a[i]); if(!_mm256_testz_si256(v, v)) return true; }. Speedup: 8x for non-short-circuit check. For all-of: use AND instead of OR, check all bits are 1. Similar pattern works for finding if any element meets a condition via comparison masks.
BEFORE: for(i=0; i<n; i++) max_val = (arr[i] > max_val) ? arr[i] : max_val;. AFTER (SSE): __m128 max_vec = _mm_set1_ps(-FLT_MAX); for(i=0; i<n; i+=4) { max_vec = _mm_max_ps(max_vec, _mm_loadu_ps(&arr[i])); }. Then horizontal reduction of max_vec. For integer: _mm_max_epi32 (SSE4.1), _mm_max_epu8 (unsigned bytes). For min: _mm_min_ps, _mm_min_epi32. Scalar branchless: max = a - ((a-b) & ((a-b) >> 31));. The SIMD versions are inherently branchless and process 4-16 elements per instruction. Speedup: 4-8x with SSE/AVX. Use _mm256_max_ps for AVX (8 floats) or _mm512_max_ps for AVX-512 (16 floats).
BEFORE: for(i=0; i<n; i++) { double scale = sin(theta) * cos(phi); result[i] = data[i] * scale; }. AFTER: double scale = sin(theta) * cos(phi); for(i=0; i<n; i++) { result[i] = data[i] * scale; }. For array-based invariants: BEFORE: for(i=0; i<n; i++) { len = strlen(str); if(i < len) process(str[i]); }. AFTER: len = strlen(str); for(i=0; i<n && i<len; i++) process(str[i]);. Speedup: Depends on invariant cost. For sin/cos: 100+ cycles saved per iteration. For strlen on 1KB string: 1000 cycles saved per iteration. Compilers perform basic LICM (Loop Invariant Code Motion) at -O2+, but may miss function calls without attribute((const)) or complex expressions.
BEFORE (horizontal): __m256 sum = _mm256_hadd_ps(a, b); (adds adjacent pairs within vector, crosses lanes). AFTER (vertical): Accumulate using vertical adds throughout loop, single horizontal reduction at end. Loop: __m256 acc = _mm256_setzero_ps(); for(i=0; i<n; i+=8) acc = _mm256_add_ps(acc, _mm256_loadu_ps(&arr[i]));. Final reduction: __m128 lo = _mm256_extractf128_ps(acc, 0); __m128 hi = _mm256_extractf128_ps(acc, 1); __m128 sum128 = _mm_add_ps(lo, hi); sum128 = _mm_hadd_ps(sum128, sum128); sum128 = _mm_hadd_ps(sum128, sum128); float result = _mm_cvtss_f32(sum128);. Speedup: 3-5x. Horizontal ops have 3-7 cycle latency vs 1 cycle for vertical. Minimize horizontal operations; do them once at the end.
BEFORE: int sign = (x > 0) - (x < 0); or if(x>0) return 1; else if(x<0) return -1; else return 0;. AFTER: int sign = (x > 0) - (x < 0);. This actually compiles well but here's the bit manipulation version: int sign = (x >> 31) | ((unsigned)-x >> 31);. Explanation: (x >> 31) is -1 for negative, 0 otherwise. ((unsigned)-x >> 31) is 1 for positive (since -x is negative), 0 otherwise. OR combines them. For floating-point: copysign(1.0, x) returns +1.0 or -1.0 (doesn't return 0 for x=0). SIMD: Compare against zero, mask to -1/0/+1. Speedup: 1.5-2x when branches mispredict. Most useful in physics simulations, smoothstep functions.
Heuristics and Rules of Thumb
72 questionsAlign data to SIMD register width: 16 bytes for SSE, 32 bytes for AVX/AVX2, 64 bytes for AVX-512. Unaligned SIMD loads/stores work on modern CPUs but may be slower when crossing cache line (64-byte) boundaries. Use compiler attributes: C11 '_Alignas(32)', GCC 'attribute((aligned(32)))', MSVC '__declspec(align(32))', or C++11 'alignas(32)'. For dynamic allocation, use aligned_alloc() or _mm_malloc(). Padding arrays to SIMD-width multiples eliminates need for scalar remainder loops and enables cleaner vectorization.
Use huge pages (2MB on x86) when: working set exceeds 4MB (1024 4KB pages), TLB miss rate is high in profiling, or memory access is scattered across large address ranges. Huge pages reduce TLB entries needed by 512x: 20MB requires only 10 huge pages vs 5120 standard pages. Best candidates: large arrays, memory-mapped files, databases, HPC applications. Enable with: Linux mmap() with MAP_HUGETLB, or transparent huge pages (THP). Benchmark first - huge pages can hurt performance for sparse access patterns due to internal fragmentation and longer page fault times.
For L2 cache blocking, use approximately sqrt(L2_size/3) elements. For a typical 256KB L2 cache with 4-byte floats, this gives sqrt(262144/3/4) = approximately 148 elements, or roughly 128-256 elements per dimension for 2D blocking. For 1MB L2 cache, target around 300 elements. L2 blocking is typically used as an outer loop around L1 blocking to create a two-level tiled algorithm that maximizes data reuse at both cache levels.
L3 cache access latency is 30-50 cycles on modern CPUs, approximately 12-20 nanoseconds. Specifically: Intel Kaby Lake: 42 cycles (16.8 ns at 2.5 GHz); Intel Haswell: 34 cycles (13 ns at 2.6 GHz); AMD Zen: ~35-40 cycles. L3 is shared across all cores and typically ranges from 8MB to 64MB on desktop/server CPUs. L3 latency varies with core count and NUMA topology. For multi-threaded applications, L3 hit rate determines cross-core data sharing efficiency. L3 misses go to DRAM with 100+ cycle penalty.
OpenMP fork/join overhead is typically 1-10 microseconds per parallel region entry, depending on implementation and number of threads. For loops, this means each iteration should do at least 10-100 microseconds of work to amortize parallelization overhead. With smaller tasks, the parallel version may be slower than sequential. Rule of thumb: parallelize when total loop work exceeds 100 microseconds and individual iterations take at least 1 microsecond. For finer-grained parallelism, use static scheduling to minimize runtime overhead compared to dynamic scheduling.
As a starting heuristic, use OpenMP parallel loops when iteration count exceeds 1000 iterations with simple bodies, or 100 iterations with moderately complex bodies (10-100 microseconds per iteration). For array operations, parallelize when array size exceeds 100,000 elements for simple operations or 10,000 elements for complex operations. Below these thresholds, the overhead of thread management often exceeds parallel speedup. Move parallelization to outer loops when possible to reduce fork/join frequency - one study showed 'code was spending nearly half the time doing OpenMP overhead work' with inner loop parallelization.
Theoretical peak bandwidth: DDR4-3200: 25.6 GB/s per channel, ~50 GB/s dual-channel; DDR5-5600: 44.8 GB/s per channel, ~90 GB/s dual-channel. Achievable bandwidth is 75-85% of peak: DDR4 dual-channel: 40-45 GB/s achievable; DDR5 dual-channel: 70-80 GB/s achievable. DDR5 doubles channels per DIMM (2 32-bit vs 1 64-bit) improving bank-level parallelism. For optimization planning, assume 40 GB/s for DDR4 systems, 70 GB/s for DDR5. Memory-bound code scales linearly with bandwidth, so DDR5 provides ~1.7x speedup for pure streaming workloads.
Use static scheduling when: iterations have uniform work (e.g., array operations), and you want minimum overhead. Use dynamic scheduling when: iteration work varies significantly (e.g., sparse matrix, adaptive algorithms), at the cost of higher overhead from runtime distribution. Use guided scheduling for: load balancing with lower overhead than dynamic - starts with large chunks, shrinks toward end. Specific guidance: static has lowest overhead (0.5 microseconds), dynamic has highest (2-5 microseconds), guided is intermediate. Default chunk size for static: iterations/num_threads; for dynamic: 1 (balance) or 64-256 (reduce overhead).
The approximate latency ratio is L1:L2:L3:DRAM = 1:3:10:60 (in terms of L1 as baseline). Concrete numbers at 3GHz: L1 = 4 cycles (1.3 ns), L2 = 12 cycles (4 ns), L3 = 40 cycles (13 ns), DRAM = 240 cycles (80 ns). This ~60x difference between L1 and DRAM is the 'memory wall'. Bandwidth ratio is similar: L1 can deliver ~1-2 TB/s, L2 ~500 GB/s, L3 ~200 GB/s, DRAM ~50-100 GB/s. Understanding this hierarchy is crucial for cache optimization - each level miss costs roughly 3-10x more than the previous level hit.
Sequential access achieves 10-100x higher throughput than random access due to prefetching and cache line utilization. Typical measurements: sequential read: 30-50 GB/s (DDR4), 60-80 GB/s (DDR5); random read (8-byte): 0.5-2 GB/s (limited by latency, not bandwidth). The gap comes from: prefetchers work for sequential patterns (hiding 200+ cycle DRAM latency), each cache line (64 bytes) fully utilized in sequential vs partially in random, and memory controller optimizations for streaming. Design data structures for sequential access in hot paths wherever possible.
A function call costs approximately 15-25 cycles on modern CPUs, equivalent to 3-4 simple assignments: call instruction (~1-2 cycles), stack frame setup (push rbp, mov rbp,rsp: ~2 cycles), parameter passing (varies), return (pop, ret: ~2-3 cycles), plus potential pipeline disruption. Indirect function calls (through pointers/vtables) cost 3-4x more due to branch prediction miss potential. For small functions called millions of times, this overhead can dominate. Inline functions or link-time optimization (LTO) eliminates this overhead. Profile before optimizing - overhead only matters for very small, frequently-called functions.
Cache line size is 64 bytes on all modern x86/x86-64 processors (Intel and AMD since ~2005). This means memory is fetched and cached in 64-byte aligned chunks. Key implications: data structures should be sized/aligned to 64-byte boundaries for optimal access; arrays of 8-byte elements have 8 elements per cache line; false sharing occurs when different threads access different data within the same 64-byte line. Apple M-series uses 128-byte cache lines. Always pad data to avoid false sharing and align hot data to cache line boundaries.
Modern CPUs support 10-20 outstanding memory requests per core via Line Fill Buffers (LFBs) and Miss Status Handling Registers (MSHRs). Intel Skylake: 12 L1D LFBs, 16 L2 superqueue entries; AMD Zen: 22 concurrent L1D misses. This limits single-core bandwidth to: concurrent_requests * cache_line_size / memory_latency. Example: 12 requests * 64 bytes / 80ns = 9.6 GB/s max single-core bandwidth. To achieve higher bandwidth, use multiple threads or software prefetching to keep memory requests in flight. Memory bandwidth scaling often requires 4-8 cores to saturate memory controller.
Batch size should make overhead <10% of useful work. Examples: system calls with 500-cycle overhead: batch 5000+ cycles of work (10+ small operations); network packets with 10 microsecond latency: batch 100+ microseconds of data; database commits with 1ms overhead: batch 10+ ms of transactions. For parallel work distribution: minimum chunk size = parallel_overhead / (num_threads - 1). If OpenMP fork/join costs 10 microseconds, each thread needs >100 microseconds of work for 10% overhead with 2 threads. Measure both latency and throughput - batching trades latency for throughput.
Prefer power-of-two array sizes for: fast modulo via bitwise AND (x & (size-1)), efficient cache blocking, SIMD alignment without remainders. Avoid power-of-two sizes when: accessing with power-of-two stride (causes cache set conflicts), or multiple power-of-two arrays compete for same cache sets. Mitigation: pad arrays to 'size + cache_line_size' to break alignment. Example: 4096-element float array with stride-1024 access uses only 4 of 64 cache sets, wasting 15/16 of cache. Add 16-element padding to spread across all sets.
L1 cache access latency is 4-5 cycles on modern Intel/AMD CPUs, which translates to approximately 1-2 nanoseconds at typical clock speeds. Specifically: Intel Kaby Lake: 5 cycles / 2.5 GHz = 2 ns; Intel Haswell: 5 cycles / 2.6 GHz = 1.9 ns; AMD Zen: 4 cycles. L1 cache is the fastest memory level after registers. L1 data cache is typically 32KB per core (8-way associative), and L1 instruction cache is also typically 32KB per core. Optimizing for L1 hit rate provides the largest performance gains.
Use spinlocks when: critical section is less than 1000 cycles (~0.3-0.5 microseconds), threads are unlikely to be preempted, and running on multicore system. Use mutexes when: critical section exceeds 1000 cycles, high contention is expected, or running in userspace where preemption is unpredictable. Threshold-based hybrids (like adaptive mutexes) spin for 1000-10000 CPU cycles before blocking. Key insight: spinlocks waste CPU when waiting, but avoid ~1000+ cycle context switch overhead. In userspace, pure spinlocks are usually wrong - use adaptive mutexes that spin briefly then sleep.
Auto-vectorization typically yields 5-10x speedup for embarrassingly parallel computations where you apply elementwise functions to arrays. The theoretical maximum is the SIMD width (4x for SSE floats, 8x for AVX floats, 16x for AVX-512 floats), but practical gains are limited by memory bandwidth, alignment overhead, and remainder loop handling. Memory-bound operations may see only 2-3x improvement regardless of SIMD width because the bottleneck shifts to memory bandwidth rather than compute throughput.
Modern out-of-order CPUs can hide latency for approximately 100-200 instructions in the reorder buffer (ROB), which translates to roughly 50-100 cycles of work. Intel Skylake has 224-entry ROB; AMD Zen3 has 256 entries. This means out-of-order execution can hide L2 cache misses (~12 cycles) effectively but struggles with DRAM latency (200+ cycles). To help the CPU hide memory latency: ensure there are enough independent instructions between loads and their uses, use software prefetching for predictable access patterns, and unroll loops to expose more instruction-level parallelism.
Hardware prefetchers typically detect strides up to 2KB-4KB and handle 8-16 concurrent streams. Intel stride prefetcher detects forward/backward strides up to 2KB; stream prefetcher handles up to 32 streams within 4KB page. For optimal prefetcher effectiveness: use strides <2KB, access no more than 8-16 distinct arrays simultaneously in hot loops, and maintain consistent access patterns (prefetchers take time to learn). When strides exceed hardware limits or patterns are irregular, use software prefetching with explicit _mm_prefetch() instructions at appropriate distances.
malloc() overhead ranges from 50-100 cycles for small allocations to 1000+ cycles for large allocations requiring system calls. Each allocation involves: acquiring a global lock (in traditional allocators), searching free lists, potential memory fragmentation handling, and bookkeeping. Allocations over 64KB (varies by allocator) may trigger mmap() system calls costing thousands of cycles. Mitigation strategies: use object pools/arenas for same-size allocations, pre-allocate during initialization, use thread-local allocators (tcmalloc, jemalloc) to avoid lock contention, or use stack allocation for short-lived data.
Atomic operations cost 10-100+ cycles depending on contention and cache state: uncontended atomic on local L1 cache: 10-20 cycles; contended atomic requiring cache line bounce between cores: 50-200 cycles; atomic across NUMA nodes: 100-300+ cycles. Compare to regular load/store: 4-5 cycles from L1. Lock-free algorithms using CAS loops can waste unpredictable cycles under high contention. Rule of thumb: minimize atomic operations in hot paths, batch updates when possible, use thread-local accumulation with periodic synchronization, and consider cache line padding to prevent false sharing on atomic variables.
Prefetch distance = ceiling(memory_latency_cycles / loop_iteration_cycles). For example, if memory latency is 200 cycles and one loop iteration takes 25 cycles, prefetch 200/25 = 8 iterations ahead. For L1 prefetch from L2, use shorter distances (e.g., 8 iterations); for L2 prefetch from memory, use longer distances (e.g., 64 iterations). Intel compilers with -O2 or higher automatically set prefetch level 3. Tuning can yield 35% or more bandwidth improvement - one test showed performance increase from 129 GB/s to 175 GB/s.
Use Array of Structures (AoS) when: accessing all/most fields of each element together, iterating through elements with good spatial locality, or element-wise operations are common. Use Structure of Arrays (SoA) when: accessing only 1-2 fields across many elements, SIMD vectorization is important (SoA enables efficient vector loads), or cache utilization of accessed fields matters more than element locality. Performance difference can be 2-10x depending on access pattern. Consider hybrid AoSoA (Array of Structures of Arrays) for balanced access patterns with SIMD requirements.
System call overhead is 100-1000 cycles on modern Linux (1000-5000 cycles on Windows). Breakdown: mode switch (user to kernel): 50-150 cycles; syscall dispatch and validation: 100-300 cycles; actual work varies by call; return (kernel to user): 50-150 cycles. Mitigation: batch operations (one write of 1MB vs 1000 writes of 1KB), use memory-mapped I/O to avoid read/write syscalls, use vDSO for time queries (gettimeofday), buffer I/O in userspace. KPTI (Spectre mitigation) increased syscall cost by 100-300 cycles due to page table switching.
Misaligned access crossing a cache line boundary costs 16 cycles on Intel Atom (vs 4 cycles for aligned) - a 4x penalty. On modern Core i7 (Sandy Bridge and newer), there is no measurable penalty for misaligned access that doesn't cross cache lines. However, access spanning two cache lines always incurs double the memory traffic and potential 2x latency. Rule of thumb: always align data to its natural size (4-byte ints to 4-byte boundaries, 8-byte doubles to 8-byte boundaries), and align hot data structures to 64-byte cache line boundaries to ensure single-line access.
SIMD string operations (strlen, memcmp, memcpy, strchr) become beneficial for strings longer than 16-32 bytes when using SSE, or 32-64 bytes for AVX. Below these lengths, scalar loops with branch prediction for early termination often win. Modern glibc/MSVC runtime libraries automatically dispatch to SIMD versions for larger strings. For custom implementations: SSE can process 16 bytes per iteration, AVX 32 bytes, with ~2 cycle per vector comparison. For memcpy specifically, SIMD helps above 64 bytes; for <64 bytes, use rep movsb (enhanced on recent CPUs) or unrolled scalar moves.
x86-64 provides 16 general-purpose 64-bit registers (RAX-RDX, RSI, RDI, RBP, RSP, R8-R15), but practically only 14-15 are available for computation (RSP is stack pointer, RBP often frame pointer). This is doubled from x86-32's 8 registers. Additionally, there are 16 XMM/YMM/ZMM vector registers for SIMD. When your algorithm needs more than 12-14 variables live simultaneously, expect register spills to stack. Loop unrolling increases register pressure - balance unroll factor against available registers to avoid costly spills inside hot loops.
Use conditional move (cmov) or SIMD min/max instructions when branches would be unpredictable. Branch-free min: 'min = y ^ ((x ^ y) & -(x < y))' or compiler intrinsics '_mm_min_ps'. Cost: cmov is 1-2 cycles vs potential 15+ cycles for mispredicted branch. However, cmov creates data dependency while branch allows speculative execution. Rule: use branchless when prediction accuracy <75%, or always for SIMD code (no branching within vector). Modern compilers often generate cmov for simple ternary operators at -O2; use '-fno-if-conversion' to force branches if needed.
Denormal (subnormal) floating-point operations can be 10-100x slower than normal operations on x86 CPUs. When results become denormal (very small numbers near zero), the CPU falls back to microcode, taking 50-200 cycles instead of 4-5 cycles. Detection: unexpected performance cliffs when values approach zero. Solutions: enable Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ) modes via MXCSR register (_MM_SET_FLUSH_ZERO_MODE, _MM_SET_DENORMALS_ZERO_MODE), add small epsilon to prevent denormals, or redesign algorithm to avoid near-zero intermediate values.
The ridge point is calculated as: Peak_Performance(FLOP/s) / Peak_Bandwidth(bytes/s). This gives the minimum operational intensity (FLOP/byte) needed to achieve peak compute performance. For example: NVIDIA A100 with 19,500 GFLOPS and 1,555 GB/s bandwidth has ridge point of 19500/1555 = 12.5 FLOP/byte. Code with operational intensity below the ridge point is memory-bound; above it is compute-bound. Typical ridge points: CPU ~1-4 FLOP/byte, GPU ~10-50 FLOP/byte. Optimize memory access for memory-bound kernels; optimize compute for compute-bound.
Modern reorder buffer (ROB) sizes: Intel Skylake/Ice Lake: 224-352 entries; AMD Zen 3/4: 256 entries; Apple M1/M2: 600+ entries. The ROB limits how far ahead the CPU can execute speculatively. For hiding latency, ensure there are enough independent instructions to fill the ROB before hitting a long-latency operation. Example: with 200-entry ROB and 4-wide issue, ~50 cycles of independent work can be found. If your loop has only 20 instructions and one memory access per iteration, you need the loop running 10+ iterations ahead to fill the window.
Use lookup tables when: computation takes >20 cycles and table fits in L1 cache (<=32KB), access pattern is unpredictable (no benefit from branch prediction), or function is called millions of times. Use computation when: table would exceed L2 cache (causing cache pollution), access pattern allows branch prediction to work well, or computation is simple (<10 cycles). Typical breakeven: 256-entry byte table (256 bytes) is almost always beneficial; 64K-entry table (64KB+) requires careful analysis. Memory latency (4 cycles L1) vs compute (1-20 cycles) determines winner.
TLB (Translation Lookaside Buffer) miss penalty varies by level: L1 ITLB miss: 7-10 cycles (usually hidden by out-of-order execution); STLB (second-level TLB) miss triggering page walk: 20-100+ cycles depending on page table depth and cache residency of page table entries. A full 4-level page walk hitting DRAM at each level could cost 400+ cycles. Reduce TLB misses by: minimizing working set to fit in TLB coverage, using huge pages (2MB instead of 4KB - requires 512x fewer TLB entries), and improving memory access locality.
Throughput (instructions per cycle) on modern x86: simple ALU (add, sub, logical): 4-6 per cycle; complex ALU (multiply): 1-2 per cycle; integer divide: 0.03-0.1 per cycle (10-30 cycles latency); FP add/multiply: 2 per cycle; FP divide: 0.2-0.5 per cycle; loads: 2-3 per cycle (L1 hit); stores: 1-2 per cycle. These are throughput limits - actual IPC depends on dependencies. Key insight: division is 10-100x more expensive than multiplication; replace 'x/const' with 'x * (1/const)' where possible. Measure instruction mix to understand bottlenecks.
Main memory (DRAM) access latency is 150-300 cycles, approximately 60-100 nanoseconds on modern systems. This is 100x slower than L1 cache. The latency includes: L3 miss detection (~40 cycles), memory controller processing, DRAM row activation (CAS latency), and data transfer. DDR4 typical latency: 60-80 ns; DDR5: 70-90 ns (higher frequency but also higher CAS latency). Memory-bound code can see processors stalling for hundreds of cycles per access. This 'memory wall' makes cache optimization crucial for performance.
L2 cache access latency is 10-14 cycles on modern CPUs, approximately 4-5 nanoseconds. Specifically: Intel Kaby Lake: 12 cycles (4.8 ns at 2.5 GHz); Intel Haswell: 11 cycles (4.2 ns at 2.6 GHz); AMD Zen: ~12 cycles. L2 is about 3-4x slower than L1 but holds 8-16x more data (typically 256KB-1MB per core). L2 cache is typically unified (both instructions and data) and 4-8 way set associative. For algorithms with working sets between 32KB and 256KB, L2 hit rate is the critical performance metric.
Order struct fields by: 1) Access frequency (hot fields first), 2) Access pattern (fields accessed together should be adjacent), 3) Size descending (reduces padding). Keep hot fields within first 64 bytes (one cache line). Group read-only fields separately from read-write to prevent false sharing. For arrays of structs vs struct of arrays (AoS vs SoA): use AoS when accessing all fields per element, SoA when accessing one field across all elements. Typical optimization: place most-accessed 2-3 fields at struct start, ensuring they fit in first cache line load.
L1 instruction cache is typically 32KB on modern x86 CPUs (Intel and AMD). Hot code paths should fit within this to avoid instruction cache misses. Key implications: aggressive loop unrolling may hurt if it expands hot loop beyond 32KB; inline functions judiciously to avoid code bloat; keep related functions together for better I-cache locality. Measure instruction cache miss rate if performance is unexpectedly poor after optimization. Unrolling from 4x to 16x might improve data-path efficiency but hurt overall performance if code no longer fits in I-cache.
Typical TLB coverage with 4KB pages: L1 DTLB: 64-128 entries = 256-512KB; L2 STLB: 1024-2048 entries = 4-8MB. Working sets exceeding TLB coverage suffer page walk penalties. When TLB miss rate >1%, consider huge pages. With 2MB huge pages: same 1024 STLB entries cover 2GB. Signs of TLB pressure: high DTLB miss rate in profiler, performance cliff at specific working set sizes, random access patterns over large memory regions. Solutions: huge pages, improve memory locality, reduce working set, or use cache blocking to reuse TLB entries.
A good starting prefetch distance for L1 (prefetching from L2 to L1) is 8 iterations ahead. This accounts for L2 access latency of approximately 12 cycles divided by typical loop iteration time. Fine-tune based on your specific loop: if each iteration takes 7 cycles and L2 latency is 56 cycles, use 56/7 = 8 iterations. Prefetching too early wastes cache space; too late fails to hide latency. Use compiler pragmas like '#pragma prefetch var:hint:distance' for manual tuning.
Plan for 1-3MB of LLC per core for working set sizing. Typical configurations: Intel desktop: 2MB per core (16MB shared / 8 cores); AMD Zen: 4MB L3 per CCX (8 cores share 32MB); Server CPUs: 1.25-2.5MB per core. Note L3 is shared, so under load, effective per-core share decreases. For multi-threaded optimization: total_working_set should fit in total_L3 * 0.7 (leave room for OS and other threads). For single-threaded: working set up to full L3 is reasonable but benefits from L2 blocking for hot data.
When branch prediction accuracy falls below 75%, branchless code (using conditional moves, SIMD masks, or arithmetic) is typically faster than branching code. At 75% prediction accuracy, the cost of mispredictions roughly equals the cost of conditional move data dependencies. Above 75% accuracy, keep the branch. Below 75%, convert to branchless. This 75% threshold is used by compilers as a heuristic for deciding whether to emit cmov instructions. Note: if data comes from slow memory (L3 or DRAM), branches may still win because early speculative loads hide latency.
Vectorized loops should process at least 4x the vector width iterations to amortize setup and cleanup overhead. For AVX2 processing 8 floats per iteration: minimum 32 iterations; for AVX-512 processing 16 floats: minimum 64 iterations. Setup costs include: loading constants into vector registers, handling alignment, setting up masks. Cleanup handles remainder elements. For loops below threshold, consider: scalar fallback, using narrower vectors (SSE instead of AVX), or accumulating small arrays before vectorized processing. Compile-time-known small counts may benefit from full unrolling instead.
Indirect function calls (through pointers or vtables) are typically 2-4x slower than direct calls. One benchmark showed indirect calls running 3.4x slower. The performance hit comes from: inability to inline, branch prediction miss on first call to new target, and additional memory load to fetch function address. Virtual function calls in C++ fall into this category. Mitigation: devirtualization through final/sealed classes, link-time optimization (LTO), profile-guided optimization (PGO), or redesigning hot paths to avoid polymorphism. Consider templates or CRTP for static polymorphism in performance-critical code.
Start with initial backoff of 1-4 iterations, double after each failed attempt, cap maximum at 1000-10000 iterations before falling back to blocking. Common implementation: initial=1, multiply by 2 each iteration, max_backoff=1000 cycles, then call yield() or switch to mutex. Exponential backoff reduces cache line bouncing and improves throughput under contention. Without backoff, test-and-set spinlocks cause severe cache coherence traffic. TTAS (test-and-test-and-set) with exponential backoff performs well even with many processors competing for the same lock.
Integer division is 10-30x slower than multiplication: integer multiply latency 3-4 cycles, throughput 1 per cycle; integer divide latency 20-80 cycles, throughput 0.03-0.1 per cycle (26-90 cycles between divisions). Optimization: replace 'x/const' with multiplication by magic number (compiler does this automatically for constants); replace 'x%power_of_2' with 'x&(power_of_2-1)'; for runtime divisors, consider libdivide or caching the magic multiplier. Integer modulo has same cost as division. Impact: a tight loop with division can be 10x slower than equivalent multiplication.
Expected vectorization speedup = min(SIMD_width, arithmetic_intensity * memory_bandwidth / compute_rate). For compute-bound code: theoretical max is SIMD width (4x for SSE float, 8x for AVX float). For memory-bound code: speedup is limited by bandwidth, typically 1.5-3x regardless of SIMD width. Practical rule: expect 50-70% of theoretical SIMD width speedup for well-vectorized compute-bound code, and 1.5-2x for memory-bound code. Factors reducing speedup: unaligned access, gather/scatter operations, horizontal operations, and remainder handling. Measure actual speedup - it varies significantly by workload.
Float (32-bit) vs double (64-bit) performance: same latency and throughput per instruction on modern x86 for scalar operations; 2x SIMD throughput for float (8 floats vs 4 doubles in 256-bit AVX register); 2x memory bandwidth efficiency for float (half the bytes). Use float when: precision is sufficient (7 significant digits), memory bandwidth is bottleneck, or SIMD width matters. Use double when: numerical precision needed (15 digits), accumulating many values (less rounding error), or mixing with double-precision libraries. Memory-bound code sees ~2x speedup from float; compute-bound sees less difference.
Parallel merge sort becomes beneficial when array size exceeds 10,000-100,000 elements, depending on hardware and element size. Below this threshold, spawn/join overhead exceeds parallel speedup. Rule of thumb: switch to sequential sort when subarray falls below 1000-5000 elements. This hybrid approach (parallel at top levels, sequential at leaves) provides best performance. Additional considerations: for 2 cores, threshold ~50,000; for 8 cores, threshold ~20,000; for 32+ cores, threshold can be as low as 5,000-10,000 elements. Always benchmark on target hardware.
Keep stack allocations under 64KB per function to avoid stack overflow risk. Default stack size is 1MB on Windows, ~8MB on Linux, but deep recursion or nested calls reduce available space. Use heap (malloc/new) for: allocations over 64KB, runtime-determined sizes, data that must outlive the function, or dynamic data structures. Stack allocation is essentially free (1 cycle stack pointer adjustment), while malloc overhead is 50-100+ cycles plus potential system calls for large allocations. For performance-critical code with known sizes, prefer stack or pre-allocated pools.
Code is memory-bound when operational intensity is below the ridge point (typically <1-4 FLOP/byte on CPUs, <10-15 FLOP/byte on GPUs). Examples: DAXPY (y=ax+y) has intensity of 2n FLOP / 24n bytes = 0.083 FLOP/byte - heavily memory-bound. SpMV (sparse matrix-vector) typically has 0.17-0.25 FLOP/byte - memory-bound. Dense matrix multiplication can achieve 2n^3 FLOP / 3n^2*8 bytes for large n, approaching 100+ FLOP/byte - compute-bound. Low-intensity kernels benefit from memory optimizations; high-intensity from compute optimizations.
For L1 cache blocking, use approximately sqrt(L1_size/3) elements. For a typical 32KB L1 data cache with 4-byte floats, this gives sqrt(32768/3/4) = approximately 52 elements, or roughly 50-100 elements per dimension for 2D blocking. The factor of 3 accounts for multiple arrays (input, output, temporary) that need to fit simultaneously. Always ensure the total working set of your blocked computation fits within L1 with room for other data the processor needs.
GCC's default inline limit is 600 pseudo-instructions for functions explicitly marked inline (controlled by --param inline-limit). For auto-inlining at -O2/-O3, functions up to about 40-50 instructions may be inlined based on various heuristics. The 'pseudo-instruction' count is an abstract measure that may change between GCC versions and does not directly map to assembly instructions. Functions called only once are more aggressively inlined regardless of size. Use -Winline to get warnings when inline requests are denied due to size or other factors.
Use pool allocators when: allocating >1000 objects of the same size per second, object lifetime is predictable (bulk allocate/free), or allocation overhead shows up in profiling. Pool allocators reduce malloc overhead from 50-100 cycles to 10-20 cycles by eliminating search and fragmentation handling. Implementation: pre-allocate chunks of N objects, maintain free list with O(1) alloc/free. Common thresholds: objects <256 bytes benefit most; allocation frequency >10,000/second sees significant gains. Memory pools also improve cache locality since objects are contiguous.
Target L1 data cache hit rate of 95% or higher for well-optimized code. Hit rates above 80% are acceptable for general code. Below 60% indicates serious access pattern problems requiring investigation. With L1 hit latency of 1-4 cycles and miss penalty to L2 of 10-12 cycles, the performance impact is significant: 97% hit rate gives average 4-cycle access, while 99% hit rate gives 2-cycle access - 2x improvement from just 2% hit rate increase. Improve L1 hit rate through better spatial locality, cache blocking, and prefetching.
Target IPC of 2-4 for general-purpose code on modern superscalar CPUs. Modern wide-issue processors can achieve IPC of 4-6 in ideal conditions with deep pipelines and superscalar execution. Apple M-series chips can exceed IPC of 3 in floating-point intensive tasks. An IPC below 0.7 indicates significant optimization opportunity - the code is likely memory-bound or suffering from pipeline stalls. Memory-bound code typically shows IPC of 0.5-1.0, while compute-bound well-optimized code should achieve IPC of 2.0 or higher.
The 2:1 cache rule states: miss rate of a direct-mapped cache of size N equals the miss rate of a 2-way set-associative cache of size N/2. This means doubling associativity is roughly equivalent to doubling cache size for reducing conflict misses. Practical implications: 8-way set associativity is nearly as effective as fully associative for most workloads; beyond 8-way, diminishing returns set in. When analyzing cache performance, increasing associativity helps with conflict misses but not capacity misses. For software optimization, focus on reducing working set size rather than worrying about associativity.
Keep no more than 10-12 live variables within a hot loop to avoid register spills on x86-64. Techniques to reduce register pressure: keep live ranges short by using variables close to their definitions, avoid excessive loop unrolling (which multiplies live variables), use restrict pointers to enable better register allocation, break complex expressions into simpler ones the compiler can optimize. Register spills inside hot loops cause significant performance degradation due to added memory traffic. When comparing unroll factors, measure performance to find the sweet spot between instruction-level parallelism and register pressure.
Use the widest SIMD available that doesn't cause frequency throttling or portability issues: AVX-512 (512-bit): use when sustained compute-heavy, accept ~10-15% frequency reduction on some Intel CPUs; AVX2 (256-bit): best default choice, supported since Haswell 2013, no frequency penalty; SSE (128-bit): use for maximum compatibility or when code has many scalar operations mixed in. Process data in multiples of SIMD width to avoid remainder loops. For portable code, compile with multiple paths and runtime dispatch based on CPUID.
Start with an unroll factor of 4x for most loops. This provides a good balance between reducing loop overhead and avoiding instruction cache pressure. For SIMD-optimized code, unroll by the SIMD width or multiples of it (e.g., 4x for SSE with floats, 8x for AVX with floats, 16x for AVX-512). Factors of 2x or 4x typically see speed improvements, while going beyond 8x often shows diminishing returns and can hurt performance due to increased code size and instruction cache misses.
Prevent false sharing by padding thread-local data to cache line boundaries (64 bytes on x86, 128 bytes on Apple M-series). Add 64 bytes of padding between variables accessed by different threads. In C: use 'attribute((aligned(64)))' or manually insert padding arrays. In Java 8+: use '@Contended' annotation which adds 128 bytes of padding. In Go: use 'cpu.CacheLinePad' between fields. The LMAX Disruptor uses 7 long fields (56 bytes) as padding before and after the cursor. While padding wastes memory, it can provide order-of-magnitude performance improvements in contended scenarios.
Target L2 cache hit rate of 90% or higher. Hit rates below 70% suggest the working set is too large or access patterns cause thrashing. L2 measures how well your working set fits: low rates indicate too many unique data accesses or poor temporal locality. With L2 miss penalty of 20-40 cycles to L3 (or 100-300 cycles to DRAM for L3 misses), even small hit rate improvements matter significantly. Design data structures to fit working sets within L2 size (typically 256KB-1MB per core) and consider cache blocking for larger datasets.
Software pipelining (overlapping iterations) provides 15-30% speedup on in-order cores and smaller arrays. Tests show: for arrays fitting L2 cache, software pipelining gives 18.8-28.8% speedup; unroll-and-interleave (UAI) gives 14.2-21.8% speedup on in-order cores. On out-of-order cores, these techniques provide minimal benefit because the hardware already performs dynamic instruction scheduling. Software pipelining works by splitting loop work into phases (load, compute, store) and overlapping phases from different iterations to hide latencies and enable dual-issue on simple processors.
A good starting prefetch distance for L2 (prefetching from main memory to L2) is 64 iterations ahead. This accounts for DRAM latency of 200-400 cycles divided by typical loop iteration time. For a loop taking 5 cycles per iteration with 300-cycle memory latency, use 300/5 = 60, rounded to 64. Memory prefetch distances must be longer than L1 distances because DRAM latency is 10-20x higher than L2 latency. Benchmark with values from 32 to 128 to find optimal for your workload.
Context switch cost is 1000-10000 cycles (0.5-5 microseconds) depending on working set size and cache pollution. Direct costs: ~1000-2000 cycles for register save/restore and TLB flush. Indirect costs: 5000-50000+ cycles to reload caches with new process working set. For threads sharing address space (no TLB flush needed): 1000-3000 cycles. This is why spinlocks can win for very short critical sections (<1000 cycles) - the context switch from blocking costs more than spinning. Minimize context switches in latency-sensitive code by using thread pinning and avoiding blocking operations.
The .NET JIT compiler has a default inline threshold of 32 bytes of IL (Intermediate Language) code. Methods larger than 32 bytes IL are generally not inlined. The rationale is that for larger methods, the function call overhead becomes negligible compared to method execution time. This is a heuristic that can fail for hot methods just over the threshold. Workarounds include: using [MethodImpl(MethodImplOptions.AggressiveInlining)] attribute to hint for inlining, or manually breaking large methods into smaller ones.
SIMD vectorization typically becomes worthwhile when processing at least 4x the SIMD width elements, so: SSE (128-bit): minimum 16 floats or 16 integers; AVX (256-bit): minimum 32 floats or 32 integers; AVX-512 (512-bit): minimum 64 floats or 64 integers. Below these thresholds, the overhead of setup, remainder handling, and potential alignment adjustments may exceed the parallel processing gains. For very small arrays with unknown size at compile time, the scalar version may actually be faster due to branch overhead for remainder loops.
Branch misprediction costs 10-30 cycles on modern x86-64 processors, depending on pipeline depth. AMD Zen 2 has a 19-cycle pipeline, so misprediction costs approximately 19 cycles. Intel processors with deeper pipelines may cost up to 20-25 cycles. This penalty equals the number of pipeline stages from fetch to execute that must be flushed and refilled. For loops with unpredictable branches, this can multiply running time significantly - converting to branchless code can reduce per-element time from 14 cycles to 7 cycles in some cases.
Well-optimized multi-threaded code should achieve 75-85% of peak theoretical memory bandwidth, with 80% being a practical target. Single-threaded code typically achieves 40-60% of peak due to memory-level parallelism limitations. Measured throughput is always below theoretical maximum due to memory controller inefficiencies, DRAM refresh cycles, rank-to-rank stalls, and read-to-write turnaround penalties. If achieving less than 60% of peak bandwidth on memory-bound code, investigate poor spatial locality, cache associativity conflicts, or insufficient prefetching.
An IPC below 0.7 indicates significant room for optimization and limited use of processor capabilities. This typically signals memory-bound execution with frequent cache misses, pipeline stalls from data dependencies, or poor instruction-level parallelism. A CPI (cycles per instruction, the inverse) greater than 1 suggests stall-bound execution. To improve: reduce memory access latency through better cache utilization, eliminate data dependencies through loop unrolling or software pipelining, and ensure sufficient independent instructions for out-of-order execution to exploit.
The maximum effective unroll factor is typically 8x for general code, beyond which diminishing returns set in. For vectorized operations on Intel Xeon Phi, unrolling 16x may be beneficial to fill 512-bit vectors. Key limiting factors include: instruction cache capacity (unrolled code must fit in L1 instruction cache, typically 32KB), register pressure (more unrolling needs more registers), and code bloat affecting branch prediction. Always measure - the only way to determine the optimal factor is through benchmarking.
Bottleneck Diagnosis
70 questionsSigns of branch misprediction bottleneck: (1) Bad Speculation metric above 10-15% in top-down analysis. (2) Branch misprediction rate above 2-5% (measured as mispredictions/total branches). (3) IPC significantly below expected despite low cache miss rates. (4) High branch-misses count in perf stat (thousands per million instructions is problematic). (5) Code with unpredictable conditionals: data-dependent branches, virtual function calls, switch statements with many cases. Performance impact: Each misprediction costs 10-30 cycles on modern x86-64 (pipeline depth). At 5% misprediction rate with branches every 5 instructions, overhead is 10-30% of execution time. Measure with: perf stat -e branches,branch-misses. Solutions: Profile-guided optimization (PGO), branch-free code using CMOV/arithmetic, sorting data to improve prediction, reducing indirect calls.
Pipeline flush (also called pipeline squash) discards all in-flight instructions and restarts from a known good state. Causes: (1) Branch misprediction: Most common cause. Costs 10-30 cycles depending on pipeline depth. (2) Exception/interrupt: Hardware or software exceptions clear the pipeline. (3) Self-modifying code: Modifying instruction memory invalidates fetched/decoded instructions. (4) Memory ordering violations: Speculative loads found to violate ordering must be discarded. (5) Machine clears: Various microarchitectural events (assists, memory disambig failures). Cost calculation: Pipeline depth D means D cycles minimum to refill. With superscalar width W, D*W instructions worth of work discarded. On a 20-stage pipeline, 4-wide machine: up to 80 instructions discarded per flush. At 3GHz with 5% branch misprediction rate (branches every 5 instructions), expect: (0.05 * 20 cycles * 1/5 branches) = 0.2 cycles overhead per instruction = ~20% performance impact. Diagnosis: Check Bad Speculation in top-down analysis, branch-miss-rate, machine_clears events.
High load-to-use latency indicates long delays between a load instruction issuing and its data being available for dependent operations. Typical latencies: L1 hit: 4-5 cycles, L2 hit: 12 cycles, LLC hit: 40-50 cycles, DRAM: 200+ cycles. Causes of elevated load-to-use: (1) Cache misses: Each cache level miss adds latency. (2) Bank conflicts: Multiple loads to same cache bank in same cycle. (3) Store-to-load forwarding failures: Load cannot forward from prior store due to address overlap mismatch or timing. Costs 10+ extra cycles. (4) TLB misses: Add page table walk latency. (5) Memory ordering constraints: Loads waiting for prior stores to complete. (6) Snoop latency: Data in modified state on another core. Diagnosis: mem_load_retired.l1_miss, l2_miss, l3_miss give miss distribution. Store forwarding failures have dedicated counters. High load latency with low miss rate suggests forwarding issues. Solutions: Improve locality, avoid aliasing patterns, ensure store-load pairs have matching addresses and sizes, software prefetching.
High LLC Bound with low LLC miss rate indicates the problem is LLC hit latency or bandwidth, not capacity misses. Data is in LLC but accessing it is still slow. Causes: (1) LLC bandwidth saturation: Many cores accessing LLC simultaneously exhaust shared bandwidth. LLC can sustain ~50-100 GB/s but this is shared among all cores. (2) LLC bank conflicts: Accesses to same LLC bank from multiple cores serialize. (3) Snoop traffic: Even for LLC hits, cache coherency checks add latency when other cores may have the data. (4) Cross-core LLC access: Data in LLC slice owned by different core adds interconnect latency. (5) Ring/mesh interconnect congestion: High traffic between cores increases access latency. Detection: LLC Bound metric high in top-down analysis, but LLC miss rate < 5%. LLC hit latency higher than expected (>40-50 cycles). Memory bandwidth < DRAM bandwidth indicates LLC is bottleneck. Solutions: Improve L2 locality to reduce LLC pressure, distribute data access across cores, cache blocking to fit in L2, consider core affinity to keep data in local LLC slice.
High cross-thread synchronization latency indicates threads spend excessive time waiting for each other, limiting parallelism benefits. Manifests as poor scaling: 2x threads does not give 2x throughput. Types and costs: (1) Mutex lock: 20-50 cycles uncontended, unbounded when contended. (2) Atomic operations: 10-100 cycles depending on contention and NUMA distance. (3) Condition variable: 1000+ cycles for signal/wait involving kernel. (4) Memory barriers: 10-50 cycles for ordering. (5) Cache line transfer: 40-300 cycles for shared data movement. Detection: (1) Thread utilization imbalanced - some threads idle waiting. (2) Profile shows time in pthread_* functions, futex, or atomics. (3) Scalability test shows diminishing returns. (4) perf c2c shows high contention on synchronization variables. Critical section analysis: Time in critical section * frequency of entry = serialization bottleneck. Amdahl's law applies: 10% serial code limits speedup to 10x regardless of cores. Solutions: Reduce critical section length, use reader-writer locks, lock-free algorithms, partition data to reduce sharing, batch work to reduce synchronization frequency, work stealing instead of central queues.
Loop vectorization failure occurs when compiler cannot or chooses not to use SIMD instructions for a loop. Results in scalar execution at 1/4 to 1/16 potential throughput. Indicators: (1) Compiler reports: -fopt-info-vec-missed (GCC), -Rpass-missed=loop-vectorize (Clang), -qopt-report (Intel) show specific reasons. (2) No SIMD instructions in disassembly of hot loops. (3) Performance far below theoretical (e.g., single-precision 8x slower than expected with AVX2). Common failure reasons: (1) Pointer aliasing: Compiler cannot prove arrays do not overlap. Fix: restrict keyword. (2) Non-contiguous access: Indirect indexing, structure fields. (3) Function calls in loop: Unknown side effects. Fix: inline or use SIMD-enabled functions. (4) Loop-carried dependencies: Each iteration depends on previous. (5) Conditionals: if-statements prevent vectorization or require masking. (6) Reduction not recognized: Sum patterns not identified. (7) Trip count too small: Vectorization overhead exceeds benefit. Diagnosis: Always check compiler vectorization reports for hot loops. Compare expected vs achieved throughput. Solutions: Address specific blocker from compiler report, pragma hints (#pragma omp simd), manual intrinsics as last resort.
Poor prefetcher effectiveness occurs when hardware prefetching fails to bring data to cache before needed, or brings unnecessary data (pollution). Signs of prefetcher miss: (1) High cache miss rate despite regular access patterns. (2) Memory latency dominates despite sequential or strided access. (3) Memory bandwidth underutilized (prefetcher would increase bandwidth usage). Signs of prefetcher pollution: (1) Increased LLC evictions without corresponding hits. (2) Working set appears to exceed cache when it should fit. (3) Performance improves when prefetchers disabled (unusual). When prefetchers fail: (1) Irregular access patterns: Pointer chasing, hash table lookups. (2) Stride too large or irregular: Beyond prefetcher detection range (~1KB stride typical limit). (3) Multiple streams: Exceeds prefetcher stream detection limit (typically 4-8). (4) Data-dependent access: Address depends on loaded value. Detection: Compare LLC bandwidth to DRAM bandwidth - gap indicates prefetch activity. prefetch.* counters show prefetch rates. Disable prefetchers (BIOS/MSR) and compare - similar performance means prefetchers not helping. Solutions: Software prefetching (_mm_prefetch), data layout optimization, algorithm changes for predictable access, increase prefetch distance for high-latency access.
Top-down analysis showing mostly Retiring (>70-80%) indicates the code is executing efficiently from a microarchitectural perspective - the CPU is spending most cycles doing useful work rather than stalling or speculating incorrectly. However, this does not mean the code is optimal. High Retiring scenarios: (1) Compute-bound, well-optimized: Ideal case - algorithm is efficient and hardware well-utilized. Further optimization requires algorithmic changes. (2) Inefficient algorithm executed efficiently: CPU runs fast, but running unnecessary operations. O(n^2) algorithm with 90% Retiring is still worse than O(n) with 60% Retiring. (3) Scalar code that could be vectorized: Retiring individual operations when SIMD could do 4-8x more. Still high Retiring but low throughput. (4) Micro-coded operations: Complex instructions retiring but many uops per instruction. What to check with high Retiring: (1) Is IPC near theoretical maximum? If Retiring 80% but IPC only 1.0 on 4-wide machine, investigate. (2) Compare instruction count to theoretical minimum for algorithm. (3) Check if vectorization is enabled and effective. (4) Look at algorithmic efficiency, not just hardware utilization.
High Frontend Bound percentage (above 15-20%) indicates the instruction fetch and decode pipeline cannot supply enough micro-ops to keep the backend busy. Causes: (1) Instruction cache (I-cache) misses: Code working set exceeds L1I (typically 32KB). Common in large codebases with poor locality. Measure with perf stat -e L1-icache-load-misses. (2) Instruction TLB (ITLB) misses: Code spans many pages. Each ITLB miss triggers page table walk (100-1000 cycles). (3) Branch misprediction recovery: Frontend stalls while fetching from correct path. (4) Decoder bottlenecks: Complex instructions (microcoded) or instruction alignment issues. (5) DSB (micro-op cache) misses: Code too large or poorly aligned for micro-op cache. Diagnosis: Check Frontend Bound breakdown - Fetch Latency (misses) vs Fetch Bandwidth (decoder limitations). Solutions: Improve code locality, use PGO for hot/cold separation, reduce code size, improve branch prediction.
High instruction cache (I-cache) miss rate causes: (1) Large code working set: Application code exceeds L1I capacity (typically 32KB). Common in large applications, heavy use of templates/generics, or excessive inlining. (2) Poor code locality: Hot code paths scattered across memory. Functions called together should be placed together. (3) Code bloat: Duplicate instantiations (C++ templates), excessive inlining, large switch statements. (4) Indirect calls and virtual functions: Unpredictable targets fragment instruction access patterns. (5) Just-in-time compilation: Dynamically generated code may have poor layout. Impact: I-cache misses stall the entire pipeline as no instructions can be decoded. Frontend Bound increases significantly. Miss rate above 1-2% per instruction is concerning. Diagnosis: perf stat -e L1-icache-load-misses,instructions. Solutions: Profile-guided optimization (PGO), link-time optimization (LTO), hot/cold code splitting, reducing code size, improving branch prediction to avoid speculative fetches from wrong paths.
High retirement rate (Retiring metric > 50-70%) with poor performance indicates the code is executing efficiently but doing unnecessary work. The CPU is busy with useful operations, but those operations are not optimal for the task. Causes: (1) Algorithmic inefficiency: O(n^2) when O(n log n) exists. Code runs fast but does too much. (2) Suboptimal instruction selection: Using scalar operations when SIMD available, software emulation of hardware operations. (3) Excessive microcode: Some complex instructions decode to many micro-ops (e.g., REP MOVS for small copies). (4) Unnecessary operations: Redundant calculations, unused results, excessive memory copying. (5) Compiler missed optimizations: Unrolling too much, not using available instructions. Diagnosis: Compare instruction count to theoretical minimum. If retiring 70% but executing 10x necessary instructions, algorithmic change needed. Solutions: Algorithm optimization (biggest wins), enable better compiler optimizations (-O3, -march=native, PGO), use optimized libraries (BLAS, memcpy), profile-guided code review.
Low effective CPU utilization indicates CPUs appear busy (high %user or %sys) but actual useful throughput is low. This differs from idle CPUs. Causes: (1) Spinlocks/busy-waiting: CPUs spin consuming cycles but making no progress. Shows as high CPU, low throughput, high context switch or no context switch (pure spin). (2) Cache coherency traffic: CPUs executing but waiting for cache line transfers. (3) Memory-bound with contention: All cores competing for limited memory bandwidth. (4) Lock convoys: Threads repeatedly wake, find lock held, sleep. (5) False sharing: CPUs execute but data moves between caches constantly. Diagnosis: Compare CPU utilization vs throughput metrics (transactions/sec, IOPS, requests/sec). If doubling cores does not improve throughput, look for serialization. Top-down analysis may show high Backend Bound despite apparent CPU activity. perf c2c reveals cache contention. Solutions: Depends on cause - reduce locking, fix false sharing, partition data, improve memory access patterns, use lock-free algorithms.
Instruction micro-fusion failures occur when instruction combinations that could execute as single micro-op are split into multiple, reducing throughput. Micro-fusion combines load+operation or address calculation into single uop. Fusion fails when: (1) Unsupported addressing mode: RIP-relative with index, or complex addressing with >2 registers. (2) Instruction too long: Combined instruction exceeds decoder length limits. (3) Specific instruction combinations: Some ops cannot fuse (e.g., immediate operands in certain positions). (4) Memory operand not first source: Some fusions require specific operand ordering. (5) Segment override: Non-default segment registers prevent fusion. Impact: Un-fused operations use extra decoder/rename bandwidth, potentially limiting frontend throughput. In tight loops, this can reduce IPC by 10-20%. Detection: Compare retired uops to instruction count. Ratio significantly above 1.0 suggests fusion opportunities lost. Intel IACA tool (now deprecated) or llvm-mca analyze this. VTune uop count metrics. Solutions: Use simple addressing modes (base+displacement), reorder operands per Intel optimization manual recommendations, avoid RIP-relative addressing in hot loops where possible, let compiler optimize with -O3 (it knows fusion rules).
High machine clear rate indicates the CPU is frequently flushing the pipeline for reasons other than branch misprediction, discarding speculative work. Machine clears are expensive: 50-100+ cycles per clear. Causes: (1) Self-modifying code: Writing to instruction memory invalidates cached decoded instructions. (2) Memory ordering violations: Out-of-order load executed before older store to same address must re-execute. (3) FP exceptions requiring precise handling. (4) Page fault during speculative execution. (5) Assists: Denormals, precision exceptions triggering microcode assist. Detection: machine_clears counters in perf: machine_clears.count, machine_clears.memory_ordering, machine_clears.smc. Rate above 0.1% of cycles is concerning. In top-down analysis, this shows as Bad Speculation > Machine Clears rather than Branch Mispredicts. Solutions: By type - SMC: avoid modifying code, use proper synchronization. Memory ordering: fix data races, use proper atomics. FP assists: enable FTZ/DAZ, initialize data properly. Avoid patterns that cause speculative execution across potential exception points.
High SMT contention indicates two logical cores sharing a physical core are competing for resources, limiting benefit of hyperthreading. Shared resources: Execution units, caches (L1/L2), TLBs, branch predictor tables, store buffer entries. Contention patterns: (1) Both threads compute-bound: Fighting for execution units. SMT benefit may be negative (-10%). (2) One memory-bound, one compute-bound: Good SMT fit - ~20-30% throughput gain. (3) Both memory-bound: Shared cache pressure, limited benefit. (4) Resource-specific conflicts: Both threads need same execution port, TLB pressure from both. Detection: Compare single-thread vs SMT performance. If SMT < 1.15x single-thread throughput, contention is significant. Measure per-thread IPC - if each thread gets <50% of single-thread IPC, high contention. Intel metrics: Tma_info_thread.bottleneck shows SMT impact. Solutions: Thread affinity to separate physical cores for independent work, pair complementary workloads on same physical core, disable SMT for latency-critical single-threaded work, partition threads by resource needs (memory-bound with compute-bound), consider SMT-aware scheduling.
Memory latency histogram shows distribution of memory access latencies, revealing where data comes from and identifying outliers. Typical latency bands: (1) 0-10 cycles: L1 cache hits (ideal). (2) 10-20 cycles: L2 cache hits. (3) 20-50 cycles: L3/LLC cache hits. (4) 50-100 cycles: Local DRAM access. (5) 100-200 cycles: Remote DRAM (NUMA). (6) 200+ cycles: Contested cache lines, TLB misses, page faults. Interpretation: (a) Bimodal distribution (peaks at L1 and DRAM): Working set exceeds cache, some data hot (hits L1), rest cold (misses to DRAM). Optimize for better locality. (b) Long tail (few very high latency): Occasional pathological cases - page faults, false sharing, NUMA issues. Investigate outliers. (c) Peak at LLC, few DRAM: Working set fits in LLC, good. (d) Mostly DRAM latency: Memory-bound, need algorithmic changes. Actionable insights: If 50% of accesses hit DRAM, halving those would roughly halve memory time. Focus optimization on access patterns causing the expensive latencies. Use histogram to set goals - e.g., 'move 80% of accesses to L2 or better.'
Backend Bound with high Core Bound (vs Memory Bound) indicates execution unit saturation or port contention - the CPU has data but cannot process it fast enough. Causes: (1) Execution port contention: Too many instructions competing for same execution units. Modern Intel CPUs have 6 ports with specific capabilities (ports 0,1,5 for ALU; ports 2,3 for loads). Measure with port utilization metrics. (2) Divider/special unit bottleneck: Division and some transcendental operations have low throughput (single unit). Sequences of divisions serialize. (3) Long latency operations: Floating-point division (10-20 cycles), some SIMD operations. Even with multiple units, latency limits throughput if dependent. (4) Register pressure: Excessive spilling to memory. Solutions: Instruction scheduling to balance port usage, strength reduction (replace division with multiplication), loop unrolling to expose ILP, vectorization to use different ports.
High structural hazard rate indicates multiple instructions competing for the same hardware resource simultaneously, causing serialization. Modern CPUs have multiple execution units to minimize this, but bottlenecks occur with: (1) Division unit: Most CPUs have single integer and single FP divider. Sequences of divisions serialize completely. Division throughput is 1 per 10-20 cycles. (2) Specific execution ports: Intel CPUs have specialized ports (e.g., shuffle operations only on port 5). Heavy use of specific instructions saturates that port. (3) Load/store units: Typically 2 load + 1 store per cycle maximum. Memory-heavy code can saturate. (4) Branch units: Usually one branch per cycle. Tight loops with multiple branches may bottleneck. Diagnosis: VTune Microarchitecture Exploration shows port utilization. Look for ports at 70-100% utilization while others are idle. Solutions: Instruction scheduling to spread across ports (compilers do this), strength reduction (avoid divisions), use SIMD to process more data with fewer instructions.
Frequent page faults indicate the application is accessing memory not currently mapped in physical RAM, requiring kernel intervention. Types: (1) Minor faults (soft): Page in memory but not in page table. Cost: ~1-10 microseconds. (2) Major faults (hard): Page on disk, must be read. Cost: 1-10+ milliseconds for HDD, 0.1-1ms for SSD. Causes: (1) Insufficient RAM: Working set exceeds physical memory causing swap. (2) Memory-mapped files: Lazy loading triggers faults on access. (3) Demand paging: New allocations faulted on first access. (4) COW (Copy-on-Write): Fork or mmap pages faulted on write. Diagnosis: perf stat -e page-faults,minor-faults,major-faults. Major fault rate above 1-10/sec indicates swapping - serious performance problem. Minor faults in thousands/sec may be acceptable for initialization but not steady-state. vmstat shows si/so (swap in/out). Solutions: Increase RAM, reduce working set, use huge pages (reduces TLB misses and minor faults), mlock for latency-critical data, prefault memory pools.
Register spilling occurs when the compiler runs out of architectural registers and must temporarily store values in memory (stack), adding load/store overhead. Modern x86-64 has 16 GPRs + 16/32 vector registers, but complex code can exhaust them. Causes: (1) Complex expressions: Many live values simultaneously. (2) Function calls: Caller-save registers must be preserved. (3) Loop-carried variables: Many values carried across iterations. (4) Large unrolled loops: Unrolling increases register pressure. (5) Inline assembly: May reserve registers unavailable to compiler. Detection: (1) Compiler optimization reports (-fopt-info-vec-missed for GCC, -qopt-report for Intel) mention register pressure. (2) Disassembly shows frequent stack loads/stores in hot code: mov to/from [rsp+offset]. (3) High memory traffic in compute-only code. (4) Performance cliff when adding local variables. Solutions: Reduce live variable count, split functions, allow compiler to optimize (-O3), use restrict qualifier, loop splitting, reduce unroll factor, reconsider algorithm to need fewer temporaries.
High integer division frequency indicates performance bottleneck from expensive divide operations. Integer division throughput: approximately 1 per 10-25 cycles (varies by CPU and operand width). Much slower than multiplication (1 cycle throughput). Impact: In division-heavy code, throughput limited to ~0.04-0.1 divisions per cycle. Divider is typically not pipelined and a single resource. Common sources: (1) Modulo operations (% operator) which compile to division. (2) Hash table index calculations. (3) Fixed-point arithmetic. (4) Range/bounds checking. Detection: High instruction latency with low IPC, instruction mix analysis showing frequent DIV/IDIV. VTune shows Divider unit as bottleneck. Disassembly shows div/idiv instructions in hot paths. Solutions: (1) Multiply by reciprocal for constant divisors (compiler does this with -O2+). (2) Use power-of-2 divisors (becomes shift). (3) Replace modulo with branch or bitwise AND for power-of-2. (4) Batch divisions, use lookup tables, strength reduce. Example: x % 8 -> x & 7 (for unsigned). x / 10 -> x * 0xCCCCCCCD >> 35 (compiler-generated).
High memory bandwidth with low CPU utilization indicates a streaming workload where CPUs spend most time waiting for memory transfers, not computing. This is memory-bandwidth-bound execution. Characteristics: (1) Memory bandwidth near system maximum (check with STREAM benchmark for reference). (2) CPU utilization <50% but cannot go higher. (3) Adding more cores does not improve throughput (already bandwidth-limited). (4) IPC very low (0.1-0.5) due to long memory stalls. (5) LLC miss rate high with sustained pattern. Typical workloads: Large array processing, BLAS level 1 operations, memory copying, simple reductions over large datasets. Diagnosis: Compare achieved bandwidth to theoretical maximum. Memory-bound if achieving >50% of peak. Check MLP (Memory-Level Parallelism) - need 40-100 concurrent misses to saturate bandwidth. Solutions: Algorithmic changes to increase computation per byte fetched (cache blocking, data compression), prefetching to overlap compute and memory, SIMD to process more data per instruction, consider HBM or faster memory for bandwidth-critical workloads, reduce memory traffic with better algorithms.
Asymmetric core utilization indicates uneven work distribution in parallel code, leaving some cores idle while others are overloaded. Results in suboptimal scaling. Causes: (1) Static partitioning with uneven work: Dividing N items among K cores assumes equal work per item. (2) Data-dependent execution time: Some inputs take longer (e.g., sparse regions of matrix). (3) Serial sections: Amdahl's law - any serial code limits scaling. (4) Lock contention: Threads serialize on shared locks, some wait while others proceed. (5) NUMA effects: Cores with remote memory access slower than local. (6) Dynamic load imbalance: Work generated during execution clusters on some cores. (7) Cache sharing effects: Cores sharing cache may interfere. Detection: Per-core CPU utilization shows imbalance. Thread timeline in profiler shows gaps where threads wait. Load imbalance metric = (max_thread_time - average_thread_time) / max_thread_time. Solutions: Work stealing schedulers (TBB, OpenMP dynamic scheduling), finer-grained tasks, overdecomposition (more tasks than cores), guided scheduling for predictable but uneven loops, NUMA-aware work distribution, reduce critical section impact.
Execution unit saturation with low IPC indicates a specific resource bottleneck combined with lack of alternative work - the CPU cannot make progress despite idle resources. Scenarios: (1) Single-port saturation: All work requires specific execution port (e.g., shuffle-heavy code saturating port 5) while other ports idle. IPC limited by that port's throughput. (2) Divider saturation: Dense division code - divider is single resource with 10-20 cycle throughput. IPC = 0.05-0.1 from division alone. (3) Long-latency compute chains: Execution unit busy with long-latency ops (FP div, sqrt) but result not ready, nothing else to schedule. (4) Memory unit saturation with compute dependency: Loads saturate ports 2,3 but data not ready, compute ports idle waiting. Detection: Per-port utilization metrics in VTune show imbalance. One port at 90%+, others at 20-30%. Correlate with instruction mix analysis. Solutions: Interleave different operation types to use different ports, strength reduction for expensive operations, overlap independent work during long-latency operations, use SIMD (different ports than scalar), consider algorithm changes to reduce specific operation density.
High memory bandwidth utilization with low IPC indicates memory-latency-bound execution - the CPU is fetching data at high bandwidth but still waiting. This occurs when: (1) Memory access pattern has poor parallelism: Sequential dependent loads cannot overlap. A single core typically needs 10+ concurrent cache misses to saturate memory bandwidth due to latency (64 cache lines in flight for full bandwidth at typical latencies). (2) Pointer chasing: Each load depends on previous load's result (linked lists, tree traversal). Bandwidth available but latency dominates. (3) Insufficient memory-level parallelism (MLP): Hardware prefetchers ineffective, out-of-order window too small to find independent loads. Diagnosis: Compare achieved bandwidth to theoretical peak. If bandwidth is 50%+ of peak but IPC is low, latency is the issue. Solutions: Restructure data for prefetch-friendly access, software prefetching, data structure transformation (arrays vs pointers), loop tiling to improve locality.
Low SIMD lane utilization indicates vectorized code is not using all available vector width, wasting potential throughput. Causes: (1) Scalar epilogue loops: When trip count is not a multiple of vector width, scalar cleanup code processes remainders. (2) Masked operations: Conditionals in vectorized loops create partially-filled vectors (e.g., only 2 of 8 lanes active). (3) Horizontal operations: Reductions, scans, and cross-lane operations have lower throughput than vertical operations. (4) Data alignment issues: Misaligned loads may use half vector width or require extra shuffles. (5) Gather/scatter overhead: Non-contiguous memory access can be slower than scalar for AVX2, only efficient in AVX-512 with high arithmetic intensity. Diagnosis: Compare theoretical SIMD throughput (vector_width * operations/cycle) vs achieved throughput. Look for high instruction counts relative to data processed. Solutions: Ensure data is aligned, use full-width operations, pad arrays to vector width multiples, convert AoS to SoA layout for contiguous access.
High reorder buffer (ROB) utilization indicates many instructions are in-flight awaiting completion, typically due to long-latency operations blocking retirement. Modern Intel CPUs have 200-300+ entry ROBs. High utilization context: (1) Memory-latency bound: Many loads waiting for cache misses. ROB fills while waiting for data. When ROB is full, no new instructions can be dispatched (allocation stall). (2) Long dependency chains: Instructions cannot retire until predecessors complete. (3) Resource stalls: Lack of physical registers or other resources. Diagnosis: VTune Resource Stalls or Allocation Stalls metrics. If ROB stalls correlate with memory-bound metrics, memory optimization needed. If ROB stalls occur with high IPC regions, likely register pressure or specific execution unit bottlenecks. Impact: ROB full = complete pipeline stall. Effective IPC drops to zero until operations complete. Solutions: Increase memory-level parallelism, reduce long-latency operation chains, software prefetching, algorithmic changes for better locality.
Latency-bound execution: Performance limited by the time to complete dependent operation chains. Indicators: (1) Low IPC despite available execution units. (2) Instruction latency matters more than throughput in critical path. (3) Increasing pipeline depth or adding execution units does not help. (4) Memory accesses are serialized (pointer chasing). Example: Loop with A[i] = f(A[i-1]) - each iteration depends on previous. Throughput-bound execution: Performance limited by rate of executing operations. Indicators: (1) High execution port utilization. (2) Adding more execution units would help. (3) Instruction throughput (reciprocal throughput) is the limiting factor. (4) Many independent operations available. Example: A[i] = B[i] * C[i] - all iterations independent. Diagnosis: Profile with perf and check if critical path length * latency approximates runtime (latency-bound) or if instruction count / throughput approximates runtime (throughput-bound).
High gather/scatter overhead indicates non-contiguous memory access patterns are defeating SIMD benefits. Gather loads data from scattered locations into a vector register; scatter stores vector elements to scattered locations. Performance characteristics: (1) AVX2 gather: Often slower than equivalent scalar loads (0.95x-1.2x speedup typical). Only beneficial with 5-10+ FP operations per gather. (2) AVX-512 gather: Improved but still 50-70% of contiguous load throughput. (3) When data exceeds cache: Vectorized gather becomes memory-bound regardless of compute. (4) Scatter typically worse than gather due to read-modify-write requirements. Diagnosis: Profile shows high instruction count relative to useful computation. Memory bandwidth underutilized despite many memory operations. High cycles per vector operation. Solutions: Restructure data layout (AoS to SoA or AoSoA), pre-sort/pre-gather into contiguous buffers if reused, accept scalar code for truly scattered access patterns.
Micro-op cache (DSB - Decoded Stream Buffer) misses indicate decoded instructions must be re-fetched and re-decoded from I-cache, wasting decoder bandwidth. DSB caches decoded micro-ops, bypassing the complex decoders. Causes: (1) Large code footprint: DSB typically holds ~1500-2000 micro-ops. Hot code exceeding this causes misses. (2) Poor code alignment: Instructions crossing cache line or 32-byte boundaries may not cache in DSB. (3) Branch target aliasing: Different branches mapping to same DSB entry cause conflicts. (4) Complex instructions: Some instructions (microcoded) cannot be cached in DSB. (5) SMT interference: Two threads share DSB capacity. Impact: Decoder throughput is 4-5 uops/cycle, DSB throughput is 6 uops/cycle. DSB misses reduce frontend bandwidth by 15-30%. Diagnosis: DSB misses metric in VTune or perf counters for idq_dsb_miss. High miss rate correlates with Frontend Bound > Fetch Bandwidth. Solutions: Profile-guided optimization (PGO), function ordering for hot/cold separation, reducing code size, strategic inlining decisions.
Unequal memory channel utilization indicates memory traffic is not balanced across available DRAM channels, reducing effective bandwidth. Modern systems have 2-8 memory channels; balanced utilization is essential for peak bandwidth. Causes: (1) Single-DIMM population: Physically only one channel populated. (2) Address interleaving configuration: BIOS settings may not optimally interleave. (3) Non-power-of-2 allocations: May cause non-uniform address distribution. (4) NUMA imbalance: Some nodes more heavily used than others. (5) Large contiguous allocations: May land primarily in one channel. Detection: Memory controller counters (Intel uncore PMU) show per-channel bandwidth. Significant imbalance (>20% difference between channels) limits total bandwidth. Platform-specific tools: Intel Memory Latency Checker, pcm-memory. Impact: With N channels but traffic on K channels (K<N), effective bandwidth is K/N of theoretical maximum. Solutions: Populate all memory channels (symmetrically), ensure proper BIOS interleaving settings, use NUMA-aware allocation, large page allocation for better interleaving, balance working set across address ranges.
The roofline model plots achievable performance (FLOPS or ops/sec) against operational intensity (ops per byte from memory). It shows whether code is compute-bound or memory-bound. Determining position on roofline: (1) Calculate operational intensity: OI = floating_point_ops / bytes_transferred. Measure bytes via memory bandwidth counters or analytical calculation. (2) Calculate achieved throughput: GFLOPS = FP_ops / runtime. (3) Compare to roofline: Plot point and see which roof limits it. Key rooflines: (a) Memory bandwidth roof: Performance = bandwidth * OI. (b) Compute roof: Peak GFLOPS (depends on instruction type - scalar, AVX2, AVX-512). Ridge point: OI where roofs intersect. Below ridge = memory-bound, above ridge = compute-bound. Modern Intel Xeon example: ~300 GFLOPS peak, ~150 GB/s bandwidth, ridge point ~2 FLOPS/byte. Diagnosis: If achieved performance << compute roof but traces memory roof, optimize for bandwidth. If near compute roof, optimize compute (better instructions, less overhead). Tools: Intel Advisor provides automated roofline analysis. Manual calculation possible with perf counters.
High micro-op sequencer (MS) usage indicates frequent execution of complex instructions that require microcode, reducing frontend throughput. Most x86 instructions decode to 1-4 micro-ops via hardware decoders. Complex instructions use microcode ROM: string operations (REP MOVS), CPUID, some FP operations, assists. Impact: Microcode sequencer delivers 4 uops/cycle, same as decoders, but monopolizes frontend. While microcode runs, no other instructions decode. This serializes instruction flow. Detection: idq.ms_uops counter shows microcode-generated uops. High percentage (>5-10%) indicates potential issue. Retiring > MS (Micro-Sequencer) in top-down analysis. Common triggers: (1) REP string operations for small copies. (2) CPUID in hot paths. (3) Transcendental math (sin, cos, exp) in software. (4) Divide operations (partially microcoded). (5) Floating-point assists for denormals. Solutions: Replace microcoded operations with simple instruction sequences (memcpy for small copies, lookup tables for math), use SIMD equivalents where available, avoid operations requiring assists (denormals), use hardware-supported versions where available (SIMD sqrt, reciprocals).
High data cache line split rate indicates memory accesses frequently cross cache line boundaries (64 bytes), requiring two cache accesses instead of one. Doubles memory bandwidth and potentially doubles latency. Causes: (1) Unaligned data structures: Structure starts not at 64-byte boundary, fields span lines. (2) Unaligned dynamic allocations: Malloc without alignment guarantees. (3) Packed structures: Removing padding saves space but causes splits. (4) Variable-length records: Sequential processing crosses boundaries unpredictably. (5) String processing: Character-by-character access through unaligned strings. Detection: ld_blocks.no_sr (loads blocked due to split) or mem_inst_retired.split_loads counters. Split rate above 1-2% of loads warrants attention. Performance impact: Each split roughly doubles access cost. At 10% split rate, ~10% performance loss. More significant if splits cause L1 misses where single access would hit. Solutions: Align allocations (posix_memalign, aligned_alloc, alignas(64) in C++), padding in structures, process data in aligned chunks, SIMD with aligned load requirements force alignment. For intrinsics, use aligned load variants when possible. Compiler alignment hints (attribute((aligned(64)))).
High NOP (No Operation) rate in VLIW code indicates poor instruction packing - the compiler cannot find enough independent operations to fill all instruction slots. Typical waste: SPEC CPU 2000 shows 28-32% under-utilization due to NOPs across different benchmark suites. Causes: (1) Insufficient instruction-level parallelism (ILP) in source code: Dependencies prevent parallel scheduling. (2) Resource constraints: Not enough functional units of required type. (3) Register pressure: Insufficient registers force serialization. (4) Control flow: Branches limit scheduling across basic blocks. Impact: VLIW processors (Itanium, many DSPs) issue fixed-width instruction bundles. NOPs waste fetch bandwidth, instruction cache, and decode resources. Diagnosis: Compiler reports (if available), or count NOP instructions in disassembly. More than 20-25% NOPs indicates optimization opportunity. Solutions: Loop unrolling, software pipelining, function inlining, predication instead of branches, restructure algorithms for more ILP.
High context switch rate indicates frequent CPU transitions between processes/threads, each switch incurring direct and indirect costs. Direct costs: 1-10 microseconds for register saves, TLB flush, page table switch. Indirect costs: Cache pollution, TLB misses, pipeline flush, branch predictor thrashing - often 10-100x the direct cost. Thresholds vary by workload: Interactive systems: 1000-10000/sec may be normal. Throughput systems: > 5000/sec indicates potential problem. Per-core: > 1000/sec/core is concerning for compute workloads. Causes: (1) Too many threads: Thread count >> core count causes excessive scheduling. (2) Lock contention: Threads blocking on locks yield CPU. (3) I/O blocking: Synchronous I/O causes context switches. (4) Timer interrupts: Frequent timers/sleeps. (5) Small time slices: Scheduler configuration. Diagnosis: vmstat shows cs (context switches), pidstat -w shows per-process. sar -w for historical data. Solutions: Reduce thread count to match cores, async I/O, user-space scheduling, batch processing, increase scheduler time slice.
L2 cache bottleneck occurs when working set fits in LLC but exceeds L2, causing frequent L2 misses at ~12-20 cycle latency (vs 4-5 for L1). Modern L2: typically 256KB-1MB per core. Specific causes: (1) Working set between 32KB (L1) and 256KB-1MB (L2): Data frequently evicted from L2, fetched from LLC. (2) High bandwidth demand: L2 fill bandwidth may be lower than LLC access rate. (3) Conflict misses: Power-of-2 strides causing set conflicts in L2. (4) Instruction and data competition: Unified L2 shared between I-cache and D-cache misses. Detection: L2 miss rate high (>5%) while LLC miss rate low. Backend Bound > Memory Bound > L2 Bound in top-down analysis. perf stat -e l2_rqsts.miss,l2_rqsts.all shows L2 miss rate. Optimization focus: L2 misses are cheaper than LLC misses but add up. Solutions: Cache blocking to fit in L2, prefetching to hide L2 latency, improve spatial locality, consider access pattern to avoid conflict misses (add padding to break power-of-2 strides).
Excessive function call overhead manifests as significant time spent in call/return sequences rather than actual work. Impact: Each call/return involves: instruction cache access for target, stack manipulation, register saves (caller/callee-saved), potential branch misprediction for indirect calls. Symptoms: (1) Hot functions have few instructions but high inclusive time. (2) High call-ret instruction count relative to work done. (3) Profile shows significant time in function prologue/epilogue. (4) Indirect call rate high with poor prediction (virtual functions, callbacks). (5) I-cache misses concentrated at function boundaries. Diagnosis: Function-level profiling shows call counts - if called millions of times with microseconds total, suspect overhead. Check instructions-per-call ratio - very low (< 20-50 instructions per call) suggests inline candidates. Indirect call prediction: branch-misses during call instructions. Thresholds: Function call overhead typically 5-50 cycles. At 1M calls/sec, overhead is 5-50ms/sec. If function body is <100 instructions, inlining likely helps. Solutions: Inline small functions (compiler -finline-functions, link-time optimization), devirtualization, batch operations to amortize call cost, profile-guided inlining decisions.
Memory ordering stalls occur when the CPU must ensure memory operations complete in program order, preventing out-of-order optimization. Causes: (1) Atomic operations: Sequentially consistent atomics (default C++ memory_order_seq_cst) require full ordering, stalling pipeline until prior operations complete. (2) Memory barriers: MFENCE costs ~30-100 cycles, SFENCE/LFENCE less but still significant. (3) Locked instructions: LOCK prefix ensures atomicity but serializes memory access. (4) Store-to-load forwarding failures: Load must wait for prior store to same address. (5) Memory disambiguation speculation failures: Load speculatively executed before older store to same address must re-execute. Detection: exe_activity.bound_on_stores or memory_ordering_stalls counters. machine_clears.memory_ordering for speculation failures. High Backend Bound > Memory Bound > Store Bound pathway. Solutions: Use weakest sufficient memory ordering (memory_order_relaxed, memory_order_acquire/release), batch atomic operations, avoid false sharing that triggers coherency traffic, use lock-free algorithms where possible, reorder code to reduce memory ordering constraints, consider thread-local data to eliminate synchronization.
High store buffer stall indicates the CPU's store buffer (typically 42-56 entries on modern Intel) is full, blocking new store instructions. Causes: (1) Store-heavy code: Many consecutive stores without intervening loads or computation. (2) Memory-bound stores: Store buffer drains slowly when memory subsystem is saturated or when stores miss cache. (3) Store forwarding failures: When loads try to read from recent stores but cannot forward (partial overlap, different sizes), causing additional stalls. (4) Non-temporal stores misuse: NT stores bypass cache but still use store buffer. Symptoms: Backend Bound with Memory Bound > Store Bound in top-down analysis. Low store throughput despite low load latency. Measure with perf stat -e mem_inst_retired.all_stores or specific store buffer events. Solutions: Reduce store frequency, interleave stores with other operations, ensure cache-friendly access patterns, use non-temporal stores only for true streaming writes.
High DTLB miss rate indicates the application accesses data across many virtual memory pages, exhausting TLB capacity. Causes: (1) Large working set: Data spans thousands of pages (4KB each). Typical L1 DTLB holds 64-128 entries; L2 TLB holds 1024-2048. (2) Random access patterns: Scattered accesses touch many pages without reuse. (3) Large data structures with poor locality: Hash tables, graphs, sparse matrices. (4) Memory fragmentation: Logically contiguous data physically scattered. Performance impact: TLB miss triggers page table walk - 10-100 cycles for cached page tables, 100-1000+ cycles if page tables miss cache. At high miss rates, this dominates execution time. Diagnosis: perf stat -e dTLB-loads,dTLB-load-misses. Miss rate above 0.1-1% is concerning. Solutions: Use huge pages (2MB or 1GB), improve data locality, reduce working set, structure-of-arrays instead of array-of-structures.
High dependency chain length indicates serial operations where each result is needed by the next operation, preventing parallel execution. A dependency chain of length N with average latency L takes at least N*L cycles regardless of available execution units. Impact: (1) Limits IPC to 1/L for chain-dominated code. (2) Out-of-order execution cannot help - no independent work to schedule. (3) Loop-carried dependencies multiply effect across iterations. Examples: Accumulation (sum += a[i]), linked list traversal, recursive calculations, pointer chasing. Diagnosis: Critical path analysis - identify longest dependency chain. If chain_length * average_latency approximates measured cycles, code is latency-bound. Solutions: Break dependency chains via: multiple accumulators (reduce tree depth), loop unrolling with independent iterations, algebraic transformations (associativity for FP with care), data structure changes (arrays vs linked structures). For 4-cycle latency operations, use 4+ independent accumulators to achieve full throughput.
Cache line ping-pong indicates a cache line is repeatedly transferred between cores as multiple cores read and write it, causing severe coherency traffic. Each transfer requires: (1) Local HITM: ~40-60 cycles for same-socket transfer. (2) Remote HITM: ~100-300 cycles for cross-socket transfer. When writes alternate between cores, the line bounces continuously. Performance impact: For frequently-accessed shared data, ping-pong can consume 90%+ of cycles in memory subsystem overhead. Symptoms: High HITM counts in perf c2c, LLC misses despite good locality, poor scaling with thread count, cache coherency traffic visible in memory controller metrics. Common causes: (1) Shared counters/flags updated by multiple threads. (2) Lock implementations (spinlocks spinning on shared variable). (3) False sharing of adjacent variables. Diagnosis: perf c2c report shows specific cache lines and access patterns. Look for lines with both high store and load counts from different cores. Solutions: Use per-thread counters with periodic aggregation, padding between thread-private data, lock-free structures, reader-writer patterns, reduce sharing frequency.
Thermal throttling indicates CPU frequency is reduced to prevent overheating, directly impacting performance. Detection methods: (1) Frequency monitoring: CPU frequency drops below base clock during load. Use turbostat, cpupower frequency-info, or /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq. (2) Temperature monitoring: Core temps approaching TjMax (typically 100C). Use sensors, Intel Power Gadget, or /sys/class/thermal/. (3) Throttling counters: perf stat -a -e 'msr/tsc/' and compare to actual cycles. HW_INSTRUCTIONS.ANY vs cycles ratio changes. (4) Inconsistent benchmark results: Performance degrades over time as temps rise. (5) Package power limiting: RAPL constraints reduce power envelope. Common causes: Inadequate cooling, blocked vents, failed thermal paste, high ambient temperature, excessive multi-core turbo, sustained AVX-512 workloads. Solutions: Improve cooling (better heatsink, more airflow), reduce ambient temperature, limit turbo boost, spread workload across cores, consider undervolting (with caution), workload scheduling to allow cool-down periods.
Inconsistent performance across benchmark runs (high variance) indicates external factors affecting measurement rather than stable code behavior. Common causes: (1) Frequency scaling: Turbo boost varies with temperature, power limits. Fix: Disable turbo or set fixed frequency. (2) NUMA effects: Memory placement varies between runs. Fix: numactl --cpubind/membind. (3) Address space randomization (ASLR): Affects cache alignment, branch prediction. Fix: Disable ASLR for benchmarking or average many runs. (4) Background processes: Competes for CPU, cache, memory. Fix: Isolate cores with isolcpus, use cset shield. (5) Timer interrupts: Periodic scheduling interruptions. Fix: Use tickless kernel (nohz_full), process isolation. (6) Power management: C-states add wake latency. Fix: Disable deep C-states or warm up CPU. (7) Thermal throttling: Temperature-dependent frequency reduction. Fix: Allow thermal equilibrium before measurement. Diagnosis: Coefficient of variation (stddev/mean) > 1-2% suggests noise. Run 10+ iterations, check min/max spread. Use perf stat's --repeat flag. Proper benchmarking setup: Fix frequency, pin processes, isolate cores, control temperature, multiple iterations with warmup.
Performance drop under high system load indicates resource contention that is not visible under isolated testing. Reveals shared resource bottlenecks. Contention sources: (1) LLC contention: Multiple processes evict each other's cache lines. LLC is shared among cores. (2) Memory bandwidth: Combined demand exceeds memory controller capacity. (3) Memory controller queue depth: Requests queue behind others. (4) Core interconnect: Ring bus or mesh traffic increases latency. (5) I/O bandwidth: Disk, network shared with other processes. (6) Kernel resources: Scheduler overhead, lock contention in kernel. Detection: Compare isolated vs loaded performance. Profile under realistic concurrent load. Monitor: (a) LLC miss rate increase under load. (b) Memory latency increase (from ~80ns to 150ns+ under contention). (c) Context switch rate increase. (d) Per-core vs aggregate metrics. Diagnosis: Intel Resource Director Technology (RDT) shows per-process LLC and memory bandwidth. perf stat under various loads reveals scaling issues. Solutions: LLC partitioning (Intel CAT), memory bandwidth allocation (Intel MBA), process isolation (cgroups, containers), NUMA pinning, reduce shared resource demand, design for graceful degradation under contention.
Signs of false sharing: (1) Performance scales poorly with thread count despite independent work. (2) High cache-to-cache transfer rate: perf c2c shows high 'Rmt HITM' (remote hit-modified) counts. HITM above 5-10% of cache accesses indicates contention. (3) High LLC miss rate in multithreaded vs single-threaded execution. (4) Variables accessed by different threads are within 64 bytes of each other. (5) Specific cache lines show both high read and write activity from different cores. Diagnosis: Run 'perf c2c record' then 'perf c2c report' to identify hot cache lines and the exact offsets being accessed. Look for patterns where different threads access adjacent (not same) memory locations. Performance impact: False sharing can cause 10-100x slowdown. Each write invalidates the cache line on all other cores, forcing expensive coherency traffic. Solutions: Pad structures to cache line boundaries (64 bytes), use thread-local storage, separate read-mostly from write-mostly data.
Memory controller saturation occurs when DRAM bandwidth demand exceeds available capacity, causing queueing delays and increased effective latency. Causes: (1) High sustained memory bandwidth from compute-intensive workloads: Dense linear algebra, video processing, ML inference. (2) Many concurrent LLC misses: Memory-level parallelism from multiple cores overwhelms controller. (3) Inefficient access patterns: Random accesses have lower effective bandwidth than sequential due to DRAM row buffer misses. (4) Memory interference from co-located workloads: Multiple VMs or containers competing for memory bandwidth. Diagnosis: Compare achieved bandwidth to theoretical maximum (memtest or STREAM benchmark gives practical peak). Intel Memory Bandwidth Monitoring (MBM) shows per-core and total bandwidth. If at 60-70% of peak with increasing latency, saturation is occurring. Memory latency stack analysis shows queueing delays. Solutions: Reduce memory traffic (blocking, tiling, compression), improve locality, balance load across memory channels, consider HBM or faster memory for bandwidth-limited workloads.
Speculative execution waste occurs when the CPU executes instructions on predicted paths that are ultimately not needed, consuming resources for discarded work. Main sources: (1) Branch misprediction (95%+ of speculation waste): Wrong branch taken, all speculatively executed instructions discarded. 10-30 cycles per misprediction. (2) Memory ordering speculation: Load speculatively executed before older store to different (predicted) address. If same address, must re-execute. (3) Value prediction misses: Some CPUs predict load values. Wrong prediction causes re-execution. (4) Prefetch waste: Speculatively prefetched data that is not used. Measured via Bad Speculation metric in top-down analysis. Above 10-15% is problematic. Diagnosis: Break down Bad Speculation into Branch Mispredicts and Machine Clears. Most waste is branches. Check branch misprediction rate (target: < 2%). Solutions: Profile-guided optimization (PGO) for better branch prediction, branchless code (CMOV, arithmetic), sorted/predictable data, fewer indirect calls, memory ordering hints.
High store forwarding stall indicates loads cannot obtain data directly from the store buffer when they should, adding 10-15+ cycles per stall. Store forwarding normally allows a load to get data from a recent store without waiting for cache write. Forwarding fails when: (1) Size mismatch: Store is larger than load (e.g., store 64 bits, load 32 bits of it may fail on some CPUs). (2) Address overlap but not contained: Load partially overlaps store. (3) Misaligned cross-boundary: Store and load cross cache line boundary differently. (4) Store not yet executed: Address not known when load executes. Detection: ld_blocks.store_forward performance counter, or store forwarding blocked metrics in VTune. Rate above 1% of loads indicates issue. Impact: Each blocked forward adds ~10-15 cycles, turning 4-cycle L1 hit into 15-20 cycle operation. Code pattern to avoid: Writing as one type, reading as another (union type punning), unaligned buffer operations, mixed-size pointer casting. Solutions: Ensure loads and stores match in size and alignment, avoid type punning, use memcpy for safe type conversion (compiler optimizes), align data structures.
High AVX-SSE transition penalty indicates frequent switching between legacy SSE and 256-bit AVX instructions, causing partial register state saves that cost ~70 cycles each on affected CPUs (Haswell and earlier). Mechanism: AVX extends SSE registers from 128 to 256 bits. Mixing AVX and SSE code requires saving/restoring upper 128 bits. The CPU enters 'upper state dirty' mode and must transition when executing different ISA. Detection: other_assists.avx_to_sse or avx512_transitions counters. Significant counts (> 0.1% of instructions) indicate problem. VTune may flag this in analysis. Common causes: (1) Calling library functions compiled without AVX between AVX code. (2) Mixing intrinsics from different generations. (3) Math libraries using SSE called from AVX code. (4) System calls (kernel typically SSE-only). Solutions: (1) Use VZEROUPPER instruction between AVX and SSE (compilers emit automatically with proper flags). (2) Compile everything with AVX (-mavx) including libraries. (3) Use -march=native for consistent ISA. (4) On Skylake+, transition penalty eliminated but VZEROUPPER still helps power. Verify with compiler flags: GCC -Wpsabi warns about ABI issues.
High instruction fetch bandwidth limitation indicates the frontend cannot supply instructions fast enough to keep execution units busy, even with no cache misses. Frontend Bound > Fetch Bandwidth (vs Fetch Latency which indicates misses). Causes: (1) Decoder limitations: Legacy decode pipeline limited to 4-5 instructions/cycle, complex instructions limited to 1/cycle on complex decoder. (2) Instruction length variability: x86 variable-length encoding (1-15 bytes) complicates fetch. (3) Branch density: Taken branches redirect fetch, limiting effective bandwidth. (4) Code alignment: Instructions crossing cache lines or decode boundaries reduce throughput. (5) Micro-op cache (DSB) limitations: High DSB miss rate forces legacy decode. Detection: idq.dsb_uops vs idq.mite_uops ratio - high MITE (legacy decoder) usage indicates DSB misses. Frontend Bound > Fetch Bandwidth in top-down. Low IPC with low memory-bound and low bad-speculation. Solutions: Profile-guided optimization for hot code placement, improve DSB utilization (code alignment, avoiding DSB-unfriendly patterns), reduce branch density in hot loops, use SIMD to do more work per instruction, consider micro-architecture-specific code alignment strategies.
High HITM count indicates cache line contention between cores - one core is reading data that another core has modified. The reading core must fetch the modified line from the writing core's cache (cache-to-cache transfer), which is slower than LLC hit. Types: (1) Local HITM: Transfer between cores on same socket (40-60 cycles). (2) Remote HITM: Transfer between cores on different sockets in NUMA systems (100-300 cycles). High remote HITM is particularly expensive. Causes: (1) True sharing: Multiple threads legitimately accessing same data. Requires synchronization redesign. (2) False sharing: Different variables on same cache line. Fix with padding/alignment. Diagnosis: perf c2c identifies specific cache lines and access patterns. HITM rate above 1-5% of cache accesses is problematic. Look at 'Snoop' column in perf c2c output. Solutions: For true sharing - reduce sharing, use read-mostly patterns, thread-local copies. For false sharing - align to cache line boundaries, restructure data layout.
High NUMA remote access rate indicates memory requests are being served by memory attached to a different CPU socket, incurring higher latency (100-300 cycles vs 60-80 cycles for local). Causes: (1) Poor memory allocation policy: Memory allocated on wrong node. (2) Thread migration: Threads moving between sockets while data stays on original node. (3) Data sharing between threads on different sockets: Legitimate but expensive. (4) First-touch allocation with unfortunate initialization patterns. Diagnosis: VTune Memory Access analysis shows Remote DRAM vs Local DRAM. perf stat -e node-load-misses,node-loads measures remote access ratio. Remote access ratio above 10-20% typically indicates optimization opportunity. Performance impact: 1.5-3x latency penalty for remote access. Aggregate bandwidth to remote memory is also lower. Solutions: Use numactl for memory binding, first-touch with correct thread affinity, data placement optimization, replicate read-only data on each node, NUMA-aware memory allocators.
Unbalanced execution port utilization indicates some execution units are saturated while others are idle, limiting throughput despite available execution resources. Modern Intel CPUs have 6-8 ports with different capabilities: (1) Ports 0, 1: General ALU, some FP/vector operations. (2) Port 5: Shuffles, some ALU, vector permutes. (3) Ports 2, 3: Load address generation. (4) Port 4: Store data. (5) Port 6: Branch, some ALU. Imbalance patterns: (a) Heavy shuffle code: Port 5 saturated at 100%, others idle. (b) Dense FP: Ports 0,1 busy, integer ports idle. (c) Memory-heavy: Ports 2,3,4 saturated. Diagnosis: VTune port utilization metrics show per-port cycles used. Ports above 70-80% utilization with others below 30% indicates imbalance. Solutions: Use alternative instructions that execute on different ports (compiler flag -qopt-report for Intel), rewrite to balance operation mix, use SIMD to reduce instruction count, interleave different operation types in instruction stream.
High 4K aliasing rate indicates memory disambiguation failures when loads and stores have addresses that differ only in bits 12 and above (same lower 12 bits). The CPU's memory disambiguator uses partial address comparison and may incorrectly predict these as aliased, causing unnecessary stalls. How it occurs: Load address = 0x1000, Store address = 0x2000 - both have lower 12 bits = 0x000. CPU may stall the load waiting for the store even though addresses differ. Performance impact: Each false alias stall costs ~5-10 cycles as the CPU waits unnecessarily or re-executes. Code patterns that trigger: (1) Multiple arrays with power-of-2 sizes accessed in parallel. (2) Stack variables and heap data with unfortunate address alignment. (3) Structure padding causing regular 4K-offset patterns. Detection: ld_blocks_partial.address_alias counter in perf. Rate above 1% of loads is worth investigating. Solutions: Add padding to offset array bases, use different allocation strategies, reorder operations to separate aliasing accesses in time, compiler may reorder with proper aliasing hints (restrict keyword).
Excessive memory allocation overhead occurs when frequent malloc/free calls consume significant CPU time and fragment memory. Typical malloc: 50-200 cycles for small allocations, potentially 1000s for large or fragmented. Signs: (1) Profile shows significant time in malloc, free, mmap, brk. (2) Memory usage grows over time (fragmentation). (3) Performance degrades as program runs longer. (4) Page faults during allocation. (5) Lock contention in multi-threaded allocation (standard malloc uses global locks). Diagnosis: perf record with call stacks shows allocation in hot paths. ltrace counts malloc calls. Valgrind massif shows allocation patterns. Thresholds: > 1M allocations/sec or > 5% of CPU time in allocator is concerning. Solutions: (1) Object pooling/arena allocation for frequently allocated types. (2) Thread-local allocators (tcmalloc, jemalloc, mimalloc) reduce lock contention. (3) Stack allocation for short-lived objects. (4) Pre-allocation and reuse patterns. (5) Reduce allocation frequency (batch operations, larger buffers). (6) Huge pages reduce page table overhead. For C++: small object allocator, custom allocators, reserve() for containers.
High address generation interlock (AGI) stalls occur when a memory access instruction uses a register that was modified by the immediately preceding instruction, causing a 1-cycle delay. Modern out-of-order CPUs largely hide this, but it can still impact tight loops. Pattern that causes AGI: add rax, 8 followed by mov rbx, [rax] - the load cannot begin address calculation until add completes. When AGI matters: (1) Very tight loops where every cycle counts. (2) When combined with other stalls, AGI adds up. (3) In-order cores (embedded, older Atom) have severe AGI penalties. (4) Address calculation involving just-computed values. Detection: Modern Intel has limited direct AGI counters. Manifests as slightly lower IPC than expected in address-intensive code. Comparison with expected throughput reveals overhead. More visible on in-order cores. Solutions: Schedule non-dependent instructions between address computation and use, unroll loops to separate address calc from use, use addressing modes that avoid dependency (base+index*scale+offset often computed in parallel), let compiler schedule with -O3 (AGI-aware scheduling), on in-order cores, manually reorder assembly.
High iowait percentage indicates CPUs are idle waiting for I/O operations (typically disk) to complete. Interpretation depends on context: (1) Database servers: Any persistent iowait suggests disk bottleneck requiring immediate attention - queries blocked on storage. (2) Web servers: 2-3% iowait may be acceptable for occasional file serving. (3) Storage backends: Up to 20% iowait may be normal for bulk data processing. Critical caveat: High iowait only appears when no other CPU work is available. Adding CPU-bound load makes iowait disappear but does not fix the I/O problem. Check 'b' column in vmstat for blocked processes - this shows I/O-blocked processes even under CPU load. Diagnosis steps: (1) iostat -x to identify saturated disks (util% > 70-80%). (2) iotop to identify processes causing I/O. (3) Check await (average wait time) - above 10-20ms indicates slow storage. Solutions: Faster storage (SSD/NVMe), I/O scheduling, caching, async I/O.
High lock contention indicates multiple threads frequently competing for the same lock, causing serialization and wasted CPU cycles spinning or context switching. Symptoms: (1) CPU utilization high but throughput low. (2) Thread profilers show significant time in lock acquisition. (3) Performance degrades as thread count increases despite available cores. (4) High context switch rate when using blocking locks. Diagnosis: Mutex profiling tools (Valgrind Helgrind, Intel Inspector), perf record with lock events, or application-level metrics. Look for locks held for long periods or acquired frequently. Impact quantification: With N threads competing for lock held T time, maximum throughput is 1/T regardless of N. At 1000 acquisitions/sec with 1ms hold time, only one thread can make progress at a time. Solutions: Lock-free algorithms where possible, fine-grained locking (more locks, less contention each), read-write locks for read-heavy workloads, lock elision/transactional memory, reduce critical section length, data partitioning.
High floating-point assist rate indicates frequent microcode assists for exceptional FP conditions, severely impacting performance. Each assist costs 50-150+ cycles. Causes: (1) Denormal numbers: Values near zero (< 2^-126 for single precision) require microcode handling. Common in iterative algorithms that converge toward zero. (2) FP exceptions: Invalid operations, overflow, underflow triggering assists. (3) Mixed precision without proper handling: Converting between float/double in ways that trigger denormals. (4) Uninitialized FP registers: Garbage values may trigger exceptional conditions. Diagnosis: fp_assist events in perf, or FP_ASSISTS_* counters. Rate above 1% of FP operations is concerning. Performance impact: Code heavy in assists can run 10-100x slower than expected. Solutions: Enable FTZ (Flush To Zero) and DAZ (Denormals Are Zero) via MXCSR register: _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON) and _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON). Initialize data properly, use appropriate precision, add epsilon to avoid near-zero values.
Performance cliff when crossing cache size boundary occurs because miss rate jumps discontinuously when working set exceeds cache capacity. Typical boundaries: L1D: 32-48KB, L2: 256KB-1MB, LLC: 2-30MB per core (shared). Behavior: Working set at 90% of cache level shows low miss rate. At 110% of cache level, miss rate can jump to 50%+ as capacity evictions begin. Performance drops proportionally to miss rate times miss penalty ratio. Why cliffs, not gradual: LRU-like replacement means once set is full, every new line evicts a potentially-useful line. With even small overflow, thrashing begins. Detection: Sweep working set size while measuring performance. Plot performance vs size - cliffs visible at cache boundaries. Or measure miss rate at each level vs working set size. Example impact: L2 (256KB) to LLC cliff: Going from 250KB to 300KB working set may increase CPI from 1.0 to 2.0 as every access that previously hit L2 (12 cycles) now hits LLC (40 cycles). Solutions: Cache blocking/tiling to keep hot data in target cache level, data structure compression, prioritize hot data for cache residency, consider explicit cache management (prefetch, non-temporal hints).
High ITLB (Instruction TLB) miss rate indicates code spans many virtual memory pages, exhausting instruction address translation cache. Typical L1 ITLB: 64-128 entries for 4KB pages, fewer for huge pages. L2 TLB is unified and larger. Causes: (1) Large code footprint: Application code exceeds ITLB coverage (e.g., 128 entries * 4KB = 512KB). (2) Code fragmentation: Functions scattered across many pages. (3) JIT compilation: Dynamically generated code at various addresses. (4) Template instantiation bloat: C++ generates many specialized functions. (5) Excessive inlining: Increases code size beyond TLB capacity. Impact: ITLB miss triggers page table walk, costing 10-100+ cycles. High miss rates directly stall instruction fetch. Detection: perf stat -e iTLB-loads,iTLB-load-misses. Miss rate > 0.1-1% is significant. Frontend Bound > Fetch Latency > ITLB Overhead in top-down. Solutions: Use huge pages for code (requires OS support), improve code locality (PGO, function reordering), reduce code size, link-time optimization to colocate related functions, consider code size vs speed tradeoffs in inlining decisions.
Degraded performance when using larger data types (e.g., double vs float, int64 vs int32) can indicate memory bandwidth limitation or reduced SIMD parallelism, not just doubled computation. Causes: (1) Memory bandwidth: 2x data size = 2x memory traffic for same algorithm. Memory-bound code sees linear slowdown. (2) SIMD lane reduction: 256-bit register holds 8 floats but only 4 doubles. Half the operations per instruction. (3) Cache pressure: Working set doubles, may exceed cache level that previously contained it. (4) Instruction latency differences: Some operations (division) have different latencies for different precision. (5) Register pressure: Larger types may cause spilling earlier. Diagnosis: Compare ratio of performance drop to data size increase. If performance drops > 2x for 2x data size, memory hierarchy effects are amplifying. Check cache miss rates - if they increase significantly with larger types, cache capacity is the issue. Solutions: Evaluate if higher precision is necessary, use mixed precision (higher precision only where needed), block algorithms to fit in cache, consider single precision with compensated summation for accuracy, data compression/quantization where applicable.
IPC (Instructions Per Cycle) below 0.5 indicates severe execution stalls. On modern superscalar CPUs capable of retiring 4-6 instructions per cycle, IPC below 0.5 means the processor is stalled more than 87% of available execution slots. Primary causes: (1) Memory-bound: Check L1 miss rate - if above 5% with high LLC misses, the application is memory-bound waiting for DRAM (60-100ns latency vs 1-2ns L1). (2) Long dependency chains: If cache hit rates are good but IPC remains low, arithmetic operations form serial dependency chains preventing parallel execution. (3) Branch mispredictions: Check if bad speculation exceeds 10% - each misprediction costs 10-30 cycles of pipeline flush. Diagnosis: Run perf stat --topdown to classify into Frontend Bound, Backend Bound, Bad Speculation, or Retiring categories.
IPC above 2.0 on a 4-wide superscalar CPU indicates efficient execution utilizing more than half the available pipeline width. This suggests: (1) Good instruction-level parallelism (ILP) - independent instructions executing in parallel. (2) Low cache miss rates - data available when needed. (3) Accurate branch prediction - minimal pipeline flushes. (4) Well-scheduled code - compiler or manual optimization reducing stalls. However, even high IPC may have optimization potential: Check the Retiring metric in top-down analysis - if below 50%, there is still significant waste. IPC of 3.0-3.5 is achievable on well-optimized compute-bound code. Maximum theoretical IPC equals pipeline width (4.0) but is rarely achieved due to dependencies and resource constraints.
CPI (Cycles Per Instruction) greater than 4 indicates severe performance bottlenecks where each instruction takes 4+ cycles on average. This is critically poor on modern CPUs designed for CPI below 1. Root causes by priority: (1) Memory-bound with LLC misses: DRAM access costs 60-100ns (200+ cycles at 3GHz). If LLC miss rate exceeds 10 per 1000 instructions, memory latency dominates. (2) TLB misses: Page table walks cost 100-1000 cycles. Check DTLB and ITLB miss rates. (3) Resource starvation: Load/store buffer full, reorder buffer exhausted. (4) Severe branch misprediction: Misprediction rate above 5% with deep pipelines. Diagnosis approach: Use perf stat -e cycles,instructions,cache-misses,LLC-load-misses to isolate. If cache-misses/instructions ratio exceeds 5%, focus on memory optimization.
High L1 data cache miss rate (above 5-10% of loads) is caused by: (1) Poor spatial locality: Accessing non-contiguous memory (linked lists, pointer chasing, random access patterns). L1 operates on 64-byte cache lines - scattered accesses waste prefetched data. (2) Poor temporal locality: Working set exceeds L1 size (typically 32-48KB). Data is evicted before reuse. (3) Conflict misses: Multiple hot addresses mapping to same cache set due to power-of-2 strides matching cache associativity. (4) False sharing in multithreaded code: Different threads accessing different variables on same cache line causing invalidations. (5) Streaming access without prefetching: Sequential access faster than random, but still misses without hardware/software prefetch. Measure with: perf stat -e L1-dcache-loads,L1-dcache-load-misses. Miss rate = misses/loads. Above 5% warrants investigation.
High LLC miss rate indicates the application's working set exceeds CPU cache hierarchy, forcing frequent DRAM accesses at 60-100ns latency (vs 10-20ns for LLC hit). Implications: (1) Memory-bandwidth bound: If LLC misses exceed memory controller capacity, bandwidth saturation occurs. (2) Memory-latency bound: If misses are scattered (low bandwidth utilization), latency dominates. (3) Severe performance impact: Each LLC miss costs 200+ CPU cycles at 3GHz. Thresholds: LLC miss rate above 2-5 misses per 1000 instructions is concerning. For memory-intensive workloads, LLC miss rate above 10% of LLC references signals need for algorithmic changes (blocking, tiling) or larger cache/memory. Diagnosis: Use perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses. Correlate with memory bandwidth using memory controller counters or Intel Memory Bandwidth Monitoring.
Use Intel's Top-Down Microarchitecture Analysis Method: Run perf stat --topdown or VTune. Results classify pipeline slots into: (1) Retiring: Useful work (higher is better). (2) Bad Speculation: Wasted on mispredicted branches. (3) Frontend Bound: Instruction fetch/decode starved backend. (4) Backend Bound: Subdivides into Core Bound (execution units) and Memory Bound (waiting for data). Memory-bound indicators: Backend Bound > 50% with Memory Bound dominating Core Bound. Additionally: CPI > 1.5-2.0, high LLC misses, memory bandwidth approaching system limits. Compute-bound indicators: High Retiring percentage (>50%), Core Bound significant, low cache miss rates, IPC approaching pipeline width. Quick check: If adding more memory bandwidth (faster RAM, more channels) would help, it is memory-bound. If faster CPU clock would help proportionally, it is compute-bound.
SIMD Vector Operations
15 questionsBreak-even depends on setup overhead vs per-element savings: Typical overhead: 5-20 cycles for vector setup (alignment checks, loading constants, mask setup). Per-iteration savings: 4-8x for compute-bound code, 2-4x for memory-bound (limited by bandwidth). Break-even calculation: vector_time = setup + N/vector_width * vector_op_time. scalar_time = N * scalar_op_time. Solve for N where vector_time < scalar_time. Example: setup=20 cycles, vector_op=10 cycles per 8 elements, scalar=10 cycles per element. 20 + N/810 < N10. Solving: N > 20/(10-10/8) = 2.3 elements. In practice, vectorization often wins for N > 16-32 elements. For small N (< 8 elements), scalar is often faster. For N=1-3, definitely use scalar. Measure both and pick threshold empirically for your specific workload.
Vector permute/shuffle costs depend on the operation type: In-lane shuffle (within 128-bit): 1 cycle latency, 1/cycle throughput. Examples: _mm256_shuffle_ps (imm8), _mm_shuffle_epi32. Cross-lane permute: 3-5 cycle latency, 1/cycle throughput. Examples: _mm256_permutevar8x32_ps, _mm256_permute2f128_ps. Variable permute (indices from register): 3-7 cycles depending on CPU. _mm256_permutevar_ps, _mm512_permutexvar_ps. Blend operations: 1 cycle latency. _mm256_blend_ps (immediate mask). Broadcast: 3-5 cycles for memory source, 1 cycle for register source. The high cost of cross-lane operations makes algorithm design important. Prefer structures where data naturally falls into independent lanes. When cross-lane is unavoidable, amortize cost over multiple computations.
Gather loads from multiple non-contiguous memory locations into a vector. Scatter writes from vector to multiple locations. Usage: __m256i indices = _mm256_set_epi32(7,6,5,4,3,2,1,0); // byte offsets / scale. __m256 result = _mm256_i32gather_ps(base_ptr, indices, 4); // scale=4 for float. This loads base[0], base[1], ..., base[7] into result even if indices are non-contiguous like (0,5,10,15,20,25,30,35). Scatter: _mm256_i32scatter_ps(base_ptr, indices, values, 4). Performance: gather is 5-20 cycles depending on cache hits (vs 4-5 for contiguous load). Each unique cache line accessed adds latency. Best when: index pattern is computed (like hash table lookup), accessing sparse data, avoiding scalar fallback. Avoid when: indices cause many cache misses, pattern could be restructured for contiguous access.
Use comparison and masking/blending instead of branches: (1) Compare: __m256 mask = _mm256_cmp_ps(a, b, _CMP_LT_OS); // creates -1 where true, 0 where false. (2) Blend: result = _mm256_blendv_ps(false_val, true_val, mask); // selects true_val where mask is -1. For integer: __m256i mask = _mm256_cmpgt_epi32(a, b); result = _mm256_blendv_epi8(false_val, true_val, mask); Alternative using AND/ANDNOT: result = _mm256_or_ps(_mm256_and_ps(mask, true_val), _mm256_andnot_ps(mask, false_val)); Both paths compute - blend just selects which result to use. This is branchless - no prediction, no divergence. Cost is ~2-4 cycles for compare+blend vs potential 10-20 cycles for branch mispredict. Even if one path is rarely taken, predicated execution is often faster.
Loop-carried dependencies (each iteration depends on the previous) limit parallelism but can still benefit from SIMD: (1) Unroll and interleave - process multiple independent streams in parallel. If you have 4 independent chains, vectorize across them. (2) Recurrence reformulation - some dependencies can be transformed. For prefix sum: instead of s[i]=s[i-1]+a[i], compute partial sums within vectors, then propagate carry: sums_in_vec = parallel_sum(a[0:7]); carry = broadcast(sums_in_vec[7]); add carry to next vector. (3) Domain-specific: for IIR filters, transform to parallel form. For matrix operations, block to expose parallelism. (4) Multiple accumulators for reductions - unroll loop to have 4 independent accumulators, combine at end. (5) Sometimes dependencies are false - analyze to see if there's actually independence.
Optimal vector length depends on memory hierarchy level: For L1 cache (bandwidth 128-256 GB/s): wider vectors better, AVX-512 maximizes throughput. For L2 cache (50-100 GB/s): AVX2 or AVX-512, depends on available ports. For L3 cache (30-50 GB/s): AVX2 sufficient, often limited by cache bandwidth. For main memory (20-50 GB/s): often bandwidth-saturated with AVX2. Diminishing returns from wider vectors. Formula: bytes_per_cycle = min(load_ports * vector_width, cache_bandwidth). Modern CPUs have 2 load ports. At 3GHz with AVX2 (32 bytes) that's 192 GB/s theoretical - exceeds DRAM bandwidth. For memory-bound code, AVX2 (256-bit) is usually sufficient. AVX-512 adds power consumption and potential frequency throttling. Profile to find where bandwidth saturates.
To vectorize scalar computation: (1) Identify independent iterations - if loop iterations don't depend on each other, they parallelize. (2) Replace scalar types with vector types: float -> __m256 (8 floats). (3) Replace operations with SIMD equivalents: a+b -> _mm256_add_ps(a,b). (4) Load data with vector loads: _mm256_load_ps(ptr) for 8 consecutive floats. (5) Apply same operation across all lanes. (6) Store results: _mm256_store_ps(ptr, result). Example: for(i=0;i<N;i++) c[i]=a[i]+b[i]; becomes: for(i=0;i<N;i+=8) { __m256 va=_mm256_load_ps(&a[i]); __m256 vb=_mm256_load_ps(&b[i]); _mm256_store_ps(&c[i], _mm256_add_ps(va,vb)); }. Compilers often auto-vectorize simple loops with -O3 -march=native, but explicit intrinsics give more control.
Lane-crossing operations move data between SIMD lanes (shuffle, permute): (1) _mm256_permute_ps - reorder within 128-bit lanes using immediate. (2) _mm256_shuffle_ps - interleave elements from two sources within lanes. (3) _mm256_permutevar8x32_ps - arbitrary permutation across all 8 lanes using index vector. (4) _mm256_blend_ps - select elements from two sources based on mask. These operations are expensive: 3-7 cycle latency vs 1 cycle for non-crossing operations. Strategies to minimize: (1) Design algorithms to avoid cross-lane dependencies. (2) Batch multiple cross-lane operations together. (3) Load data in the arrangement you'll need. (4) Use horizontal operations (_mm256_hadd_ps) which handle common patterns efficiently. (5) For reductions, tree-reduce within lanes first, then cross lanes only at the end.
Unaligned access penalty varies by CPU generation: Pre-Nehalem (before 2008): severe penalty, 100+ cycles, required alignment. Nehalem onwards: ~3-4 cycle penalty if crossing cache line boundary, near-zero penalty within cache line. Haswell/Skylake (2013+): typically 0-1 cycle penalty for unaligned loads within cache line, ~5-10 cycle penalty when crossing cache lines. AVX-512: cache line splits more likely with wider vectors, 10-15 cycle penalty for splits. The penalty occurs when a single load spans two cache lines (64-byte boundary). With 32-byte AVX2 loads, roughly 50% of unaligned loads cross boundaries in random access. Best practice: always align data when possible. If alignment is unpredictable, use unaligned load intrinsics (_mm256_loadu_ps) - they work for any alignment with the CPU handling penalties automatically.
Align data to vector register width for best performance: 16-byte alignment for SSE (128-bit). 32-byte alignment for AVX2 (256-bit). 64-byte alignment for AVX-512 (512-bit). Methods: (1) Static allocation: use attribute((aligned(32))) for GCC/Clang, __declspec(align(32)) for MSVC. (2) Dynamic allocation: use aligned_alloc(alignment, size) or posix_memalign(&ptr, alignment, size). (3) Struct padding: add padding to make array start aligned. (4) Compiler hints: C++17 alignas(32). (5) For stack variables, the compiler handles alignment with proper attributes. (6) Verify alignment at runtime: ((uintptr_t)ptr % alignment) == 0. Unaligned access penalty: historically severe (~100 cycles), modern CPUs handle unaligned much better (0-10 cycle penalty), but aligned is still optimal.
Most vector ALU operations have the same latency as their scalar equivalents: Integer add/sub/bitwise: 1 cycle (both scalar and vector). Integer multiply: 4-5 cycles (both). Floating-point add/sub: 3-4 cycles. Floating-point multiply: 4-5 cycles. Floating-point FMA: 4-5 cycles. Floating-point divide: 10-14 cycles (scalar), 10-23 cycles (vector, varies by width). Throughput differs: scalar can issue 3-4 ops/cycle per type, vectors can also issue 2 ops/cycle but each processes 4-16 elements. Net vector throughput is 8-32x higher. AVX-512 on some CPUs has higher latency due to frequency throttling or port contention. The latency similarity makes porting scalar to SIMD straightforward for latency calculations - multiply operation count by 1/SIMD_width for effective operations.
Vectorized table lookup uses gather operations or SIMD permutes: For small tables (16 entries or less): use pshufb (_mm_shuffle_epi8). Load table into vector register, indices into another. Result = _mm_shuffle_epi8(table, indices); Each index byte selects corresponding table byte. Fast: 1-2 cycles. For larger tables: use gather. __m256i indices = ...; __m256 result = _mm256_i32gather_ps(table, indices, 4); This loads table[indices[0]], table[indices[1]], etc. Cost: 5-20 cycles depending on cache hits. For very large tables: consider reorganizing to improve cache locality of lookups, or precompute/interpolate instead of lookup. Alternative: if indices have limited range, use multiple small table lookups with masking to combine results.
When N is not divisible by vector width, handle the remainder: (1) Scalar cleanup: vectorize main loop for (N/8)*8 elements, scalar loop for remaining N%8. Simple but adds code. (2) Masked operations: use mask to process only valid lanes in final iteration. AVX-512: _mm512_mask_load_ps(zeros, mask, ptr). AVX2: load full vector, mask invalid results. (3) Padding: pad arrays to multiple of vector width. Simplest and fastest if padding is acceptable. (4) Overlap last iteration: last vector includes some already-processed elements. Works for associative operations (sum, max). (5) Unrolled cleanup: specialized code paths for each possible remainder (0-7 for AVX2). More code but no branches. The best approach depends on loop count distribution - if N is usually large, simple scalar cleanup has negligible overhead.
Vectorized reduction has two phases - parallel reduction within vectors, then horizontal reduction to scalar: (1) Parallel phase: maintain vector accumulator, operate on 8 elements at once. __m256 vsum = _mm256_setzero_ps(); for(i=0;i<N;i+=8) vsum = _mm256_add_ps(vsum, _mm256_load_ps(&arr[i])); (2) Horizontal reduction: reduce 8-element vector to single value. For AVX2 sum: __m128 hi = _mm256_extractf128_ps(vsum, 1); __m128 lo = _mm256_castps256_ps128(vsum); __m128 sum4 = _mm_add_ps(hi, lo); sum4 = _mm_add_ps(sum4, _mm_movehl_ps(sum4, sum4)); sum4 = _mm_add_ss(sum4, _mm_shuffle_ps(sum4, sum4, 1)); float result = _mm_cvtss_f32(sum4); This is log2(SIMD_width) operations. For max/min, replace _add_ps with _max_ps or _min_ps.
Vector loads and scalar loads have similar latency but different throughput: Both have 4-5 cycle latency to L1 cache. Scalar load moves 4 or 8 bytes per operation. Vector load moves 16 bytes (SSE), 32 bytes (AVX2), or 64 bytes (AVX-512) per operation. Throughput: modern CPUs support 2 loads per cycle regardless of width - so vector loads achieve 2-16x the bandwidth. For L1 hits: scalar = 4 bytes * 2/cycle = 8 bytes/cycle. AVX2 vector = 32 bytes * 2/cycle = 64 bytes/cycle. The key advantage is bandwidth efficiency - one vector load fills a vector register for multiple operations. For L2/L3/memory, the advantage compounds as each cache line transfer (64 bytes) fills more of a vector register with fewer operations.
VLIW Instruction Scheduling
15 questionsA WAR hazard (anti-dependency) occurs when an instruction writes a location that a previous instruction reads. Example: MUL R4, R1, R5 followed by ADD R1, R2, R3 - if ADD completes before MUL reads R1, we get wrong results. In VLIW with in-order execution, WAR hazards are typically not a problem within a single bundle since reads happen before writes in the same cycle. Across bundles, the compiler ensures reads complete before writes to the same location. To eliminate WAR dependencies and enable more reordering: (1) Use register renaming - assign ADD's result to a different register R6 instead of R1. (2) Use SSA (Static Single Assignment) form during compilation where each variable is assigned exactly once, eliminating all anti-dependencies.
To balance execution unit utilization: (1) Profile the code to identify which units are bottlenecks. If loads saturate but ALU is idle, the code is memory-bound. (2) Apply strength reduction - replace expensive operations with cheaper ones (multiply by constant -> shifts and adds). (3) Convert memory accesses to computation where possible - compute values instead of loading from tables. (4) For memory-bound code: improve spatial locality, prefetch, use wider vector loads. (5) For compute-bound code: ensure you're using vector units for parallel computation. (6) Loop unrolling helps balance by providing more operations to schedule. (7) Software pipelining interleaves iterations to fill all units. The goal is to have all units busy every cycle - any idle unit represents wasted throughput.
No, each execution unit can only execute one operation per cycle, so it can appear at most once per VLIW bundle. If a processor has one integer ALU and one multiplier, a single bundle can contain at most one integer ALU operation and one multiply. This is a fundamental hardware constraint - each unit has one set of input registers, one execution pipeline, and one output. Some processors have multiple instances of the same unit type (e.g., two ALUs, two load units) to increase parallelism. In that case, you can have two ALU operations per bundle, but each uses a different physical ALU. The architecture specification defines how many of each unit type exist and which instruction slots can access which units.
The number of cycles depends on the operation latency. Typical latencies: integer ALU operations (ADD, SUB, AND, OR, XOR, shift) are 1 cycle - the result is available next cycle. Integer multiply is 3-4 cycles. Floating-point add/subtract is 3-4 cycles. Floating-point multiply is 4-5 cycles. Floating-point divide is 10-20+ cycles. Load from L1 cache is 3-4 cycles (but could be 10-20 for L2, 100+ for memory). Store-to-load forwarding may add cycles. In VLIW, you must explicitly schedule the consumer N cycles after the producer where N is the latency. If ADD has 1-cycle latency, the dependent operation goes in cycle+1. The compiler must know exact latencies for the target processor to generate correct code.
To identify independent operations for parallel VLIW execution, perform data dependency analysis on the instruction stream. Two operations are independent if there is no Read-After-Write (RAW), Write-After-Read (WAR), or Write-After-Write (WAW) dependency between them. Build a Data Dependency Graph (DDG) where nodes are operations and edges represent dependencies. Operations with no edges between them can execute in parallel. For register dependencies, check if operation B reads a register that operation A writes (RAW), if B writes a register A reads (WAR), or if both write the same register (WAW). For memory dependencies, perform alias analysis - if two memory operations might access the same address, assume a dependency unless proven otherwise. Use liveness analysis to identify dead values that create false dependencies removable via register renaming.
In most VLIW architectures, you cannot leave a slot empty - the instruction word has a fixed width and every slot must contain an opcode. A NOP explicitly encodes 'do nothing' for that functional unit. The costs are: (1) Code size - NOPs consume bits in the instruction word, increasing binary size by 20-40% in typical code. (2) Fetch bandwidth - every NOP is fetched from memory/cache, wasting bandwidth. (3) Decode energy - the NOP must still be decoded. (4) I-cache pollution - larger code means more cache misses. Some architectures (like Itanium with its template bits) can compress multiple NOPs, and instruction compression schemes exist, but the fundamental cost remains. Studies show 28-32% of VLIW slots contain NOPs in typical benchmarks.
Several techniques fill empty VLIW slots: (1) Loop unrolling - replicate loop body N times to create N independent iterations worth of operations, dramatically increasing ILP. (2) Software pipelining - overlap iterations so operations from different iterations fill slots in the same bundle. (3) Speculative execution - move operations from after branches into empty slots, adding compensation code if needed. (4) Trace scheduling - schedule across basic block boundaries along predicted execution paths. (5) If-conversion - convert branches to predicated operations that always execute. (6) Superblock formation - create larger scheduling regions by tail duplicating merge points. When all else fails, insert NOPs, but minimize this as it wastes fetch bandwidth and code cache space.
Each operation type maps to specific functional units. ALU handles scalar integer operations (add, sub, and, or, xor, shift, compare). VALU (Vector ALU) handles SIMD operations on vector registers. Load unit handles memory reads. Store unit handles memory writes. Flow/branch unit handles jumps, calls, and predicate manipulation. The scheduler must: (1) Identify which unit each operation requires. (2) Track unit availability - each unit can execute at most one operation per cycle. (3) When scheduling, check if the required unit is free in the target cycle. (4) Balance the load - if code is heavy on loads but light on ALU, restructure to convert memory operations to computation where possible. (5) For vector code, ensure VALU operations are balanced with memory bandwidth for the load/store units.
A WAW hazard (output dependency) occurs when two instructions write to the same location and the order matters. Example: ADD R1, R2, R3 followed by SUB R1, R4, R5 - if these execute out of order, R1 ends up with the wrong final value. In VLIW, WAW hazards within a single bundle are typically forbidden by the architecture - you cannot have two operations in the same bundle writing to the same register or memory location. Across bundles, the compiler maintains program order. To eliminate WAW dependencies: (1) Use register renaming so each write targets a different physical register. (2) In SSA form, each definition gets a unique name, eliminating WAW entirely. This frees the scheduler to reorder operations that previously had false output dependencies.
This is processor-specific, but typical mappings are: Scalar ALU - integer arithmetic (add, sub), logical (and, or, xor, not), shifts, comparisons, conversions. Multiply unit - integer multiply, multiply-accumulate (often separate from ALU due to longer latency). Vector ALU (VALU) - SIMD versions of arithmetic, logical, and comparison operations on vector registers. Load unit - load byte/halfword/word/doubleword/vector from memory. Store unit - store byte/halfword/word/doubleword/vector to memory. Branch/Flow unit - conditional and unconditional branches, calls, returns, predicate register manipulation. Some architectures have specialized units for floating-point or divide operations. Check the processor reference manual for exact mappings, as some operations may be restricted to specific slots.
Data dependency detection involves analyzing register and memory operands: (1) For registers: track which registers each instruction reads (source operands) and writes (destination operands). A RAW dependency exists from instruction A to B if B reads a register A writes. A WAR dependency exists if B writes a register A reads. A WAW exists if both write the same register. (2) For memory: this is harder due to aliasing. Use alias analysis to determine if two memory references might access the same location. Pointers through different arrays are independent. Same base with different constant offsets may be independent if offsets differ by element size. Unknown offsets must be assumed dependent. (3) Build the dependency graph with edges for each dependency, labeled with the minimum cycle distance (operation latency for RAW).
Instruction reordering for VLIW follows these steps: (1) Build the Data Dependency Graph (DDG) capturing all true dependencies (RAW) and output dependencies (WAW). (2) Compute operation priorities based on critical path length - operations on longer paths get higher priority. (3) Use list scheduling to place operations: each cycle, fill available slots with highest-priority ready operations. (4) Apply transformations to increase ILP: rename registers to eliminate false WAR/WAW dependencies, unroll loops to expose iteration-level parallelism, apply strength reduction and common subexpression elimination. (5) For loops, use modulo scheduling to overlap iterations. (6) Use profile data to prioritize hot paths. The key insight: you can reorder any operations without true data dependencies, and many apparent dependencies can be removed through renaming.
A RAW hazard (also called a true dependency or flow dependency) occurs when an instruction reads a value that a previous instruction writes. For example: ADD R1, R2, R3 followed by MUL R4, R1, R5 - the MUL must wait for ADD to complete. In VLIW, unlike superscalar, there is no hardware interlock - the compiler must ensure correct timing. To avoid RAW hazards: (1) Schedule the consumer operation enough cycles after the producer to account for the producer's latency. If ADD has 1-cycle latency, MUL can be in the next bundle. If a load has 3-cycle latency, the consumer must be at least 3 bundles later. (2) Fill the gap with independent operations to hide latency. (3) Use software pipelining to overlap iterations so operations from different iterations fill latency slots.
The standard algorithm is list scheduling with resource constraints. First, build a Data Dependency Graph where edges encode latencies. Compute the priority of each operation (typically critical path length to exit). Maintain a ready list of operations whose predecessors have completed. On each cycle: (1) For each available functional unit, select the highest-priority ready operation that can use that unit. (2) Place selected operations into the current bundle. (3) Advance the cycle, update ready list based on operation latencies. (4) Repeat until all operations scheduled. For better results, use backtracking if a dead-end is reached. Software pipelining algorithms like modulo scheduling are used for loops, computing an Initiation Interval (II) and scheduling operations modulo II cycles.
Typical execution unit latencies (processor-specific, check your target): Integer ALU (add, sub, logic, shift): 1 cycle. Integer multiply: 3-4 cycles. Integer divide: 10-40 cycles (often not pipelined). Floating-point add/sub: 3-4 cycles. Floating-point multiply: 4-5 cycles. Floating-point divide: 10-20+ cycles. Floating-point sqrt: 15-30 cycles. Vector ALU operations: same as scalar equivalents but on vector registers. Load from L1 cache: 3-4 cycles. Load from L2 cache: 10-15 cycles. Load from main memory: 100-300 cycles. Store: typically 1 cycle to issue (writes to store buffer), but store-to-load forwarding may add latency. Branch: often 0-1 cycles if predicted correctly, full pipeline flush if mispredicted.
Memory Hierarchy
12 questionsAoS (Array of Structs): struct Point {float x,y,z;}; Point points[N]; Memory: x0,y0,z0,x1,y1,z1,... SoA (Struct of Arrays): struct Points {float x[N],y[N],z[N];}; Memory: x0,x1,x2,...,y0,y1,y2,...,z0,z1,z2... For SIMD, SoA is generally better: (1) Contiguous access - loading all x values uses sequential loads. (2) Vector operations operate on same field of multiple elements. (3) No shuffling needed after load. AoS requires gather operations or load-transpose to separate fields. AoS advantages: better cache locality when accessing all fields of one element, natural object representation. Hybrid: AoSoA - Array of Structs of Arrays. struct Block {float x[8],y[8],z[8];}; Block blocks[N/8]; Combines SIMD-friendly layout within blocks with cache locality across blocks.
Optimal batch layout balances SIMD efficiency with cache locality: (1) SoA within batch: separate arrays for each field, enables contiguous SIMD loads. For N elements: float x[N], y[N], z[N]; (2) Batch size = SIMD width multiple: process 8/16/32 elements per batch to maximize vector utilization. (3) Batch fits in cache: keep total batch data (all fields) under L1 size (32KB) for repeated access patterns. (4) Align batch starts: each array aligned to vector width (32 bytes for AVX2). (5) For streaming: batch size tuned to prefetch distance. Too small = overhead, too large = prefetch misses. (6) AoSoA for multi-pass: struct { float x[8],y[8],z[8]; } block[N/8]; Each block fits cache line, fields within block are contiguous. Profile different batch sizes for your access patterns.
Place in scratch memory: (1) Frequently accessed data - anything accessed multiple times per computation. (2) Working set data that fits - arrays actively being processed. (3) Data with poor cache behavior - random access patterns that would cause many cache misses. (4) Small lookup tables - constant data accessed repeatedly. (5) Temporary buffers - intermediate results reused within a kernel. Keep in main memory: (1) Data too large for scratch - must be tiled and processed in chunks. (2) Infrequently accessed data - setup overhead not worth it. (3) Data needed by multiple cores - unless explicit sharing mechanism exists. (4) Read-once data - streaming input that won't be reused. Decision process: profile to find hot data, calculate working set size, tile if necessary, measure speedup from scratch placement.
Cache-optimal tree layouts minimize cache misses during traversal: (1) Eytzinger/BFS layout: store nodes level-by-level. Node at index i has children at 2i+1 and 2i+2. Consecutive levels are contiguous in memory. (2) Van Emde Boas layout: recursively place subtrees contiguously. Achieves O(log_B N) cache misses for any cache line size B. Theoretically optimal. (3) B-tree style: store many keys per node sized to cache line (64 bytes). Binary search within node is cache-efficient. (4) Split keys from children: store all keys contiguously, children pointers separately. Search only touches keys until leaf reached. (5) Align nodes to cache line boundaries (64 bytes). (6) For static trees, offline compute optimal layout. For dynamic trees, use B-trees or cache-oblivious B-trees. Measure actual cache miss rates to validate layout choice.
Pad structures to achieve alignment and avoid false sharing: (1) Add explicit padding: struct S { float data[7]; float pad; }; // 32 bytes, 32-byte aligned. (2) Use alignment attributes: struct alignas(32) S { float data[7]; }; Compiler adds padding automatically. (3) Pad to cache line for false sharing avoidance: struct alignas(64) S { ... }; (4) For arrays, ensure sizeof(element) is multiple of alignment: struct alignas(32) S { float v[8]; }; // 32 bytes each, array naturally aligned. (5) Pad between arrays if they might alias: float a[N] attribute((aligned(64))); float padding[16]; float b[N] attribute((aligned(64))); (6) Verify: assert((uintptr_t)&s % 32 == 0). Over-padding wastes memory but under-padding causes performance penalties or crashes with aligned load instructions.
Manual scratch memory management involves explicit placement and lifetime control: (1) Define scratch region: use linker script or pragma to place arrays in scratch address space. Example: #pragma DATA_SECTION(buffer, ".scratch"). (2) Static allocation: assign fixed offsets for each buffer. Total usage must fit scratch size. (3) Dynamic allocation: implement simple allocator (bump pointer or stack) within scratch region. (4) Double buffering: allocate two buffers, DMA loads one while processing other. Alternate each iteration. (5) Tiling: if data exceeds scratch, divide into tiles. Load tile to scratch, process, store results, load next tile. (6) Lifetime management: reuse scratch space - once buffer is no longer needed, its space can be reused. (7) For multi-core: partition scratch between cores or implement locking for shared sections.
Scratch memory (also called scratchpad or local memory) is a fast, software-managed memory separate from the cache hierarchy. Differences from main memory: (1) Access latency: scratch typically 1-10 cycles vs 100-300 cycles for DRAM. (2) Management: software explicitly loads/stores data to scratch vs hardware-managed caching for main memory. (3) Size: typically 16KB-256KB per core vs gigabytes for main memory. (4) Addressing: often uses separate address space or explicit DMA transfers. (5) Coherence: scratch is usually not coherent across cores vs hardware cache coherence for main memory. (6) Predictability: scratch access time is deterministic vs cache misses causing variable latency. Common in DSPs, GPUs, and embedded processors. Programmer copies working data to scratch, computes, copies results back.
Scratch memory sizes vary by processor: Texas Instruments C66x DSP: 32KB L1D (scratchpad mode), 32KB L1P, configurable L2 up to 1MB. Qualcomm Hexagon DSP: 32-64KB L1 per thread, 256KB-1MB L2. NVIDIA GPUs (for comparison): 48KB shared memory per SM (configurable vs L1). Intel Itanium: no dedicated scratch, but 32KB L1D. Cell SPE: 256KB local store (pure scratchpad, no cache). Typical embedded VLIW: 16KB-64KB. Common strategy: L1 operates as scratchpad (software-managed), L2 operates as cache. When data exceeds scratch, tile the computation - process chunks that fit, load next chunk, repeat. Maximum efficient working set = scratch size - algorithm temporary space. Plan data layout around these constraints.
Register spills occur when live values exceed available registers, forcing stores/reloads: (1) Reduce live ranges: restructure code so values are produced close to where they're consumed. (2) Reduce unrolling: excessive unrolling increases live values. Find balance between ILP and register pressure. (3) Recompute vs spill: if recomputing a value is cheaper than spill+reload (typically 6-8 cycles), recompute. (4) Loop tiling: process smaller tiles to reduce working set. (5) Use compiler hints: register keyword (often ignored), inline carefully. (6) Profile register usage: compiler can report register pressure. (7) Restructure algorithms: some algorithms are inherently register-heavy. Consider alternatives. (8) For VLIW: scalar and vector registers are separate pools - balance across both. VLIW processors often have 32-128 registers, but complex kernels can still exhaust them.
Register spill costs depend on where data goes: Spill to L1 cache: store 1 cycle to issue, reload 4-5 cycles latency = ~5-6 cycles total impact. But if many spills, they can overlap. Spill to stack (likely L1): same as above, ~5-6 cycles. If L1 is full, spill to L2: store 1 cycle, reload 12-15 cycles = ~13-16 cycles impact. In extreme cases (L2 full): spill to L3 or memory, 40-300+ cycles. The hidden cost: spilled value unavailable during reload latency, creating pipeline stalls if immediately needed. Additional cost: spills/reloads consume load/store unit bandwidth that could be used for real data access. Memory bandwidth overhead: 4-8 bytes per spill depending on value size. For VLIW vector registers: 32-64 bytes per spill for 256-512 bit registers. Minimize spills - they're expensive!
Prefetching loads data into cache before it's needed: (1) Hardware prefetch: CPUs automatically detect sequential/strided patterns. Help by using predictable access patterns. (2) Software prefetch: _mm_prefetch(addr, _MM_HINT_T0); loads into L1. _MM_HINT_T1/T2 for L2/L3. (3) Prefetch distance: issue prefetch far enough ahead to hide latency. Distance = memory_latency / cycles_per_iteration. For 200-cycle latency and 10-cycle iteration: prefetch 20 iterations ahead. (4) Avoid over-prefetching: too many prefetches evict useful data, saturate memory bandwidth. (5) Prefetch loop pattern: for(i=0;i<N;i++) { _mm_prefetch(&a[i+DIST], ...); process(a[i]); }. (6) Non-temporal prefetch: _MM_HINT_NTA for streaming data that won't be reused - avoids polluting cache. (7) Validate with performance counters - check cache miss rates and memory bandwidth.
Typical latencies: Scratch/scratchpad memory: 1-10 cycles (similar to L1 cache but software-managed). L1 cache: 4-5 cycles. L2 cache: 12-15 cycles. L3 cache: 40-50 cycles. Main memory (DRAM): 200-300 cycles (~100ns at 3GHz). Scratch memory achieves L1-like latency because it's SRAM close to the processor, but unlike cache it doesn't have tag lookup overhead. The key difference is determinism - scratch always takes the same time, while cache access varies based on hit/miss. For DSPs with scratch: Texas Instruments C6x has 32KB L1D with 1-cycle access. For GPUs: shared memory is 20-40 cycles vs 400-800 cycles for global memory. The 20-100x speed difference makes data placement in scratch critical for performance.
Cycle Optimization
12 questionsTo estimate VLIW cycle count: (1) Build the Data Dependency Graph (DDG) with operation latencies as edge weights. (2) Find the critical path - the longest weighted path through the DDG. This is the minimum cycles assuming unlimited resources. (3) Account for resource constraints: if code has more operations of a type than available units, divide operations by units and round up. (4) The cycle count is MAX(critical_path_length, operations/throughput for each resource). (5) Add stalls for memory latency on cache misses. (6) For loops: cycles = (iterations * cycles_per_iteration) + prologue + epilogue. For software-pipelined loops: cycles = II * iterations + (num_stages - 1) where II is initiation interval. Validate estimates against actual profiling.
Branch misprediction cost in VLIW is typically the pipeline depth in cycles - all instructions fetched after the branch must be flushed. For a 5-stage pipeline: ~5 cycles. For deeper pipelines (10-20 stages): 10-20 cycles. Additionally, on some VLIW architectures, the long instruction word means more instructions are flushed per cycle than in scalar processors. Unlike superscalar, VLIW typically has simpler branch prediction (or none - relying on compiler scheduling). Mitigation strategies: (1) If-conversion using predicated execution - both paths execute but only correct path's results commit. (2) Delay slots - operations after branch execute regardless, compiler fills with useful work. (3) Profile-guided optimization to place likely path in fall-through position. (4) Minimize branches through loop transformations.
Latency-bound means execution time is determined by the critical path - the chain of dependent operations. No matter how many execution units, you must wait for each operation to complete before starting the next dependent one. Adding more parallelism doesn't help. Throughput-bound means execution time is determined by how many operations can be processed per cycle. The critical path is short relative to total operations, so performance scales with execution units. To determine which: if adding independent operations (unrolling) significantly reduces cycles per iteration, code was latency-bound. If cycles stay constant, code was throughput-bound. Latency-bound code needs latency reduction (faster operations, algorithm changes). Throughput-bound code needs more execution units or SIMD.
VLIW pipeline stalls occur due to: (1) Data hazards - when code is incorrectly scheduled and a consumer executes before its producer completes. In VLIW this is a compiler bug, not handled by hardware. (2) Memory latency - loads that miss cache cause stalls until data arrives. Non-blocking loads help but eventually you need the data. (3) Control hazards - branches may cause stalls if the target isn't ready. Predication eliminates some branches. (4) Structural hazards - if the schedule requires more resources than available (shouldn't happen with correct scheduling). (5) Exception handling - interrupts and traps can stall execution. (6) Functional unit latency variation - if an operation takes longer than expected (floating-point denormal handling). In VLIW, the compiler is responsible for avoiding most stalls through correct scheduling.
Key metrics for VLIW slot utilization: (1) NOP rate - percentage of slots containing NOPs. Target: below 20%. (2) Slot fill rate - percentage of slots with actual operations. (operations_issued / (cycles * slots_per_bundle)) * 100. (3) Per-unit utilization - for each execution unit type, track cycles_used / total_cycles. (4) IPC - instructions (operations) per cycle. Maximum is slots_per_bundle. Achieving 70-80% of max is good. (5) Critical path utilization - percentage of cycles with critical-path operations executing. (6) Memory unit utilization - if load/store units are underutilized, code may have unnecessary memory operations. (7) Vector unit utilization - for SIMD slots, track how often SIMD operations vs scalar operations fill the slot. Profile tools report these; use them to identify under-utilized units.
The critical path is the longest latency chain through the computation: (1) Build the Data Dependency Graph with nodes for each operation and edges for dependencies. (2) Weight edges with the producer operation's latency. (3) Find the longest path from any input (or graph entry) to any output (or graph exit) using topological sort and dynamic programming. (4) For each node, compute: earliest_start = max(earliest_finish of all predecessors). earliest_finish = earliest_start + latency. (5) Work backward to compute: latest_finish, latest_start. (6) Operations where earliest_start == latest_start are on the critical path. (7) The critical path length in cycles is the sum of latencies along this path. Optimizations should target critical path operations - reducing their latency directly reduces total time.
Software pipelining is a loop optimization that overlaps execution of multiple iterations so operations from different iterations execute simultaneously. Instead of completing iteration 1 before starting iteration 2, you start iteration 2's early stages while iteration 1's later stages are still executing. Use it when: (1) Loop has multiple operations with varying latencies. (2) Loop body has dependencies preventing full parallelism within one iteration. (3) Loop has enough iterations to amortize prologue/epilogue overhead. The key metric is Initiation Interval (II) - the cycles between starting consecutive iterations. Minimum II is constrained by resources (ops/cycle) and recurrence cycles (loop-carried dependencies). A perfectly pipelined loop with II=1 achieves one iteration's worth of results per cycle.
Theoretical minimum cycles = MAX(compute_minimum, memory_minimum). Compute minimum: count operations of each type, divide by throughput (ops/cycle) for each unit, take the maximum. If kernel has 100 multiplies and 50 adds, with 2 multiply units and 4 ALUs: compute_min = max(100/2, 50/4) = 50 cycles. Memory minimum: total_bytes / bandwidth_bytes_per_cycle. If loading 1000 bytes with 32 bytes/cycle bandwidth: memory_min = 32 cycles. The actual minimum is the greater of these. For loops: also consider the loop-carried dependency chain - if each iteration depends on the previous, minimum = iterations * dependency_latency. This gives a lower bound; actual cycles will be higher due to scheduling constraints and overhead.
To hide latency, find operations with no dependencies on the long-latency operation and schedule them during its execution: (1) After a load instruction, schedule computations that don't need the loaded value. (2) Unroll loops to create multiple independent iteration instances - while waiting for iteration N's load, compute iteration N-1's results. (3) For multiply chains with 4-cycle latency, interleave 4 independent computations. (4) Prefetch data well before it's needed so loads complete in time. (5) Reorder code to maximize distance between producer and consumer. Example: instead of LOAD R1; USE R1; LOAD R2; USE R2, do: LOAD R1; LOAD R2; USE R1; USE R2 - now both loads can be in flight together. (6) Software pipelining automatically achieves this for loops.
Profile using performance counters to measure: (1) IPC (Instructions Per Cycle) - low IPC (below 1.0) with memory stalls indicates memory-bound. High IPC (near execution unit count) indicates compute-bound. (2) Cache miss rates - high L1/L2/L3 miss rates indicate memory-bound. (3) Memory bandwidth utilization - compare achieved bandwidth to theoretical maximum. If close, memory-bound. (4) Execution unit utilization - if units are frequently idle waiting for data, memory-bound. (5) Arithmetic intensity: operations / bytes transferred. Less than ~10 ops/byte is typically memory-bound. (6) Try increasing computation (add redundant work) - if no slowdown, memory-bound. Try reducing computation - if no speedup, memory-bound. The roofline model plots this: performance = min(peak_compute, memory_bandwidth * arithmetic_intensity).
Total cycles = compute_cycles + memory_stall_cycles. Memory stall cycles = num_misses * miss_latency. For hierarchical cache: stall_cycles = L1_misses * L1_miss_penalty + L2_misses * L2_miss_penalty + L3_misses * L3_miss_penalty. Miss penalty = latency_to_next_level - hit_time. Typical values: L1 hit 4 cycles, L2 hit 12 cycles, L3 hit 40 cycles, DRAM 200+ cycles. If code can overlap compute with memory: effective_cycles = MAX(compute_cycles, memory_cycles). Little's Law for memory bandwidth: outstanding_requests = bandwidth * latency. If the processor supports 8 outstanding loads and each has 200-cycle latency, you can hide latency with enough parallel loads. Non-blocking loads and prefetching help overlap compute and memory.
Overlapping computation with memory requires non-blocking memory operations and enough independent work: (1) Issue loads early, as far before the use as possible. (2) Use prefetch instructions to start cache line fetches ahead of time. (3) Double-buffering: while processing buffer A, load buffer B. When A is done, process B while loading A. (4) Unroll loops to have multiple loads in flight - modern CPUs support 8-16 outstanding loads. (5) Structure code as: LOAD, COMPUTE_ON_PREVIOUS_LOAD, USE_CURRENT_LOAD. (6) For VLIW, schedule loads in early slots with computation in parallel slots. (7) Ensure memory bandwidth isn't the bottleneck - if you're issuing loads faster than memory can deliver, overlapping won't help. Profile to verify compute and memory overlap in practice.
CPU profiling tools
12 questionsPress Cmd+I in Xcode to open Instruments, or select Product > Profile. Choose the Time Profiler template and press the red record button (Cmd+R) to start profiling. The Time Profiler samples the call stack every few milliseconds to show where CPU time is spent. Enable 'Hide System Libraries' to focus on your own code, 'Flatten Recursion' to simplify recursive calls, and 'Top Functions' to see cumulative time including called functions. For CPU optimization on Apple Silicon, prefer the CPU Profiler over Time Profiler as it samples based on CPU clock frequency rather than a fixed timer, providing more accurate results and fairer weighting of CPU resources.
Use perf record with -p flag: 'perf record -g -p PID' to attach to running process by PID. Press Ctrl+C to stop recording, then 'perf report' to analyze. For system-wide profiling: 'perf record -a -g' captures all processes. VTune can also attach: 'vtune -collect hotspots -target-pid PID'. For Java: async-profiler can attach to running JVMs. Python: py-spy attaches without restarting: 'py-spy record -p PID -o profile.svg'. Note some profilers require debug symbols to be present (can be in separate debug package). Sampling profilers generally support attach; instrumentation-based profilers often require process restart. Check kernel parameter perf_event_paranoid if permission denied.
Launch VTune GUI with 'vtune-gui' or use the command line with 'vtune -collect hotspots ./your_program'. In the GUI, create a new project, specify your executable, and select Hotspots analysis from the Analysis tree. VTune offers two collection modes: User-Mode Sampling (higher overhead, no drivers needed) and Hardware Event-Based Sampling (lower overhead, requires sampling drivers). After collection completes, VTune displays a Summary viewpoint showing Top Hotspots sorted by CPU time. The Elapsed Time shows total runtime including idle time, while CPU Time shows the sum of all threads' CPU usage.
perf stat counts events and reports aggregate statistics at the end of execution, while perf record samples events over time and stores detailed profiles for later analysis. Use 'perf stat ./program' to get summary counts of cycles, instructions, cache misses, and branch mispredictions. Use 'perf record ./program' followed by 'perf report' when you need to identify which specific functions consume the most time. perf stat has lower overhead since it only maintains counters, whereas perf record captures instruction pointers and call stacks at each sample, creating a perf.data file for detailed analysis.
Use these compiler flags for profiling: '-g' for debug symbols (essential for source-level annotation), '-fno-omit-frame-pointer' to preserve frame pointers for accurate stack traces (modern compilers omit by default), '-O2' or '-O3' to profile optimized code (profiling unoptimized code gives misleading results). Full command: 'gcc -O2 -g -fno-omit-frame-pointer program.c -o program'. For split debug info (smaller binary): '-g -gsplit-dwarf'. Note: debug symbols don't affect performance, only binary size. The -g flag can be combined with any optimization level. Without frame pointers, tools may show incomplete call stacks or use slower DWARF-based unwinding.
Integrated profiling approaches: 1) NVIDIA Nsight Systems: captures CPU and GPU activity on unified timeline, shows kernel launches, memory transfers, and CPU work together. 2) Intel VTune 2025: GPU Compute/Media Hotspots analysis for Intel GPUs and integrated graphics. 3) AMD ROCm Profiler (rocprof): profiles GPU kernels with timeline and counter data. 4) Perfetto: supports GPU traces alongside CPU traces on Android and some desktop configurations. 5) Chrome tracing: includes GPU activity for graphics workloads. For CUDA: use nvprof or Nsight Compute for kernel-level analysis. Key metrics: GPU occupancy, memory throughput, kernel duration, CPU-GPU synchronization points. Look for: idle GPU waiting for CPU, idle CPU waiting for GPU, excessive memory transfers between CPU and GPU.
Mark regions of interest: 1) Intel VTune ITT API: __itt_resume() and __itt_pause() around regions, run with 'vtune -start-paused'. 2) perf with markers: use 'perf record -D 1000' to delay start, or signal-based control. 3) Programmatic control: Google Benchmark State.PauseTiming()/ResumeTiming(), JMH @CompilerControl annotations. 4) Time-based filtering: record everything, then filter in analysis to specific time ranges. 5) Intel PIN for binary instrumentation of specific functions. 6) Wrapper functions that enable/disable profiling around calls. 7) Perfetto custom track events with TRACE_EVENT macros. Profiling specific regions reduces data volume and focuses analysis on areas you control.
Multiple options based on needs: 1) cProfile (built-in): 'python -m cProfile -s cumtime script.py' for deterministic profiling with call counts. 2) py-spy: sampling profiler with minimal overhead, works on running processes: 'py-spy record -o profile.svg -- python script.py' generates flame graph. 3) Scalene: low-overhead sampling profiler distinguishing Python/native/system time. 4) yappi: supports multi-threaded profiling. 5) perf can profile CPython itself: 'perf record python script.py' but shows C functions, not Python. 6) Python 3.12+ has built-in sampling profiler module (PEP 669). For production, py-spy and Scalene have lowest overhead. For detailed call analysis, use cProfile with snakeviz visualization.
Performance flags: -O2/-O3 (optimization level), -march=native (CPU-specific instructions), -flto (link-time optimization), -ffast-math (aggressive FP optimization, may change results). Profiling accuracy flags: -g (debug symbols for source annotation), -fno-omit-frame-pointer (accurate stack traces - essential), -fno-inline (optional: prevents inlining for clearer profiles, but changes performance). Flags to avoid during profiling: -fomit-frame-pointer (breaks stack unwinding), -s (strips symbols). Recommended combination: '-O2 -g -fno-omit-frame-pointer -march=native' for production-like profiling. Note: -O3 can inline aggressively making profiles harder to read. Consider building with debug info separately: '-O2 -g0' for production, '-O2 -g -fno-omit-frame-pointer' for profiling.
Use perf record to sample CPU activity and perf report to analyze results. Run 'perf record -g ./your_program' to capture stack traces during execution. The -g flag enables call graph recording. After execution completes, run 'perf report' to view an interactive report showing functions sorted by CPU time. For real-time profiling, use 'perf top' to see live CPU usage across all processes. The default sampling rate is 1000Hz (1000 samples per second), which the kernel dynamically adjusts. By default, perf uses the 'cycles' event, which maps to UNHALTED_CORE_CYCLES on Intel processors.
Default 99Hz (or 997Hz to avoid lockstep with timer interrupts) is good for most cases. Lower frequency (10-50Hz): less overhead, good for long-running production profiling, but less precision - may miss short-lived hot spots. Higher frequency (1000-10000Hz): more detail on short functions, but higher overhead and risk of perturbation. For flame graphs, 99Hz for 30-60 seconds typically captures 3000-6000 samples - sufficient for statistical accuracy. Rule of thumb: overhead = (samples/second) * (time per sample) / total time. At 99Hz with ~10us per sample interrupt on 1GHz CPU, overhead is about 0.1%. Increase frequency only if hot spots aren't clear in initial profile.
VTune Microarchitecture Exploration (uarch) analysis provides low-level CPU metrics organized by TMAM categories. Key metrics: CPI (Cycles Per Instruction) - inverse of IPC, lower is better. Frontend Bound % - instruction fetch/decode stalls. Backend Bound % - subdivided into Memory Bound (cache misses, DRAM latency) and Core Bound (execution unit contention). Bad Speculation % - mispredicted branches, machine clears. Retiring % - useful work done. Focus optimization on the highest percentage category. Drill down: if Memory Bound is high, check L1/L2/L3 Bound and DRAM Bound sub-metrics. If Core Bound, look at Port Utilization to identify oversubscribed execution units. Compare uarch metrics between code versions to verify optimizations address the right bottleneck.
Tree Traversal Optimization
10 questionsOptimal batch size balances SIMD utilization against divergence and memory pressure: (1) Minimum useful batch = SIMD width (8 for AVX2, 16 for AVX-512). (2) For coherent workloads (sorted keys), larger batches (256-1024) improve throughput by amortizing loop overhead. (3) For random keys, smaller batches (16-64) reduce divergence waste. (4) Memory constraint: batch * sizeof(index_array_entry) should fit in L1 cache. (5) For tree depth D, expect 2^(D/2) average divergence - larger batches compensate. (6) Empirically for B+ trees: 32-128 keys per batch is often optimal. (7) Profile different sizes for your workload. If hit rate drops significantly at large batch sizes, you're exceeding cache. If small batches have high overhead, increase size.
Vectorize tree traversal by processing N keys simultaneously: (1) Load N search keys into a SIMD register. (2) At each tree level, broadcast the node's split value to all lanes. (3) Compare all keys against split value using SIMD compare, producing a mask. (4) Based on mask bits, determine which keys go left vs right. (5) Two approaches: A) Gather-based: use gather to load N child pointers based on comparison results, continue with N independent paths. B) Sort-based: partition keys into left-going and right-going groups, process each group separately. For balanced trees, approach B is more efficient as it maintains SIMD efficiency. Keys on same path can share loaded nodes. The challenge is divergence - when keys take different paths, SIMD efficiency drops.
For tree traversal where branching depends on hash values (like hash tries or Bloom filter trees): (1) Compute hashes for all query keys in parallel using SIMD hash. (2) At each node, extract relevant hash bits (e.g., bits for current level) using SIMD AND and shift. (3) Use extracted bits as indices into node's child array. (4) Gather child pointers/values using the computed indices. (5) For hash tries: each level uses different hash bits - level 0 uses bits 0-7, level 1 uses bits 8-15, etc. (6) For Bloom filter tree: compute k hash functions in parallel, AND results to determine membership before traversing. (7) The hash computation can be pipelined with the traversal - compute next level's hash bits while fetching current level's children. This pattern is efficient because hash computation is parallel and deterministic.
When SIMD lanes diverge (different keys take different tree paths), several strategies exist: (1) Masking - continue processing all lanes but mask off results for inactive lanes. Wastes some computation but maintains SIMD. (2) Compaction - use packing instructions to group active keys together and restart with smaller batch. Has overhead but reduces waste. (3) Two-phase: use SIMD while keys are coherent (upper levels), switch to scalar when divergence exceeds threshold. (4) Sort keys before search to maximize coherence. (5) TRIE structures - use multiple comparison levels per node to reduce divergence. (6) Accept some inefficiency - even with 50% utilization, SIMD is 4x faster than scalar for AVX2. The optimal strategy depends on tree shape and key distribution.
Level-order (BFS) processes all nodes at one depth before moving to next depth. Better for SIMD because: all nodes at a level can be processed in parallel, memory access pattern is more regular, naturally maps to batch processing. Depth-first (DFS) processes each path to leaf before backtracking. Better for: early termination when target found, reduced memory for tracking state, cache locality along a single path. For SIMD tree search, level-order enables processing multiple queries at the same tree level simultaneously. Load all nodes at level L, compare all queries against them, update indices for level L+1. This maintains SIMD width utilization. DFS loses SIMD efficiency when queries diverge to different subtrees.
Pointer chasing cost depends on memory hierarchy hits: L1 cache hit: 4-5 cycles per node access. L2 cache hit: 12-15 cycles. L3 cache hit: 40-50 cycles. DRAM access: 200-300 cycles. For a tree of depth D, random access pattern causes D cache misses. Total cost: D * miss_latency. For a tree with 1M nodes (depth ~20), worst case is 20 * 300 = 6000 cycles per lookup. Mitigation: (1) Cache-optimized layouts reduce misses. (2) Prefetching child nodes while processing current node. (3) B+ trees increase fanout, reducing depth. (4) Batch lookups to enable software pipelining - while waiting for one lookup's memory, process others. Cache-line-sized nodes allow fetching multiple children in one miss.
Cache-efficient tree layouts: (1) Eytzinger/BFS layout - store nodes in breadth-first order. Node i has children at 2i+1 and 2i+2. Sequential access pattern for each level. (2) Van Emde Boas layout - recursively lay out subtrees contiguously. Optimal for any cache line size. O(log_B N) cache misses where B is block size. (3) Level-blocking - group multiple levels into cache-line-sized blocks. (4) Cache-Oblivious B-tree - pack multiple nodes per cache line (B=8 for 64-byte lines with 8-byte keys). (5) Align nodes to cache line boundaries to prevent false sharing. (6) For SIMD, store multiple split values together for parallel comparison. (7) Separate keys from child pointers to improve cache density during search. Profile actual cache miss rates to validate layout choice.
Convert recursion to iteration for VLIW efficiency: (1) Replace recursive calls with a loop using explicit stack or current-pointer. (2) For binary search (no backtracking needed): use a simple while loop: idx = 0; while (!is_leaf(idx)) { idx = key < node[idx].split ? left_child(idx) : right_child(idx); }. (3) For full traversal (pre/in/post-order), use explicit stack: push root; while (stack not empty) { pop; process; push children }. (4) Eliminate function call overhead which disrupts VLIW bundling. (5) Inline the loop body to expose more ILP to compiler. (6) The iterative form enables loop transformations like unrolling and software pipelining that recursion prevents. (7) For tail recursion, the compiler may convert automatically, but explicit iteration is more reliable.
When tree nodes exceed vector register width: (1) Split processing across multiple registers - if node has 32 values and register holds 8, use 4 registers. (2) Load only the needed portion - for binary search, only load the split value/key, not entire node. (3) Use gather operations with sparse indices to load only relevant fields. (4) For B-tree nodes (many keys), use SIMD to compare query against all keys in the node, finding the correct child pointer. (5) Cache the frequently accessed small fields separately from large payloads (separate keys array from values array). (6) If storing child pointers is the problem, use implicit indexing (Eytzinger layout) where child positions are calculated, not stored. (7) Consider node size as architecture design parameter - size nodes to fit cache lines (64 bytes) for efficiency.
Batch tree traversal pattern: (1) Sort incoming keys to improve coherence - nearby keys likely take similar paths. (2) Process keys in batches of SIMD width (8 for AVX2, 16 for AVX-512). (3) At each node, broadcast node value to all lanes, compare batch against it. (4) Track current position for each key using a parallel index array. (5) After comparison, update indices: left_child = 2idx+1, right_child = 2idx+2 based on comparison mask. (6) Use predication/masking for keys that have reached leaves (don't update their indices). (7) Continue until all keys reach leaves. (8) Gather results from leaf positions. This achieves good SIMD utilization when keys cluster in the tree. For random keys, sorting before search significantly improves batch efficiency.
Loop Transformations
10 questionsLoop fission (distribution) splits one loop into multiple loops: Before: for(i=0;i<N;i++) { a[i]=b[i]+c[i]; d[i]=a[i]*e[i]; }. After: for(i=0;i<N;i++) a[i]=b[i]+c[i]; for(i=0;i<N;i++) d[i]=a[i]*e[i]; When to split: (1) Register pressure: combined loop exceeds registers, splitting reduces live values per loop. (2) Vectorization: one part vectorizes well, other doesn't. Split and vectorize the easy part. (3) Different optimization strategies: one part is memory-bound, other is compute-bound. Optimize separately. (4) Cache behavior: splitting may improve streaming access patterns. (5) Parallelism: after fission, loops may be parallelizable independently. When to avoid: locality loss is too expensive (data must be reloaded), or splitting increases loop overhead disproportionately for small N.
Loops with early exits (break, return) require special unrolling: (1) Standard approach: check exit condition in each unrolled iteration: for(i+=4) { if(cond(i)) break; do(i); if(cond(i+1)) break; do(i+1); if(cond(i+2)) break; do(i+2); if(cond(i+3)) break; do(i+3); } Branches may limit ILP benefit. (2) Speculative execution: compute all iterations, check conditions at end: do(i); do(i+1); do(i+2); do(i+3); if(cond(i)||cond(i+1)||cond(i+2)||cond(i+3)) {handle_exit();} Works if do() has no side effects and exit is rare. (3) Predicated execution: use SIMD masking - continue with mask tracking active elements. (4) Chunked: fully unroll within chunks, check between chunks. (5) Often limited unrolling (2x) is better than aggressive unrolling for early-exit loops due to branch prediction efficiency.
Optimal unroll factor balances ILP with register pressure and code size: (1) Minimum unroll = latency of longest operation / operations per iteration. For 4-cycle multiply with 2 multiplies per iteration: unroll >= 2 to hide latency. (2) Match SIMD width: for AVX2, unroll by multiples of 8 for float ops. (3) Fill all VLIW slots: if processor has 8 slots and loop has 4 ops, unroll 2x minimum. (4) Register limit: each unrolled iteration adds live values. Unroll until register pressure causes spills. Typical limit: 4-16x depending on loop complexity. (5) I-cache pressure: unrolled code shouldn't exceed L1 I-cache (32KB typical). Very large unrolling hurts I-cache hit rate. (6) Empirical: start with 4x, measure cycles, try 8x and 16x. Often 4-8x is optimal. Profile to find the sweet spot.
Loop interchange swaps nested loop order to improve access patterns: Before (column-major access): for(j=0;j<M;j++) for(i=0;i<N;i++) a[i][j]+=1; After (row-major access): for(i=0;i<N;i++) for(j=0;j<M;j++) a[i][j]+=1; The second form accesses memory sequentially (a[i][0], a[i][1], a[i][2]...) vs strided access in the first (a[0][j], a[1][j]...). Sequential access: uses cache lines fully (16 floats per 64-byte line). Strided access: one useful element per line, wasting 15/16 of bandwidth. When legal: no loop-carried dependency that would be violated. Check: if dependency is from a[i][j] to a[i'][j'] where i'<i, interchanging to i-outer maintains correctness. Apply automatically: many compilers interchange when legal and beneficial. Manually: profile cache misses, restructure if strided access dominates.
For nested loops, unroll inner loop first, then outer if needed: (1) Inner loop unrolling: for(i) for(j+=4) { body(i,j); body(i,j+1); body(i,j+2); body(i,j+3); } Exposes ILP within each inner iteration group. (2) Outer loop unrolling (loop interchange variant): for(i+=2) for(j) { body(i,j); body(i+1,j); } Exposes parallelism across outer iterations. (3) Combined (2D tiling with unroll): for(i+=2) for(j+=4) { body(i,j..j+3); body(i+1,j..j+3); } Creates a 2x4 block with 8 parallel operations. (4) Unroll to match VLIW width: if 8 slots available, create 8 independent operations per bundle through 2x4, 4x2, or 1x8 unrolling. (5) Respect register limits - 2D unrolling multiplies register usage. (6) Profile different unroll combinations for your specific loop structure.
Calculate unroll factor from register budget: (1) Count live values per iteration: input registers, output registers, loop-carried values, constants, temporary computations. (2) Check available registers: typical VLIW has 32-64 GP registers, 32 vector registers. (3) Formula: max_unroll = (available_registers - loop_overhead) / registers_per_iteration. Example: 64 registers, 4 for loop control, loop uses 10 per iteration: max_unroll = (64-4)/10 = 6. (4) Account for compiler's needs - reserve 20-30% for temporaries and spill space. (5) Vector and scalar have separate register files - calculate for each. (6) Reduce unroll if you see spill/reload in generated assembly. (7) Consider register renaming: some "uses" can share registers if live ranges don't overlap. (8) Profile spill rate at different unroll factors - find the knee where spills increase sharply.
Modulo scheduling is a software pipelining technique that overlaps loop iterations: (1) Compute Initiation Interval (II) - minimum cycles between starting iterations. II >= max(resource_constraints, recurrence_constraints). (2) Create a schedule where each operation executes at a specific (cycle mod II, slot) position. (3) Prologue fills the pipeline - first II-1 cycles start new iterations without completing any. (4) Kernel is the steady state - each cycle starts a new iteration AND completes an old one. (5) Epilogue drains - final II-1 cycles complete remaining iterations without starting new ones. Example with II=2: Cycle 0: load(iter0). Cycle 1: load(iter1), compute(iter0). Cycle 2: load(iter2), compute(iter1), store(iter0). [Kernel] Cycle 3: load(iter3), compute(iter2), store(iter1)... Achieves II cycles per iteration vs potentially 3 cycles sequential.
Loop fusion combines two adjacent loops with the same bounds into one: Before: for(i=0;i<N;i++) a[i]=b[i]+c[i]; for(i=0;i<N;i++) d[i]=a[i]*e[i]; After: for(i=0;i<N;i++) { a[i]=b[i]+c[i]; d[i]=a[i]*e[i]; }. Benefits for VLIW: (1) More operations per iteration = more ILP to fill VLIW slots. (2) Improved locality: a[i] produced and consumed in same iteration, stays in register. (3) Reduced loop overhead: one increment, one compare instead of two. (4) Better cache usage: data touched once stays hot. When it helps: loops access same or related data, limited register pressure, loops have compatible bounds. When to avoid: fusion would exceed register capacity, second loop depends on ALL of first loop's results, different iteration counts, or loops have vastly different performance characteristics.
Loop-carried dependencies require special handling when unrolling: (1) Identify the dependency: a[i] = a[i-1] + b[i] - each iteration depends on previous. (2) Unroll maintaining the dependency chain: a[i] = a[i-1]+b[i]; a[i+1] = a[i]+b[i+1]; a[i+2] = a[i+1]+b[i+2]; a[i+3] = a[i+2]+b[i+3]; Still sequential - no parallel speedup. (3) Transform if possible - some dependencies can be reformulated. For prefix sum, use parallel scan algorithm. (4) Interleave independent streams: if you have 4 independent arrays to scan, process them in parallel: for(i) { a1[i]=a1[i-1]+b1[i]; a2[i]=a2[i-1]+b2[i]; a3[i]=a3[i-1]+b3[i]; a4[i]=a4[i-1]+b4[i]; }. (5) Use SIMD for independent streams - each lane processes one stream. (6) Accept limited parallelism - unrolling still helps with other operations in the loop body.
Loop tiling (blocking) restructures loops to operate on cache-sized chunks: Before (matrix multiply): for(i) for(j) for(k) C[i][j]+=A[i][k]B[k][j]; After (tiled): for(ii=0;ii<N;ii+=TILE) for(jj=0;jj<N;jj+=TILE) for(kk=0;kk<N;kk+=TILE) for(i=ii;i<min(ii+TILE,N);i++) for(j=jj;j<min(jj+TILE,N);j++) for(k=kk;k<min(kk+TILE,N);k++) C[i][j]+=A[i][k]B[k][j]; TILE chosen so that tiles of A, B, C fit in cache: TILETILEsizeof(float)3 < cache_size. For 32KB L1 and float: TILE <= sqrt(32KB/12) = ~52, typically round to 32 or 64. Benefits: data reuse within tile before eviction, reduces memory bandwidth by factor of ~TILE for matrix operations. Reduces cache misses from O(N^3/B) to O(N^3/(Bsqrt(M))) where B=block size, M=cache size.
Hash Function Optimization
10 questionsFor independent hash computations, use multiple accumulators to exploit instruction-level parallelism. Instead of hash1 = f(hash1, data[0]); hash1 = f(hash1, data[1])... which has serial dependencies, use: hash1 = f(hash1, data[0]); hash2 = f(hash2, data[1]); hash3 = f(hash3, data[2]); hash4 = f(hash4, data[3]); then combine at the end. With 4-cycle multiply latency and 4 accumulators, you achieve full throughput. For SIMD, process N independent keys in parallel using vector registers - each lane computes an independent hash. This achieves N-way parallelism (4x for 128-bit, 8x for 256-bit). The key insight: each hash computation should be independent so there's no cross-lane dependency until the final reduction.
SIMD hash table lookup pattern: (1) Hash N keys in parallel using SIMD, producing N hash values. (2) Compute bucket indices: AND hash values with (table_size-1) if power-of-2. (3) Use gather instruction (vpgatherdd/vpgatherdq) to load N entries from non-contiguous table locations in parallel. (4) Compare loaded keys against search keys using SIMD compare (vpcmpeqd). (5) The comparison produces a mask indicating matches. (6) Use the mask to blend/select the values for matching keys. For open addressing with linear probing, if some lanes don't match, increment those indices and gather again. The gather instruction is key - it loads from base + index*scale for each vector lane, enabling parallel random access.
Vectorize by processing N inputs in parallel using SIMD registers: (1) Load N input keys into a vector register (e.g., 8 32-bit keys into YMM). (2) Perform all hash operations as SIMD: multiply uses vpmulld, XOR uses vpxor, shift uses vpsrld. (3) Each SIMD lane computes an independent hash. For xxHash-style: load 8 keys, vpmulld with PRIME constant, vpxor with vpsrld result. (4) After all rounds, you have 8 hash results in one register. (5) Store results or use them for parallel lookups. Throughput improvement is proportional to SIMD width: 4x for 128-bit SSE, 8x for 256-bit AVX2, 16x for 512-bit AVX-512 (for 32-bit hashes). Memory bandwidth often becomes the bottleneck before compute.
To eliminate dependencies between hash stages: (1) Process multiple independent inputs simultaneously - each input's hash stages are independent of other inputs. (2) Use multiple accumulator variables instead of one: acc1 ^= data[0]*c1; acc2 ^= data[1]*c2; acc3 ^= data[2]*c3; acc4 ^= data[3]*c4; then combine at end. (3) Rearrange the hash algorithm if possible - some mixing functions can be partially parallelized. (4) For the final combining/finalization, there will be dependencies - minimize this stage. (5) For tree-structured hashing, process subtrees in parallel and combine results. The key is that within a single hash computation there WILL be dependencies between stages (that's what makes it a hash), but between different hash computations there are none.
Typical cycle costs: XOR is 1 cycle latency with 1-per-cycle throughput (often multiple XOR units). Shift is 1 cycle latency with 1-per-cycle throughput. Multiply (32-bit) is 3-4 cycles latency but can often achieve 1-per-cycle throughput when pipelined (new multiply can start each cycle). Multiply (64-bit) may be 4-5 cycles. Add is 1 cycle. For hash functions, XOR-shift sequences are essentially free (1 cycle each), while multiplies dominate latency. A multiply-XOR-shift round: 4 cycles if dependencies force serialization, but with pipelining and multiple accumulators you can achieve 1 hash-round per cycle throughput. SIMD multiply (vpmulld) has similar latency but 4-8x throughput.
Mixing functions with best ILP have operations that can execute in parallel: (1) XOR-shift-XOR sequences where each XOR is independent: h ^= h >> 16; h ^= h >> 8; - the two XORs depend on different shift results but those shifts can happen in parallel. (2) Multiply-accumulate with multiple accumulators: h1 *= c; h2 *= c; h3 *= c; h4 *= c; - four independent chains. (3) Parallel mixing like in xxHash: round(acc, data) can process 4 accumulators independently before final merge. (4) Avoid deep sequential chains like: h = f(h); h = g(h); h = f(h); which has ILP=1. (5) The Murmur3 finalizer (h ^= h>>16; h *= c; h ^= h>>13; h *= c; h ^= h>>16) has ILP=1 due to strict dependencies, but latency is acceptable for finalization.
The optimal number depends on the quality requirements: For simple hash tables (collision tolerance OK): 1-2 rounds of multiply-xor-shift gives good distribution. xxHash uses 2 rounds of multiply-xor in finalization. For cryptographic-quality avalanche (every input bit affects every output bit with 50% probability): typically 3-4 rounds minimum. MurmurHash3 uses 2 rounds of multiply-xor-shift-xor for good avalanche. For maximum performance with acceptable quality: use the minimum rounds that pass SMHasher's avalanche tests. Each round costs ~4 cycles (multiply latency). Going from 2 to 3 rounds costs 4 cycles but may not improve quality for non-crypto uses. Profile your specific collision rate vs throughput tradeoff.
Pipeline hash iterations by starting new iterations before previous ones complete: If one hash iteration has 4-cycle latency, process 4 independent streams: Cycle 1: start iter 1 of stream A. Cycle 2: start iter 1 of stream B, stream A in flight. Cycle 3: start iter 1 of stream C, A and B in flight. Cycle 4: start iter 1 of stream D, A, B, C in flight. Cycle 5: A completes, start iter 2 of A, and continue D. Now you have continuous throughput - one result every cycle despite 4-cycle latency. This is software pipelining applied to hashing. The number of streams needed equals the latency to fully hide it. Four 32-byte streams can hide 4-cycle multiply latency, achieving 1 multiply's worth of results per cycle.
To implement a hash function with SIMD: (1) Load multiple input elements into vector registers using packed loads (e.g., 8 32-bit values into a 256-bit register). (2) Apply hash operations in parallel across all lanes - XOR, multiply, shift, rotate all have SIMD equivalents. (3) For mixing steps like multiply-XOR, use SIMD multiply (vpmulld for 32-bit) and SIMD XOR (vpxor). (4) Shifts use SIMD shift instructions (vpsrld for right shift). (5) For final combining, either keep hashes separate (8 independent hashes) or reduce using horizontal operations. Example for multiply-xor-shift: load 8 keys, vpmulld with constant, vpxor with shifted version, repeat. MurmurHash3, xxHash, and wyhash have been successfully vectorized achieving 8-16x throughput over scalar implementations.
Unroll by processing multiple input elements per iteration and interleaving their stages: For a 3-stage hash (mix1, mix2, mix3) on 4 inputs: Cycle 1: mix1(a), load(b), load(c), load(d). Cycle 2: mix2(a), mix1(b), load(...). Cycle 3: mix3(a), mix2(b), mix1(c). This fills VLIW slots with independent operations from different inputs. The unroll factor should match or exceed the latency of the longest operation to fully hide it. If mix stages have 4-cycle latency total, unroll by at least 4. Each iteration processes 4 inputs with operations interleaved so dependencies don't stall. Combine results at loop end. This transforms a serial chain into parallel waves of computation.
Code Transformation Patterns
8 questionsInstruction combining merges multiple operations into fewer, more powerful ones: (1) FMA (Fused Multiply-Add): ab+c -> FMA(a,b,c). One instruction vs two, often better precision. (2) Multiply-accumulate: x += ab -> MAC instruction on DSPs. (3) Load-compute: some architectures combine load with simple operation. (4) Compare-branch: fused compare and conditional branch. (5) Address calculation: base+offset folded into load/store addressing mode. For VLIW benefits: (1) Fewer instructions = potentially fewer VLIW bundles. (2) Combined operations may have lower latency than separate. (3) Frees VLIW slots for other work. How to enable: (1) Use appropriate compiler flags (-mfma, -ffast-math). (2) Write code in combinable patterns: write ab+c not t=ab; t+c. (3) Use intrinsics for specific combined operations when compiler misses them.
Loop-invariant code motion (LICM) moves computations that don't change across iterations outside the loop: Before: for(i=0;i<N;i++) { y = ab+c; x[i] = yd[i]; } After: y = ab+c; for(i=0;i<N;i++) { x[i] = yd[i]; } Benefits: (1) Computation happens once instead of N times. (2) More VLIW slots available for loop-varying operations. How to identify: (1) Expression uses only loop-invariant values (constants, variables not modified in loop). (2) No side effects that must happen each iteration. (3) Result is the same every iteration. What compilers miss: (1) Expressions involving pointers (may alias). (2) Expressions across function calls. (3) Complex expressions the optimizer doesn't recognize as invariant. Manual hoisting: explicitly compute outside loop and pass as parameter. Verify with assembly - look for repeated constant computations.
Common Subexpression Elimination (CSE) computes repeated expressions once: Before: a = (x+y)z; b = (x+y)w; c = (x+y)/2; After: temp = x+y; a = tempz; b = tempw; c = temp/2; For VLIW benefits: (1) Fewer operations = potentially fewer cycles. (2) Reduced register pressure if temp can be reused and original operands become dead. (3) May expose more ILP - operations using temp become independent of original computation. How to apply: (1) Identify repeated subexpressions. (2) Compute once, store in temp variable. (3) Ensure reuse doesn't increase register pressure excessively. (4) For memory expressions: ensure no aliasing between subexpression and intervening stores. Compilers perform CSE automatically, but may miss opportunities across function boundaries or through pointers. Profile assembly to find repeated computations.
Convert conditional logic to arithmetic operations for branchless SIMD: (1) Min/max: if(a<b) c=a; else c=b; -> c = min(a,b). SIMD: _mm256_min_ps(va,vb). (2) Clamp: if(x<0) x=0; if(x>1) x=1; -> x = max(0, min(1, x)). (3) Absolute value: if(x<0) x=-x; -> x = abs(x). SIMD: _mm256_andnot_ps(sign_mask, vx). (4) Conditional assignment: if(cond) x=a; else x=b; -> x = (cond & a) | (~cond & b). SIMD: _mm256_blendv_ps(vb, va, cond). (5) Conditional increment: if(cond) count++; -> count += cond. With SIMD cond being -1 or 0. (6) Sign function: sign = (x>0) - (x<0). General pattern: express result as function of both outcomes weighted by condition bits. Both paths compute, no branch prediction needed.
Convert branches to data-parallel select operations: Before: if(a[i] > 0) b[i] = c[i]; else b[i] = d[i]; After (scalar): b[i] = (a[i] > 0) ? c[i] : d[i]; After (SIMD): __m256 mask = _mm256_cmp_ps(va, zero, _CMP_GT_OS); vb = _mm256_blendv_ps(vd, vc, mask); Both paths are computed, blend selects result based on condition. Steps: (1) Load all inputs (a, c, d). (2) Compute comparison mask. (3) Compute both paths (vc and vd here are already loaded). (4) Use blendv to select per-lane result. Cost: compute both paths + blend vs branch prediction. Typically faster unless one path is very expensive or condition is highly predictable (>95% one way). For expensive paths, consider hybrid: SIMD for common case, scalar cleanup for rare case.
Strength reduction replaces expensive operations with cheaper equivalents: (1) Multiply to shift: x2 -> x<<1, x*4 -> x<<2 (for powers of 2). (2) Multiply to add sequence: x*5 -> x+(x<<2). (3) Divide to multiply: x/3 -> x0xAAAAAAAB>>33 (magic number division). (4) Modulo to AND: x%8 -> x&7 (for powers of 2). (5) Loop index multiply to add: for(i) a[i*4] -> ptr=a; for(i) {*ptr; ptr+=4;}. Apply when: (1) Hot loops - profile shows significant time in expensive operations. (2) Division/modulo in inner loops - extremely expensive (10-40 cycles). (3) Constant multipliers - compiler may not optimize runtime-constant multiplies. Modern compilers apply many strength reductions automatically with -O2/-O3. Manual strength reduction still helps for complex expressions or when compiler lacks context (e.g., runtime constants).
Linearizing control flow removes branches to enable longer instruction sequences: (1) If-conversion: convert if-then-else to predicated execution. Both paths execute, predicate selects results. Before: if(p) a=b; else a=c; After (predicated): a = p ? b : c; or with predicates: (p) a=b; (!p) a=c; Both issue in same bundle. (2) Loop unswitching: if condition is loop-invariant, create two versions of loop. Before: for(i) { if(flag) do_a(); else do_b(); } After: if(flag) for(i) do_a(); else for(i) do_b(); Eliminates branch from hot loop. (3) Tail duplication: merge paths by duplicating code. (4) Speculative execution: execute likely path, recover if wrong. Benefits: longer basic blocks for scheduling, eliminates branch misprediction, enables SIMD. Cost: more instructions execute. Use when branches are unpredictable or paths are short.
Division by constant converts to multiply by reciprocal plus correction: For unsigned: x/d = (x * m) >> (32 + s) where m and s are magic constants computed from d. For d=3: x/3 = (x * 0xAAAAAAAB) >> 33. For d=7: x/7 = (x * 0x24924925) >> 33 plus correction. For signed: add sign handling before and after. Implementation: (1) Precompute magic multiplier m and shift s for your divisor. (2) Replace divide with multiply + shift: uint32_t div3(uint32_t x) { return ((uint64_t)x * 0xAAAAAAABull) >> 33; }. Cost: multiply (3-4 cycles) + shift (1 cycle) vs divide (10-40 cycles). Works for any constant divisor. Many compilers do this automatically for compile-time constants. For runtime constants, precompute m and s, use for all divisions by that value.
Performance Analysis
8 questionsReading cycle traces to identify bottlenecks: (1) Execution timeline: look for gaps (stalls) between instruction completions. Long gaps indicate waiting for memory or dependencies. (2) Unit utilization: check which units are busy each cycle. Underutilized units indicate either not enough ILP or wrong bottleneck. (3) Memory events: identify cache misses - they appear as long latency loads. Track miss rate and miss latency. (4) Dependency chains: follow producer->consumer relationships. Long chains indicate critical path. (5) Branch mispredicts: look for pipeline flushes after conditional branches. (6) Resource conflicts: multiple operations wanting same unit create serialization. (7) Stall reasons: most traces categorize stalls (data hazard, structural hazard, cache miss). Total these to find dominant cause. (8) Instructions per cycle (IPC): calculate as instructions / total_cycles. Compare to theoretical max (VLIW width).
Iterative performance optimization follows a systematic cycle: (1) Measure baseline: profile original code, establish metrics (cycles, IPC, cache misses). (2) Identify bottleneck: use profiling data to find the dominant limiter (memory, compute, branches). (3) Hypothesize improvement: based on bottleneck, propose specific change (unroll, tile, vectorize, etc.). (4) Implement change: make ONE change at a time. Multiple changes obscure causation. (5) Measure result: profile modified code with same methodology. (6) Evaluate: did it improve? By how much? Any regressions elsewhere? (7) Iterate or backtrack: if improved, keep change and return to step 2 with new baseline. If not improved, revert and try different approach. (8) Document: record what worked, what didn't, and why. Stop when: returns diminish (next improvement is < 5%), code complexity becomes unmaintainable, or you've reached theoretical limits (roofline). The 80/20 rule applies - first optimizations often give 5-10x, later ones give 10-20%.
A/B testing optimization changes requires controlled comparison: (1) Baseline measurement: run original code multiple times (10-30), record mean and standard deviation of cycles/time. (2) Apply single change: modify ONE aspect (unroll factor, layout, etc.). (3) Optimized measurement: run modified code same number of times, same conditions. (4) Statistical comparison: use t-test to determine if difference is significant. Require p < 0.05 for significance. (5) Control variables: same input data, same machine state (warm cache), same CPU frequency (disable turbo for consistency). (6) Measure what matters: cycles (perf stat), time, cache misses, IPC - depending on what you're optimizing. (7) Iterate: if improvement, keep it. If not, revert and try different change. (8) Log everything: input size, code version, measurements. Enables reproducing results and understanding trends.
IPC (Instructions Per Cycle) measures how many instructions complete per clock cycle on average: IPC = instructions_executed / cycles_elapsed. Maximum possible IPC equals the processor's issue width. For a 4-wide superscalar: max IPC = 4. For a VLIW with 8 slots: max IPC = 8. Good IPC targets: Superscalar general code: 1-2 IPC is typical, 2.5-3.5 IPC is excellent. VLIW optimized code: 50-70% of maximum is good (4-5.6 for 8-wide). Memory-bound code: often limited to 0.5-1.0 IPC regardless of width. Compute-bound code: should approach maximum IPC. Interpreting IPC: Low IPC + high cache misses = memory-bound. Low IPC + high branch mispredicts = control-bound. Low IPC + few hazards = not enough ILP exposed. High IPC = well-optimized. Always compare IPC to the theoretical maximum for your specific processor.
Measure memory bandwidth to determine if code is memory-bound: (1) Using performance counters: measure LLC (Last Level Cache) misses. Bandwidth = LLC_misses * 64 bytes / time. (2) Peak bandwidth test: run STREAM benchmark to establish maximum achievable bandwidth (typically 70-90% of theoretical). (3) Application bandwidth: for your code, measure bytes transferred vs time. Compare to peak. (4) Formula: achieved_bandwidth = (bytes_read + bytes_written) / execution_time. For arrays: bytes = N * sizeof(element) * (reads + writes). (5) Bandwidth utilization = achieved / peak * 100%. If > 70%, likely memory-bound. (6) On Linux: perf stat -e LLC-load-misses,LLC-store-misses. (7) Intel VTune / AMD uProf provide memory bandwidth metrics directly. (8) If close to bandwidth limit, optimizations should focus on reducing memory traffic (better locality, compression, computation vs lookup).
Identify wasted VLIW slots by analyzing bundle utilization: (1) NOP count: explicitly placed NOPs consume slots with no work. Count NOPs / total slots. (2) Slot fill rate: useful_operations / (cycles * slots_per_bundle). For 8-slot VLIW running 1000 cycles with 5000 operations: fill rate = 5000/(1000*8) = 62.5%. (3) Per-unit utilization: track which functional units are used. If load unit is 90% utilized but ALU is 30%, ALU slots are wasted. (4) Dependency-caused gaps: count cycles where an operation COULD run on an available unit but doesn't due to data not being ready. (5) Code structure analysis: look for narrow basic blocks (few instructions between branches) - limited scheduling freedom. (6) From trace: for each cycle, count empty/NOP slots. Aggregate: histogram of bundles with 0,1,2...N slots filled. Average fill rate is key metric.
Profile execution unit utilization to identify bottlenecks: (1) Hardware counters: modern CPUs expose per-port/unit utilization. Intel: UOPS_DISPATCHED_PORT.PORT_*. AMD: similar PMCs. (2) VTune/uProf analysis: these tools provide execution port utilization breakdown. Look for ports near 100% = bottleneck. (3) Static analysis: count operations by type in your inner loop. Unit with highest operation_count/available_units is likely bottleneck. (4) Experimental: add NOPs for specific unit type. If total time increases, that unit wasn't the bottleneck. If time stays same, it was. (5) Throughput analysis: if theoretical peak is 2 multiplies/cycle and you're achieving 1.8, multiply unit is saturated. (6) Balance check: compare load:compute:store ratio to machine capabilities. If code has 4 loads, 2 computes per iteration but machine has 2 load ports and 4 compute ports, loads are bottleneck. Rebalance by reducing memory traffic or increasing compute.
Latency optimization minimizes time for a single operation. Throughput optimization maximizes operations per second: Latency (response time): time from request to completion. Critical for: user-facing operations, real-time systems, operations on critical path. Optimize by: reducing dependencies, using faster operations, caching results. Throughput (bandwidth): operations completed per unit time. Critical for: batch processing, streaming, data-parallel workloads. Optimize by: pipelining, parallelism, reducing overhead per operation. Trade-offs: batching improves throughput but increases latency. Parallelism improves throughput but may not help single-operation latency. For VLIW/SIMD: software pipelining improves throughput (overlapped iterations) but not single-iteration latency. Multiple accumulators improve throughput of reduction. Both matter: optimize for latency when you have ONE item, throughput when you have MANY items. Often different code paths optimal for each.
Hardware performance counters
7 questionsHardware Performance Counters (also called Performance Monitoring Counters or PMCs) are special CPU registers that count hardware events like instructions executed, cache misses, and branch mispredictions. The Performance Monitoring Unit (PMU) contains these counters. Most Intel Core processors have 4 fully programmable counters and 3 fixed-function counters per logical core. Fixed counters measure core clocks, reference clocks, and instructions retired. Programmable counters let you choose which events to measure. The Linux perf tool accesses PMU counters through the perf_event_open() system call, providing abstractions over hardware-specific capabilities.
Event multiplexing occurs when you request more events than available hardware counters. The PMU time-slices between event groups, measuring each for a portion of total runtime, then scales up the counts. This introduces estimation error. Intel CPUs typically have 4 programmable + 3 fixed counters. Multiplexing matters when: calculating derived metrics (ratios become inaccurate if numerator and denominator weren't measured simultaneously), comparing absolute counts across events (both have estimation error), or when workload behavior varies over time (different events measured during different phases). To minimize impact: group related events (use perf -e '{event1,event2}'), reduce total events requested, or run multiple times measuring different event subsets.
Intel PCM provides real-time access to performance counters without Linux perf. Install from GitHub (opcm/pcm). Run 'sudo pcm' for real-time display of: IPC, cache hit rates, memory bandwidth, QPI traffic, and power consumption across all cores. For specific metrics: 'pcm-memory' for memory bandwidth, 'pcm-pcie' for PCIe traffic, 'pcm-power' for power metrics. PCM works on both Linux and Windows. It accesses uncore PMUs (memory controller, QPI) not available through standard perf interface. Output includes: L2/L3 hit rates, memory read/write bandwidth per channel, core and package power. Useful for understanding system-level behavior that per-process profiling misses. Requires root/admin access for MSR reads.
Run 'perf list' to display all available performance events on your system. Events are categorized as: Hardware events (cycles, instructions, cache-references, cache-misses, branches, branch-misses), Software events (context-switches, page-faults, cpu-migrations), Hardware cache events (L1-dcache-loads, L1-dcache-load-misses, LLC-loads, LLC-load-misses), and Tracepoint events (kernel functions, syscalls). The available events depend on your CPU model. For Intel processors, use the pmu-tools 'ocperf' wrapper to access the full list of processor-specific events not exposed by default perf, including detailed microarchitectural events.
Use 'perf stat -e event1,event2,event3 ./program' to measure multiple events. Example: 'perf stat -e cycles,instructions,cache-misses,branch-misses ./program'. Most Intel CPUs have 4 programmable counters plus 3 fixed counters, so measuring more than ~7 events requires multiplexing - perf time-slices between event groups and estimates totals. Use event groups with curly braces to ensure events are measured together: 'perf stat -e '{cycles,instructions}','{cache-references,cache-misses}' ./program'. This ensures cycles and instructions are counted simultaneously (enabling accurate IPC calculation), and cache events are grouped together. Check for '
Intel Processor Trace records complete control flow by encoding taken branches into a compressed trace buffer. Unlike sampling which captures point-in-time snapshots, PT provides exact execution history. Use PT when: you need exact function call sequences (debugging race conditions), you want precise timing of specific code paths, branch sampling misses rare events, or you need to understand control flow leading to a bug. PT has higher overhead than sampling and generates large traces. Enable with 'perf record -e intel_pt// ./program'. Requires Broadwell or newer Intel CPU (check 'grep intel_pt /proc/cpuinfo'). PT can generate 'virtual LBRs' of arbitrary size, overcoming the 32-entry hardware LBR limit.
Use 'perf stat ./program' which reports IPC by default, calculated as instructions / cycles. Modern CPUs can execute 4+ instructions per cycle with superscalar execution. Typical IPC values: <1.0 indicates stalls (memory-bound, branch mispredictions, dependency chains), 1.0-2.0 is common for general code, 2.0-4.0 indicates well-optimized compute code, >4.0 possible with SIMD. Low IPC + high cache misses = memory-bound. Low IPC + high branch-misses = branch misprediction bound. Low IPC + neither = likely dependency chains or lack of instruction-level parallelism. IPC alone doesn't tell the full story - use TMAM methodology for detailed bottleneck breakdown. Compare IPC between code versions to assess optimization impact.
Flame graphs and call stacks
6 questionsDifferential flame graphs show what changed between two profiles. Workflow: 1) Capture baseline: 'perf record -F 99 -a -g -- sleep 30' during workload, 'perf script > before.perf'. 2) Make changes and capture again: 'perf script > after.perf'. 3) Generate differential: './stackcollapse-perf.pl before.perf > before.folded', './stackcollapse-perf.pl after.perf > after.folded', './difffolded.pl before.folded after.folded | ./flamegraph.pl > diff.svg'. Red indicates functions that got slower (more samples), blue indicates faster (fewer samples). Width shows absolute difference. This immediately highlights what changed - faster to identify regressions than comparing two separate flame graphs manually.
In a flame graph: each box represents a function (stack frame), the y-axis shows stack depth (bottom is entry point, top is leaf functions), the x-axis shows population of samples (NOT time passage - it's sorted alphabetically). Box width indicates relative time spent. Look for wide boxes as these are hotspots. Functions beneath are callers (parents), functions above are callees (children). Colors typically indicate: green for user code, red/orange for kernel code, yellow for C/C++ runtime. To find optimizations, look for the widest towers - these represent code paths consuming the most CPU. Prior to flame graphs, understanding complex profiles took hours; now the hottest paths are immediately visible.
Sampling profiler limitations: 1) Misses short functions called less often than sampling interval - a 1us function at 1000Hz sampling has ~0.1% chance of being caught per call. 2) Statistical nature means rare hot paths may not appear in profiles. 3) Cannot accurately count function call frequency - only measures time. 4) 'Skid' on interrupt-based sampling places samples slightly after actual event. 5) Kernel/interrupt time may be attributed incorrectly. 6) Multi-threaded aliasing if sampling correlates with thread scheduling. 7) Cannot detect contention or blocking (need off-CPU analysis). When they fail: very short benchmarks, rarely-called expensive functions, timing-sensitive debugging. For exact call counts and sequences, use tracing or instrumentation-based profilers despite higher overhead.
Off-CPU flame graphs show where threads spend time blocked (not on CPU) - waiting for I/O, locks, sleep, page faults, etc. Regular (on-CPU) flame graphs miss this because CPU profilers only sample running threads. Create with: record scheduler events 'perf record -e sched:sched_switch -a -g' or use BPF-based tools like bcc's offcputime. Use when: application seems slow but CPU utilization is low, you suspect I/O or lock contention, threads frequently block, or on-CPU profile doesn't explain observed latency. Off-CPU analysis complements on-CPU profiling - together they account for all wall-clock time. The flame graph shows blocking call stacks, with width indicating total blocked time.
Sampling-based profiling captures call stack snapshots at regular intervals (typically timer-based) rather than instrumenting every function call. A sampling profiler reads the call stack periodically (e.g., every 10ms) to record what code is running. Functions consuming more CPU time appear in more samples. This is statistically accurate: with enough samples (1000+ minimum, 5000+ ideal), you get reliable hotspot identification. Advantages over instrumentation: minimal overhead (often <1%), no code modification needed, works on optimized binaries. The overhead calculation: at 100Hz sampling with 10,000 instructions per sample on a 1GHz CPU, theoretical overhead is only 0.1%. Most modern profilers (perf, VTune, Instruments) use sampling.
First record with stack traces: 'perf record -F 99 -a -g -- sleep 60' (99 Hz sampling, all CPUs, call graphs). Then convert to text: 'perf script > out.perf'. Clone FlameGraph tools: 'git clone https://github.com/brendangregg/FlameGraph'. Generate the SVG: './stackcollapse-perf.pl out.perf | ./flamegraph.pl > flamegraph.svg'. The resulting interactive SVG shows the call stack hierarchy where: x-axis represents stack profile population (sorted alphabetically, NOT time), y-axis shows stack depth, and box width indicates time spent. The widest boxes at any level are your hottest code paths. Click boxes to zoom into subtrees.
Tracing tools
5 questionsDownload tracebox: 'curl -LO https://get.perfetto.dev/tracebox && chmod +x tracebox'. Set permissions: 'sudo chown -R $USER /sys/kernel/tracing', 'echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid', 'echo 0 | sudo tee /proc/sys/kernel/kptr_restrict'. Create a config enabling callstack sampling with the callstack_sampling field in your data source config. Run tracebox with your config to collect traces. Open the resulting .pb file in the Perfetto UI (ui.perfetto.dev). Perfetto shows callstack samples as instant events on the timeline within process track groups, with dynamic flamegraph views when selecting time regions. Convert to pprof format with: 'python3 traceconv profile --perf trace.pb'.
Navigate to chrome://tracing in Chrome browser. Click 'Record' to start capture, select categories to trace (more categories = more data but potentially noisy), perform the action you want to profile, then 'Stop' recording. The trace shows TRACE_EVENT data from Chrome's instrumented code in a hierarchical timeline view per thread per process. Save traces as JSON with the Save button. Note: chrome://tracing is deprecated in favor of Perfetto (ui.perfetto.dev) which is faster, more stable, and supports custom queries. For web developers, Chrome DevTools Performance panel is often more ergonomic - press F12, go to Performance tab, click record, and it auto-selects appropriate trace categories for the current tab only.
Chrome Trace Event Format is a JSON format for profiling data viewable in chrome://tracing or Perfetto UI. Basic structure: array of event objects with fields: name (event name), cat (category), ph (phase: B=begin, E=end, X=complete, i=instant), ts (timestamp in microseconds), pid (process ID), tid (thread ID), args (metadata). Example duration event: {"name":"function","cat":"custom","ph":"X","ts":1000,"dur":500,"pid":1,"tid":1}. Write your profiling system to output this format and open in chrome://tracing for visualization without building custom tools. Supports nested events, counters, async events, and flow events for cross-thread/process relationships.
Profiling collects statistical samples to identify where time is spent - you see hotspots and call distribution but not exact execution sequence. Tracing records discrete events with timestamps to show exact execution flow and timing - you see what happened and when, but data volume is large. Use profiling for: finding hotspots, optimizing CPU-bound code, understanding where time goes generally. Use tracing for: debugging timing issues, understanding event sequences, analyzing latency distributions, finding rare slow paths. Profiling has lower overhead (sampling), tracing higher overhead (records all events). Many tools do both: perf can sample or trace, VTune offers Hotspots (profiling) and Platform Profiler (tracing), Perfetto primarily traces but supports sampling.
Use function tracing with 'perf probe' and 'perf trace'. First add probes: 'perf probe --add function_name' for entry, 'perf probe --add function_name%return' for return. Then record: 'perf record -e probe:function_name,probe:function_name__return ./program'. Use 'perf script' to see timestamped events, calculate latencies by matching entry/return pairs. For system calls: 'perf trace ./program' shows all syscalls with latencies. For specific functions without probes, use dynamic tracing: 'perf record -e 'sched:*' -g' for scheduler events. Combine with -T flag to add timestamps. This is more precise than sampling for measuring specific function execution times but has higher overhead than PMU-based profiling.
Microbenchmarking best practices
5 questionsInclude <benchmark/benchmark.h>. Define benchmark functions taking benchmark::State& state parameter. Time code inside 'for (auto _ : state)' loop - this is the measured section. Use benchmark::DoNotOptimize(result) to prevent dead code elimination. Register with BENCHMARK(BM_FunctionName). End file with BENCHMARK_MAIN(). Build with CMake using -DBENCHMARK_DOWNLOAD_DEPENDENCIES=on -DCMAKE_BUILD_TYPE=Release. Run with flags: --benchmark_format=json for machine-readable output, --benchmark_repetitions=N for statistical reliability, --benchmark_enable_random_interleaving=true to reduce order-dependent variance with TurboBoost CPUs. Google Benchmark automatically determines iteration count for statistical stability.
Control sources of variance: 1) CPU: disable Turbo Boost, fix frequency, pin threads to cores with taskset/numactl. 2) Memory: disable ASLR ('echo 0 | sudo tee /proc/sys/kernel/randomize_va_space'), warm up caches with dry runs. 3) OS: use isolcpus to reserve cores, disable irqbalance, use real-time scheduling if needed. 4) Thermal: let system reach steady-state temperature. 5) Statistical: run many iterations (30+), use median instead of mean, report confidence intervals. 6) Environment: close other applications, disable network, use consistent environment variables. 7) Benchmarking: randomize run order to avoid ordering effects. Check coefficient of variation (stddev/mean) - should be <5% for reliable results. If variance remains high, investigate sources with multiple profiler runs.
Key pitfalls: 1) Dead Code Elimination - compiler removes code with unused results. Fix: use JMH's Blackhole.consume() or Google Benchmark's DoNotOptimize(). 2) Constant Folding - compiler pre-computes results at compile time. Fix: use runtime inputs, not compile-time constants. 3) Loop Optimization - compiler may hoist computations out of loops. Fix: use benchmark framework's iteration mechanism, not manual loops. 4) Inadequate Warmup - JIT hasn't optimized code yet. Fix: run sufficient warmup iterations (JMH default: 5). 5) Measurement variance - Fix: run multiple iterations and forks, report with confidence intervals. 6) Benchmark order effects - Fix: use random interleaving (JMH: --benchmark_enable_random_interleaving).
Add JMH dependency to Maven: org.openjdk.jmh:jmh-core and jmh-generator-annprocess. Create benchmark class with @Benchmark annotated methods. Use @State(Scope.Thread) for per-thread state. The benchmark loop is: 'for (auto _ : state)' equivalent - JMH handles iteration count automatically based on target measurement time. Configure with annotations: @Warmup(iterations=5) for warmup, @Measurement(iterations=5) for actual measurements, @Fork(2) for JVM forks. Run via Maven plugin or JMH runner main class. JMH automatically handles JIT compilation warmup, dead code elimination prevention, and statistical analysis, outputting mean, error, and confidence intervals.
CPU frequency scaling introduces variance as Turbo Boost activates/deactivates based on thermal and power conditions. Options: 1) Disable Turbo Boost during benchmarks: 'echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo' (Linux). 2) Fix CPU frequency with cpupower: 'sudo cpupower frequency-set -g performance -d 3.0GHz -u 3.0GHz'. 3) Use cycles instead of time as primary metric - cycles are invariant to frequency. 4) Let frequency vary but run many iterations - Google Benchmark's --benchmark_enable_random_interleaving helps. 5) Warm up the benchmark to reach steady-state Turbo frequency before measuring. Report whether frequency was fixed and what frequency was used.
Memory profiling
5 questionsMemory leak detection (e.g., Valgrind Memcheck) finds memory that was allocated but never freed - the pointer is lost. Heap profiling (e.g., Massif) tracks all allocations over time to show memory usage patterns and identify allocation hotspots, regardless of whether memory is properly freed. Heap profilers answer: which functions allocate the most memory? How does usage change over time? Where is peak memory? They also detect 'space leaks' - memory that's technically reachable (pointer exists) but not actually used, which leak detectors miss. Use leak detection to find bugs; use heap profiling to optimize memory consumption and identify allocation-heavy code paths.
perf mem records and analyzes memory access samples. Run: 'perf mem record ./program' then 'perf mem report' to see memory access breakdown. It shows: data source (L1/L2/L3 cache, local/remote DRAM), access type (load/store), addresses accessed, and latency. Use for identifying: memory-bound hot spots, cache miss sources, NUMA issues (remote memory access). On Intel, perf mem uses PEBS memory sampling which captures precise load/store addresses. Filter by latency: 'perf mem record -t 30 ./program' to sample only accesses with 30+ cycle latency. The report shows data addresses - combine with 'perf report --sort mem' for source analysis. Useful for optimizing data layout and access patterns.
NUMA (Non-Uniform Memory Access) systems have different latencies to local vs remote memory. Profile with: 'perf stat -e numa_hit,numa_miss,numa_foreign,numa_interleave ./program' to count NUMA-related events. Use 'numactl --hardware' to see topology. Intel VTune has Memory Access analysis showing NUMA traffic. For Linux, check /proc/PID/numa_maps to see memory placement. High numa_miss or remote memory accesses indicate suboptimal placement. Optimize with numactl: 'numactl --membind=0 ./program' binds memory to node 0, 'numactl --cpunodebind=0 --membind=0 ./program' binds both CPU and memory. Consider NUMA-aware data structures that keep data local to accessing threads.
Multiple approaches: 1) Valgrind Massif: 'valgrind --tool=massif ./program' then 'ms_print massif.out.' - shows heap usage over time with allocation call stacks. 2) Heaptrack: lower overhead than Massif, tracks every allocation with full backtrace. Run 'heaptrack ./program' then 'heaptrack_gui heaptrack..gz' for visualization. 3) perf with memory events: 'perf record -e malloc -g ./program' using USDT probes if available. 4) gperftools (tcmalloc): link with -ltcmalloc and set HEAPPROFILE environment variable. 5) Address Sanitizer with -fsanitize=address also tracks allocations. For production profiling with minimal overhead, consider sampling-based approaches or eBPF-based tools that don't require recompilation.
Run 'valgrind --tool=massif ./program' to profile heap allocations over time. Massif measures both useful allocation space and bookkeeping/alignment overhead. Output goes to massif.out.
Memory Access Optimization
4 questionsUse __builtin_prefetch(addr, rw, locality) to hint that addr will be accessed soon. Parameters: rw is 0 for read, 1 for write. locality is 0-3 (3=high temporal locality, keep in all cache levels; 0=no locality, can evict immediately). Example: for processing array with stride 64: __builtin_prefetch(&arr[i+16], 0, 0) several iterations ahead. Since RAM latency is ~100ns and loop iterations are ~1ns, prefetch 10-20 iterations ahead. Prefetching benefits irregular access patterns that defeat hardware prefetchers. Measure carefully - incorrect prefetching can hurt performance by polluting caches.
The restrict keyword (C99) tells the compiler that a pointer is the only way to access the pointed-to memory during its scope. This enables optimizations that would be unsafe with potential aliasing. Example: void add(int* restrict a, int* restrict b, int* restrict c, int n) allows the compiler to vectorize without checking if a, b, c overlap. Without restrict, the compiler must assume writes through one pointer might affect reads through another, preventing reordering and vectorization. Use restrict when you can guarantee no aliasing. Incorrect use causes undefined behavior. C++ uses __restrict as a non-standard extension.
False sharing occurs when threads on different cores modify variables that share a cache line, causing the cache line to bounce between cores even though threads access different data. Each modification invalidates other cores' cached copies, causing expensive cache coherency traffic. To avoid: (1) Pad structures to cache line size: struct attribute((aligned(64))) { int counter; char pad[60]; }. (2) Use thread-local storage for per-thread counters. (3) Group related data accessed by the same thread. (4) Use compiler-provided padding: alignas(64) in C++11. False sharing can cause 10-100x slowdown in multithreaded code. Profile with perf to detect it.
Cache line size is typically 64 bytes on modern CPUs. Data crossing cache line boundaries requires two memory fetches instead of one. Align structures to cache line boundaries using: struct attribute((aligned(64))) Data { ... }; or posix_memalign(&ptr, 64, size). Benefits: single cache line access for data that fits, prevents false sharing in multithreaded code where different cores modify adjacent data. Tradeoffs: over-aligning wastes memory (aligning 4-byte int to 64 bytes wastes 60 bytes). Best practice: align frequently accessed hot data and shared data in concurrent code; don't over-align small structures.
Cache miss analysis
4 questionsUse 'perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses ./program' to count cache events. Key metrics to calculate: L1 miss rate = L1-dcache-load-misses / L1-dcache-loads, LLC miss rate = LLC-load-misses / LLC-loads. For sampling-based analysis, use 'perf record -e cache-misses ./program' followed by 'perf report' to identify functions causing the most cache misses. Note that cache miss latency costs vary significantly: L1 hit is about 3-4 cycles, L2 hit is 10-14 cycles, L3 hit is 40-50 cycles, and a main memory access is 200-300 cycles. Focus optimization efforts on L3 misses as they have the highest latency impact.
Profile TLB (Translation Lookaside Buffer) misses with perf: 'perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./program'. High TLB miss rates indicate working set exceeds TLB coverage. Modern CPUs have: L1 dTLB ~64 entries, L1 iTLB ~128 entries, L2 STLB ~1536 entries. With 4KB pages, max coverage is ~6MB. Solutions: use huge pages (2MB on x86) to increase TLB coverage 512x, improve memory locality to reduce working set, or use transparent huge pages (THP). Enable huge pages: 'echo always > /sys/kernel/mm/transparent_hugepage/enabled' or use madvise(MADV_HUGEPAGE). TLB miss penalty is ~20-100 cycles for page table walk, significant for memory-intensive workloads.
Run 'valgrind --tool=cachegrind ./program' to simulate L1 and L2 cache behavior. Cachegrind uses a simulation of a machine with split L1 cache (instruction and data) and unified L2 cache. After execution, it outputs summary statistics and creates a cachegrind.out.
The three key metrics are: 1) LX request rate - number of cache level X requests per instruction. Low request rate means data comes from faster cache levels. 2) LX miss rate - number of cache level X misses per instruction. High request rate with low miss rate means data is mostly served from that cache level. High miss rate means data comes from slower memory. 3) LX miss ratio - ratio of misses to requests at level X. This is commonly cited but only meaningful when request rate is high. When analyzing cache performance, focus on miss rate (misses per instruction) rather than miss ratio alone, as a high miss ratio with low request rate may not indicate a real performance problem.
Compiler Optimization Flags
4 questionsGCC auto-vectorization: Enabled at -O2 with basic cost model, full at -O3. Key flags: -ftree-vectorize enables vectorization explicitly. -fopt-info-vec shows what was vectorized. -fopt-info-vec-missed shows failures and reasons. -march=native uses all available SIMD instructions (SSE, AVX, etc.). -ffast-math may enable more vectorization by allowing reordering. To help vectorization: use restrict on pointers, align data to SIMD width (32 bytes for AVX), avoid loop-carried dependencies, use simple loop bounds. Check vectorization report - common failures: aliasing, non-contiguous access, complex control flow, unaligned access.
GCC -O2 enables most optimizations without aggressive space-time tradeoffs. Key optimizations include: -finline-small-functions, -findirect-inlining, -fpartial-inlining, -fthread-jumps, -fcrossjumping, -foptimize-sibling-calls, -fcse-follow-jumps, -fgcse, -fexpensive-optimizations, -frerun-cse-after-loop, -fcaller-saves, -fpeephole2, -fschedule-insns2, -fstrict-aliasing, -fstrict-overflow, -freorder-blocks, -freorder-functions, -ftree-vrp, -ftree-pre, -ftree-switch-conversion. As of GCC 12, vectorization is enabled at -O2 with -fvect-cost-model=very-cheap. -O2 is the recommended level for production code balancing speed and compilation time.
GCC -O3 enables all -O2 optimizations plus: -fgcse-after-reload (global CSE after register allocation), -fipa-cp-clone (interprocedural constant propagation with function cloning), -floop-interchange, -floop-unroll-and-jam, -fpeel-loops, -fpredictive-commoning (reuses computations across loop iterations), -fsplit-loops, -fsplit-paths, -ftree-loop-distribution, -ftree-partial-pre, -funswitch-loops (moves loop-invariant conditionals out), -fvect-cost-model=dynamic (full vectorization cost modeling). These may increase code size significantly. -O3 can sometimes be slower than -O2 due to instruction cache pressure - benchmark your specific code.
-Ofast enables all -O3 optimizations plus non-standard-compliant optimizations: -ffast-math (allows reordering floating-point operations, assuming no NaN/Inf, enables reciprocal approximations), -fallow-store-data-races (allows introducing data races for performance). WARNING: -ffast-math breaks IEEE 754 compliance and can produce incorrect results for code depending on precise floating-point semantics. Never use -Ofast system-wide. Only use it for specific numerical code after verifying correctness is maintained. For most applications, stick with -O2 or -O3. If you need -ffast-math, enable it selectively per file.
Roofline model analysis
3 questionsArithmetic Intensity (AI) = FLOPs performed / Bytes transferred from memory. It determines whether code is compute-bound or memory-bound. Calculate: count floating-point operations in a kernel, count bytes read/written. Example: DAXPY (y = ax + y) with n elements: 2n FLOPs (multiply + add), 3n8 bytes (read x, read y, write y), AI = 2n / 24n = 0.083 FLOPs/byte - very memory-bound. Intel Advisor calculates AI automatically in Roofline analysis. Compare AI to machine balance (peak FLOPS / peak memory bandwidth). If AI < machine balance, kernel is memory-bound; if AI > machine balance, kernel is compute-bound. Increase AI through cache blocking, data reuse, or algorithm changes to move from memory-bound to compute-bound.
The Roofline Model visualizes application performance relative to hardware limits. The X-axis is Arithmetic Intensity (operations per byte of data moved), Y-axis is performance (operations per second). The 'roofline' has two parts: a sloped memory-bound region (limited by memory bandwidth) and a flat compute-bound region (limited by peak FLOPS). Each dot represents a kernel/loop. If a dot is below the sloped roof, it's memory-bound - optimize data movement. If below the flat roof, it's compute-bound - optimize computation. Intel Advisor generates roofline charts automatically. The Cache-Aware Roofline Model extends this with separate roofs for each cache level, helping identify which memory level is the bottleneck.
In Intel Advisor GUI: create project, specify executable, run Survey analysis first to collect timing data, then run Trip Counts analysis with FLOPS collection enabled. Alternatively, use the 'Collect Roofline' shortcut which runs both. Command line: 'advisor --collect=survey --project-dir=./adv -- ./program' then 'advisor --collect=tripcounts --flop --project-dir=./adv -- ./program'. The Roofline pane shows kernels as dots with size/color indicating execution time. Check the Recommendation tab for optimization guidance. A kernel's vertical position relative to roofs indicates bottlenecks - if above a roof, that's not the primary bottleneck. Focus on kernels far below roofs (room to optimize) and large dots (high time impact).
Cycle counting and measurement
3 questionsInclude <x86intrin.h> and use __rdtsc() to read the Time Stamp Counter. For accurate measurements, use CPUID to serialize instructions before the first RDTSC, and RDTSCP (which has partial serialization) at the end. A reliable pattern is: call CPUID, call RDTSC (start), run code, call RDTSCP (end), call CPUID. The final CPUID prevents instructions after RDTSCP from being reordered. Modern Intel CPUs since Nehalem (2008) have an 'invariant TSC' that increments at a constant rate regardless of CPU frequency scaling or power states. Note that TSC frequency differs from actual CPU frequency - for example, a CPU ranging 800MHz-4800MHz might have TSC ticking at a fixed 2.3GHz.
RDTSC overhead is approximately 150-200 clock cycles on modern Intel processors. For accurate benchmarking, measure the RDTSC overhead separately and subtract it from your results. Intel and Agner Fog recommend this approach. However, for functions taking 100,000+ cycles, the overhead is negligible and can be ignored. Even RDTSC itself returns varying results, so sample many times - around 3 million samples at 4.2GHz produces stable averages. Also use SetThreadAffinityMask (Windows) or sched_setaffinity (Linux) to pin your thread to a single CPU core, since TSC values are not synchronized across cores on multi-processor systems.
Wall-clock time (elapsed/real time): total time from start to finish, including waiting for I/O, other processes, sleeping. What a stopwatch would measure. CPU time: time CPU spent executing your code, excluding waits. If process sleeps for 1 second, wall time increases by 1s but CPU time doesn't. User time: CPU time spent in user space executing your code. System time: CPU time spent in kernel on behalf of your process (syscalls, I/O operations). User + System = total CPU time. On multi-core: CPU time can exceed wall time if threads run in parallel. The Unix 'time' command reports all three. For benchmarking compute-bound code, use CPU time; for user-facing latency, use wall time.
Statistical analysis of benchmarks
3 questionsMinimum 1000 samples for basic reliability, 5000+ samples for high confidence. Confidence interval width is inversely proportional to square root of sample size - quadrupling samples halves the interval width. For comparing benchmarks, use statistical tests: t-test for two configurations (requires normal distribution), ANOVA for multiple configurations. Report results with confidence intervals (typically 95%) rather than just means. Use median and median absolute deviation for non-normal distributions instead of mean and standard deviation. For LLM and ML benchmarks, recent research shows 10 independent trials per configuration with reported variance and confidence intervals as best practice.
Essential metrics to report: 1) Central tendency: mean AND median (median more robust to outliers). 2) Variance: standard deviation, interquartile range, min/max. 3) Confidence intervals: 95% CI for mean. 4) Sample size: number of iterations/runs. 5) Methodology: warmup iterations, measurement iterations, tools used. Hardware: CPU model (exact SKU), RAM size/speed, storage type. Software: OS version, compiler/runtime version, optimization flags. Environment: frequency scaling settings, other running processes, whether virtualized. For comparison claims, report: statistical test used (t-test, Mann-Whitney), p-value or whether confidence intervals overlap. Include raw data or histogram when possible. Note any known sources of variance.
For comparing two configurations: use t-test if data is normally distributed with similar standard deviations. For multiple configurations: use ANOVA (one-factor analysis of variance). For non-normal distributions: use non-parametric tests like Mann-Whitney U. Always report confidence intervals - 95% is standard but 90% or 99% may be appropriate depending on risk tolerance. Use Maritz-Jarrett method for confidence intervals around percentiles/quantiles. For high-variance workloads, consider CUPED (Controlled-experiment Using Pre-Experiment Data) to reduce variance. Avoid comparing just means - overlapping confidence intervals suggest no statistically significant difference. Visualize distributions, not just summary statistics.
Bottleneck identification methodology
3 questionsTMAM is a hierarchical methodology for identifying CPU bottlenecks. It classifies execution into four top-level categories: 1) Retiring - useful work, ideal state. 2) Bad Speculation - wasted work from mispredicted branches. 3) Front-End Bound - instruction fetch/decode bottlenecks (I-cache misses, complex instructions). 4) Back-End Bound - subdivided into Memory Bound (cache misses, memory latency) and Core Bound (execution unit contention, long-latency operations). Start at the top level to identify which category dominates, then drill down. Key insight: only optimize the bottleneck category - improving non-bottleneck areas won't help. Intel VTune and pmu-tools toplev.py implement TMAM automatically.
CPU-bound: high CPU utilization (near 100%), low I/O wait, performance scales with faster CPU. Check with 'top' - if CPU bars are full, you're CPU-bound. Memory-bound: CPU utilization moderate but performance limited by memory bandwidth/latency. Profile cache misses with perf - high L3 miss rate indicates memory-bound. Use Roofline Model - if kernels are on the sloped portion, they're memory-bound. I/O-bound: low CPU utilization, high I/O wait (wa% in top). Use 'iotop' to see disk I/O per process. Check with 'vmstat' - high 'wa' column indicates I/O wait. Intel VTune's Top-Down Microarchitecture Analysis Method classifies as Front-End Bound, Back-End Bound (Memory or Core), Bad Speculation, or Retiring.
Several approaches: 1) perf with lock events: 'perf lock record ./program' then 'perf lock report' shows contention statistics. 2) Valgrind DRD/Helgrind: detect lock order issues and contention. 3) Intel VTune Threading analysis: shows wait time per sync object. 4) Off-CPU analysis to see time spent waiting on locks. 5) Instrumented mutex libraries (e.g., pthread with PTHREAD_MUTEX_ERRORCHECK). 6) eBPF/BCC tools like lockstat. Look for: high lock hold times, threads waiting longer than holding, lock ordering issues, unnecessary locking (consider lock-free structures). Metrics: contention rate = lock_waiters / lock_acquisitions, wait time vs hold time ratio. High contention indicates need for finer-grained locking or lock-free algorithms.
Arithmetic Optimizations
3 questionsModern compilers produce identical code for all three. GCC/Clang at -O2 compile x * 2, x + x, and x << 1 to the same instruction: typically add %eax, %eax or lea (%rax,%rax), %eax (which computes address but is used for arithmetic). The LEA form is often preferred because it can write to a different register without modifying the source, enabling better instruction scheduling. Don't manually optimize to shifts - it reduces readability without improving performance. The compiler knows your target architecture's instruction latencies and will choose optimally. Focus on algorithmic improvements instead.
Compilers replace integer division by constants with multiplication by a magic number plus a shift. For x/3 on 32-bit: x * 0x55555556 >> 32 approximates division by 3. The magic number 0x55555556 (1431655766) / 2^32 = 0.33333... GCC transforms x/255 into (x * 0x80808081) >> 39. For division by powers of 2, unsigned division uses right shift: x/8 becomes x >> 3. This works because multiplication is much faster than division on modern CPUs (3-4 cycles vs 20-80 cycles). The technique is documented in Hacker's Delight and implemented in GCC since the 1990s.
No - let the compiler do it. Modern compilers automatically replace multiplication/division by powers of 2 with shifts. Manual shift optimizations can backfire: (1) On AMD Athlon, two shift units vs one multiply unit meant multiply was faster for complex expressions. (2) A benchmark showed naive multiply took 3.7s while manual shift-and-add took 21.3s - 7x slower. (3) Manual shifts reduce readability and portability. (4) For signed integers, right shift behavior is implementation-defined (arithmetic vs logical). (5) Modern CPUs have fast multipliers. Write clear code with multiplication; the compiler knows your target architecture better than you do.
Hot spot analysis
3 questionsStartup profiling techniques: 1) perf record from process start: 'perf record -g program' captures everything including initialization. 2) strace for syscall timing: 'strace -tt -T -f program 2>&1 | head -100' shows early syscalls with timestamps and durations. 3) LD_DEBUG for library loading: 'LD_DEBUG=libs program' shows dynamic library loading order and timing. 4) perf with fork following: 'perf record -F 99 -g --call-graph dwarf program' for complete initialization traces. 5) Application-specific: add timestamps at key initialization points. Analyze: dynamic linking time (consider static linking), configuration file parsing, network/database connection establishment, lazy vs eager initialization. Generate flame graph focusing on early samples to visualize startup hot spots.
Follow this cycle: 1) Profile to identify the current biggest bottleneck (don't guess). 2) Understand WHY it's slow - is it algorithmic, memory access patterns, branch mispredictions? 3) Form hypothesis and implement targeted fix. 4) Re-profile to verify improvement. 5) Compare before/after metrics quantitatively. 6) Repeat - the new hottest spot may be different. Key principles: always start with profiling data, optimize the bottleneck (not random code), measure impact of each change, stop when meeting performance targets or hitting diminishing returns. Use version control to track optimization attempts. Document what you tried and results - some 'obvious' optimizations may not help or may hurt.
Hot spots are code regions consuming disproportionate execution time. To identify them: 1) Run sampling profiler (perf record, VTune Hotspots, Instruments Time Profiler) on representative workload. 2) Sort functions by CPU time in the report. 3) Top functions are hot spots - focus optimization there. 4) Use call graph to understand how hot functions are reached. 5) Drill down with perf annotate to find hot instructions within functions. 6) Generate flame graphs for visual overview - widest boxes are hottest. Remember: some hot spots are fundamental to the algorithm and can't be eliminated. After optimizing, re-profile to verify improvement and identify new hot spots - optimization often shifts bottlenecks.
Profiling overhead and observer effect
3 questionsProfiling overhead is the performance cost of measurement itself. Sampling profilers typically have <1-5% overhead since they only periodically capture state. Instrumentation-based profilers can have 10-100x slowdown as they intercept every function call. Tracing tools vary: perf has minimal overhead, Valgrind can be 20-50x slower. Overhead affects both absolute timings and relative hotspot rankings. High overhead can cause measurement to dominate workload, making short functions appear faster than they are. Mitigation: use sampling for minimal overhead, adjust sampling rate (lower = less overhead but less precision), be aware that some tools (Cachegrind, Memcheck) fundamentally require high overhead. Report profiler used when sharing results.
Measurement bias occurs when the measurement environment systematically favors some configurations over others. Sources include: link order affecting memory layout, environment variable size changing stack alignment, ASLR randomization, filesystem cache state, CPU frequency scaling state, and other processes on the system. Research found that none of 133 papers in major systems conferences adequately addressed measurement bias. To avoid: 1) Randomize setup - vary link order, environment, etc. across runs. 2) Use consistent test environment - same hardware, OS, background load. 3) Run many trials with different random seeds. 4) Report variance alongside means. 5) Test on multiple machines if possible. 6) Use statistical tests that account for variance when comparing results.
The observer effect occurs when the act of measurement changes the behavior being measured. In performance analysis, profiling instrumentation can alter cache behavior, branch prediction, memory layout, and timing. Research by Mytkowicz et al. shows this can lead to incorrect conclusions - perturbation is non-monotonic and unpredictable with respect to instrumentation amount. Mitigation strategies: 1) Use hardware performance counters which have minimal overhead. 2) Compare results across multiple profiling tools. 3) Use setup randomization - vary environment variables, link order, stack alignment to detect environment-sensitive results. 4) Perform causal analysis to distinguish real effects from measurement artifacts. 5) Report measurement methodology with results.
Bit Manipulation Tricks
3 questions__builtin_popcount(x) returns the population count (number of 1 bits) in integer x. Use __builtin_popcountll for long long. On CPUs with the POPCNT instruction (Intel Nehalem+, AMD Barcelona+), GCC compiles this to a single instruction taking 1-3 cycles. Enable with -mpopcnt or -march=native. Without hardware support, GCC uses a software implementation with multiple operations. Check support: __builtin_cpu_supports("popcnt"). Common uses: counting set bits in bitmasks, computing Hamming distance (popcount(a^b)), and efficient set intersection size (popcount(a&b)).
__builtin_clz(x) counts leading zeros - the number of zero bits before the most significant 1 bit. __builtin_ctz(x) counts trailing zeros - zeros after the least significant 1 bit. Use __builtin_clzll/__builtin_ctzll for long long. WARNING: Results are undefined when x is 0. These map to BSR/BSF (x86), CLZ/CTZ (ARM), or LZCNT/TZCNT instructions. Applications: finding the highest/lowest set bit position, computing floor(log2(x)) as 31-clz(x) for 32-bit, efficient division by powers of 2, and implementing priority queues. C++20 provides portable std::countl_zero and std::countr_zero.
Modern x86 provides several bit manipulation instruction sets: POPCNT (population count), LZCNT (leading zero count, part of ABM), BMI1 includes TZCNT, ANDN, BEXTR, BLSI (isolate lowest set bit), BLSMSK, BLSR. BMI2 adds BZHI, PDEP (parallel bit deposit), PEXT (parallel bit extract), SARX/SHLX/SHRX. Enable in GCC with -mbmi, -mbmi2, -mpopcnt, or -march=native. Note: BMI1 is not a subset of BMI2 - they are separate. Intel Haswell+ and AMD Excavator+ support both. PDEP/PEXT are useful for bit permutation but are slow on AMD Zen (18 cycles vs 3 on Intel).
Register Allocation
2 questionsRegister allocation assigns program variables to CPU registers. It is typically the most valuable compiler optimization because accessing registers is orders of magnitude faster than memory (1 cycle vs 100+ cycles for cache miss). When more variables are live than available registers, some must be spilled to memory and reloaded later - this spilling overhead can dominate execution time. Modern compilers use graph coloring algorithms: build an interference graph where edges connect simultaneously-live variables, then color the graph with R colors (registers). Variables assigned the same color share a register; uncolorable variables are spilled.
To reduce register pressure: (1) Keep variable live ranges short - declare variables close to use, let them go out of scope early. (2) Avoid excessive loop unrolling - unrolled loops create many temporaries. (3) Break complex expressions into separate statements to give the compiler flexibility. (4) Use local variables instead of globals (easier for register allocation). (5) Avoid pointer aliasing - use restrict keyword in C. (6) Reduce function parameter count - more parameters means more register/stack pressure at call sites. (7) Use smaller data types when appropriate (int vs long). (8) Profile with -fopt-info-vec-missed to see where register pressure causes issues.
Peephole Optimization
2 questionsPeephole optimization examines small windows (peepholes) of generated code, typically 2-3 instructions, looking for patterns that can be replaced with more efficient sequences. It runs as a late compiler pass on assembly or machine code. Common peephole optimizations: (1) Redundant load/store elimination - store followed by load from same location becomes a copy. (2) Strength reduction - multiply by 2 to left shift. (3) Algebraic simplifications - x * 1 to x, x + 0 to x. (4) Instruction combining - separate operations merged into one. (5) Redundant instruction elimination - consecutive pushes and pops of same register. The technique is simple but effective for cleaning up inefficiencies in generated code.
Common peephole patterns: (1) mov $0, %eax to xor %eax, %eax (smaller, faster on some CPUs). (2) Sequential mov through temp register eliminated. (3) add $1, %eax to inc %eax. (4) Push/pop pairs of same register removed. (5) Jump to next instruction removed. (6) Conditional jump over unconditional jump inverted. (7) Multiple consecutive stores to same location - keep only last. (8) Load after store to same address - use stored value directly. (9) Compare against zero after arithmetic - use flags already set. (10) Multiply by power of 2 to shift. Modern compilers like GCC and LLVM apply hundreds of such patterns automatically.
Branch misprediction profiling
2 questionsLBR is a CPU feature that records the last 32 branches taken by the processor, including source and destination addresses, whether the branch was predicted correctly, and cycle counts. To capture LBR data with perf: 'perf record -b -e cycles ./program'. The resulting data shows branch stacks with format: FROM -> TO (M/P for mispredicted/predicted, cycles). LBR provides better coverage than direct branch-misses sampling because it captures branch history without requiring additional performance counters. Use 'perf report --branch-history' to analyze branch patterns. LBR is available on Intel processors since Nehalem and is useful for identifying hot branch paths and misprediction patterns.
Use 'perf stat -e branches,branch-misses ./program' to count total branches and mispredictions. Calculate misprediction rate as branch-misses/branches. A rate above 1-2% may indicate optimization opportunities. For sampling-based analysis, use 'perf record -e branch-misses ./program' then 'perf report' to find functions with the most mispredictions. Intel Last Branch Records (LBR) provide more detailed branch analysis: use 'perf record -b -e cycles ./program' to capture branch stacks with 32 entries showing FROM/TO addresses and predicted/mispredicted flags. Modern CPUs rely heavily on branch prediction to keep pipelines full, so high misprediction rates can severely impact performance.
Comparative benchmarking
2 questionsUse A/B comparison methodology: 1) Establish baseline with multiple runs (minimum 10, ideally 30+) to capture variance. 2) Calculate mean, median, standard deviation, and confidence intervals. 3) Make optimization change. 4) Run same number of tests under identical conditions. 5) Use statistical tests - t-test for normally distributed data, Mann-Whitney for non-normal. 6) Check if confidence intervals overlap - non-overlapping indicates statistically significant difference. Tools: VTune has comparison mode, perf diff compares two perf.data files, differential flame graphs show changes visually. Important: control for system noise - use CPU pinning, disable frequency scaling, close other applications, run multiple times.
Use dedicated benchmark runners with fixed hardware for consistency. Steps: 1) Store benchmark baseline results in version control. 2) Run benchmarks on every commit/PR using tools like Google Benchmark, JMH, or Criterion. 3) Compare against baseline with statistical tests - reject if performance regresses beyond threshold (e.g., 5% slower with 95% confidence). 4) Tools: Bencher, Conbench, Pernosco for tracking over time; GitHub Actions for automation. 5) Pin CPU frequency, disable Turbo Boost on benchmark machines. 6) Run benchmarks multiple times (10+) for statistical reliability. 7) Alert/block merges on statistically significant regressions. 8) Store historical results for trend analysis. Consider dedicated benchmark machines to avoid cloud instance variability.
Instruction-level profiling
2 questionsIntel Processor Event-Based Sampling (PEBS) is a hardware mechanism that records precise instruction pointers when performance events occur, unlike regular sampling which has 'skid' due to interrupt latency. When a configured event (cache miss, branch misprediction) occurs, PEBS captures the instruction pointer and register state into a dedicated buffer with minimal overhead. This pinpoints the exact instruction causing the event, not an instruction several cycles later. Enable PEBS in perf with ':pp' or ':ppp' suffix on events, e.g., 'perf record -e cycles:pp ./program'. PEBS is essential for accurate attribution of cache misses and branch mispredictions to specific code locations.
After recording with 'perf record -g ./program', run 'perf annotate function_name' to see per-instruction samples. perf annotate displays assembly with percentage of time spent on each instruction. If compiled with debug info (-g flag), source code appears alongside assembly. For best results, compile with '-fno-omit-frame-pointer -ggdb'. In perf report interactive mode, press 'a' to annotate the selected function. Note that interrupt-based sampling introduces 'skid' - the recorded instruction pointer may be several dozen instructions away from where the counter actually overflowed due to out-of-order execution and pipeline depth. Intel PEBS (Processor Event-Based Sampling) provides more precise instruction attribution.
Loop Optimizations
2 questionsLoop interchange swaps the order of nested loops to improve memory access patterns. For row-major arrays in C, accessing a[i][j] with j in the inner loop is cache-friendly. Interchanging for(i)for(j) to for(j)for(i) when accessing a[j][i] converts strided access to sequential access. Benefits: improved spatial locality, better prefetching, potential for vectorization. Constraints: interchange is only legal when loop-carried dependencies are preserved. GCC enables loop interchange with -floop-interchange at -O3. Profile memory-bound code with cache miss counters to identify interchange opportunities.
Loop tiling partitions loop iteration space into smaller blocks (tiles) that fit in cache. For matrix multiplication: instead of computing entire rows/columns, process BxB blocks where B is chosen so three BxB blocks fit in L1 cache. This maximizes data reuse before eviction. Typical tile sizes: 32x32 to 64x64 for double-precision on modern CPUs. Benefits: reduces cache misses from O(N^3/L) to O(N^3/(L*sqrt(C))) where L is cache line size and C is cache size. GCC/LLVM can auto-tile with -floop-block. For critical code, manual tiling often outperforms auto-tiling. Profile to find optimal tile size for your hardware.
Function Call Overhead Reduction
2 questionsFunction call overhead includes: (1) Saving caller-saved registers to stack. (2) Pushing arguments onto stack or loading into argument registers per ABI. (3) Executing call instruction (pushes return address, jumps). (4) Function prologue: saving callee-saved registers, allocating stack frame, setting up frame pointer. (5) Potential pipeline stall from jump. (6) Function epilogue: restoring registers, deallocating frame. (7) Executing ret instruction (pops return address, jumps back). (8) Restoring caller state. This overhead is significant for small, frequently-called functions - each call may cost 10-20 cycles even with optimized calling conventions.
Strategies to reduce function call overhead: (1) Inline small, frequently-called functions - use inline keyword as hint or compiler will auto-inline at -O2+. (2) Use tail call optimization for recursive functions - structure recursion in tail position. (3) Prefer iteration over recursion for simple loops. (4) Use function pointers sparingly - indirect calls prevent inlining and branch prediction. (5) Batch operations - one call processing N items beats N calls. (6) Use restrict keyword to enable more aggressive optimization. (7) Consider macros for truly trivial operations (but prefer inline functions for type safety). (8) Profile to identify hot call sites worth optimizing.
Branch Optimization
1 questionAvoid branch hints when: (1) The prediction is wrong - incorrect hints cause pipeline flushes worse than no hint. (2) Branch probabilities are close to 50/50 - let the hardware predictor learn. (3) Modern CPUs with sophisticated branch prediction buffers (Intel Core series) - they often predict better than static hints. (4) The code path varies by input - runtime branch prediction adapts, static hints do not. (5) Overuse throughout codebase - dilutes impact and adds maintenance burden. Profile your code first; only add hints to verified hot paths where the branch direction is consistently predictable (>90% one way).