performance_optimization 133 Q&As

Performance Optimization FAQ & Answers

133 expert Performance Optimization answers researched from official documentation. Every answer cites authoritative sources you can verify.

Heuristics and Rules of Thumb

72 questions
A

L1 instruction cache is typically 32KB on modern x86 CPUs (Intel and AMD). Hot code paths should fit within this to avoid instruction cache misses. Key implications: aggressive loop unrolling may hurt if it expands hot loop beyond 32KB; inline functions judiciously to avoid code bloat; keep related functions together for better I-cache locality. Measure instruction cache miss rate if performance is unexpectedly poor after optimization. Unrolling from 4x to 16x might improve data-path efficiency but hurt overall performance if code no longer fits in I-cache.

95% confidence
A

L3 cache access latency is 30-50 cycles on modern CPUs, approximately 12-20 nanoseconds. Specifically: Intel Kaby Lake: 42 cycles (16.8 ns at 2.5 GHz); Intel Haswell: 34 cycles (13 ns at 2.6 GHz); AMD Zen: ~35-40 cycles. L3 is shared across all cores and typically ranges from 8MB to 64MB on desktop/server CPUs. L3 latency varies with core count and NUMA topology. For multi-threaded applications, L3 hit rate determines cross-core data sharing efficiency. L3 misses go to DRAM with 100+ cycle penalty.

95% confidence
A

Float (32-bit) vs double (64-bit) performance: same latency and throughput per instruction on modern x86 for scalar operations; 2x SIMD throughput for float (8 floats vs 4 doubles in 256-bit AVX register); 2x memory bandwidth efficiency for float (half the bytes). Use float when: precision is sufficient (7 significant digits), memory bandwidth is bottleneck, or SIMD width matters. Use double when: numerical precision needed (15 digits), accumulating many values (less rounding error), or mixing with double-precision libraries. Memory-bound code sees ~2x speedup from float; compute-bound sees less difference.

95% confidence
A

Atomic operations cost 10-100+ cycles depending on contention and cache state: uncontended atomic on local L1 cache: 10-20 cycles; contended atomic requiring cache line bounce between cores: 50-200 cycles; atomic across NUMA nodes: 100-300+ cycles. Compare to regular load/store: 4-5 cycles from L1. Lock-free algorithms using CAS loops can waste unpredictable cycles under high contention. Rule of thumb: minimize atomic operations in hot paths, batch updates when possible, use thread-local accumulation with periodic synchronization, and consider cache line padding to prevent false sharing on atomic variables.

95% confidence
A

Typical TLB coverage with 4KB pages: L1 DTLB: 64-128 entries = 256-512KB; L2 STLB: 1024-2048 entries = 4-8MB. Working sets exceeding TLB coverage suffer page walk penalties. When TLB miss rate >1%, consider huge pages. With 2MB huge pages: same 1024 STLB entries cover 2GB. Signs of TLB pressure: high DTLB miss rate in profiler, performance cliff at specific working set sizes, random access patterns over large memory regions. Solutions: huge pages, improve memory locality, reduce working set, or use cache blocking to reuse TLB entries.

95% confidence
A

Keep no more than 10-12 live variables within a hot loop to avoid register spills on x86-64. Techniques to reduce register pressure: keep live ranges short by using variables close to their definitions, avoid excessive loop unrolling (which multiplies live variables), use restrict pointers to enable better register allocation, break complex expressions into simpler ones the compiler can optimize. Register spills inside hot loops cause significant performance degradation due to added memory traffic. When comparing unroll factors, measure performance to find the sweet spot between instruction-level parallelism and register pressure.

95% confidence
A

An IPC below 0.7 indicates significant room for optimization and limited use of processor capabilities. This typically signals memory-bound execution with frequent cache misses, pipeline stalls from data dependencies, or poor instruction-level parallelism. A CPI (cycles per instruction, the inverse) greater than 1 suggests stall-bound execution. To improve: reduce memory access latency through better cache utilization, eliminate data dependencies through loop unrolling or software pipelining, and ensure sufficient independent instructions for out-of-order execution to exploit.

95% confidence
A

Expected vectorization speedup = min(SIMD_width, arithmetic_intensity * memory_bandwidth / compute_rate). For compute-bound code: theoretical max is SIMD width (4x for SSE float, 8x for AVX float). For memory-bound code: speedup is limited by bandwidth, typically 1.5-3x regardless of SIMD width. Practical rule: expect 50-70% of theoretical SIMD width speedup for well-vectorized compute-bound code, and 1.5-2x for memory-bound code. Factors reducing speedup: unaligned access, gather/scatter operations, horizontal operations, and remainder handling. Measure actual speedup - it varies significantly by workload.

95% confidence
A

Use lookup tables when: computation takes >20 cycles and table fits in L1 cache (<=32KB), access pattern is unpredictable (no benefit from branch prediction), or function is called millions of times. Use computation when: table would exceed L2 cache (causing cache pollution), access pattern allows branch prediction to work well, or computation is simple (<10 cycles). Typical breakeven: 256-entry byte table (256 bytes) is almost always beneficial; 64K-entry table (64KB+) requires careful analysis. Memory latency (4 cycles L1) vs compute (1-20 cycles) determines winner.

95% confidence
A

Prevent false sharing by padding thread-local data to cache line boundaries (64 bytes on x86, 128 bytes on Apple M-series). Add 64 bytes of padding between variables accessed by different threads. In C: use 'attribute((aligned(64)))' or manually insert padding arrays. In Java 8+: use '@Contended' annotation which adds 128 bytes of padding. In Go: use 'cpu.CacheLinePad' between fields. The LMAX Disruptor uses 7 long fields (56 bytes) as padding before and after the cursor. While padding wastes memory, it can provide order-of-magnitude performance improvements in contended scenarios.

95% confidence
A

Plan for 1-3MB of LLC per core for working set sizing. Typical configurations: Intel desktop: 2MB per core (16MB shared / 8 cores); AMD Zen: 4MB L3 per CCX (8 cores share 32MB); Server CPUs: 1.25-2.5MB per core. Note L3 is shared, so under load, effective per-core share decreases. For multi-threaded optimization: total_working_set should fit in total_L3 * 0.7 (leave room for OS and other threads). For single-threaded: working set up to full L3 is reasonable but benefits from L2 blocking for hot data.

95% confidence
A

Main memory (DRAM) access latency is 150-300 cycles, approximately 60-100 nanoseconds on modern systems. This is 100x slower than L1 cache. The latency includes: L3 miss detection (~40 cycles), memory controller processing, DRAM row activation (CAS latency), and data transfer. DDR4 typical latency: 60-80 ns; DDR5: 70-90 ns (higher frequency but also higher CAS latency). Memory-bound code can see processors stalling for hundreds of cycles per access. This 'memory wall' makes cache optimization crucial for performance.

95% confidence
A

Sequential access achieves 10-100x higher throughput than random access due to prefetching and cache line utilization. Typical measurements: sequential read: 30-50 GB/s (DDR4), 60-80 GB/s (DDR5); random read (8-byte): 0.5-2 GB/s (limited by latency, not bandwidth). The gap comes from: prefetchers work for sequential patterns (hiding 200+ cycle DRAM latency), each cache line (64 bytes) fully utilized in sequential vs partially in random, and memory controller optimizations for streaming. Design data structures for sequential access in hot paths wherever possible.

95% confidence
A

Prefer power-of-two array sizes for: fast modulo via bitwise AND (x & (size-1)), efficient cache blocking, SIMD alignment without remainders. Avoid power-of-two sizes when: accessing with power-of-two stride (causes cache set conflicts), or multiple power-of-two arrays compete for same cache sets. Mitigation: pad arrays to 'size + cache_line_size' to break alignment. Example: 4096-element float array with stride-1024 access uses only 4 of 64 cache sets, wasting 15/16 of cache. Add 16-element padding to spread across all sets.

95% confidence
A

TLB (Translation Lookaside Buffer) miss penalty varies by level: L1 ITLB miss: 7-10 cycles (usually hidden by out-of-order execution); STLB (second-level TLB) miss triggering page walk: 20-100+ cycles depending on page table depth and cache residency of page table entries. A full 4-level page walk hitting DRAM at each level could cost 400+ cycles. Reduce TLB misses by: minimizing working set to fit in TLB coverage, using huge pages (2MB instead of 4KB - requires 512x fewer TLB entries), and improving memory access locality.

95% confidence
A

Use pool allocators when: allocating >1000 objects of the same size per second, object lifetime is predictable (bulk allocate/free), or allocation overhead shows up in profiling. Pool allocators reduce malloc overhead from 50-100 cycles to 10-20 cycles by eliminating search and fragmentation handling. Implementation: pre-allocate chunks of N objects, maintain free list with O(1) alloc/free. Common thresholds: objects <256 bytes benefit most; allocation frequency >10,000/second sees significant gains. Memory pools also improve cache locality since objects are contiguous.

95% confidence
A

Context switch cost is 1000-10000 cycles (0.5-5 microseconds) depending on working set size and cache pollution. Direct costs: ~1000-2000 cycles for register save/restore and TLB flush. Indirect costs: 5000-50000+ cycles to reload caches with new process working set. For threads sharing address space (no TLB flush needed): 1000-3000 cycles. This is why spinlocks can win for very short critical sections (<1000 cycles) - the context switch from blocking costs more than spinning. Minimize context switches in latency-sensitive code by using thread pinning and avoiding blocking operations.

95% confidence
A

Target L2 cache hit rate of 90% or higher. Hit rates below 70% suggest the working set is too large or access patterns cause thrashing. L2 measures how well your working set fits: low rates indicate too many unique data accesses or poor temporal locality. With L2 miss penalty of 20-40 cycles to L3 (or 100-300 cycles to DRAM for L3 misses), even small hit rate improvements matter significantly. Design data structures to fit working sets within L2 size (typically 256KB-1MB per core) and consider cache blocking for larger datasets.

95% confidence
A

Batch size should make overhead <10% of useful work. Examples: system calls with 500-cycle overhead: batch 5000+ cycles of work (10+ small operations); network packets with 10 microsecond latency: batch 100+ microseconds of data; database commits with 1ms overhead: batch 10+ ms of transactions. For parallel work distribution: minimum chunk size = parallel_overhead / (num_threads - 1). If OpenMP fork/join costs 10 microseconds, each thread needs >100 microseconds of work for 10% overhead with 2 threads. Measure both latency and throughput - batching trades latency for throughput.

95% confidence
A

Start with initial backoff of 1-4 iterations, double after each failed attempt, cap maximum at 1000-10000 iterations before falling back to blocking. Common implementation: initial=1, multiply by 2 each iteration, max_backoff=1000 cycles, then call yield() or switch to mutex. Exponential backoff reduces cache line bouncing and improves throughput under contention. Without backoff, test-and-set spinlocks cause severe cache coherence traffic. TTAS (test-and-test-and-set) with exponential backoff performs well even with many processors competing for the same lock.

95% confidence
A

Auto-vectorization typically yields 5-10x speedup for embarrassingly parallel computations where you apply elementwise functions to arrays. The theoretical maximum is the SIMD width (4x for SSE floats, 8x for AVX floats, 16x for AVX-512 floats), but practical gains are limited by memory bandwidth, alignment overhead, and remainder loop handling. Memory-bound operations may see only 2-3x improvement regardless of SIMD width because the bottleneck shifts to memory bandwidth rather than compute throughput.

95% confidence
A

The ridge point is calculated as: Peak_Performance(FLOP/s) / Peak_Bandwidth(bytes/s). This gives the minimum operational intensity (FLOP/byte) needed to achieve peak compute performance. For example: NVIDIA A100 with 19,500 GFLOPS and 1,555 GB/s bandwidth has ridge point of 19500/1555 = 12.5 FLOP/byte. Code with operational intensity below the ridge point is memory-bound; above it is compute-bound. Typical ridge points: CPU ~1-4 FLOP/byte, GPU ~10-50 FLOP/byte. Optimize memory access for memory-bound kernels; optimize compute for compute-bound.

95% confidence
A

Modern out-of-order CPUs can hide latency for approximately 100-200 instructions in the reorder buffer (ROB), which translates to roughly 50-100 cycles of work. Intel Skylake has 224-entry ROB; AMD Zen3 has 256 entries. This means out-of-order execution can hide L2 cache misses (~12 cycles) effectively but struggles with DRAM latency (200+ cycles). To help the CPU hide memory latency: ensure there are enough independent instructions between loads and their uses, use software prefetching for predictable access patterns, and unroll loops to expose more instruction-level parallelism.

95% confidence
A

System call overhead is 100-1000 cycles on modern Linux (1000-5000 cycles on Windows). Breakdown: mode switch (user to kernel): 50-150 cycles; syscall dispatch and validation: 100-300 cycles; actual work varies by call; return (kernel to user): 50-150 cycles. Mitigation: batch operations (one write of 1MB vs 1000 writes of 1KB), use memory-mapped I/O to avoid read/write syscalls, use vDSO for time queries (gettimeofday), buffer I/O in userspace. KPTI (Spectre mitigation) increased syscall cost by 100-300 cycles due to page table switching.

95% confidence
A

Start with an unroll factor of 4x for most loops. This provides a good balance between reducing loop overhead and avoiding instruction cache pressure. For SIMD-optimized code, unroll by the SIMD width or multiples of it (e.g., 4x for SSE with floats, 8x for AVX with floats, 16x for AVX-512). Factors of 2x or 4x typically see speed improvements, while going beyond 8x often shows diminishing returns and can hurt performance due to increased code size and instruction cache misses.

95% confidence
A

The approximate latency ratio is L1:L2:L3:DRAM = 1:3:10:60 (in terms of L1 as baseline). Concrete numbers at 3GHz: L1 = 4 cycles (1.3 ns), L2 = 12 cycles (4 ns), L3 = 40 cycles (13 ns), DRAM = 240 cycles (80 ns). This ~60x difference between L1 and DRAM is the 'memory wall'. Bandwidth ratio is similar: L1 can deliver ~1-2 TB/s, L2 ~500 GB/s, L3 ~200 GB/s, DRAM ~50-100 GB/s. Understanding this hierarchy is crucial for cache optimization - each level miss costs roughly 3-10x more than the previous level hit.

95% confidence
A

A function call costs approximately 15-25 cycles on modern CPUs, equivalent to 3-4 simple assignments: call instruction (~1-2 cycles), stack frame setup (push rbp, mov rbp,rsp: ~2 cycles), parameter passing (varies), return (pop, ret: ~2-3 cycles), plus potential pipeline disruption. Indirect function calls (through pointers/vtables) cost 3-4x more due to branch prediction miss potential. For small functions called millions of times, this overhead can dominate. Inline functions or link-time optimization (LTO) eliminates this overhead. Profile before optimizing - overhead only matters for very small, frequently-called functions.

95% confidence
A

Software pipelining (overlapping iterations) provides 15-30% speedup on in-order cores and smaller arrays. Tests show: for arrays fitting L2 cache, software pipelining gives 18.8-28.8% speedup; unroll-and-interleave (UAI) gives 14.2-21.8% speedup on in-order cores. On out-of-order cores, these techniques provide minimal benefit because the hardware already performs dynamic instruction scheduling. Software pipelining works by splitting loop work into phases (load, compute, store) and overlapping phases from different iterations to hide latencies and enable dual-issue on simple processors.

95% confidence
A

SIMD vectorization typically becomes worthwhile when processing at least 4x the SIMD width elements, so: SSE (128-bit): minimum 16 floats or 16 integers; AVX (256-bit): minimum 32 floats or 32 integers; AVX-512 (512-bit): minimum 64 floats or 64 integers. Below these thresholds, the overhead of setup, remainder handling, and potential alignment adjustments may exceed the parallel processing gains. For very small arrays with unknown size at compile time, the scalar version may actually be faster due to branch overhead for remainder loops.

95% confidence
A

Order struct fields by: 1) Access frequency (hot fields first), 2) Access pattern (fields accessed together should be adjacent), 3) Size descending (reduces padding). Keep hot fields within first 64 bytes (one cache line). Group read-only fields separately from read-write to prevent false sharing. For arrays of structs vs struct of arrays (AoS vs SoA): use AoS when accessing all fields per element, SoA when accessing one field across all elements. Typical optimization: place most-accessed 2-3 fields at struct start, ensuring they fit in first cache line load.

95% confidence
A

Indirect function calls (through pointers or vtables) are typically 2-4x slower than direct calls. One benchmark showed indirect calls running 3.4x slower. The performance hit comes from: inability to inline, branch prediction miss on first call to new target, and additional memory load to fetch function address. Virtual function calls in C++ fall into this category. Mitigation: devirtualization through final/sealed classes, link-time optimization (LTO), profile-guided optimization (PGO), or redesigning hot paths to avoid polymorphism. Consider templates or CRTP for static polymorphism in performance-critical code.

95% confidence
A

Use static scheduling when: iterations have uniform work (e.g., array operations), and you want minimum overhead. Use dynamic scheduling when: iteration work varies significantly (e.g., sparse matrix, adaptive algorithms), at the cost of higher overhead from runtime distribution. Use guided scheduling for: load balancing with lower overhead than dynamic - starts with large chunks, shrinks toward end. Specific guidance: static has lowest overhead (0.5 microseconds), dynamic has highest (2-5 microseconds), guided is intermediate. Default chunk size for static: iterations/num_threads; for dynamic: 1 (balance) or 64-256 (reduce overhead).

95% confidence
A

Hardware prefetchers typically detect strides up to 2KB-4KB and handle 8-16 concurrent streams. Intel stride prefetcher detects forward/backward strides up to 2KB; stream prefetcher handles up to 32 streams within 4KB page. For optimal prefetcher effectiveness: use strides <2KB, access no more than 8-16 distinct arrays simultaneously in hot loops, and maintain consistent access patterns (prefetchers take time to learn). When strides exceed hardware limits or patterns are irregular, use software prefetching with explicit _mm_prefetch() instructions at appropriate distances.

95% confidence
A

When branch prediction accuracy falls below 75%, branchless code (using conditional moves, SIMD masks, or arithmetic) is typically faster than branching code. At 75% prediction accuracy, the cost of mispredictions roughly equals the cost of conditional move data dependencies. Above 75% accuracy, keep the branch. Below 75%, convert to branchless. This 75% threshold is used by compilers as a heuristic for deciding whether to emit cmov instructions. Note: if data comes from slow memory (L3 or DRAM), branches may still win because early speculative loads hide latency.

95% confidence
A

x86-64 provides 16 general-purpose 64-bit registers (RAX-RDX, RSI, RDI, RBP, RSP, R8-R15), but practically only 14-15 are available for computation (RSP is stack pointer, RBP often frame pointer). This is doubled from x86-32's 8 registers. Additionally, there are 16 XMM/YMM/ZMM vector registers for SIMD. When your algorithm needs more than 12-14 variables live simultaneously, expect register spills to stack. Loop unrolling increases register pressure - balance unroll factor against available registers to avoid costly spills inside hot loops.

95% confidence
A

Prefetch distance = ceiling(memory_latency_cycles / loop_iteration_cycles). For example, if memory latency is 200 cycles and one loop iteration takes 25 cycles, prefetch 200/25 = 8 iterations ahead. For L1 prefetch from L2, use shorter distances (e.g., 8 iterations); for L2 prefetch from memory, use longer distances (e.g., 64 iterations). Intel compilers with -O2 or higher automatically set prefetch level 3. Tuning can yield 35% or more bandwidth improvement - one test showed performance increase from 129 GB/s to 175 GB/s.

95% confidence
A

Target L1 data cache hit rate of 95% or higher for well-optimized code. Hit rates above 80% are acceptable for general code. Below 60% indicates serious access pattern problems requiring investigation. With L1 hit latency of 1-4 cycles and miss penalty to L2 of 10-12 cycles, the performance impact is significant: 97% hit rate gives average 4-cycle access, while 99% hit rate gives 2-cycle access - 2x improvement from just 2% hit rate increase. Improve L1 hit rate through better spatial locality, cache blocking, and prefetching.

95% confidence
A

L1 cache access latency is 4-5 cycles on modern Intel/AMD CPUs, which translates to approximately 1-2 nanoseconds at typical clock speeds. Specifically: Intel Kaby Lake: 5 cycles / 2.5 GHz = 2 ns; Intel Haswell: 5 cycles / 2.6 GHz = 1.9 ns; AMD Zen: 4 cycles. L1 cache is the fastest memory level after registers. L1 data cache is typically 32KB per core (8-way associative), and L1 instruction cache is also typically 32KB per core. Optimizing for L1 hit rate provides the largest performance gains.

95% confidence
A

Use huge pages (2MB on x86) when: working set exceeds 4MB (1024 4KB pages), TLB miss rate is high in profiling, or memory access is scattered across large address ranges. Huge pages reduce TLB entries needed by 512x: 20MB requires only 10 huge pages vs 5120 standard pages. Best candidates: large arrays, memory-mapped files, databases, HPC applications. Enable with: Linux mmap() with MAP_HUGETLB, or transparent huge pages (THP). Benchmark first - huge pages can hurt performance for sparse access patterns due to internal fragmentation and longer page fault times.

95% confidence
A

Parallel merge sort becomes beneficial when array size exceeds 10,000-100,000 elements, depending on hardware and element size. Below this threshold, spawn/join overhead exceeds parallel speedup. Rule of thumb: switch to sequential sort when subarray falls below 1000-5000 elements. This hybrid approach (parallel at top levels, sequential at leaves) provides best performance. Additional considerations: for 2 cores, threshold ~50,000; for 8 cores, threshold ~20,000; for 32+ cores, threshold can be as low as 5,000-10,000 elements. Always benchmark on target hardware.

95% confidence
A

Cache line size is 64 bytes on all modern x86/x86-64 processors (Intel and AMD since ~2005). This means memory is fetched and cached in 64-byte aligned chunks. Key implications: data structures should be sized/aligned to 64-byte boundaries for optimal access; arrays of 8-byte elements have 8 elements per cache line; false sharing occurs when different threads access different data within the same 64-byte line. Apple M-series uses 128-byte cache lines. Always pad data to avoid false sharing and align hot data to cache line boundaries.

95% confidence
A

malloc() overhead ranges from 50-100 cycles for small allocations to 1000+ cycles for large allocations requiring system calls. Each allocation involves: acquiring a global lock (in traditional allocators), searching free lists, potential memory fragmentation handling, and bookkeeping. Allocations over 64KB (varies by allocator) may trigger mmap() system calls costing thousands of cycles. Mitigation strategies: use object pools/arenas for same-size allocations, pre-allocate during initialization, use thread-local allocators (tcmalloc, jemalloc) to avoid lock contention, or use stack allocation for short-lived data.

95% confidence
A

As a starting heuristic, use OpenMP parallel loops when iteration count exceeds 1000 iterations with simple bodies, or 100 iterations with moderately complex bodies (10-100 microseconds per iteration). For array operations, parallelize when array size exceeds 100,000 elements for simple operations or 10,000 elements for complex operations. Below these thresholds, the overhead of thread management often exceeds parallel speedup. Move parallelization to outer loops when possible to reduce fork/join frequency - one study showed 'code was spending nearly half the time doing OpenMP overhead work' with inner loop parallelization.

95% confidence
A

Modern CPUs support 10-20 outstanding memory requests per core via Line Fill Buffers (LFBs) and Miss Status Handling Registers (MSHRs). Intel Skylake: 12 L1D LFBs, 16 L2 superqueue entries; AMD Zen: 22 concurrent L1D misses. This limits single-core bandwidth to: concurrent_requests * cache_line_size / memory_latency. Example: 12 requests * 64 bytes / 80ns = 9.6 GB/s max single-core bandwidth. To achieve higher bandwidth, use multiple threads or software prefetching to keep memory requests in flight. Memory bandwidth scaling often requires 4-8 cores to saturate memory controller.

95% confidence
A

A good starting prefetch distance for L1 (prefetching from L2 to L1) is 8 iterations ahead. This accounts for L2 access latency of approximately 12 cycles divided by typical loop iteration time. Fine-tune based on your specific loop: if each iteration takes 7 cycles and L2 latency is 56 cycles, use 56/7 = 8 iterations. Prefetching too early wastes cache space; too late fails to hide latency. Use compiler pragmas like '#pragma prefetch var:hint:distance' for manual tuning.

95% confidence
A

Vectorized loops should process at least 4x the vector width iterations to amortize setup and cleanup overhead. For AVX2 processing 8 floats per iteration: minimum 32 iterations; for AVX-512 processing 16 floats: minimum 64 iterations. Setup costs include: loading constants into vector registers, handling alignment, setting up masks. Cleanup handles remainder elements. For loops below threshold, consider: scalar fallback, using narrower vectors (SSE instead of AVX), or accumulating small arrays before vectorized processing. Compile-time-known small counts may benefit from full unrolling instead.

95% confidence
A

Code is memory-bound when operational intensity is below the ridge point (typically <1-4 FLOP/byte on CPUs, <10-15 FLOP/byte on GPUs). Examples: DAXPY (y=ax+y) has intensity of 2n FLOP / 24n bytes = 0.083 FLOP/byte - heavily memory-bound. SpMV (sparse matrix-vector) typically has 0.17-0.25 FLOP/byte - memory-bound. Dense matrix multiplication can achieve 2n^3 FLOP / 3n^2*8 bytes for large n, approaching 100+ FLOP/byte - compute-bound. Low-intensity kernels benefit from memory optimizations; high-intensity from compute optimizations.

95% confidence
A

Integer division is 10-30x slower than multiplication: integer multiply latency 3-4 cycles, throughput 1 per cycle; integer divide latency 20-80 cycles, throughput 0.03-0.1 per cycle (26-90 cycles between divisions). Optimization: replace 'x/const' with multiplication by magic number (compiler does this automatically for constants); replace 'x%power_of_2' with 'x&(power_of_2-1)'; for runtime divisors, consider libdivide or caching the magic multiplier. Integer modulo has same cost as division. Impact: a tight loop with division can be 10x slower than equivalent multiplication.

95% confidence
A

Well-optimized multi-threaded code should achieve 75-85% of peak theoretical memory bandwidth, with 80% being a practical target. Single-threaded code typically achieves 40-60% of peak due to memory-level parallelism limitations. Measured throughput is always below theoretical maximum due to memory controller inefficiencies, DRAM refresh cycles, rank-to-rank stalls, and read-to-write turnaround penalties. If achieving less than 60% of peak bandwidth on memory-bound code, investigate poor spatial locality, cache associativity conflicts, or insufficient prefetching.

95% confidence
A

The .NET JIT compiler has a default inline threshold of 32 bytes of IL (Intermediate Language) code. Methods larger than 32 bytes IL are generally not inlined. The rationale is that for larger methods, the function call overhead becomes negligible compared to method execution time. This is a heuristic that can fail for hot methods just over the threshold. Workarounds include: using [MethodImpl(MethodImplOptions.AggressiveInlining)] attribute to hint for inlining, or manually breaking large methods into smaller ones.

95% confidence
A

OpenMP fork/join overhead is typically 1-10 microseconds per parallel region entry, depending on implementation and number of threads. For loops, this means each iteration should do at least 10-100 microseconds of work to amortize parallelization overhead. With smaller tasks, the parallel version may be slower than sequential. Rule of thumb: parallelize when total loop work exceeds 100 microseconds and individual iterations take at least 1 microsecond. For finer-grained parallelism, use static scheduling to minimize runtime overhead compared to dynamic scheduling.

95% confidence
A

Denormal (subnormal) floating-point operations can be 10-100x slower than normal operations on x86 CPUs. When results become denormal (very small numbers near zero), the CPU falls back to microcode, taking 50-200 cycles instead of 4-5 cycles. Detection: unexpected performance cliffs when values approach zero. Solutions: enable Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ) modes via MXCSR register (_MM_SET_FLUSH_ZERO_MODE, _MM_SET_DENORMALS_ZERO_MODE), add small epsilon to prevent denormals, or redesign algorithm to avoid near-zero intermediate values.

95% confidence
A

Use spinlocks when: critical section is less than 1000 cycles (~0.3-0.5 microseconds), threads are unlikely to be preempted, and running on multicore system. Use mutexes when: critical section exceeds 1000 cycles, high contention is expected, or running in userspace where preemption is unpredictable. Threshold-based hybrids (like adaptive mutexes) spin for 1000-10000 CPU cycles before blocking. Key insight: spinlocks waste CPU when waiting, but avoid ~1000+ cycle context switch overhead. In userspace, pure spinlocks are usually wrong - use adaptive mutexes that spin briefly then sleep.

95% confidence
A

A good starting prefetch distance for L2 (prefetching from main memory to L2) is 64 iterations ahead. This accounts for DRAM latency of 200-400 cycles divided by typical loop iteration time. For a loop taking 5 cycles per iteration with 300-cycle memory latency, use 300/5 = 60, rounded to 64. Memory prefetch distances must be longer than L1 distances because DRAM latency is 10-20x higher than L2 latency. Benchmark with values from 32 to 128 to find optimal for your workload.

95% confidence
A

Theoretical peak bandwidth: DDR4-3200: 25.6 GB/s per channel, ~50 GB/s dual-channel; DDR5-5600: 44.8 GB/s per channel, ~90 GB/s dual-channel. Achievable bandwidth is 75-85% of peak: DDR4 dual-channel: 40-45 GB/s achievable; DDR5 dual-channel: 70-80 GB/s achievable. DDR5 doubles channels per DIMM (2 32-bit vs 1 64-bit) improving bank-level parallelism. For optimization planning, assume 40 GB/s for DDR4 systems, 70 GB/s for DDR5. Memory-bound code scales linearly with bandwidth, so DDR5 provides ~1.7x speedup for pure streaming workloads.

95% confidence
A

The 2:1 cache rule states: miss rate of a direct-mapped cache of size N equals the miss rate of a 2-way set-associative cache of size N/2. This means doubling associativity is roughly equivalent to doubling cache size for reducing conflict misses. Practical implications: 8-way set associativity is nearly as effective as fully associative for most workloads; beyond 8-way, diminishing returns set in. When analyzing cache performance, increasing associativity helps with conflict misses but not capacity misses. For software optimization, focus on reducing working set size rather than worrying about associativity.

95% confidence
A

Use conditional move (cmov) or SIMD min/max instructions when branches would be unpredictable. Branch-free min: 'min = y ^ ((x ^ y) & -(x < y))' or compiler intrinsics '_mm_min_ps'. Cost: cmov is 1-2 cycles vs potential 15+ cycles for mispredicted branch. However, cmov creates data dependency while branch allows speculative execution. Rule: use branchless when prediction accuracy <75%, or always for SIMD code (no branching within vector). Modern compilers often generate cmov for simple ternary operators at -O2; use '-fno-if-conversion' to force branches if needed.

95% confidence
A

Branch misprediction costs 10-30 cycles on modern x86-64 processors, depending on pipeline depth. AMD Zen 2 has a 19-cycle pipeline, so misprediction costs approximately 19 cycles. Intel processors with deeper pipelines may cost up to 20-25 cycles. This penalty equals the number of pipeline stages from fetch to execute that must be flushed and refilled. For loops with unpredictable branches, this can multiply running time significantly - converting to branchless code can reduce per-element time from 14 cycles to 7 cycles in some cases.

95% confidence
A

L2 cache access latency is 10-14 cycles on modern CPUs, approximately 4-5 nanoseconds. Specifically: Intel Kaby Lake: 12 cycles (4.8 ns at 2.5 GHz); Intel Haswell: 11 cycles (4.2 ns at 2.6 GHz); AMD Zen: ~12 cycles. L2 is about 3-4x slower than L1 but holds 8-16x more data (typically 256KB-1MB per core). L2 cache is typically unified (both instructions and data) and 4-8 way set associative. For algorithms with working sets between 32KB and 256KB, L2 hit rate is the critical performance metric.

95% confidence
A

Use Array of Structures (AoS) when: accessing all/most fields of each element together, iterating through elements with good spatial locality, or element-wise operations are common. Use Structure of Arrays (SoA) when: accessing only 1-2 fields across many elements, SIMD vectorization is important (SoA enables efficient vector loads), or cache utilization of accessed fields matters more than element locality. Performance difference can be 2-10x depending on access pattern. Consider hybrid AoSoA (Array of Structures of Arrays) for balanced access patterns with SIMD requirements.

95% confidence
A

Modern reorder buffer (ROB) sizes: Intel Skylake/Ice Lake: 224-352 entries; AMD Zen 3/4: 256 entries; Apple M1/M2: 600+ entries. The ROB limits how far ahead the CPU can execute speculatively. For hiding latency, ensure there are enough independent instructions to fill the ROB before hitting a long-latency operation. Example: with 200-entry ROB and 4-wide issue, ~50 cycles of independent work can be found. If your loop has only 20 instructions and one memory access per iteration, you need the loop running 10+ iterations ahead to fill the window.

95% confidence
A

GCC's default inline limit is 600 pseudo-instructions for functions explicitly marked inline (controlled by --param inline-limit). For auto-inlining at -O2/-O3, functions up to about 40-50 instructions may be inlined based on various heuristics. The 'pseudo-instruction' count is an abstract measure that may change between GCC versions and does not directly map to assembly instructions. Functions called only once are more aggressively inlined regardless of size. Use -Winline to get warnings when inline requests are denied due to size or other factors.

95% confidence
A

Target IPC of 2-4 for general-purpose code on modern superscalar CPUs. Modern wide-issue processors can achieve IPC of 4-6 in ideal conditions with deep pipelines and superscalar execution. Apple M-series chips can exceed IPC of 3 in floating-point intensive tasks. An IPC below 0.7 indicates significant optimization opportunity - the code is likely memory-bound or suffering from pipeline stalls. Memory-bound code typically shows IPC of 0.5-1.0, while compute-bound well-optimized code should achieve IPC of 2.0 or higher.

95% confidence
A

For L1 cache blocking, use approximately sqrt(L1_size/3) elements. For a typical 32KB L1 data cache with 4-byte floats, this gives sqrt(32768/3/4) = approximately 52 elements, or roughly 50-100 elements per dimension for 2D blocking. The factor of 3 accounts for multiple arrays (input, output, temporary) that need to fit simultaneously. Always ensure the total working set of your blocked computation fits within L1 with room for other data the processor needs.

95% confidence
A

Throughput (instructions per cycle) on modern x86: simple ALU (add, sub, logical): 4-6 per cycle; complex ALU (multiply): 1-2 per cycle; integer divide: 0.03-0.1 per cycle (10-30 cycles latency); FP add/multiply: 2 per cycle; FP divide: 0.2-0.5 per cycle; loads: 2-3 per cycle (L1 hit); stores: 1-2 per cycle. These are throughput limits - actual IPC depends on dependencies. Key insight: division is 10-100x more expensive than multiplication; replace 'x/const' with 'x * (1/const)' where possible. Measure instruction mix to understand bottlenecks.

95% confidence
A

Use the widest SIMD available that doesn't cause frequency throttling or portability issues: AVX-512 (512-bit): use when sustained compute-heavy, accept ~10-15% frequency reduction on some Intel CPUs; AVX2 (256-bit): best default choice, supported since Haswell 2013, no frequency penalty; SSE (128-bit): use for maximum compatibility or when code has many scalar operations mixed in. Process data in multiples of SIMD width to avoid remainder loops. For portable code, compile with multiple paths and runtime dispatch based on CPUID.

95% confidence
A

For L2 cache blocking, use approximately sqrt(L2_size/3) elements. For a typical 256KB L2 cache with 4-byte floats, this gives sqrt(262144/3/4) = approximately 148 elements, or roughly 128-256 elements per dimension for 2D blocking. For 1MB L2 cache, target around 300 elements. L2 blocking is typically used as an outer loop around L1 blocking to create a two-level tiled algorithm that maximizes data reuse at both cache levels.

95% confidence
A

SIMD string operations (strlen, memcmp, memcpy, strchr) become beneficial for strings longer than 16-32 bytes when using SSE, or 32-64 bytes for AVX. Below these lengths, scalar loops with branch prediction for early termination often win. Modern glibc/MSVC runtime libraries automatically dispatch to SIMD versions for larger strings. For custom implementations: SSE can process 16 bytes per iteration, AVX 32 bytes, with ~2 cycle per vector comparison. For memcpy specifically, SIMD helps above 64 bytes; for <64 bytes, use rep movsb (enhanced on recent CPUs) or unrolled scalar moves.

95% confidence
A

Misaligned access crossing a cache line boundary costs 16 cycles on Intel Atom (vs 4 cycles for aligned) - a 4x penalty. On modern Core i7 (Sandy Bridge and newer), there is no measurable penalty for misaligned access that doesn't cross cache lines. However, access spanning two cache lines always incurs double the memory traffic and potential 2x latency. Rule of thumb: always align data to its natural size (4-byte ints to 4-byte boundaries, 8-byte doubles to 8-byte boundaries), and align hot data structures to 64-byte cache line boundaries to ensure single-line access.

95% confidence

Optimization Decision Trees

61 questions
A

Use loop unrolling over vectorization when: 1) Loop body has complex control flow with data-dependent branches that prevent vectorization, 2) Operations are not amenable to SIMD (irregular memory access, non-contiguous data), 3) Loop iteration count is small and fixed (4-16 iterations) making SIMD setup overhead dominate, 4) You need to reduce loop overhead but data dependencies prevent parallel execution. Keep vectorization when: loop body is simple arithmetic on contiguous arrays, iteration count is large (>64), and operations map directly to SIMD instructions.

95% confidence
A

Use fixed-point when: 1) Target has no FPU or weak FPU (embedded, older ARM), 2) Deterministic results required across platforms, 3) Values have known bounded range, 4) Converting to/from FP would be in hot path anyway, 5) SIMD integer path is faster than FP on your hardware. Use floating-point when: 1) Dynamic range needed (values span orders of magnitude), 2) Modern CPU with fast FP (desktop/server), 3) Precision requirements beyond what fixed-point can offer, 4) Algorithms assume IEEE semantics, 5) Using libraries that expect FP. Fixed-point overhead: shift operations, range checking, more complex code.

95% confidence
A

Use virtual functions when: 1) Types are determined at runtime (plugins, user input), 2) Open extension is needed (new derived classes added without recompiling), 3) Collection of mixed types processed uniformly, 4) Overhead is acceptable (~15-25 cycles indirect call). Use static polymorphism (templates, CRTP) when: 1) Types are known at compile time, 2) Hot path where virtual call overhead matters, 3) Want inlining and further optimization, 4) Binary size is less concern than performance. Hybrid: use virtual dispatch at high level, template for inner loops. Virtual call overhead: indirect branch + possible icache miss for vtable.

95% confidence
A

Use software pipelining when: 1) Loop body has long-latency operations (loads, multiplies, divides), 2) Operations in different iterations are independent, 3) You can overlap load/compute/store from different iterations, 4) Loop runs many iterations (pipeline fill/drain overhead amortized), 5) Register file is large enough to hold multiple iterations in flight. Use simple unrolling when: 1) Operations are short-latency (simple ALU), 2) Goal is primarily reducing loop overhead (branch, counter update), 3) Few registers available (can't keep multiple iterations live), 4) Loop has loop-carried dependencies that prevent overlapping.

95% confidence
A

Use intrinsics when: 1) Compiler fails to vectorize (check assembly), 2) Need specific instruction sequences compiler won't generate, 3) Algorithm requires precise control over SIMD operations, 4) Performance is critical and you can invest in manual optimization, 5) Using advanced features (shuffles, gathers) that compilers handle poorly. Rely on auto-vectorization when: 1) Code is straightforward loops over arrays, 2) Portability across ISAs matters (auto-vec adapts), 3) Compiler does good job (verify with -fopt-info-vec or assembly), 4) Maintenance cost of intrinsics is prohibitive, 5) Code changes frequently (intrinsics require rework).

95% confidence
A

Use software prefetch when: 1) Access pattern is predictable to you but not to hardware (pointer chasing, indirect indexing), 2) Stride is larger than hardware can detect (often >2KB), 3) Access pattern changes rapidly (hardware needs training time), 4) Working on linked structures (trees, graphs) with known traversal order. Rely on hardware when: 1) Sequential or small-stride access (hardware handles this well), 2) Pattern is simple enough for prefetcher to learn, 3) Code needs to be portable (software prefetch effectiveness varies by CPU), 4) Don't want prefetch overhead in non-hot paths.

95% confidence
A

Use eager evaluation when: 1) Result will definitely be used, 2) Computation is cheap relative to tracking laziness, 3) Memory for intermediate results is acceptable, 4) Want predictable timing (no surprise delays), 5) Debugging is easier with immediate execution. Use lazy evaluation when: 1) Result may not be needed (conditional use), 2) Computation is expensive and avoidable, 3) Working with infinite or large sequences, 4) Building composable pipelines (filter, map, reduce), 5) Memory-constrained environment. Overhead: lazy evaluation adds thunk/closure creation cost. Don't use lazy for simple operations that will always execute.

95% confidence
A

Use branchless (predication, conditional moves) when: 1) Branch is unpredictable (misprediction rate >15-20%), 2) Both paths are cheap (< 5-10 cycles combined), 3) No side effects occur from speculatively computing wrong path, 4) Code is in a hot loop executed millions of times. Keep branching when: 1) Branch is highly predictable (>90% one direction), 2) One path is significantly more expensive than the other, 3) Skipped path has side effects (memory writes, I/O, exceptions), 4) Branchless version requires many more instructions, negating the benefit.

95% confidence
A

Use SIMD compress/expand (AVX-512 VPCOMPRESSD/VPEXPANDD) when: 1) Filtering arrays based on condition (sparse to dense or vice versa), 2) AVX-512 is available with good performance, 3) Processing large arrays where SIMD overhead is amortized, 4) Selectivity is moderate (10-90% kept). Use scalar when: 1) No AVX-512 or using AVX2 (emulation is complex and slow), 2) Selectivity is extreme (nearly all kept or nearly all filtered), 3) Small arrays where SIMD setup dominates, 4) Need portable code. AVX2 workaround: use pext/pdep for compression but slower than native AVX-512.

95% confidence
A

Cache-oblivious wins when: 1) Multiple cache levels exist and tuning for one hurts others, 2) Actual cache available varies (shared with other processes), 3) Data size varies across calls (one tuned block size doesn't fit all), 4) Virtual memory paging matters (cache-oblivious often optimizes for disk too), 5) TLB pressure is significant (cache-oblivious recursive structure often has better locality). Cache-aware typically wins when: single dominant cache level, dedicated cores, known data sizes, ability to tune extensively. Hybrid approach: cache-aware at top level, cache-oblivious for base cases.

95% confidence
A

Use reader-writer locks when: 1) Reads significantly outnumber writes (>10:1 ratio), 2) Read critical section is long enough that contention matters, 3) Multiple concurrent readers provide measurable benefit, 4) Write operations are infrequent. Use plain mutex when: 1) Read/write ratio is low or operations are short, 2) RW lock overhead exceeds benefit (RW locks are more complex), 3) Writes are frequent (writers starve with many readers), 4) Single-threaded read performance is adequate. Warning: RW locks can have writer starvation; use fair variants if writes must make progress. Uncontended RW lock is slower than uncontended mutex (~2-3x).

95% confidence
A

Prefer vertical (lane-parallel) operations when: 1) Processing independent data streams, 2) Same operation applies to all elements, 3) Data is naturally packed by operation type (SoA layout). Use horizontal (cross-lane) operations when: 1) Computing reductions (sum, min, max across vector), 2) Data arrives in AoS format requiring field extraction, 3) Shuffling/permuting data between lanes, 4) Dot products of small vectors. Minimize horizontal ops because they're typically 3-10x slower than vertical. Restructure algorithms to batch horizontal ops or convert to vertical form.

95% confidence
A

Apply false sharing avoidance when: 1) Different threads frequently write to adjacent memory locations, 2) Performance counters show high L1D cache misses or coherence traffic, 3) Scaling is poor despite independent work, 4) Data structures are arrays of small objects accessed by thread index. Techniques: 1) Pad structures to cache line (64 bytes typical), 2) Use alignas(64) on per-thread data, 3) Separate hot and cold data, 4) Use thread-local storage instead of shared array. Cost: increased memory usage. Don't pad everything; profile first to identify actual false sharing hotspots.

95% confidence
A

Combine unrolling with vectorization when: 1) SIMD width doesn't fully utilize execution units (unroll 2-4x SIMD operations to hide latency), 2) Loop has multiple independent SIMD operations that can execute in parallel, 3) Memory bandwidth is not the bottleneck and CPU has multiple vector execution units, 4) Unrolling enables better instruction scheduling between vector operations. Avoid combining when: memory bandwidth is saturated, register pressure is already high (causes spills), or loop body is already complex enough that compiler cannot schedule efficiently.

95% confidence
A

Use branch-free (conditional move, SIMD min/max) when: 1) Comparisons are unpredictable (random data), 2) Comparing many independent pairs (SIMD opportunity), 3) Code is in hot loop with high iteration count, 4) Both values are already in registers. Use branching when: 1) One outcome is much more likely (>90%), 2) Computing unused value is expensive, 3) Single comparison (setup overhead of branchless not amortized), 4) Comparison involves memory that doesn't need to be loaded if branch skipped. Note: compilers often generate cmov automatically; check assembly before manual optimization.

95% confidence
A

Use strength reduction when: 1) Computation can be converted to cheaper operation (multiply->shift, divide->multiply by reciprocal), 2) Lookup table would be large (>L1 cache, causing misses), 3) Memory bandwidth is the bottleneck, 4) Input values are not bounded to small range. Use lookup tables when: 1) Computation is expensive (trig functions, complex formulas), 2) Input domain is small enough for table to fit in cache (<4KB for L1), 3) Table has good temporal locality (values reused), 4) Computation cannot be simplified algebraically, 5) Precision requirements allow table interpolation.

95% confidence
A

Use compiler vector extensions (GCC vector types, Clang ext_vector_type) when: 1) Want portable SIMD across x86/ARM/etc, 2) Operations are standard arithmetic (+, -, *, /), 3) Compiler can optimize vector operations well, 4) Don't need specific instruction control. Use explicit intrinsics when: 1) Need instructions without vector extension equivalent (shuffles, special math), 2) Targeting specific microarchitecture optimizations, 3) Compiler generates suboptimal code from vector types, 4) Need precise control over instruction selection. Hybrid works: use vector types for common ops, intrinsics for specialized operations.

95% confidence
A

Prefetch distance calculation: cycles_ahead = memory_latency / cycles_per_iteration. Typical values: L2 prefetch (50-100 cycles ahead), main memory (200-400 cycles ahead). Practical guidance: 1) Start with 16-64 cache lines ahead for main memory, 2) For L2, 4-16 lines ahead, 3) Adjust based on iteration time (faster iterations need more lookahead), 4) Too close: data not ready in time, 5) Too far: prefetched data evicted before use. Optimal distance depends on: memory latency, cache sizes, iteration cost, contention. Always profile: wrong distance can hurt due to cache pollution and prefetch instruction overhead.

95% confidence
A

Use speculative execution when: 1) Speculation is cheap and usually correct (>70% hit rate), 2) Recovery from wrong speculation is fast, 3) Latency is more important than throughput, 4) Parallel resources are available for speculation, 5) Verification can happen in parallel with dependent work. Wait for conditions when: 1) Speculation is often wrong (<50% success), 2) Wrong speculation has side effects that are hard to undo, 3) Resources are scarce (speculation wastes them), 4) Speculative work is expensive relative to wait time, 5) Correctness is paramount. Examples: branch prediction (CPU), request speculation (databases), prefetching (memory subsystem).

95% confidence
A

A branch is predictable enough when: 1) Pattern repeats regularly (TTTTTTTT or TFTFTFTF), 2) Same direction taken >90% of the time, 3) Pattern fits in branch history table (typically 2-4K entries), 4) Loop-carried pattern with fixed iteration count. Measure with CPU performance counters (branch-misses event). Rule of thumb: >5% misprediction rate on a hot branch warrants considering branchless. Modern predictors handle: nested loops, simple alternating patterns, correlated branches. They struggle with: random patterns, data-dependent branches with high entropy, very long patterns.

95% confidence
A

Use row-major when: 1) Language convention is row-major (C, C++, Python/NumPy default), 2) Algorithms traverse rows (image processing row by row), 3) Interfacing with row-major libraries (most C libraries). Use column-major when: 1) Language convention is column-major (Fortran, MATLAB, Julia), 2) Matrix operations are column-oriented (solving linear systems), 3) Using BLAS/LAPACK (optimized for column-major). Key insight: match storage to access pattern. If you iterate over columns but store row-major, you get cache misses on every access. Profile memory access patterns, not just language defaults.

95% confidence
A

Horizontal reduction is acceptable when: 1) Performed once after processing many elements (amortized), 2) Reduction is final result, not intermediate in hot loop, 3) Alternative scalar code would require loading each element individually, 4) Using efficient reduction patterns (pairwise for accuracy, tree for speed). Avoid horizontal reduction when: 1) Inside inner loop (restructure to accumulate vertically, reduce once), 2) SIMD width is very large (AVX-512 reduction is expensive), 3) Reduced value feeds back into next SIMD iteration (creates dependency). Cost: ~3-5 cycles for 128-bit, ~5-8 for 256-bit, ~10-15 for 512-bit reductions.

95% confidence
A

Use NUMA-aware allocation when: 1) Running on multi-socket system (check with numactl --hardware), 2) Data is accessed primarily by specific threads that can be pinned to nodes, 3) Memory bandwidth is the bottleneck, 4) Dataset is large enough that cross-node traffic matters (>L3 cache). Use default allocation when: 1) Single-socket system, 2) Data is accessed by all threads equally, 3) Application is not memory-bandwidth bound, 4) Thread-to-core mapping is dynamic. NUMA overhead: 50-100% latency penalty for remote access, 50% bandwidth reduction. First-touch policy: allocate in thread that will primarily use the data.

95% confidence
A

Choose cache tiling when: 1) Data access pattern is predictable but reuses data multiple times (matrix multiply, stencil codes), 2) Working set exceeds cache size but can be partitioned into cache-fitting blocks, 3) Algorithm structure allows blocking without significant code complexity, 4) Temporal locality is more important than spatial locality. Choose prefetching when: 1) Access pattern is streaming (one-pass, no reuse), 2) Memory access is predictable but spread across large address range, 3) Hardware prefetcher cannot detect the pattern (indirect access, large strides), 4) You need to hide memory latency but data doesn't fit in cache anyway.

95% confidence
A

Use switch/jump table when: 1) Cases are dense integers (0, 1, 2...), 2) Branch predictor can learn pattern (repeated same cases), 3) Need compiler to inline case bodies, 4) Dispatch is in moderately hot path. Use function pointers when: 1) Cases are sparse or non-integer keys, 2) Functions are in different compilation units (can't inline anyway), 3) Need runtime configurability (plugins, callbacks), 4) Polymorphic behavior with clear interface. Performance: dense switch compiles to jump table (~same as function pointer array), but switch allows inlining. Indirect call has ~15-25 cycle penalty if mispredicted.

95% confidence
A

Size thresholds: 1) <4KB: Likely fits in L1D cache, good for frequently accessed tables, 2) 4KB-256KB: L2 cache territory, acceptable if access has locality, 3) 256KB-8MB: L3 cache, only if access pattern has strong locality or table is shared across cores, 4) >8MB: Will cause cache misses, often slower than computation. Key factors: access pattern (random vs sequential), reuse frequency, cache contention from other data. Test empirically: if table causes >5% L1 miss rate increase, consider computation instead. Modern CPUs can often compute faster than random memory access.

95% confidence
A

Use mmap when: 1) Random access pattern (no sequential read-ahead needed), 2) Multiple processes share same file (shared mapping), 3) File fits in address space and accesses have locality, 4) Want to leverage OS page cache automatically, 5) Treating file as array simplifies code. Use read/write when: 1) Sequential processing (read-ahead optimizations), 2) Need control over buffering and read size, 3) File is huge relative to address space, 4) Processing without page fault overhead is critical, 5) File is on network filesystem (mmap semantics problematic). Note: mmap has page fault overhead (~1000+ cycles) per new page accessed.

95% confidence
A

Use multiplication by reciprocal when: 1) Dividing by same constant multiple times (compute reciprocal once), 2) Floating-point precision loss is acceptable (1-2 ULP typically), 3) Division is in hot loop (division ~15-25 cycles, multiply ~4-5), 4) Compiler doesn't auto-optimize (check assembly). Keep division when: 1) Divisor changes each iteration (reciprocal computation overhead), 2) Need exact results (financial, deterministic simulation), 3) Division by variable with potential for divide-by-zero (reciprocal of 0 = inf, different behavior), 4) Integer division (requires different approach: magic numbers, not simple reciprocal).

95% confidence
A

Use PGO when: 1) Application has stable hot paths that training can capture, 2) Representative workload is available for profiling, 3) Willing to add profiling step to build process, 4) Performance gains justify build complexity (typically 10-30% for complex code). Use generic optimization when: 1) Workload varies significantly between runs, 2) Cannot create representative training workload, 3) Build simplicity is prioritized, 4) Code is already vectorized/optimized and PGO gains are marginal. PGO helps most with: branch prediction hints, function layout, inlining decisions, register allocation. Modern equivalent: AutoFDO uses production profiles.

95% confidence
A

Inline when: 1) Function is small (<10-20 instructions), 2) Function is called in hot path, 3) Inlining enables further optimizations (constant propagation, dead code elimination), 4) Function has parameters that are often constants, 5) Call overhead (stack frame, parameter passing) is significant relative to work done. Keep as call when: 1) Function is large (inlining causes code bloat), 2) Function is called from many sites (instruction cache pressure), 3) Function is rarely called (cold path), 4) Recursion is involved, 5) Function address is taken (function pointers, callbacks).

95% confidence
A

Use SIMD masking when: 1) Edge cases are scattered throughout data (predication per lane), 2) Both paths have similar cost, 3) Using AVX-512 (first-class mask support) or AVX2 blend, 4) Branching would be unpredictable. Use branching when: 1) Edge cases cluster (process main batch then handle edges), 2) Edge path is much more expensive (division, function call), 3) Edge cases are rare (<1-5%), 4) Edge handling has side effects. Hybrid: branch on vector-level condition (any/all edge cases), then use masking within the SIMD path. This catches common all-normal case efficiently.

95% confidence
A

Inlining is hurting when: 1) Instruction cache miss rate increases significantly, 2) Binary size grows substantially (>20-30% for hot code), 3) Compile times become excessive, 4) Profile shows icache stalls in previously fast code, 5) Same function inlined at many call sites causes code duplication. Diagnose with: instruction cache miss counters (L1-icache-load-misses), comparing binary sizes before/after, profiling showing unexpected icache bottlenecks. Fix by: marking large functions with noinline attribute, using link-time optimization (LTO) which can make better decisions, reducing aggressive inline thresholds in compiler flags.

95% confidence
A

Use loop fusion when: 1) Loops iterate over same range, 2) Combined loop body fits in instruction cache, 3) Data from first loop is immediately used by second (improves locality), 4) Register pressure allows holding intermediate values. Keep loops separate when: 1) Individual loops vectorize better separately, 2) Combined loop exceeds register capacity (causes spills), 3) Loops have different optimal tiling factors, 4) First loop produces data consumed by many subsequent operations, 5) Parallelization strategy differs between loops. Profile both: fusion reduces memory traffic but can hurt vectorization.

95% confidence
A

Manually unroll when: 1) Compiler doesn't unroll despite #pragma hints, 2) You need specific unroll factor for SIMD alignment, 3) Unrolling enables manual prefetching at specific offsets, 4) Profiling shows the loop is hot and compiler under-optimized, 5) You need control over instruction scheduling between unrolled iterations. Trust compiler when: 1) Loop is straightforward (simple bounds, no complex control flow), 2) You use appropriate optimization flags (-O3, -funroll-loops), 3) Profile-guided optimization (PGO) is available, 4) Code needs to be portable across compilers, 5) Loop bounds vary (compiler can handle epilogue).

95% confidence
A

Use huge pages (2MB or 1GB) when: 1) Working set is large (>100MB), 2) Access pattern is random across large range (TLB misses are bottleneck), 3) Can preallocate memory (huge pages harder to allocate dynamically), 4) Application is long-running (amortize setup cost). Use regular pages when: 1) Working set is small or has good locality, 2) Memory usage is dynamic and unpredictable, 3) Memory is shared with other processes (huge pages can cause fragmentation), 4) Using memory-mapped files (file size alignment constraints). Check TLB miss rate (perf stat -e dTLB-load-misses); if >1% and random access, try huge pages.

95% confidence
A

Use streaming (non-temporal) stores when: 1) Writing large amounts of data that won't be read again soon, 2) You want to avoid polluting cache with write-only data, 3) Write bandwidth is the bottleneck and you can saturate memory bus, 4) Data size significantly exceeds LLC (typically >10x). Use regular stores with prefetching when: 1) Written data will be read back shortly, 2) Writes are scattered or small (streaming stores require aligned, sequential writes), 3) You're updating existing cached data, 4) Write combining buffers are limited and you can't fill full cache lines.

95% confidence
A

Use computed goto (GCC extension) when: 1) Building high-performance interpreter (20-30% faster than switch), 2) Dispatch is the dominant cost, 3) GCC/Clang are your target compilers, 4) Willing to sacrifice portability, 5) Many opcodes with variable execution times. Use switch when: 1) Portability required (MSVC doesn't support computed goto), 2) Compiler optimizes switch well for your case, 3) Opcode count is small (<50), 4) Maintenance and readability matter more than last 20% performance. Alternative: tail-call dispatch (each handler calls next) can approach computed goto performance with better portability.

95% confidence
A

Use scalar when: 1) Processing fewer than 2-4 elements (SIMD setup/extraction overhead dominates), 2) Data requires gather/scatter that's slower than scalar loads, 3) Operations involve many branches/conditionals that mask most lanes, 4) Data alignment cannot be guaranteed and unaligned SIMD is slow, 5) SIMD version requires expensive horizontal operations (reductions across lanes). SIMD break-even points: typically 4+ floats for SSE, 8+ for AVX, 16+ for AVX-512. Exception: if scalar code is followed by SIMD, may be worth vectorizing small sizes to avoid transition penalties.

95% confidence
A

Use buffered I/O (standard) when: 1) Access pattern has temporal locality (rereading data), 2) Small random reads/writes (buffer coalescing helps), 3) Want OS to manage caching, 4) Don't need precise I/O timing control. Use direct I/O (O_DIRECT) when: 1) Implementing your own caching layer (databases), 2) Streaming large files once (avoid polluting page cache), 3) Need predictable I/O latency (no page cache eviction delays), 4) Memory is limited and cache pressure is high, 5) Benchmarking raw device performance. Direct I/O requires aligned buffers and has higher per-request overhead.

95% confidence
A

Use gather/scatter when: 1) Data layout cannot be changed (external APIs, legacy code), 2) Access pattern is truly irregular (sparse matrices, indirect indexing), 3) Gathered elements are processed enough to amortize gather cost, 4) Alternative is scalar loop with same random access pattern. Restructure data when: 1) You control the data layout, 2) Gathers would be frequent in hot path (AVX2 gather ~15-25 cycles vs 3 for packed load), 3) Same data accessed multiple times (restructure once, benefit many times), 4) Data naturally fits SoA or AoSoA without major code changes.

95% confidence
A

Use loop fission when: 1) Loop body is too large for vectorization, 2) Different parts have different optimization opportunities, 3) Register pressure causes spills in unified loop, 4) Want to parallelize parts independently, 5) Some iterations need different cache behavior (streaming vs reuse). Keep unified when: 1) Loop body has good locality that would be lost, 2) Fission would require multiple passes over data (bandwidth limited), 3) Loop carries dependencies between would-be-split parts, 4) Iteration overhead would multiply. Fission increases memory traffic but enables targeted optimization.

95% confidence
A

Use static scheduling when: 1) Work per iteration is uniform and predictable, 2) Iteration count is known at compile time, 3) You want minimal scheduling overhead, 4) Load balancing is not a concern, 5) Cache affinity matters (each thread processes same memory region). Use dynamic scheduling when: 1) Work varies significantly between iterations, 2) Iteration times are unpredictable (data-dependent), 3) Some iterations may block on I/O or locks, 4) Hardware has heterogeneous performance (power throttling, NUMA effects), 5) You're processing a work queue with varying task sizes.

95% confidence
A

Use thread-local storage when: 1) Each thread maintains independent state (counters, buffers, caches), 2) Want to eliminate synchronization overhead entirely, 3) Combining thread-local results is infrequent (aggregate at end), 4) Memory overhead of per-thread copies is acceptable. Use global with locking when: 1) Threads must see each other's updates immediately, 2) State is inherently shared (work queue, shared cache), 3) Memory is constrained (can't duplicate per thread), 4) Operations are infrequent (locking overhead acceptable). TLS is ~3 cycles on modern systems; uncontended lock is ~15-25 cycles.

95% confidence
A

Use stack allocation when: 1) Size is known at compile time and small (<1MB typically), 2) Object lifetime matches function scope, 3) Performance is critical (stack allocation is ~20-50 cycles vs 100-1000+ for malloc), 4) You want guaranteed deallocation (no memory leaks), 5) Recursive depth is bounded and known. Use malloc/heap when: 1) Size determined at runtime or is large, 2) Object must outlive creating function, 3) Size exceeds safe stack limits (risk of stack overflow), 4) Need to resize (realloc), 5) Shared between threads with different lifetimes. Consider alloca for dynamic stack allocation with caution.

95% confidence
A

Use SIMD shuffles when: 1) Permutation pattern is fixed at compile time, 2) Using AVX2+ with powerful shuffle instructions (VPERM, VSHUF), 3) Need to permute within or across lanes efficiently, 4) Data is already in SIMD registers. Use table-based when: 1) Permutation varies at runtime (shuffle control from LUT), 2) Pattern doesn't map to available shuffle instructions, 3) Implementing arbitrary byte-level permutation, 4) Preprocessing time is available to build optimal sequence. Hybrid: pshufb with table-loaded control byte is flexible and fast when LUT fits in cache.

95% confidence
A

Use AoS when: 1) You frequently access all fields of a single entity together, 2) Entities are accessed randomly by index, 3) Cache line utilization is good because you use most fields per access, 4) Code readability and maintainability are priorities, 5) Object-oriented design with methods operating on complete entities. Use SoA when: 1) Operations process one or few fields across many entities, 2) SIMD operations are common (vectorizing same operation across entities), 3) Different fields have different access frequencies, 4) Memory bandwidth is critical and you want to minimize cache line waste.

95% confidence
A

Use CAS/atomics when: 1) Operation is simple (counter, flag, pointer swap), 2) Contention is low to moderate, 3) Critical section would be very short (< 100 cycles), 4) Want to avoid kernel transitions (mutex may sleep), 5) Building lock-free data structures. Use mutexes when: 1) Critical section is complex or long, 2) Need to protect multiple related operations atomically, 3) Contention is high (spinning wastes CPU), 4) Operations include blocking calls (I/O, allocation), 5) Simpler correctness reasoning needed. CAS loops can cause livelock under high contention; mutexes guarantee progress.

95% confidence
A

Thresholds depend on work per item: 1) Trivial work (few cycles): need millions of items, often better sequential, 2) Light work (100-1000 cycles): 10K-100K items to amortize thread overhead, 3) Medium work (1K-10K cycles): 1K-10K items sufficient, 4) Heavy work (>10K cycles): even 100 items may benefit. Key factors: thread creation/synchronization cost (~10K-100K cycles), cache effects (parallel may thrash shared cache), memory bandwidth (may saturate with few threads). Practical rule: if total work < 1 million cycles, stay sequential unless profiling shows benefit.

95% confidence
A

Use bounds checking when: 1) Input comes from untrusted sources, 2) Index computation is complex or error-prone, 3) Debugging or development builds, 4) Security is critical (buffer overflows), 5) Performance cost is negligible (cold paths, complex processing per element). Remove bounds checking when: 1) Proven safe by construction (loop from 0 to len-1), 2) Inner loop where check dominates computation time, 3) Already validated at higher level, 4) Using memory-safe language with compiler optimization. Technique: check once before loop, use unchecked access inside. Profile to confirm checking is actually the bottleneck.

95% confidence
A

Prefer AVX-512 when: 1) Algorithm is compute-bound (utilize double width), 2) Can fill all 512 bits with useful work, 3) Using mask operations (AVX-512 masks are more efficient), 4) Running on server-class CPUs with good AVX-512 support. Prefer AVX2 when: 1) Running on consumer CPUs (many throttle frequency for AVX-512), 2) Cannot fill 512 bits (wasting execution resources), 3) Memory-bound anyway (wider SIMD doesn't help), 4) Need consistent performance across CPU generations, 5) Power/thermal constraints matter. Test both: on some CPUs, AVX2 at higher frequency beats throttled AVX-512.

95% confidence
A

Use integer when: 1) Values are naturally discrete (counts, indices, flags), 2) Exact computation required (no rounding), 3) Targeting older hardware or embedded systems with weak FPU, 4) Division is common (integer division can use magic multiply, while FP divide is slow). Use floating-point when: 1) Values represent continuous quantities, 2) Range varies significantly (float handles 10^38 range), 3) Multiplication/addition dominant (modern FPUs match or exceed integer), 4) Using SIMD (FP SIMD often better supported), 5) Converting to/from int frequently (conversion has cost).

95% confidence
A

Use exceptions when: 1) Errors are exceptional (rare), 2) Multiple call levels would need to pass error codes up, 3) Constructor/destructor errors need handling (can't return codes), 4) Cleaner code logic flow is prioritized, 5) Using RAII for resource management. Use error codes when: 1) Errors are common (expected conditions), 2) Zero-cost error path is required (exceptions have throw cost), 3) Interfacing with C code, 4) Real-time constraints (exception unwinding is unpredictable), 5) Embedded systems with limited runtime. Note: modern C++ exceptions have near-zero cost when not thrown; thrown exceptions cost ~1000+ cycles for unwinding.

95% confidence
A

Choose dynamic(1) when: iterations vary wildly and any iteration could be the slowest, minimum overhead is less important than perfect load balancing. Choose dynamic(chunk) when: moderate variation, want to reduce scheduling overhead vs chunk=1. Choose guided when: early iterations tend to be larger (common in many algorithms), want decreasing chunk sizes to balance load at end while reducing overhead. Guided starts with large chunks (iterations/threads) and decreases geometrically. Rule of thumb: start with guided, profile, switch to dynamic if load imbalance persists at the end of parallel region.

95% confidence
A

Use OpenMP when: 1) Parallelism is loop-based or fork-join, 2) Want portable parallel code (supported by major compilers), 3) Incremental parallelization of existing serial code, 4) Task-based parallelism with clear dependencies, 5) Team has mixed parallel programming experience. Use manual threading when: 1) Need fine control over thread affinity/priority, 2) Parallelism pattern doesn't fit OpenMP model, 3) Building long-running thread pools with custom scheduling, 4) Integrating with existing threading framework, 5) Need deterministic thread behavior for debugging. OpenMP has ~1-5 microsecond overhead per parallel region.

95% confidence
A

Use memory pools when: 1) Allocating many objects of same size, 2) Allocation/deallocation is in hot path (pools can be <10 cycles), 3) Want to avoid fragmentation, 4) Need bulk deallocation (release entire pool at once), 5) Objects have similar lifetimes. Stick with malloc when: 1) Object sizes vary significantly, 2) Allocation is rare (cold path), 3) Memory usage is unpredictable, 4) Pool management complexity isn't worth the gain, 5) Using language with good allocator (jemalloc, tcmalloc). Pool overhead: initial setup, pool size decisions, potential memory waste from granularity.

95% confidence
A

Use batch processing when: 1) Data arrives in discrete chunks, 2) Processing has setup overhead amortized over batch, 3) Can tolerate latency (not real-time), 4) Memory is sufficient for batch accumulation, 5) Operations benefit from sorting/grouping (reduce random access). Use streaming when: 1) Data is continuous or unbounded, 2) Latency matters (real-time requirements), 3) Memory is limited (can't buffer), 4) Operations are independent per item (no benefit from batching), 5) Want simpler programming model. Hybrid: micro-batching (small batches with low latency) combines benefits when tuned properly.

95% confidence
A

Use AoSoA when: 1) You need SIMD efficiency but also access multiple fields per entity, 2) SIMD width matches your natural processing batch size (e.g., 8 floats for AVX), 3) You want cache-friendly access while enabling vectorization, 4) Processing entities in small batches that fit SIMD registers, 5) Typical case: particle systems, ECS game engines, physics simulations. Structure: group entities into small arrays (size = SIMD width), then array of these groups. Avoid AoSoA when: random single-entity access is common, or data access patterns don't align with SIMD batching.

95% confidence
A

Use hardware popcount (POPCNT instruction) when: available (most x86-64 since 2008, ARM since ARMv8). Use lookup table when: 1) No hardware support, 2) Processing bytes (256-entry table fits in cache), 3) Counting many values with good locality. Use bit manipulation (SWAR) when: 1) No hardware support and memory is constrained, 2) Processing wide values where table would be too large, 3) SIMD version needed (parallel popcount across lanes). Performance: hardware ~1 cycle, table ~3-4 cycles (with hit), bit manipulation ~10-15 cycles for 64-bit. SIMD popcount via pshufb+paddb is excellent for bulk data.

95% confidence
A

Use cache-aware when: 1) Cache sizes are known at compile/deploy time, 2) You can tune parameters for specific hardware, 3) Maximum performance is critical and you can afford tuning, 4) Algorithm has natural blocking parameters (matrix operations), 5) Memory hierarchy is simple (embedded systems). Use cache-oblivious when: 1) Code must run well on diverse hardware, 2) Tuning is impractical (library code, many deployment targets), 3) Memory hierarchy is complex (multiple cache levels, NUMA), 4) Algorithm naturally decomposes recursively, 5) Portability matters more than last 10-20% performance.

95% confidence
A

Use spinlocks when: 1) Critical section is very short (<1 microsecond), 2) Running on dedicated cores (spinning doesn't steal from other work), 3) Lock holder won't be preempted (kernel or real-time context), 4) Contention is rare (uncontended fast path is few cycles). Use sleeping locks (mutex) when: 1) Critical section is longer, 2) Lock holder might sleep or be preempted, 3) Oversubscribed system (spinning wastes shared CPU), 4) Fairness matters (spinlocks often unfair), 5) Power efficiency matters (spinning prevents CPU sleep). Hybrid: spin briefly then sleep (adaptive mutex) for moderate sections.

95% confidence