Performance Optimization FAQ & Answers

What is the performance ratio of float vs double operations?

95% confidence

Float (32-bit) vs double (64-bit) performance: same latency and throughput per instruction on modern x86 for scalar operations; 2x SIMD throughput for float (8 floats vs 4 doubles in 256-bit AVX register); 2x memory bandwidth efficiency for float (half the bytes). Use float when: precision is sufficient (7 significant digits), memory bandwidth is bottleneck, or SIMD width matters. Use double when: numerical precision needed (15 digits), accumulating many values (less rounding error), or mixing with double-precision libraries. Memory-bound code sees ~2x speedup from float; compute-bound sees less difference.

Sources

users.ece.cmu.edu en.algorithmica.org

95% confidence

What is the typical atomic operation overhead compared to regular operations?

Atomic operations cost 10-100+ cycles depending on contention and cache state: uncontended atomic on local L1 cache: 10-20 cycles; contended atomic requiring cache line bounce between cores: 50-200 cycles; atomic across NUMA nodes: 100-300+ cycles. Compare to regular load/store: 4-5 cycles from L1. Lock-free algorithms using CAS loops can waste unpredictable cycles under high contention. Rule of thumb: minimize atomic operations in hot paths, batch updates when possible, use thread-local accumulation with periodic synchronization, and consider cache line padding to prevent false sharing on atomic variables.

Sources

rigtorp.se internalpointers.com

95% confidence

What is the typical TLB coverage and when does it become limiting?

Typical TLB coverage with 4KB pages: L1 DTLB: 64-128 entries = 256-512KB; L2 STLB: 1024-2048 entries = 4-8MB. Working sets exceeding TLB coverage suffer page walk penalties. When TLB miss rate >1%, consider huge pages. With 2MB huge pages: same 1024 STLB entries cover 2GB. Signs of TLB pressure: high DTLB miss rate in profiler, performance cliff at specific working set sizes, random access patterns over large memory regions. Solutions: huge pages, improve memory locality, reduce working set, or use cache blocking to reuse TLB entries.

Sources

en.wikipedia.org lwn.net

95% confidence

What is the rule of thumb for avoiding register spills in loops?

Keep no more than 10-12 live variables within a hot loop to avoid register spills on x86-64. Techniques to reduce register pressure: keep live ranges short by using variables close to their definitions, avoid excessive loop unrolling (which multiplies live variables), use restrict pointers to enable better register allocation, break complex expressions into simpler ones the compiler can optimize. Register spills inside hot loops cause significant performance degradation due to added memory traffic. When comparing unroll factors, measure performance to find the sweet spot between instruction-level parallelism and register pressure.

Sources

intel.com ece.lsu.edu

95% confidence

What IPC value indicates code needs optimization?

An IPC below 0.7 indicates significant room for optimization and limited use of processor capabilities. This typically signals memory-bound execution with frequent cache misses, pipeline stalls from data dependencies, or poor instruction-level parallelism. A CPI (cycles per instruction, the inverse) greater than 1 suggests stall-bound execution. To improve: reduce memory access latency through better cache utilization, eliminate data dependencies through loop unrolling or software pipelining, and ensure sufficient independent instructions for out-of-order execution to exploit.

Sources

en.wikipedia.org intel.com

95% confidence

What is the rule of thumb for estimating vectorization speedup?

Expected vectorization speedup = min(SIMD_width, arithmetic_intensity * memory_bandwidth / compute_rate). For compute-bound code: theoretical max is SIMD width (4x for SSE float, 8x for AVX float). For memory-bound code: speedup is limited by bandwidth, typically 1.5-3x regardless of SIMD width. Practical rule: expect 50-70% of theoretical SIMD width speedup for well-vectorized compute-bound code, and 1.5-2x for memory-bound code. Factors reducing speedup: unaligned access, gather/scatter operations, horizontal operations, and remainder handling. Measure actual speedup - it varies significantly by workload.

Sources

en.algorithmica.org users.ece.cmu.edu

95% confidence

What is the rule of thumb for choosing between lookup tables and computation?

Use lookup tables when: computation takes >20 cycles and table fits in L1 cache (<=32KB), access pattern is unpredictable (no benefit from branch prediction), or function is called millions of times. Use computation when: table would exceed L2 cache (causing cache pollution), access pattern allows branch prediction to work well, or computation is simple (<10 cycles). Typical breakeven: 256-entry byte table (256 bytes) is almost always beneficial; 64K-entry table (64KB+) requires careful analysis. Memory latency (4 cycles L1) vs compute (1-20 cycles) determines winner.

Sources

en.algorithmica.org cs.cmu.edu

95% confidence

What is the typical LLC (Last Level Cache) size per core for optimization planning?

Plan for 1-3MB of LLC per core for working set sizing. Typical configurations: Intel desktop: 2MB per core (16MB shared / 8 cores); AMD Zen: 4MB L3 per CCX (8 cores share 32MB); Server CPUs: 1.25-2.5MB per core. Note L3 is shared, so under load, effective per-core share decreases. For multi-threaded optimization: total_working_set should fit in total_L3 * 0.7 (leave room for OS and other threads). For single-threaded: working set up to full L3 is reasonable but benefits from L2 blocking for hot data.

Sources

orlyset.com embeddedtechlearn.com

95% confidence

What is the typical main memory (DRAM) latency in cycles and nanoseconds?

Main memory (DRAM) access latency is 150-300 cycles, approximately 60-100 nanoseconds on modern systems. This is 100x slower than L1 cache. The latency includes: L3 miss detection (~40 cycles), memory controller processing, DRAM row activation (CAS latency), and data transfer. DDR4 typical latency: 60-80 ns; DDR5: 70-90 ns (higher frequency but also higher CAS latency). Memory-bound code can see processors stalling for hundreds of cycles per access. This 'memory wall' makes cache optimization crucial for performance.

Sources

gist.github.com blog.jyotiprakash.org

95% confidence

What is the typical throughput difference between sequential and random memory access?

Sequential access achieves 10-100x higher throughput than random access due to prefetching and cache line utilization. Typical measurements: sequential read: 30-50 GB/s (DDR4), 60-80 GB/s (DDR5); random read (8-byte): 0.5-2 GB/s (limited by latency, not bandwidth). The gap comes from: prefetchers work for sequential patterns (hiding 200+ cycle DRAM latency), each cache line (64 bytes) fully utilized in sequential vs partially in random, and memory controller optimizations for streaming. Design data structures for sequential access in hot paths wherever possible.

Sources

lemire.me gist.github.com

95% confidence

What is the rule of thumb for power-of-two array sizes?

Prefer power-of-two array sizes for: fast modulo via bitwise AND (x & (size-1)), efficient cache blocking, SIMD alignment without remainders. Avoid power-of-two sizes when: accessing with power-of-two stride (causes cache set conflicts), or multiple power-of-two arrays compete for same cache sets. Mitigation: pad arrays to 'size + cache_line_size' to break alignment. Example: 4096-element float array with stride-1024 access uses only 4 of 64 cache sets, wasting 15/16 of cache. Add 16-element padding to spread across all sets.

Sources

cs.cmu.edu passlab.github.io

95% confidence

What is the typical TLB miss penalty in cycles?

TLB (Translation Lookaside Buffer) miss penalty varies by level: L1 ITLB miss: 7-10 cycles (usually hidden by out-of-order execution); STLB (second-level TLB) miss triggering page walk: 20-100+ cycles depending on page table depth and cache residency of page table entries. A full 4-level page walk hitting DRAM at each level could cost 400+ cycles. Reduce TLB misses by: minimizing working set to fit in TLB coverage, using huge pages (2MB instead of 4KB - requires 512x fewer TLB entries), and improving memory access locality.

Sources

en.wikipedia.org rigtorp.se

95% confidence

What is the threshold for using pool allocators vs general malloc?

Use pool allocators when: allocating >1000 objects of the same size per second, object lifetime is predictable (bulk allocate/free), or allocation overhead shows up in profiling. Pool allocators reduce malloc overhead from 50-100 cycles to 10-20 cycles by eliminating search and fragmentation handling. Implementation: pre-allocate chunks of N objects, maintain free list with O(1) alloc/free. Common thresholds: objects <256 bytes benefit most; allocation frequency >10,000/second sees significant gains. Memory pools also improve cache locality since objects are contiguous.

Sources

en.wikibooks.org oracle.com

95% confidence

What is the typical cost of a context switch in cycles?

Context switch cost is 1000-10000 cycles (0.5-5 microseconds) depending on working set size and cache pollution. Direct costs: ~1000-2000 cycles for register save/restore and TLB flush. Indirect costs: 5000-50000+ cycles to reload caches with new process working set. For threads sharing address space (no TLB flush needed): 1000-3000 cycles. This is why spinlocks can win for very short critical sections (<1000 cycles) - the context switch from blocking costs more than spinning. Minimize context switches in latency-sensitive code by using thread pinning and avoiding blocking operations.

Sources

en.wikipedia.org mmomtchev.medium.com

95% confidence

What is a good target for L2 cache hit rate?

Target L2 cache hit rate of 90% or higher. Hit rates below 70% suggest the working set is too large or access patterns cause thrashing. L2 measures how well your working set fits: low rates indicate too many unique data accesses or poor temporal locality. With L2 miss penalty of 20-40 cycles to L3 (or 100-300 cycles to DRAM for L3 misses), even small hit rate improvements matter significantly. Design data structures to fit working sets within L2 size (typically 256KB-1MB per core) and consider cache blocking for larger datasets.

Sources

medium.com cs.cmu.edu

95% confidence

What is the rule of thumb for batching operations to amortize overhead?

Batch size should make overhead <10% of useful work. Examples: system calls with 500-cycle overhead: batch 5000+ cycles of work (10+ small operations); network packets with 10 microsecond latency: batch 100+ microseconds of data; database commits with 1ms overhead: batch 10+ ms of transactions. For parallel work distribution: minimum chunk size = parallel_overhead / (num_threads - 1). If OpenMP fork/join costs 10 microseconds, each thread needs >100 microseconds of work for 10% overhead with 2 threads. Measure both latency and throughput - batching trades latency for throughput.

Sources

hpc-wiki.info oracle.com

95% confidence

What exponential backoff parameters should be used for spinlock contention?

Start with initial backoff of 1-4 iterations, double after each failed attempt, cap maximum at 1000-10000 iterations before falling back to blocking. Common implementation: initial=1, multiply by 2 each iteration, max_backoff=1000 cycles, then call yield() or switch to mutex. Exponential backoff reduces cache line bouncing and improves throughput under contention. Without backoff, test-and-set spinlocks cause severe cache coherence traffic. TTAS (test-and-test-and-set) with exponential backoff performs well even with many processors competing for the same lock.

Sources

medium.com en.wikipedia.org

95% confidence

What speedup can you typically expect from SIMD auto-vectorization?

Auto-vectorization typically yields 5-10x speedup for embarrassingly parallel computations where you apply elementwise functions to arrays. The theoretical maximum is the SIMD width (4x for SSE floats, 8x for AVX floats, 16x for AVX-512 floats), but practical gains are limited by memory bandwidth, alignment overhead, and remainder loop handling. Memory-bound operations may see only 2-3x improvement regardless of SIMD width because the bottleneck shifts to memory bandwidth rather than compute throughput.

Sources

en.algorithmica.org users.ece.cmu.edu

95% confidence

What is the ridge point threshold in the roofline model?

The ridge point is calculated as: Peak_Performance(FLOP/s) / Peak_Bandwidth(bytes/s). This gives the minimum operational intensity (FLOP/byte) needed to achieve peak compute performance. For example: NVIDIA A100 with 19,500 GFLOPS and 1,555 GB/s bandwidth has ridge point of 19500/1555 = 12.5 FLOP/byte. Code with operational intensity below the ridge point is memory-bound; above it is compute-bound. Typical ridge points: CPU ~1-4 FLOP/byte, GPU ~10-50 FLOP/byte. Optimize memory access for memory-bound kernels; optimize compute for compute-bound.

Sources

en.wikipedia.org people.eecs.berkeley.edu

95% confidence

What is the typical latency hiding capacity of out-of-order execution?

Modern out-of-order CPUs can hide latency for approximately 100-200 instructions in the reorder buffer (ROB), which translates to roughly 50-100 cycles of work. Intel Skylake has 224-entry ROB; AMD Zen3 has 256 entries. This means out-of-order execution can hide L2 cache misses (~12 cycles) effectively but struggles with DRAM latency (200+ cycles). To help the CPU hide memory latency: ensure there are enough independent instructions between loads and their uses, use software prefetching for predictable access patterns, and unroll loops to expose more instruction-level parallelism.

Sources

en.algorithmica.org johnnysswlab.com

95% confidence

What is the typical cost of a system call in cycles?

System call overhead is 100-1000 cycles on modern Linux (1000-5000 cycles on Windows). Breakdown: mode switch (user to kernel): 50-150 cycles; syscall dispatch and validation: 100-300 cycles; actual work varies by call; return (kernel to user): 50-150 cycles. Mitigation: batch operations (one write of 1MB vs 1000 writes of 1KB), use memory-mapped I/O to avoid read/write syscalls, use vDSO for time queries (gettimeofday), buffer I/O in userspace. KPTI (Spectre mitigation) increased syscall cost by 100-300 cycles due to page table switching.

Sources

cs.cornell.edu matecdev.com

95% confidence

What is a good starting value for loop unroll factor?

Start with an unroll factor of 4x for most loops. This provides a good balance between reducing loop overhead and avoiding instruction cache pressure. For SIMD-optimized code, unroll by the SIMD width or multiples of it (e.g., 4x for SSE with floats, 8x for AVX with floats, 16x for AVX-512). Factors of 2x or 4x typically see speed improvements, while going beyond 8x often shows diminishing returns and can hurt performance due to increased code size and instruction cache misses.

Sources

software-dl.ti.com en.wikipedia.org

95% confidence

What is the memory latency hierarchy ratio (L1:L2:L3:DRAM)?

The approximate latency ratio is L1:L2:L3:DRAM = 1:3:10:60 (in terms of L1 as baseline). Concrete numbers at 3GHz: L1 = 4 cycles (~~1.3 ns), L2 = 12 cycles (~~4 ns), L3 = 40 cycles (~~13 ns), DRAM = 240 cycles (~~80 ns). This ~60x difference between L1 and DRAM is the 'memory wall'. Bandwidth ratio is similar: L1 can deliver ~1-2 TB/s, L2 ~500 GB/s, L3 ~200 GB/s, DRAM ~50-100 GB/s. Understanding this hierarchy is crucial for cache optimization - each level miss costs roughly 3-10x more than the previous level hit.

Sources

gist.github.com blog.jyotiprakash.org

95% confidence

What is the typical overhead of a function call?

A function call costs approximately 15-25 cycles on modern CPUs, equivalent to 3-4 simple assignments: call instruction (~1-2 cycles), stack frame setup (push rbp, mov rbp,rsp: ~2 cycles), parameter passing (varies), return (pop, ret: ~2-3 cycles), plus potential pipeline disruption. Indirect function calls (through pointers/vtables) cost 3-4x more due to branch prediction miss potential. For small functions called millions of times, this overhead can dominate. Inline functions or link-time optimization (LTO) eliminates this overhead. Profile before optimizing - overhead only matters for very small, frequently-called functions.

Sources

dev.to hbfs.wordpress.com

95% confidence

What is the speedup from loop pipelining or software pipelining?

Software pipelining (overlapping iterations) provides 15-30% speedup on in-order cores and smaller arrays. Tests show: for arrays fitting L2 cache, software pipelining gives 18.8-28.8% speedup; unroll-and-interleave (UAI) gives 14.2-21.8% speedup on in-order cores. On out-of-order cores, these techniques provide minimal benefit because the hardware already performs dynamic instruction scheduling. Software pipelining works by splitting loop work into phases (load, compute, store) and overlapping phases from different iterations to hide latencies and enable dual-issue on simple processors.

Sources

johnnysswlab.com en.wikipedia.org

95% confidence

What is the minimum number of elements needed for SIMD vectorization to be worthwhile?

SIMD vectorization typically becomes worthwhile when processing at least 4x the SIMD width elements, so: SSE (128-bit): minimum 16 floats or 16 integers; AVX (256-bit): minimum 32 floats or 32 integers; AVX-512 (512-bit): minimum 64 floats or 64 integers. Below these thresholds, the overhead of setup, remainder handling, and potential alignment adjustments may exceed the parallel processing gains. For very small arrays with unknown size at compile time, the scalar version may actually be faster due to branch overhead for remainder loops.

Sources

en.algorithmica.org stackoverflow.blog

95% confidence

What is the rule of thumb for struct field ordering for cache efficiency?

Order struct fields by: 1) Access frequency (hot fields first), 2) Access pattern (fields accessed together should be adjacent), 3) Size descending (reduces padding). Keep hot fields within first 64 bytes (one cache line). Group read-only fields separately from read-write to prevent false sharing. For arrays of structs vs struct of arrays (AoS vs SoA): use AoS when accessing all fields per element, SoA when accessing one field across all elements. Typical optimization: place most-accessed 2-3 fields at struct start, ensuring they fit in first cache line load.

Sources

medium.com codeinterstellar.medium.com

95% confidence

What is the performance impact of indirect vs direct function calls?

Indirect function calls (through pointers or vtables) are typically 2-4x slower than direct calls. One benchmark showed indirect calls running 3.4x slower. The performance hit comes from: inability to inline, branch prediction miss on first call to new target, and additional memory load to fetch function address. Virtual function calls in C++ fall into this category. Mitigation: devirtualization through final/sealed classes, link-time optimization (LTO), profile-guided optimization (PGO), or redesigning hot paths to avoid polymorphism. Consider templates or CRTP for static polymorphism in performance-critical code.

Sources

What is the recommended maximum allocation size for stack vs heap?

95% confidence

Keep stack allocations under 64KB per function to avoid stack overflow risk. Default stack size is 1MB on Windows, ~8MB on Linux, but deep recursion or nested calls reduce available space. Use heap (malloc/new) for: allocations over 64KB, runtime-determined sizes, data that must outlive the function, or dynamic data structures. Stack allocation is essentially free (1 cycle stack pointer adjustment), while malloc overhead is 50-100+ cycles plus potential system calls for large allocations. For performance-critical code with known sizes, prefer stack or pre-allocated pools.

Sources

matecdev.com en.wikipedia.org

95% confidence

What is the rule of thumb for choosing static vs dynamic OpenMP scheduling?

Use static scheduling when: iterations have uniform work (e.g., array operations), and you want minimum overhead. Use dynamic scheduling when: iteration work varies significantly (e.g., sparse matrix, adaptive algorithms), at the cost of higher overhead from runtime distribution. Use guided scheduling for: load balancing with lower overhead than dynamic - starts with large chunks, shrinks toward end. Specific guidance: static has lowest overhead (~~0.5 microseconds), dynamic has highest (~~2-5 microseconds), guided is intermediate. Default chunk size for static: iterations/num_threads; for dynamic: 1 (balance) or 64-256 (reduce overhead).

Sources

hpc-wiki.info wenbinfei.github.io

95% confidence

What hardware prefetch stride detection range should code stay within?

Hardware prefetchers typically detect strides up to 2KB-4KB and handle 8-16 concurrent streams. Intel stride prefetcher detects forward/backward strides up to 2KB; stream prefetcher handles up to 32 streams within 4KB page. For optimal prefetcher effectiveness: use strides <2KB, access no more than 8-16 distinct arrays simultaneously in hot loops, and maintain consistent access patterns (prefetchers take time to learn). When strides exceed hardware limits or patterns are irregular, use software prefetching with explicit _mm_prefetch() instructions at appropriate distances.

Sources

medium.com faculty.cc.gatech.edu

95% confidence

What branch prediction accuracy threshold indicates branchless code would be faster?

When branch prediction accuracy falls below 75%, branchless code (using conditional moves, SIMD masks, or arithmetic) is typically faster than branching code. At 75% prediction accuracy, the cost of mispredictions roughly equals the cost of conditional move data dependencies. Above 75% accuracy, keep the branch. Below 75%, convert to branchless. This 75% threshold is used by compilers as a heuristic for deciding whether to emit cmov instructions. Note: if data comes from slow memory (L3 or DRAM), branches may still win because early speculative loads hide latency.

Sources

en.algorithmica.org johnnysswlab.com

95% confidence

What is the typical number of general-purpose registers available on x86-64?

x86-64 provides 16 general-purpose 64-bit registers (RAX-RDX, RSI, RDI, RBP, RSP, R8-R15), but practically only 14-15 are available for computation (RSP is stack pointer, RBP often frame pointer). This is doubled from x86-32's 8 registers. Additionally, there are 16 XMM/YMM/ZMM vector registers for SIMD. When your algorithm needs more than 12-14 variables live simultaneously, expect register spills to stack. Loop unrolling increases register pressure - balance unroll factor against available registers to avoid costly spills inside hot loops.

Sources

en.wikibooks.org intel.com

95% confidence

How do you calculate the optimal prefetch distance for a loop?

Prefetch distance = ceiling(memory_latency_cycles / loop_iteration_cycles). For example, if memory latency is 200 cycles and one loop iteration takes 25 cycles, prefetch 200/25 = 8 iterations ahead. For L1 prefetch from L2, use shorter distances (e.g., 8 iterations); for L2 prefetch from memory, use longer distances (e.g., 64 iterations). Intel compilers with -O2 or higher automatically set prefetch level 3. Tuning can yield 35% or more bandwidth improvement - one test showed performance increase from 129 GB/s to 175 GB/s.

Sources

sciencedirect.com sciencedirect.com

95% confidence

What is a good target for L1 cache hit rate?

Target L1 data cache hit rate of 95% or higher for well-optimized code. Hit rates above 80% are acceptable for general code. Below 60% indicates serious access pattern problems requiring investigation. With L1 hit latency of 1-4 cycles and miss penalty to L2 of 10-12 cycles, the performance impact is significant: 97% hit rate gives average 4-cycle access, while 99% hit rate gives 2-cycle access - 2x improvement from just 2% hit rate increase. Improve L1 hit rate through better spatial locality, cache blocking, and prefetching.

Sources

medium.com cs.cmu.edu

95% confidence

What is the typical L1 cache latency in cycles?

L1 cache access latency is 4-5 cycles on modern Intel/AMD CPUs, which translates to approximately 1-2 nanoseconds at typical clock speeds. Specifically: Intel Kaby Lake: 5 cycles / 2.5 GHz = 2 ns; Intel Haswell: 5 cycles / 2.6 GHz = 1.9 ns; AMD Zen: 4 cycles. L1 cache is the fastest memory level after registers. L1 data cache is typically 32KB per core (8-way associative), and L1 instruction cache is also typically 32KB per core. Optimizing for L1 hit rate provides the largest performance gains.

Sources

When should you use huge pages to reduce TLB misses?

95% confidence

Use huge pages (2MB on x86) when: working set exceeds 4MB (1024 4KB pages), TLB miss rate is high in profiling, or memory access is scattered across large address ranges. Huge pages reduce TLB entries needed by 512x: 20MB requires only 10 huge pages vs 5120 standard pages. Best candidates: large arrays, memory-mapped files, databases, HPC applications. Enable with: Linux mmap() with MAP_HUGETLB, or transparent huge pages (THP). Benchmark first - huge pages can hurt performance for sparse access patterns due to internal fragmentation and longer page fault times.

Sources

lwn.net rigtorp.se

95% confidence

What is the threshold for when parallel merge sort beats sequential?

Parallel merge sort becomes beneficial when array size exceeds 10,000-100,000 elements, depending on hardware and element size. Below this threshold, spawn/join overhead exceeds parallel speedup. Rule of thumb: switch to sequential sort when subarray falls below 1000-5000 elements. This hybrid approach (parallel at top levels, sequential at leaves) provides best performance. Additional considerations: for 2 cores, threshold ~50,000; for 8 cores, threshold ~20,000; for 32+ cores, threshold can be as low as 5,000-10,000 elements. Always benchmark on target hardware.

Sources

medium.com hpc-wiki.info

95% confidence

What is the cache line size on modern x86 CPUs?

Cache line size is 64 bytes on all modern x86/x86-64 processors (Intel and AMD since ~2005). This means memory is fetched and cached in 64-byte aligned chunks. Key implications: data structures should be sized/aligned to 64-byte boundaries for optimal access; arrays of 8-byte elements have 8 elements per cache line; false sharing occurs when different threads access different data within the same 64-byte line. Apple M-series uses 128-byte cache lines. Always pad data to avoid false sharing and align hot data to cache line boundaries.

Sources

en.wikipedia.org sciencedirect.com

95% confidence

What is the typical overhead of malloc/free?

malloc() overhead ranges from 50-100 cycles for small allocations to 1000+ cycles for large allocations requiring system calls. Each allocation involves: acquiring a global lock (in traditional allocators), searching free lists, potential memory fragmentation handling, and bookkeeping. Allocations over 64KB (varies by allocator) may trigger mmap() system calls costing thousands of cycles. Mitigation strategies: use object pools/arenas for same-size allocations, pre-allocate during initialization, use thread-local allocators (tcmalloc, jemalloc) to avoid lock contention, or use stack allocation for short-lived data.

Sources

nnethercote.github.io oracle.com

95% confidence

What is the minimum iteration count for OpenMP parallel loops to be beneficial?

As a starting heuristic, use OpenMP parallel loops when iteration count exceeds 1000 iterations with simple bodies, or 100 iterations with moderately complex bodies (10-100 microseconds per iteration). For array operations, parallelize when array size exceeds 100,000 elements for simple operations or 10,000 elements for complex operations. Below these thresholds, the overhead of thread management often exceeds parallel speedup. Move parallelization to outer loops when possible to reduce fork/join frequency - one study showed 'code was spending nearly half the time doing OpenMP overhead work' with inner loop parallelization.

Sources

medium.com community.intel.com

95% confidence

How many concurrent memory requests can a modern CPU have in flight?

Modern CPUs support 10-20 outstanding memory requests per core via Line Fill Buffers (LFBs) and Miss Status Handling Registers (MSHRs). Intel Skylake: 12 L1D LFBs, 16 L2 superqueue entries; AMD Zen: 22 concurrent L1D misses. This limits single-core bandwidth to: concurrent_requests * cache_line_size / memory_latency. Example: 12 requests * 64 bytes / 80ns = 9.6 GB/s max single-core bandwidth. To achieve higher bandwidth, use multiple threads or software prefetching to keep memory requests in flight. Memory bandwidth scaling often requires 4-8 cores to saturate memory controller.

Sources

sites.utexas.edu sciencedirect.com

95% confidence

What is a good starting prefetch distance for L1 cache?

A good starting prefetch distance for L1 (prefetching from L2 to L1) is 8 iterations ahead. This accounts for L2 access latency of approximately 12 cycles divided by typical loop iteration time. Fine-tune based on your specific loop: if each iteration takes 7 cycles and L2 latency is 56 cycles, use 56/7 = 8 iterations. Prefetching too early wastes cache space; too late fails to hide latency. Use compiler pragmas like '#pragma prefetch var:hint:distance' for manual tuning.

Sources

sciencedirect.com sciencedirect.com

95% confidence

How many iterations should a vectorized loop process to amortize setup overhead?

Vectorized loops should process at least 4x the vector width iterations to amortize setup and cleanup overhead. For AVX2 processing 8 floats per iteration: minimum 32 iterations; for AVX-512 processing 16 floats: minimum 64 iterations. Setup costs include: loading constants into vector registers, handling alignment, setting up masks. Cleanup handles remainder elements. For loops below threshold, consider: scalar fallback, using narrower vectors (SSE instead of AVX), or accumulating small arrays before vectorized processing. Compile-time-known small counts may benefit from full unrolling instead.

Sources

en.algorithmica.org intel.com

95% confidence

What operational intensity indicates memory-bound vs compute-bound code?

Code is memory-bound when operational intensity is below the ridge point (typically <1-4 FLOP/byte on CPUs, <10-15 FLOP/byte on GPUs). Examples: DAXPY (y=ax+y) has intensity of 2n FLOP / 24n bytes = 0.083 FLOP/byte - heavily memory-bound. SpMV (sparse matrix-vector) typically has 0.17-0.25 FLOP/byte - memory-bound. Dense matrix multiplication can achieve 2n^3 FLOP / 3n^2*8 bytes for large n, approaching 100+ FLOP/byte - compute-bound. Low-intensity kernels benefit from memory optimizations; high-intensity from compute optimizations.

Sources

en.wikipedia.org dando18.github.io

95% confidence

What is the typical performance ratio of integer division vs multiplication?

Integer division is 10-30x slower than multiplication: integer multiply latency 3-4 cycles, throughput 1 per cycle; integer divide latency 20-80 cycles, throughput 0.03-0.1 per cycle (26-90 cycles between divisions). Optimization: replace 'x/const' with multiplication by magic number (compiler does this automatically for constants); replace 'x%power_of_2' with 'x&(power_of_2-1)'; for runtime divisors, consider libdivide or caching the magic multiplier. Integer modulo has same cost as division. Impact: a tight loop with division can be 10x slower than equivalent multiplication.

Sources

en.algorithmica.org intel.com

95% confidence

What is the recommended data structure alignment for SIMD operations?

Align data to SIMD register width: 16 bytes for SSE, 32 bytes for AVX/AVX2, 64 bytes for AVX-512. Unaligned SIMD loads/stores work on modern CPUs but may be slower when crossing cache line (64-byte) boundaries. Use compiler attributes: C11 '_Alignas(32)', GCC 'attribute((aligned(32)))', MSVC '__declspec(align(32))', or C++11 'alignas(32)'. For dynamic allocation, use aligned_alloc() or _mm_malloc(). Padding arrays to SIMD-width multiples eliminates need for scalar remainder loops and enables cleaner vectorization.

Sources

users.ece.cmu.edu songho.ca

95% confidence

What percentage of peak memory bandwidth should well-optimized code achieve?

Well-optimized multi-threaded code should achieve 75-85% of peak theoretical memory bandwidth, with 80% being a practical target. Single-threaded code typically achieves 40-60% of peak due to memory-level parallelism limitations. Measured throughput is always below theoretical maximum due to memory controller inefficiencies, DRAM refresh cycles, rank-to-rank stalls, and read-to-write turnaround penalties. If achieving less than 60% of peak bandwidth on memory-bound code, investigate poor spatial locality, cache associativity conflicts, or insufficient prefetching.

Sources

lemire.me sites.utexas.edu

95% confidence

What is the .NET JIT inline size threshold?

The .NET JIT compiler has a default inline threshold of 32 bytes of IL (Intermediate Language) code. Methods larger than 32 bytes IL are generally not inlined. The rationale is that for larger methods, the function call overhead becomes negligible compared to method execution time. This is a heuristic that can fail for hot methods just over the threshold. Workarounds include: using [MethodImpl(MethodImplOptions.AggressiveInlining)] attribute to hint for inlining, or manually breaking large methods into smaller ones.

Sources

mattwarren.org en.wikipedia.org

95% confidence

What is the typical OpenMP parallelization overhead?

OpenMP fork/join overhead is typically 1-10 microseconds per parallel region entry, depending on implementation and number of threads. For loops, this means each iteration should do at least 10-100 microseconds of work to amortize parallelization overhead. With smaller tasks, the parallel version may be slower than sequential. Rule of thumb: parallelize when total loop work exceeds 100 microseconds and individual iterations take at least 1 microsecond. For finer-grained parallelism, use static scheduling to minimize runtime overhead compared to dynamic scheduling.

Sources

hpc-wiki.info cs.virginia.edu

95% confidence

What is the performance cost of floating-point denormals?

Denormal (subnormal) floating-point operations can be 10-100x slower than normal operations on x86 CPUs. When results become denormal (very small numbers near zero), the CPU falls back to microcode, taking 50-200 cycles instead of 4-5 cycles. Detection: unexpected performance cliffs when values approach zero. Solutions: enable Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ) modes via MXCSR register (_MM_SET_FLUSH_ZERO_MODE, _MM_SET_DENORMALS_ZERO_MODE), add small epsilon to prevent denormals, or redesign algorithm to avoid near-zero intermediate values.

Sources

en.algorithmica.org stackoverflow.blog

95% confidence

What is the typical spinlock vs mutex threshold for switching between them?

Use spinlocks when: critical section is less than 1000 cycles (~0.3-0.5 microseconds), threads are unlikely to be preempted, and running on multicore system. Use mutexes when: critical section exceeds 1000 cycles, high contention is expected, or running in userspace where preemption is unpredictable. Threshold-based hybrids (like adaptive mutexes) spin for 1000-10000 CPU cycles before blocking. Key insight: spinlocks waste CPU when waiting, but avoid ~1000+ cycle context switch overhead. In userspace, pure spinlocks are usually wrong - use adaptive mutexes that spin briefly then sleep.

Sources

en.wikipedia.org mmomtchev.medium.com

95% confidence

What is a good starting prefetch distance for L2 cache from main memory?

A good starting prefetch distance for L2 (prefetching from main memory to L2) is 64 iterations ahead. This accounts for DRAM latency of 200-400 cycles divided by typical loop iteration time. For a loop taking 5 cycles per iteration with 300-cycle memory latency, use 300/5 = 60, rounded to 64. Memory prefetch distances must be longer than L1 distances because DRAM latency is 10-20x higher than L2 latency. Benchmark with values from 32 to 128 to find optimal for your workload.

Sources

sciencedirect.com sciencedirect.com

95% confidence

What is the maximum recommended loop unroll factor before diminishing returns?

The maximum effective unroll factor is typically 8x for general code, beyond which diminishing returns set in. For vectorized operations on Intel Xeon Phi, unrolling 16x may be beneficial to fill 512-bit vectors. Key limiting factors include: instruction cache capacity (unrolled code must fit in L1 instruction cache, typically 32KB), register pressure (more unrolling needs more registers), and code bloat affecting branch prediction. Always measure - the only way to determine the optimal factor is through benchmarking.

Sources

software-dl.ti.com intel.com

95% confidence

What is the typical memory bandwidth of DDR4 vs DDR5?

Theoretical peak bandwidth: DDR4-3200: 25.6 GB/s per channel, ~50 GB/s dual-channel; DDR5-5600: 44.8 GB/s per channel, ~90 GB/s dual-channel. Achievable bandwidth is 75-85% of peak: DDR4 dual-channel: 40-45 GB/s achievable; DDR5 dual-channel: 70-80 GB/s achievable. DDR5 doubles channels per DIMM (2 32-bit vs 1 64-bit) improving bank-level parallelism. For optimization planning, assume 40 GB/s for DDR4 systems, 70 GB/s for DDR5. Memory-bound code scales linearly with bandwidth, so DDR5 provides ~1.7x speedup for pure streaming workloads.

Sources

lemire.me en.wikipedia.org

95% confidence

What is the 2:1 cache associativity rule?

The 2:1 cache rule states: miss rate of a direct-mapped cache of size N equals the miss rate of a 2-way set-associative cache of size N/2. This means doubling associativity is roughly equivalent to doubling cache size for reducing conflict misses. Practical implications: 8-way set associativity is nearly as effective as fully associative for most workloads; beyond 8-way, diminishing returns set in. When analyzing cache performance, increasing associativity helps with conflict misses but not capacity misses. For software optimization, focus on reducing working set size rather than worrying about associativity.

Sources

cs.cmu.edu compas.cs.stonybrook.edu

95% confidence

What is the rule of thumb for branch-free min/max operations?

Use conditional move (cmov) or SIMD min/max instructions when branches would be unpredictable. Branch-free min: 'min = y ^ ((x ^ y) & -(x < y))' or compiler intrinsics '_mm_min_ps'. Cost: cmov is 1-2 cycles vs potential 15+ cycles for mispredicted branch. However, cmov creates data dependency while branch allows speculative execution. Rule: use branchless when prediction accuracy <75%, or always for SIMD code (no branching within vector). Modern compilers often generate cmov for simple ternary operators at -O2; use '-fno-if-conversion' to force branches if needed.

Sources

en.algorithmica.org johnfarrier.com

95% confidence

What is the typical branch misprediction penalty in cycles on modern CPUs?

Branch misprediction costs 10-30 cycles on modern x86-64 processors, depending on pipeline depth. AMD Zen 2 has a 19-cycle pipeline, so misprediction costs approximately 19 cycles. Intel processors with deeper pipelines may cost up to 20-25 cycles. This penalty equals the number of pipeline stages from fetch to execute that must be flushed and refilled. For loops with unpredictable branches, this can multiply running time significantly - converting to branchless code can reduce per-element time from 14 cycles to 7 cycles in some cases.

Sources

en.algorithmica.org lemire.me

95% confidence

What is the typical L2 cache latency in cycles?

L2 cache access latency is 10-14 cycles on modern CPUs, approximately 4-5 nanoseconds. Specifically: Intel Kaby Lake: 12 cycles (4.8 ns at 2.5 GHz); Intel Haswell: 11 cycles (4.2 ns at 2.6 GHz); AMD Zen: ~12 cycles. L2 is about 3-4x slower than L1 but holds 8-16x more data (typically 256KB-1MB per core). L2 cache is typically unified (both instructions and data) and 4-8 way set associative. For algorithms with working sets between 32KB and 256KB, L2 hit rate is the critical performance metric.

Sources

What is the rule of thumb for choosing AoS vs SoA data layout?

95% confidence

Use Array of Structures (AoS) when: accessing all/most fields of each element together, iterating through elements with good spatial locality, or element-wise operations are common. Use Structure of Arrays (SoA) when: accessing only 1-2 fields across many elements, SIMD vectorization is important (SoA enables efficient vector loads), or cache utilization of accessed fields matters more than element locality. Performance difference can be 2-10x depending on access pattern. Consider hybrid AoSoA (Array of Structures of Arrays) for balanced access patterns with SIMD requirements.

Sources

passlab.github.io en.algorithmica.org

95% confidence

What is the typical reorder buffer size for estimating instruction-level parallelism?

Modern reorder buffer (ROB) sizes: Intel Skylake/Ice Lake: 224-352 entries; AMD Zen 3/4: 256 entries; Apple M1/M2: 600+ entries. The ROB limits how far ahead the CPU can execute speculatively. For hiding latency, ensure there are enough independent instructions to fill the ROB before hitting a long-latency operation. Example: with 200-entry ROB and 4-wide issue, ~50 cycles of independent work can be found. If your loop has only 20 instructions and one memory access per iteration, you need the loop running 10+ iterations ahead to fill the window.

Sources

en.algorithmica.org johnnysswlab.com

95% confidence

What is the GCC default threshold for inlining functions?

GCC's default inline limit is 600 pseudo-instructions for functions explicitly marked inline (controlled by --param inline-limit). For auto-inlining at -O2/-O3, functions up to about 40-50 instructions may be inlined based on various heuristics. The 'pseudo-instruction' count is an abstract measure that may change between GCC versions and does not directly map to assembly instructions. Functions called only once are more aggressively inlined regardless of size. Use -Winline to get warnings when inline requests are denied due to size or other factors.

Sources

gcc.gnu.org en.wikipedia.org

95% confidence

What is a good target IPC for well-optimized code on modern CPUs?

Target IPC of 2-4 for general-purpose code on modern superscalar CPUs. Modern wide-issue processors can achieve IPC of 4-6 in ideal conditions with deep pipelines and superscalar execution. Apple M-series chips can exceed IPC of 3 in floating-point intensive tasks. An IPC below 0.7 indicates significant optimization opportunity - the code is likely memory-bound or suffering from pipeline stalls. Memory-bound code typically shows IPC of 0.5-1.0, while compute-bound well-optimized code should achieve IPC of 2.0 or higher.

Sources

en.wikipedia.org sciencedirect.com

95% confidence

What is the rule of thumb for cache blocking size for L1 cache?

For L1 cache blocking, use approximately sqrt(L1_size/3) elements. For a typical 32KB L1 data cache with 4-byte floats, this gives sqrt(32768/3/4) = approximately 52 elements, or roughly 50-100 elements per dimension for 2D blocking. The factor of 3 accounts for multiple arrays (input, output, temporary) that need to fit simultaneously. Always ensure the total working set of your blocked computation fits within L1 with room for other data the processor needs.

Sources

cs.cmu.edu passlab.github.io

95% confidence

What is the typical instruction throughput for different instruction types?

Throughput (instructions per cycle) on modern x86: simple ALU (add, sub, logical): 4-6 per cycle; complex ALU (multiply): 1-2 per cycle; integer divide: 0.03-0.1 per cycle (10-30 cycles latency); FP add/multiply: 2 per cycle; FP divide: 0.2-0.5 per cycle; loads: 2-3 per cycle (L1 hit); stores: 1-2 per cycle. These are throughput limits - actual IPC depends on dependencies. Key insight: division is 10-100x more expensive than multiplication; replace 'x/const' with 'x * (1/const)' where possible. Measure instruction mix to understand bottlenecks.

Sources

en.algorithmica.org intel.com

95% confidence

What is the rule of thumb for SIMD register width selection?

Use the widest SIMD available that doesn't cause frequency throttling or portability issues: AVX-512 (512-bit): use when sustained compute-heavy, accept ~10-15% frequency reduction on some Intel CPUs; AVX2 (256-bit): best default choice, supported since Haswell 2013, no frequency penalty; SSE (128-bit): use for maximum compatibility or when code has many scalar operations mixed in. Process data in multiples of SIMD width to avoid remainder loops. For portable code, compile with multiple paths and runtime dispatch based on CPUID.

Sources

users.ece.cmu.edu en.wikipedia.org

95% confidence

What is the rule of thumb for cache blocking size for L2 cache?

For L2 cache blocking, use approximately sqrt(L2_size/3) elements. For a typical 256KB L2 cache with 4-byte floats, this gives sqrt(262144/3/4) = approximately 148 elements, or roughly 128-256 elements per dimension for 2D blocking. For 1MB L2 cache, target around 300 elements. L2 blocking is typically used as an outer loop around L1 blocking to create a two-level tiled algorithm that maximizes data reuse at both cache levels.

Sources

cs.cmu.edu passlab.github.io

95% confidence

What is the threshold for using SIMD for string operations?

SIMD string operations (strlen, memcmp, memcpy, strchr) become beneficial for strings longer than 16-32 bytes when using SSE, or 32-64 bytes for AVX. Below these lengths, scalar loops with branch prediction for early termination often win. Modern glibc/MSVC runtime libraries automatically dispatch to SIMD versions for larger strings. For custom implementations: SSE can process 16 bytes per iteration, AVX 32 bytes, with ~2 cycle per vector comparison. For memcpy specifically, SIMD helps above 64 bytes; for <64 bytes, use rep movsb (enhanced on recent CPUs) or unrolled scalar moves.

Sources

stackoverflow.blog codeforces.com

95% confidence

What is the performance cost of misaligned memory access?

Misaligned access crossing a cache line boundary costs 16 cycles on Intel Atom (vs 4 cycles for aligned) - a 4x penalty. On modern Core i7 (Sandy Bridge and newer), there is no measurable penalty for misaligned access that doesn't cross cache lines. However, access spanning two cache lines always incurs double the memory traffic and potential 2x latency. Rule of thumb: always align data to its natural size (4-byte ints to 4-byte boundaries, 8-byte doubles to 8-byte boundaries), and align hot data structures to 64-byte cache line boundaries to ensure single-line access.

Sources

lemire.me sciencedirect.com

95% confidence

Optimization Decision Trees

61 questions

When should I use loop unrolling instead of vectorization?

Use loop unrolling over vectorization when: 1) Loop body has complex control flow with data-dependent branches that prevent vectorization, 2) Operations are not amenable to SIMD (irregular memory access, non-contiguous data), 3) Loop iteration count is small and fixed (4-16 iterations) making SIMD setup overhead dominate, 4) You need to reduce loop overhead but data dependencies prevent parallel execution. Keep vectorization when: loop body is simple arithmetic on contiguous arrays, iteration count is large (>64), and operations map directly to SIMD instructions.

Sources

llvm.org intel.com agner.org

95% confidence

When should I use fixed-point arithmetic instead of floating-point?

Use fixed-point when: 1) Target has no FPU or weak FPU (embedded, older ARM), 2) Deterministic results required across platforms, 3) Values have known bounded range, 4) Converting to/from FP would be in hot path anyway, 5) SIMD integer path is faster than FP on your hardware. Use floating-point when: 1) Dynamic range needed (values span orders of magnitude), 2) Modern CPU with fast FP (desktop/server), 3) Precision requirements beyond what fixed-point can offer, 4) Algorithms assume IEEE semantics, 5) Using libraries that expect FP. Fixed-point overhead: shift operations, range checking, more complex code.

Sources

developer.arm.com agner.org

95% confidence

When should I use virtual functions vs static polymorphism?

Use virtual functions when: 1) Types are determined at runtime (plugins, user input), 2) Open extension is needed (new derived classes added without recompiling), 3) Collection of mixed types processed uniformly, 4) Overhead is acceptable (~15-25 cycles indirect call). Use static polymorphism (templates, CRTP) when: 1) Types are known at compile time, 2) Hot path where virtual call overhead matters, 3) Want inlining and further optimization, 4) Binary size is less concern than performance. Hybrid: use virtual dispatch at high level, template for inner loops. Virtual call overhead: indirect branch + possible icache miss for vtable.

Sources

en.cppreference.com eli.thegreenplace.net

95% confidence

When should I use software pipelining instead of simple loop unrolling?

Use software pipelining when: 1) Loop body has long-latency operations (loads, multiplies, divides), 2) Operations in different iterations are independent, 3) You can overlap load/compute/store from different iterations, 4) Loop runs many iterations (pipeline fill/drain overhead amortized), 5) Register file is large enough to hold multiple iterations in flight. Use simple unrolling when: 1) Operations are short-latency (simple ALU), 2) Goal is primarily reducing loop overhead (branch, counter update), 3) Few registers available (can't keep multiple iterations live), 4) Loop has loop-carried dependencies that prevent overlapping.

Sources

llvm.org cs.cmu.edu

95% confidence

When should I use SIMD intrinsics vs auto-vectorization?

Use intrinsics when: 1) Compiler fails to vectorize (check assembly), 2) Need specific instruction sequences compiler won't generate, 3) Algorithm requires precise control over SIMD operations, 4) Performance is critical and you can invest in manual optimization, 5) Using advanced features (shuffles, gathers) that compilers handle poorly. Rely on auto-vectorization when: 1) Code is straightforward loops over arrays, 2) Portability across ISAs matters (auto-vec adapts), 3) Compiler does good job (verify with -fopt-info-vec or assembly), 4) Maintenance cost of intrinsics is prohibitive, 5) Code changes frequently (intrinsics require rework).

Sources

When should I use explicit prefetch instructions vs relying on hardware prefetcher?

95% confidence

Use software prefetch when: 1) Access pattern is predictable to you but not to hardware (pointer chasing, indirect indexing), 2) Stride is larger than hardware can detect (often >2KB), 3) Access pattern changes rapidly (hardware needs training time), 4) Working on linked structures (trees, graphs) with known traversal order. Rely on hardware when: 1) Sequential or small-stride access (hardware handles this well), 2) Pattern is simple enough for prefetcher to learn, 3) Code needs to be portable (software prefetch effectiveness varies by CPU), 4) Don't want prefetch overhead in non-hot paths.

Sources

intel.com akkadia.org

95% confidence

When should I use eager evaluation vs lazy evaluation?

Use eager evaluation when: 1) Result will definitely be used, 2) Computation is cheap relative to tracking laziness, 3) Memory for intermediate results is acceptable, 4) Want predictable timing (no surprise delays), 5) Debugging is easier with immediate execution. Use lazy evaluation when: 1) Result may not be needed (conditional use), 2) Computation is expensive and avoidable, 3) Working with infinite or large sequences, 4) Building composable pipelines (filter, map, reduce), 5) Memory-constrained environment. Overhead: lazy evaluation adds thunk/closure creation cost. Don't use lazy for simple operations that will always execute.

Sources

docs.python.org docs.oracle.com

95% confidence

When should I use branchless code instead of branching?

Use branchless (predication, conditional moves) when: 1) Branch is unpredictable (misprediction rate >15-20%), 2) Both paths are cheap (< 5-10 cycles combined), 3) No side effects occur from speculatively computing wrong path, 4) Code is in a hot loop executed millions of times. Keep branching when: 1) Branch is highly predictable (>90% one direction), 2) One path is significantly more expensive than the other, 3) Skipped path has side effects (memory writes, I/O, exceptions), 4) Branchless version requires many more instructions, negating the benefit.

Sources

When should I use SIMD compression/expansion vs scalar conditional stores?

95% confidence

Use SIMD compress/expand (AVX-512 VPCOMPRESSD/VPEXPANDD) when: 1) Filtering arrays based on condition (sparse to dense or vice versa), 2) AVX-512 is available with good performance, 3) Processing large arrays where SIMD overhead is amortized, 4) Selectivity is moderate (10-90% kept). Use scalar when: 1) No AVX-512 or using AVX2 (emulation is complex and slow), 2) Selectivity is extreme (nearly all kept or nearly all filtered), 3) Small arrays where SIMD setup dominates, 4) Need portable code. AVX2 workaround: use pext/pdep for compression but slower than native AVX-512.

Sources

intel.com lemire.me

95% confidence

When do cache-oblivious algorithms outperform cache-aware implementations?

Cache-oblivious wins when: 1) Multiple cache levels exist and tuning for one hurts others, 2) Actual cache available varies (shared with other processes), 3) Data size varies across calls (one tuned block size doesn't fit all), 4) Virtual memory paging matters (cache-oblivious often optimizes for disk too), 5) TLB pressure is significant (cache-oblivious recursive structure often has better locality). Cache-aware typically wins when: single dominant cache level, dedicated cores, known data sizes, ability to tune extensively. Hybrid approach: cache-aware at top level, cache-oblivious for base cases.

Sources

cs.cmu.edu dl.acm.org

95% confidence

When should I use reader-writer locks vs plain mutexes?

Use reader-writer locks when: 1) Reads significantly outnumber writes (>10:1 ratio), 2) Read critical section is long enough that contention matters, 3) Multiple concurrent readers provide measurable benefit, 4) Write operations are infrequent. Use plain mutex when: 1) Read/write ratio is low or operations are short, 2) RW lock overhead exceeds benefit (RW locks are more complex), 3) Writes are frequent (writers starve with many readers), 4) Single-threaded read performance is adequate. Warning: RW locks can have writer starvation; use fair variants if writes must make progress. Uncontended RW lock is slower than uncontended mutex (~2-3x).

Sources

man7.org preshing.com

95% confidence

When should I use horizontal SIMD operations over vertical operations?

Prefer vertical (lane-parallel) operations when: 1) Processing independent data streams, 2) Same operation applies to all elements, 3) Data is naturally packed by operation type (SoA layout). Use horizontal (cross-lane) operations when: 1) Computing reductions (sum, min, max across vector), 2) Data arrives in AoS format requiring field extraction, 3) Shuffling/permuting data between lanes, 4) Dot products of small vectors. Minimize horizontal ops because they're typically 3-10x slower than vertical. Restructure algorithms to batch horizontal ops or convert to vertical form.

Sources

When should I combine loop unrolling with vectorization?

95% confidence

Combine unrolling with vectorization when: 1) SIMD width doesn't fully utilize execution units (unroll 2-4x SIMD operations to hide latency), 2) Loop has multiple independent SIMD operations that can execute in parallel, 3) Memory bandwidth is not the bottleneck and CPU has multiple vector execution units, 4) Unrolling enables better instruction scheduling between vector operations. Avoid combining when: memory bandwidth is saturated, register pressure is already high (causes spills), or loop body is already complex enough that compiler cannot schedule efficiently.

Sources

When should I use branch-free min/max vs conditional branches?

95% confidence

Use branch-free (conditional move, SIMD min/max) when: 1) Comparisons are unpredictable (random data), 2) Comparing many independent pairs (SIMD opportunity), 3) Code is in hot loop with high iteration count, 4) Both values are already in registers. Use branching when: 1) One outcome is much more likely (>90%), 2) Computing unused value is expensive, 3) Single comparison (setup overhead of branchless not amortized), 4) Comparison involves memory that doesn't need to be loaded if branch skipped. Note: compilers often generate cmov automatically; check assembly before manual optimization.

Sources

agner.org intel.com

95% confidence

When should I use strength reduction over lookup tables?

Use strength reduction when: 1) Computation can be converted to cheaper operation (multiply->shift, divide->multiply by reciprocal), 2) Lookup table would be large (>L1 cache, causing misses), 3) Memory bandwidth is the bottleneck, 4) Input values are not bounded to small range. Use lookup tables when: 1) Computation is expensive (trig functions, complex formulas), 2) Input domain is small enough for table to fit in cache (<4KB for L1), 3) Table has good temporal locality (values reused), 4) Computation cannot be simplified algebraically, 5) Precision requirements allow table interpolation.

Sources

gcc.gnu.org agner.org

95% confidence

When should I use compiler vector extensions vs explicit intrinsics?

Use compiler vector extensions (GCC vector types, Clang ext_vector_type) when: 1) Want portable SIMD across x86/ARM/etc, 2) Operations are standard arithmetic (+, -, *, /), 3) Compiler can optimize vector operations well, 4) Don't need specific instruction control. Use explicit intrinsics when: 1) Need instructions without vector extension equivalent (shuffles, special math), 2) Targeting specific microarchitecture optimizations, 3) Compiler generates suboptimal code from vector types, 4) Need precise control over instruction selection. Hybrid works: use vector types for common ops, intrinsics for specialized operations.

Sources

gcc.gnu.org clang.llvm.org

95% confidence

What prefetch distance should I use for optimal performance?

Prefetch distance calculation: cycles_ahead = memory_latency / cycles_per_iteration. Typical values: L2 prefetch (~~50-100 cycles ahead), main memory (~~200-400 cycles ahead). Practical guidance: 1) Start with 16-64 cache lines ahead for main memory, 2) For L2, 4-16 lines ahead, 3) Adjust based on iteration time (faster iterations need more lookahead), 4) Too close: data not ready in time, 5) Too far: prefetched data evicted before use. Optimal distance depends on: memory latency, cache sizes, iteration cost, contention. Always profile: wrong distance can hurt due to cache pollution and prefetch instruction overhead.

Sources

intel.com akkadia.org

95% confidence

When should I use speculative execution vs waiting for conditions?

Use speculative execution when: 1) Speculation is cheap and usually correct (>70% hit rate), 2) Recovery from wrong speculation is fast, 3) Latency is more important than throughput, 4) Parallel resources are available for speculation, 5) Verification can happen in parallel with dependent work. Wait for conditions when: 1) Speculation is often wrong (<50% success), 2) Wrong speculation has side effects that are hard to undo, 3) Resources are scarce (speculation wastes them), 4) Speculative work is expensive relative to wait time, 5) Correctness is paramount. Examples: branch prediction (CPU), request speculation (databases), prefetching (memory subsystem).

Sources

intel.com dl.acm.org

95% confidence

How do I decide if a branch is predictable enough to keep as a branch?

A branch is predictable enough when: 1) Pattern repeats regularly (TTTTTTTT or TFTFTFTF), 2) Same direction taken >90% of the time, 3) Pattern fits in branch history table (typically 2-4K entries), 4) Loop-carried pattern with fixed iteration count. Measure with CPU performance counters (branch-misses event). Rule of thumb: >5% misprediction rate on a hot branch warrants considering branchless. Modern predictors handle: nested loops, simple alternating patterns, correlated branches. They struggle with: random patterns, data-dependent branches with high entropy, very long patterns.

Sources

How do I decide between row-major and column-major matrix storage?

95% confidence

Use row-major when: 1) Language convention is row-major (C, C++, Python/NumPy default), 2) Algorithms traverse rows (image processing row by row), 3) Interfacing with row-major libraries (most C libraries). Use column-major when: 1) Language convention is column-major (Fortran, MATLAB, Julia), 2) Matrix operations are column-oriented (solving linear systems), 3) Using BLAS/LAPACK (optimized for column-major). Key insight: match storage to access pattern. If you iterate over columns but store row-major, you get cache misses on every access. Profile memory access patterns, not just language defaults.

Sources

intel.com netlib.org

95% confidence

How do I decide when the cost of horizontal SIMD reduction is acceptable?

Horizontal reduction is acceptable when: 1) Performed once after processing many elements (amortized), 2) Reduction is final result, not intermediate in hot loop, 3) Alternative scalar code would require loading each element individually, 4) Using efficient reduction patterns (pairwise for accuracy, tree for speed). Avoid horizontal reduction when: 1) Inside inner loop (restructure to accumulate vertically, reduce once), 2) SIMD width is very large (AVX-512 reduction is expensive), 3) Reduced value feeds back into next SIMD iteration (creates dependency). Cost: ~3-5 cycles for 128-bit, ~5-8 for 256-bit, ~10-15 for 512-bit reductions.

Sources

When should I use NUMA-aware allocation vs default allocation?

95% confidence

Use NUMA-aware allocation when: 1) Running on multi-socket system (check with numactl --hardware), 2) Data is accessed primarily by specific threads that can be pinned to nodes, 3) Memory bandwidth is the bottleneck, 4) Dataset is large enough that cross-node traffic matters (>L3 cache). Use default allocation when: 1) Single-socket system, 2) Data is accessed by all threads equally, 3) Application is not memory-bandwidth bound, 4) Thread-to-core mapping is dynamic. NUMA overhead: 50-100% latency penalty for remote access, 50% bandwidth reduction. First-touch policy: allocate in thread that will primarily use the data.

Sources

man7.org intel.com

95% confidence

What are the decision criteria for choosing cache tiling over prefetching?

Choose cache tiling when: 1) Data access pattern is predictable but reuses data multiple times (matrix multiply, stencil codes), 2) Working set exceeds cache size but can be partitioned into cache-fitting blocks, 3) Algorithm structure allows blocking without significant code complexity, 4) Temporal locality is more important than spatial locality. Choose prefetching when: 1) Access pattern is streaming (one-pass, no reuse), 2) Memory access is predictable but spread across large address range, 3) Hardware prefetcher cannot detect the pattern (indirect access, large strides), 4) You need to hide memory latency but data doesn't fit in cache anyway.

Sources

intel.com people.freebsd.org

95% confidence

How do I decide between function pointers and switch statements for dispatch?

Use switch/jump table when: 1) Cases are dense integers (0, 1, 2...), 2) Branch predictor can learn pattern (repeated same cases), 3) Need compiler to inline case bodies, 4) Dispatch is in moderately hot path. Use function pointers when: 1) Cases are sparse or non-integer keys, 2) Functions are in different compilation units (can't inline anyway), 3) Need runtime configurability (plugins, callbacks), 4) Polymorphic behavior with clear interface. Performance: dense switch compiles to jump table (~same as function pointer array), but switch allows inlining. Indirect call has ~15-25 cycle penalty if mispredicted.

Sources

agner.org eli.thegreenplace.net

95% confidence

What size lookup table is too large for performance optimization?

Size thresholds: 1) <4KB: Likely fits in L1D cache, good for frequently accessed tables, 2) 4KB-256KB: L2 cache territory, acceptable if access has locality, 3) 256KB-8MB: L3 cache, only if access pattern has strong locality or table is shared across cores, 4) >8MB: Will cause cache misses, often slower than computation. Key factors: access pattern (random vs sequential), reuse frequency, cache contention from other data. Test empirically: if table causes >5% L1 miss rate increase, consider computation instead. Modern CPUs can often compute faster than random memory access.

Sources

When should I use memory-mapped I/O vs read/write system calls?

95% confidence

Use mmap when: 1) Random access pattern (no sequential read-ahead needed), 2) Multiple processes share same file (shared mapping), 3) File fits in address space and accesses have locality, 4) Want to leverage OS page cache automatically, 5) Treating file as array simplifies code. Use read/write when: 1) Sequential processing (read-ahead optimizations), 2) Need control over buffering and read size, 3) File is huge relative to address space, 4) Processing without page fault overhead is critical, 5) File is on network filesystem (mmap semantics problematic). Note: mmap has page fault overhead (~1000+ cycles) per new page accessed.

Sources

man7.org usenix.org

95% confidence

When should I use division vs multiplication by reciprocal?

Use multiplication by reciprocal when: 1) Dividing by same constant multiple times (compute reciprocal once), 2) Floating-point precision loss is acceptable (1-2 ULP typically), 3) Division is in hot loop (division ~15-25 cycles, multiply ~4-5), 4) Compiler doesn't auto-optimize (check assembly). Keep division when: 1) Divisor changes each iteration (reciprocal computation overhead), 2) Need exact results (financial, deterministic simulation), 3) Division by variable with potential for divide-by-zero (reciprocal of 0 = inf, different behavior), 4) Integer division (requires different approach: magic numbers, not simple reciprocal).

Sources

agner.org intel.com

95% confidence

When should I use profile-guided optimization vs generic optimization?

Use PGO when: 1) Application has stable hot paths that training can capture, 2) Representative workload is available for profiling, 3) Willing to add profiling step to build process, 4) Performance gains justify build complexity (typically 10-30% for complex code). Use generic optimization when: 1) Workload varies significantly between runs, 2) Cannot create representative training workload, 3) Build simplicity is prioritized, 4) Code is already vectorized/optimized and PGO gains are marginal. PGO helps most with: branch prediction hints, function layout, inlining decisions, register allocation. Modern equivalent: AutoFDO uses production profiles.

Sources

gcc.gnu.org llvm.org

95% confidence

When should I inline a function instead of keeping it as a separate call?

Inline when: 1) Function is small (<10-20 instructions), 2) Function is called in hot path, 3) Inlining enables further optimizations (constant propagation, dead code elimination), 4) Function has parameters that are often constants, 5) Call overhead (stack frame, parameter passing) is significant relative to work done. Keep as call when: 1) Function is large (inlining causes code bloat), 2) Function is called from many sites (instruction cache pressure), 3) Function is rarely called (cold path), 4) Recursion is involved, 5) Function address is taken (function pointers, callbacks).

Sources

When should I use SIMD masked operations vs branching to handle edge cases?

95% confidence

Use SIMD masking when: 1) Edge cases are scattered throughout data (predication per lane), 2) Both paths have similar cost, 3) Using AVX-512 (first-class mask support) or AVX2 blend, 4) Branching would be unpredictable. Use branching when: 1) Edge cases cluster (process main batch then handle edges), 2) Edge path is much more expensive (division, function call), 3) Edge cases are rare (<1-5%), 4) Edge handling has side effects. Hybrid: branch on vector-level condition (any/all edge cases), then use masking within the SIMD path. This catches common all-normal case efficiently.

Sources

What are the warning signs that inlining is hurting performance?

95% confidence

Inlining is hurting when: 1) Instruction cache miss rate increases significantly, 2) Binary size grows substantially (>20-30% for hot code), 3) Compile times become excessive, 4) Profile shows icache stalls in previously fast code, 5) Same function inlined at many call sites causes code duplication. Diagnose with: instruction cache miss counters (L1-icache-load-misses), comparing binary sizes before/after, profiling showing unexpected icache bottlenecks. Fix by: marking large functions with noinline attribute, using link-time optimization (LTO) which can make better decisions, reducing aggressive inline thresholds in compiler flags.

Sources

intel.com gcc.gnu.org

95% confidence

When should I use loop fusion vs keeping loops separate?

Use loop fusion when: 1) Loops iterate over same range, 2) Combined loop body fits in instruction cache, 3) Data from first loop is immediately used by second (improves locality), 4) Register pressure allows holding intermediate values. Keep loops separate when: 1) Individual loops vectorize better separately, 2) Combined loop exceeds register capacity (causes spills), 3) Loops have different optimal tiling factors, 4) First loop produces data consumed by many subsequent operations, 5) Parallelization strategy differs between loops. Profile both: fusion reduces memory traffic but can hurt vectorization.

Sources

When should I manually unroll loops vs relying on compiler optimization?

95% confidence

Manually unroll when: 1) Compiler doesn't unroll despite #pragma hints, 2) You need specific unroll factor for SIMD alignment, 3) Unrolling enables manual prefetching at specific offsets, 4) Profiling shows the loop is hot and compiler under-optimized, 5) You need control over instruction scheduling between unrolled iterations. Trust compiler when: 1) Loop is straightforward (simple bounds, no complex control flow), 2) You use appropriate optimization flags (-O3, -funroll-loops), 3) Profile-guided optimization (PGO) is available, 4) Code needs to be portable across compilers, 5) Loop bounds vary (compiler can handle epilogue).

Sources

When should I use huge pages vs regular 4KB pages?

95% confidence

Use huge pages (2MB or 1GB) when: 1) Working set is large (>100MB), 2) Access pattern is random across large range (TLB misses are bottleneck), 3) Can preallocate memory (huge pages harder to allocate dynamically), 4) Application is long-running (amortize setup cost). Use regular pages when: 1) Working set is small or has good locality, 2) Memory usage is dynamic and unpredictable, 3) Memory is shared with other processes (huge pages can cause fragmentation), 4) Using memory-mapped files (file size alignment constraints). Check TLB miss rate (perf stat -e dTLB-load-misses); if >1% and random access, try huge pages.

Sources

kernel.org intel.com

95% confidence

When should I use streaming stores instead of regular stores with prefetching?

Use streaming (non-temporal) stores when: 1) Writing large amounts of data that won't be read again soon, 2) You want to avoid polluting cache with write-only data, 3) Write bandwidth is the bottleneck and you can saturate memory bus, 4) Data size significantly exceeds LLC (typically >10x). Use regular stores with prefetching when: 1) Written data will be read back shortly, 2) Writes are scattered or small (streaming stores require aligned, sequential writes), 3) You're updating existing cached data, 4) Write combining buffers are limited and you can't fill full cache lines.

Sources

When should I use computed goto vs switch for interpreter dispatch?

95% confidence

Use computed goto (GCC extension) when: 1) Building high-performance interpreter (20-30% faster than switch), 2) Dispatch is the dominant cost, 3) GCC/Clang are your target compilers, 4) Willing to sacrifice portability, 5) Many opcodes with variable execution times. Use switch when: 1) Portability required (MSVC doesn't support computed goto), 2) Compiler optimizes switch well for your case, 3) Opcode count is small (<50), 4) Maintenance and readability matter more than last 20% performance. Alternative: tail-call dispatch (each handler calls next) can approach computed goto performance with better portability.

Sources

eli.thegreenplace.net complang.tuwien.ac.at

95% confidence

When should I use scalar code instead of SIMD for small data?

Use scalar when: 1) Processing fewer than 2-4 elements (SIMD setup/extraction overhead dominates), 2) Data requires gather/scatter that's slower than scalar loads, 3) Operations involve many branches/conditionals that mask most lanes, 4) Data alignment cannot be guaranteed and unaligned SIMD is slow, 5) SIMD version requires expensive horizontal operations (reductions across lanes). SIMD break-even points: typically 4+ floats for SSE, 8+ for AVX, 16+ for AVX-512. Exception: if scalar code is followed by SIMD, may be worth vectorizing small sizes to avoid transition penalties.

Sources

When should I use buffered I/O vs direct I/O?

95% confidence

Use buffered I/O (standard) when: 1) Access pattern has temporal locality (rereading data), 2) Small random reads/writes (buffer coalescing helps), 3) Want OS to manage caching, 4) Don't need precise I/O timing control. Use direct I/O (O_DIRECT) when: 1) Implementing your own caching layer (databases), 2) Streaming large files once (avoid polluting page cache), 3) Need predictable I/O latency (no page cache eviction delays), 4) Memory is limited and cache pressure is high, 5) Benchmarking raw device performance. Direct I/O requires aligned buffers and has higher per-request overhead.

Sources

man7.org scylladb.com

95% confidence

When should I use SIMD gather/scatter vs restructuring data layout?

Use gather/scatter when: 1) Data layout cannot be changed (external APIs, legacy code), 2) Access pattern is truly irregular (sparse matrices, indirect indexing), 3) Gathered elements are processed enough to amortize gather cost, 4) Alternative is scalar loop with same random access pattern. Restructure data when: 1) You control the data layout, 2) Gathers would be frequent in hot path (AVX2 gather ~15-25 cycles vs 3 for packed load), 3) Same data accessed multiple times (restructure once, benefit many times), 4) Data naturally fits SoA or AoSoA without major code changes.

Sources

When should I use link-time optimization (LTO) vs compilation-unit optimization?

95% confidence

Use LTO when: 1) Cross-module inlining would help (many small functions across files), 2) Building final release binary (LTO adds significant compile time), 3) Dead code elimination across modules is valuable, 4) Interprocedural constant propagation would help. Skip LTO when: 1) Incremental builds are important (LTO defeats caching), 2) Debug builds (LTO complicates debugging), 3) Code is mostly in one file already, 4) Compile time is critical. Typical gains: 5-20% for large C/C++ programs. Modern variant: ThinLTO offers most benefits with better compile time and parallelism.

Sources

gcc.gnu.org llvm.org

95% confidence

When should I use loop fission (splitting) vs keeping a loop unified?

Use loop fission when: 1) Loop body is too large for vectorization, 2) Different parts have different optimization opportunities, 3) Register pressure causes spills in unified loop, 4) Want to parallelize parts independently, 5) Some iterations need different cache behavior (streaming vs reuse). Keep unified when: 1) Loop body has good locality that would be lost, 2) Fission would require multiple passes over data (bandwidth limited), 3) Loop carries dependencies between would-be-split parts, 4) Iteration overhead would multiply. Fission increases memory traffic but enables targeted optimization.

Sources

When should I use static scheduling over dynamic scheduling in parallel code?

95% confidence

Use static scheduling when: 1) Work per iteration is uniform and predictable, 2) Iteration count is known at compile time, 3) You want minimal scheduling overhead, 4) Load balancing is not a concern, 5) Cache affinity matters (each thread processes same memory region). Use dynamic scheduling when: 1) Work varies significantly between iterations, 2) Iteration times are unpredictable (data-dependent), 3) Some iterations may block on I/O or locks, 4) Hardware has heterogeneous performance (power throttling, NUMA effects), 5) You're processing a work queue with varying task sizes.

Sources

openmp.org docs.microsoft.com

95% confidence

When should I use thread-local storage vs global variables with locking?

Use thread-local storage when: 1) Each thread maintains independent state (counters, buffers, caches), 2) Want to eliminate synchronization overhead entirely, 3) Combining thread-local results is infrequent (aggregate at end), 4) Memory overhead of per-thread copies is acceptable. Use global with locking when: 1) Threads must see each other's updates immediately, 2) State is inherently shared (work queue, shared cache), 3) Memory is constrained (can't duplicate per thread), 4) Operations are infrequent (locking overhead acceptable). TLS is ~3 cycles on modern systems; uncontended lock is ~15-25 cycles.

Sources

en.cppreference.com gcc.gnu.org

95% confidence

What are the decision criteria for choosing between malloc and stack allocation?

Use stack allocation when: 1) Size is known at compile time and small (<1MB typically), 2) Object lifetime matches function scope, 3) Performance is critical (stack allocation is ~20-50 cycles vs 100-1000+ for malloc), 4) You want guaranteed deallocation (no memory leaks), 5) Recursive depth is bounded and known. Use malloc/heap when: 1) Size determined at runtime or is large, 2) Object must outlive creating function, 3) Size exceeds safe stack limits (risk of stack overflow), 4) Need to resize (realloc), 5) Shared between threads with different lifetimes. Consider alloca for dynamic stack allocation with caution.

Sources

en.cppreference.com agner.org

95% confidence

When should I use SIMD shuffles vs table-based permutations?

Use SIMD shuffles when: 1) Permutation pattern is fixed at compile time, 2) Using AVX2+ with powerful shuffle instructions (VPERM, VSHUF), 3) Need to permute within or across lanes efficiently, 4) Data is already in SIMD registers. Use table-based when: 1) Permutation varies at runtime (shuffle control from LUT), 2) Pattern doesn't map to available shuffle instructions, 3) Implementing arbitrary byte-level permutation, 4) Preprocessing time is available to build optimal sequence. Hybrid: pshufb with table-loaded control byte is flexible and fast when LUT fits in cache.

Sources