Software prefetch (PREFETCH) cost: Instruction overhead: 1-4 cycles to issue. No immediate benefit - just initiates memory fetch. Prefetch must arrive before data is needed: prefetch_distance = cache_miss_latency / loop_iteration_cycles. Example: 200 cycle miss latency, 7 cycles per iteration = prefetch 28 iterations ahead. Too early: data evicted before use. Too late: no hiding of latency. Cost model: effective if hiding latency exceeds instruction overhead. For irregular access patterns, prefetch is often counterproductive. Modern CPUs have good hardware prefetchers for regular patterns - software prefetch mainly helps irregular but predictable patterns.
Performance Cost Models FAQ & Answers
75 expert Performance Cost Models answers researched from official documentation. Every answer cites authoritative sources you can verify.
Cost Models
75 questionsModulo has the same cost as division since it uses the same hardware (DIV/IDIV produces both quotient and remainder). For 32-bit modulo: 26-35 cycles latency, throughput of one every 6 cycles on Intel. Key optimizations: (1) Modulo by power of 2 becomes AND operation (1 cycle): x % 8 = x & 7. (2) For constant divisors, compilers use multiply-by-reciprocal: about 4 cycles total (multiply + shift). (3) For runtime divisors, precompute reciprocal if dividing many values by same divisor. Avoid modulo in tight loops.
DRAM refresh overhead: Each row refreshed every 64ms (DDR4 standard). Refresh takes 350ns, blocks access to that bank. Average impact: 5-10% of bandwidth lost to refresh. Worst case: request hits refreshing bank, adds 350ns (1000 cycles) to latency. Temperature effects: above 85C, refresh rate doubles (per DDR4 spec). Cost model: for latency-sensitive code, 99th percentile latency can be 2-3x median due to refresh collisions. Mitigation: modern memory controllers schedule around refresh when possible. For real-time systems: account for refresh-induced jitter. Server-grade memory: higher refresh rates for reliability can increase performance impact.
SIMD speedup estimation: Theoretical maximum: vector_width / scalar_width. SSE: 4x for float, 2x for double. AVX: 8x for float, 4x for double. AVX-512: 16x for float, 8x for double. Real speedup factors: Vectorization overhead (setup, remainder): reduces by 10-20%. Memory-bound code: limited by bandwidth, not compute. Complex control flow: may not vectorize. Measured typical speedups: compute-bound: 60-80% of theoretical. Memory-bound: 20-50% of theoretical (bandwidth limited). Mixed: 40-70% of theoretical. Cost model: actual_speedup = min(theoretical_speedup, memory_bandwidth_ratio). For bandwidth-bound: speedup = bytes_loaded_per_scalar / bytes_loaded_per_SIMD (often ~1-2x, not 8x).
Integer to string conversion cost: Naive div/mod by 10 approach: ~26 cycles per digit (division cost). 64-bit integer (up to 20 digits): ~520 cycles worst case. Optimized approaches: lookup table for 00-99: ~10 cycles per 2 digits. Multiply by reciprocal: ~5 cycles per digit. Modern libraries (fmt, to_chars): 50-100 cycles for typical integers. Cost model: digit_count = floor(log10(n)) + 1. Naive: digit_count * 26 cycles. Optimized: digit_count * 5-10 cycles. sprintf overhead: adds 100-500 cycles for format parsing. For high throughput: use specialized integer formatting (fmt library is 5-10x faster than sprintf).
Arithmetic intensity (AI) = FLOPs / Bytes moved from memory. Count operations and data movement: FLOPs = adds + multiplies + divides + sqrts (count FMA as 2 FLOPs). Bytes = unique cache lines touched * 64 (for cold cache) or actual bytes for hot cache. Example - dot product of N floats: FLOPs = 2N (one mul + one add per element). Bytes = 2 * 4N = 8N (read two vectors). AI = 2N / 8N = 0.25 FLOPs/byte. Example - matrix multiply (NxN, naive): FLOPs = 2N^3. Bytes = 3 * 4N^2 (read two matrices, write one). AI = 2N^3 / 12N^2 = N/6 FLOPs/byte. For N=1000: AI = 166 FLOPs/byte (compute-bound).
Memory throughput at each level: L1 cache: 64-128 bytes/cycle (2 loads + 1 store of 32-64 bytes each). L2 cache: 32-64 bytes/cycle. L3 cache: 16-32 bytes/cycle per core. DRAM: ~10-20 bytes/cycle per core (but shared across all cores). Example: DDR4-3200 dual channel = 51.2 GB/s total. At 3 GHz = 17 bytes/cycle shared. Single thread typically achieves 5-8 bytes/cycle from DRAM due to limited memory-level parallelism. Cost model: for bandwidth-bound code, time = bytes / effective_bandwidth_bytes_per_cycle cycles.
L2 cache access latency is 12-14 cycles on modern CPUs (approximately 3-5 nanoseconds). L2 cache is typically 256KB-1MB per core. This is roughly 3x slower than L1 cache. For cost estimation: L2 miss but L1 hit adds about 8-10 cycles over L1 baseline. Memory access pattern analysis: if your working set fits in L2 (256KB-1MB per core), expect average latency around 12-14 cycles. L2 bandwidth is typically 32-64 bytes per cycle.
Floating-point to string is much more expensive than integer: Simple cases (small integers, simple decimals): 200-500 cycles. Complex cases (large exponents, many decimal places): 1000-5000 cycles. Full precision (17 digits for double): often requires arbitrary precision arithmetic. Algorithms: Grisu2/Grisu3: 100-300 cycles for simple cases. Ryu (newer): ~100 cycles average, handles all cases correctly. Dragonbox: similar to Ryu, different tradeoffs. Cost model: ~5-10x slower than integer conversion. printf/sprintf: adds parsing overhead, total 500-2000 cycles. For high throughput: use Ryu or Dragonbox-based libraries (10x faster than naive approaches).
Roofline model: Attainable_FLOPS = min(Peak_FLOPS, Arithmetic_Intensity * Memory_Bandwidth). Where Arithmetic_Intensity (AI) = Work (FLOPs) / Data (Bytes). Example: Peak = 200 GFLOPS, BW = 50 GB/s. If AI = 1 FLOP/byte: Attainable = min(200, 150) = 50 GFLOPS (memory-bound). If AI = 10 FLOPS/byte: Attainable = min(200, 1050) = 200 GFLOPS (compute-bound). Ridge point (crossover): 200/50 = 4 FLOPS/byte. Hierarchical roofline uses different bandwidths for L1/L2/L3/DRAM to identify which cache level is the bottleneck.
CPU performance notation explained: Latency: cycles from instruction start to result availability. Used for dependency chains. Throughput (reciprocal): cycles between starting consecutive independent instructions. Example: 4L1T means 4 cycles latency, 1 cycle throughput (start one per cycle). 4L0.5T means 4 cycles latency, 0.5 cycle throughput (start two per cycle). Throughput 0.33 means 3 instructions per cycle. Reading Agner Fog tables: columns show latency and reciprocal throughput separately. uops.info format: similar, also shows micro-op breakdown. Cost model application: dependent chain: multiply latencies. Independent operations: divide by throughput. Real code: somewhere between, use profiling to measure actual performance.
Floating-point division is expensive and not fully pipelined: Single precision (DIVSS): Latency 11 cycles, throughput 0.33/cycle (one every 3 cycles). Double precision (DIVSD): Latency 13-14 cycles, throughput 0.25/cycle (one every 4 cycles). This is 3-4x slower than multiplication. Optimization: multiply by reciprocal when dividing by constant. For approximate division, use RCPSS (4 cycles) then Newton-Raphson refinement. For cost estimation in loops: N divisions = N * 3-4 cycles minimum (throughput bound), or N * 11-14 cycles if dependent (latency bound).
Network operation costs: Syscall overhead: 500-1500 cycles per send/recv. Localhost roundtrip: 10-50 microseconds. Same datacenter roundtrip: 0.5-2 milliseconds. Cross-datacenter: 10-100+ milliseconds. TCP overhead: connection setup 3-way handshake adds 1.5x RTT. Kernel network stack: ~10 microseconds processing per packet. Cost model: network_time = syscall + RTT + size/bandwidth. For small messages: RTT dominates. For large transfers: bandwidth dominates. Optimization: batch small messages, use TCP_NODELAY to disable Nagle, use kernel bypass (DPDK) for lowest latency. 10 Gbps network: 1.25 GB/s = 400 cycles per byte at 3 GHz.
Cache thrashing occurs when working set exceeds cache, causing repeated evictions: Thrashing signatures: L1 miss rate > 10%, L3 miss rate > 5%. Cost model: if working_set > cache_size, every access potentially misses. Effective latency approaches next level: L1 thrash -> L2 latency (12 cycles). L2 thrash -> L3 latency (40 cycles). L3 thrash -> DRAM latency (200 cycles). Example: matrix multiply without blocking, N=1000. Working set per row: 8KB (1000 doubles). L1 (32KB) holds 4 rows - heavy L1 thrashing. Performance: 10-50x slower than cache-blocked version. Detection: perf stat -e cache-misses,cache-references. Fix: restructure algorithm for cache-oblivious or explicit blocking to fit working set in cache.
Float vs double on modern x86: Scalar operations: same latency and throughput (both use 64-bit FP unit). SIMD operations: float has 2x throughput (8 floats vs 4 doubles in AVX). Memory: float uses half the bandwidth and cache. Specific costs: float ADD/MUL: 4 cycles latency, 2/cycle throughput (same as double). Division: float ~11 cycles, double ~13-14 cycles (float slightly faster). SIMD ADD (AVX): 8 floats or 4 doubles per instruction, same latency. Cost model: memory-bound code benefits 2x from float (half the bytes). Compute-bound SIMD code: float 2x faster. Pure scalar compute: roughly equal. Choose float for performance unless precision requires double.
Virtual function call overhead is 10-20 cycles in the best case (warm cache, predicted branch). Components: (1) Load vptr from object: 4 cycles if L1 hit. (2) Load function pointer from vtable: 4 cycles if L1 hit. (3) Indirect branch: 1-3 cycles if predicted, 15-20 cycles if mispredicted. Cold cache worst case: 100+ cycles (two cache misses + misprediction). Rule of thumb: virtual calls are 3-10x slower than direct calls when cache is warm. In tight loops, devirtualization or template polymorphism can eliminate overhead entirely.
Basic function call overhead is 2-5 cycles for the CALL/RET instructions themselves, but total overhead includes: (1) Stack frame setup: PUSH RBP, MOV RBP,RSP = 2-3 cycles. (2) Register saves if needed: 1 cycle per register. (3) Stack alignment: potentially 1-2 cycles. (4) Return: POP RBP, RET = 2-3 cycles. Typical total: 5-15 cycles for small functions. However, if function is not inlined and causes instruction cache miss, add 10-50+ cycles. For tiny functions (1-3 instructions), call overhead can exceed function body cost by 5-10x.
Instruction mix by workload: Scientific/HPC: 30-50% FP operations, 20-30% loads/stores, 10-20% branches. Integer-heavy (databases): 40-50% integer ALU, 30-40% loads/stores, 10-15% branches. Web servers: heavy load/store (40-50%), many branches (15-20%), string operations. Games: 30-40% FP/SIMD, 30% loads/stores, 15-20% branches. Compilers: 50%+ branches (control flow heavy), moderate loads/stores. Cost model impact: high branch count = misprediction dominates. High load/store = cache behavior dominates. High FP/SIMD = execution unit throughput matters. Profile your workload to understand which category applies, then optimize accordingly.
Min/max operation costs: Integer: compare + conditional move = 2 cycles. Or using bit tricks: 3-4 cycles for branchless. Floating point (MINSS/MAXSS): 3-4 cycles latency. SIMD (VPMINSB, etc.): 1 cycle for packed integers, 4 cycles for packed floats. Branching min/max: 1 cycle if predicted, 15-20 cycles if mispredicted. Cost model: for predictable comparisons (e.g., clamping), branches may be faster. For random comparisons, branchless (CMOV or SIMD) wins. Reduction to find min/max of array: N comparisons, but SIMD can do 8-16 per instruction. SIMD min/max of N elements: N/16 cycles for AVX2, plus horizontal reduction (10 cycles).
Indirect branch cost depends on prediction: Predicted correctly: 1-2 cycles (similar to direct branch). Mispredicted: 15-20 cycles (full pipeline flush). Prediction accuracy depends on call site pattern: always same target = near 100% predicted. few targets, stable pattern = well predicted. many targets or random = poorly predicted. Cost model: average_cost = 1 + (1 - prediction_rate) * 15. For virtual function calls: usually well-predicted if object type is consistent. For switch statements compiled to jump tables: depends on case distribution. Profile-guided optimization (PGO) significantly improves indirect branch prediction.
Hash table lookup cost: Hash computation: 5-50 cycles depending on hash function (CRC32: ~5, SHA: ~100+). Table lookup: 4-200 cycles depending on table size vs cache. Collision handling: adds one lookup per collision. Best case (small table in L1, no collision): ~10-15 cycles. Typical case (table in L3, avg 0.5 collisions): ~50-80 cycles. Worst case (table in DRAM, several collisions): ~400+ cycles. Cost model: lookup_cost = hash_cycles + expected_probes * memory_latency_for_table_size. Keep tables sized to fit in cache when possible. For high-performance: use cache-line-aligned buckets, minimize pointer chasing.
Bitwise operation costs: AND, OR, XOR, NOT: 1 cycle latency, 3-4 per cycle throughput. Shifts (SHL, SHR, SAR): 1 cycle latency, 2 per cycle throughput. Variable shift: 1-2 cycles latency (depends on whether count is CL register). Rotate (ROL, ROR): 1 cycle latency. Population count (POPCNT): 3 cycles latency, 1 per cycle throughput. Leading zeros (LZCNT): 3 cycles latency. Bit test (BT): 1 cycle for register, more for memory. Cost model: bitwise operations are extremely cheap - essentially free compared to memory access. Use bit manipulation for flags, masks, and compact data structures. SIMD bitwise: processes 256-512 bits in 1 cycle.
Loop unrolling tradeoffs: Benefits: reduces loop overhead (test, branch) by factor of unroll. 15x unroll eliminates ~94% of branches. Typical speedup: 1.2-2x for unroll factor 4-8. Costs: code size grows linearly with unroll factor. Excessive unrolling causes: instruction cache misses, defeats micro-op cache (typically holds ~1500 micro-ops). Sweet spot: unroll 4-8x usually optimal, beyond 16x rarely helps. Cost model: unrolled_cycles = base_cycles / unroll_factor + overhead. But if code exceeds L1I cache (32KB), add i-cache miss penalty (10-20 cycles per miss). Diminishing returns beyond instruction_cache_size / loop_body_size iterations.
NUMA remote memory access costs 1.5-2x more than local: Local node latency: 90-100 nanoseconds (300 cycles at 3 GHz). Remote node latency: 150-250 nanoseconds (450-750 cycles). Bandwidth reduction: 1.5-2x lower for remote access. Measured examples: local 118ns vs remote 242ns (2.0x). Under contention: remote can reach 1200 cycles vs 300 local (4x). Cost model: for NUMA-aware allocation, bind memory to same node as compute thread. Random access across NUMA domains: average_latency = local_latency * local_fraction + remote_latency * remote_fraction.
Floating-point square root cost: Single precision (SQRTSS): Latency 12-15 cycles, throughput ~0.33/cycle. Double precision (SQRTSD): Latency 15-20 cycles, throughput ~0.25/cycle. Not pipelined - one sqrt must complete before next starts in same unit. Approximate alternatives: RSQRTSS (reciprocal sqrt): 4 cycles, ~12-bit precision. RSQRTSS + Newton-Raphson: 8-10 cycles, full precision. For cost estimation: treat sqrt like division - expensive, avoid in tight loops. If precision allows, use rsqrt approximation (3-4x faster).
System call overhead: 150-1500 cycles depending on the syscall and CPU mitigations. Minimal syscall (e.g., getpid): ~760 cycles, ~250 nanoseconds. Typical syscall (e.g., read): 1000-1500 cycles. With Spectre/Meltdown mitigations: can add 100-500 cycles. vDSO optimization (clock_gettime, gettimeofday): ~150 cycles, ~50 nanoseconds - about 10x faster than true syscall. For cost estimation: budget 500-1000 cycles per syscall in hot paths. Batch operations when possible to amortize overhead. Prefer vDSO functions for high-frequency calls.
Reduction operation costs: Scalar sum of N elements: latency-bound at ~4 cycles per add (dependency chain). Total: 4N cycles. Multi-accumulator (4 accumulators): throughput-bound at 2 adds/cycle. Total: N/2 cycles (8x faster). SIMD reduction (AVX with 8 floats): vector adds at 8 elements per instruction, then horizontal reduce. Vertical: N/8 vector adds = N/16 cycles. Horizontal reduction: ~10-15 cycles (shuffle + add tree). Total: N/16 + 15 cycles. Cost model: scalar_reduction = N * op_latency. Optimized: N / (accumulators * throughput) + horizontal_reduction. Memory-bound check: if N * element_size > L3, time = N * element_size / bandwidth.
Garbage collection pause costs: Minor GC (young generation): 1-10 milliseconds typically. Major GC (full heap): 100ms to several seconds depending on heap size. Concurrent collectors (G1, ZGC, Shenandoah): target 10ms max pause. Cost model: pause_time proportional to live_objects for copying collectors. For stop-the-world GC: pause = heap_size / scan_rate (~1GB/s typical). Rule of thumb: 1ms per 100MB of live data for minor GC. Mitigation: reduce allocation rate, tune heap size, use low-pause collectors. For latency-critical: budget 10-20% of cycle time for GC. Allocation cost: JIT-optimized bump pointer = 10-20 cycles. Measured: Java allocation can be faster than malloc.
C++ exception cost (zero-cost exception model used by modern compilers): No exception thrown: 0 cycles overhead in hot path (tables consulted only on throw). Exception thrown: extremely expensive, 10,000-100,000+ cycles. Components of throw cost: stack unwinding, destructor calls, table lookups, dynamic type matching. Exception construction: typical exception object allocation + string formatting = 1000+ cycles. Cost model: never use exceptions for control flow. Throw rate should be < 0.1% for negligible impact. Code size: exception tables add 10-20% to binary size. For performance-critical code: prefer error codes or std::expected, reserve exceptions for truly exceptional cases.
Floating-point ADD/MUL on modern CPUs: Single precision (float): Latency 3-4 cycles, throughput 2 per cycle. Double precision (double): Latency 3-4 cycles, throughput 2 per cycle. FMA (fused multiply-add): Latency 4-5 cycles, throughput 2 per cycle. Intel Skylake: ADD latency 4 cycles, MUL latency 4 cycles, FMA latency 4 cycles - all symmetric. For cost estimation: floating-point math is nearly as fast as integer math on modern CPUs. Dependency chains are the limiting factor, not throughput. Use FMA when possible (a*b+c in one instruction).
Array of Structs (AoS) vs Struct of Arrays (SoA): AoS: struct Point { float x, y, z; } points[N]; - loads 12 bytes per point, may waste cache if only accessing x. SoA: struct Points { float x[N], y[N], z[N]; }; - loads only needed components, SIMD-friendly. Cache efficiency: if using 1 field: AoS wastes 2/3 of loaded cache lines. SIMD vectorization: SoA naturally aligns for SIMD (8 x values contiguous). AoS requires gather or shuffle. Cost model: AoS processing all fields: ~equal to SoA. AoS processing one field: 3x memory bandwidth wasted (for 3-field struct). SoA with SIMD: potential 8x speedup from vectorization. Rule: use SoA for hot data processed in bulk; AoS for random access of complete records.
Trigonometric function costs on x86: Hardware x87 (FSIN/FCOS): 50-120 cycles, variable based on input range. Library implementations (libm): typically 50-100 cycles for double precision. Fast approximations (polynomial): 10-30 cycles for single precision, reduced accuracy. SIMD vectorized (Intel SVML): 10 cycles per element for 8 floats. Cost model: use fast approximations when accuracy allows (games, graphics). Library sin/cos: assume 80 cycles per call. For bulk computation: use SIMD versions (vectorize the loop). sincos() computes both for price of one (100 cycles). Avoid sin/cos in tight loops if possible; precompute lookup tables.
TLB miss penalty ranges from 20-150 cycles depending on page table depth and caching. Best case (page table in L2): 20-40 cycles. Typical case: 40-80 cycles. Worst case (page table in DRAM): 100-150+ cycles. x86-64 uses 4-level page tables, requiring up to 4 memory accesses. With virtualization (nested page tables): 6x more lookups, penalty can exceed 500 cycles. Mitigation: use huge pages (2MB/1GB) to reduce TLB pressure. TLB typically holds 1500-2000 4KB page entries or 32-64 huge page entries.
Prefetch distance formula: distance = (cache_miss_latency * IPC) / instructions_per_element. Or simpler: distance = miss_latency_cycles / cycles_per_iteration. Example: L3 miss = 200 cycles, loop body = 25 cycles, distance = 200/25 = 8 iterations ahead. For arrays: prefetch &array[i + distance] while processing array[i]. Tuning: measure actual cache miss rate. Start with distance = 8-16, adjust based on profiling. Too short: still waiting for data. Too long: prefetched data evicted before use. Bandwidth consideration: don't prefetch faster than memory bandwidth allows.
String operation costs: Comparison (strcmp): best case 1 cycle (first char differs), worst case = length cycles + cache misses. SIMD comparison: 16-32 chars per cycle with SSE/AVX. Copy (memcpy): for small strings (<64 bytes): 10-30 cycles. For large strings: limited by memory bandwidth (10-20 GB/s). SIMD copy: 32-64 bytes per cycle from L1. Search (strstr): naive O(nm), optimized (SIMD + hashing) approaches O(n). Cost model: memcpy of N bytes = max(20, N / effective_bandwidth_bytes_per_cycle) cycles. For N=1KB from L1: ~30-50 cycles. For N=1MB from DRAM: ~50,000 cycles at 20 GB/s. Small string optimization (SSO) in std::string: avoids allocation for strings <15-23 chars.
An integer ADD instruction has a latency of 1 cycle on modern x86 CPUs (Intel and AMD). The throughput is typically 3-4 operations per cycle due to multiple execution units. This means while each individual ADD takes 1 cycle to produce its result, the CPU can execute 3-4 independent ADDs simultaneously. For dependency chains, count 1 cycle per ADD. For independent operations, divide total ADDs by 3-4 for approximate cycle count.
Unaligned access cost on modern x86: Within cache line: 0 extra cycles (handled by hardware). Crossing cache line boundary: +4-10 cycles (two cache accesses needed). Crossing page boundary: very expensive, potentially 100+ cycles (two TLB lookups). SIMD unaligned loads (VMOVDQU vs VMOVDQA): historically 1-2 cycles more, now essentially free on Haswell+. Cost model: ensure arrays are aligned to 16/32/64 bytes for SIMD. For structures with mixed sizes, use alignas() or manual padding. Unaligned atomics may not be atomic (split across cache lines). Rule: align to max(element_size, SIMD_width) for best performance.
Cache line size is 64 bytes on x86 (Intel/AMD) and 128 bytes on Apple M-series. Memory is transferred in whole cache lines - accessing 1 byte loads 64 bytes. Performance implications: (1) Sequential access: effectively 64 bytes per cache miss. (2) Strided access: if stride >= 64 bytes, every access is a cache miss. (3) False sharing: two threads modifying different variables on same cache line cause ping-pong. Cost model: array traversal with stride S bytes: cache_misses = array_size / max(S, 64). For stride 8 (double): miss every 8 elements. For stride 64+: miss every element.
Memory hierarchy cost ratios (normalized to L1=1): L1 hit: 1x (4 cycles baseline). L2 hit: 3x (12 cycles). L3 hit: 10x (40 cycles). DRAM: 50-75x (200-300 cycles). NUMA remote: 75-150x (300-600 cycles). Bandwidth ratios: L1: 64-128 bytes/cycle. L2: 32-64 bytes/cycle (0.5x L1). L3: 16-32 bytes/cycle (0.25x L1). DRAM: 10-20 bytes/cycle shared (0.1-0.15x L1). Cost model rule of thumb: each cache level is 3-4x slower than previous. Missing all caches is 50-100x slower than L1 hit. Design data structures to maximize L1/L2 hits - the latency difference is dramatic.
Little's Law: Concurrency = Throughput * Latency. For memory systems: Outstanding_requests = Bandwidth * Latency. Example: to achieve 50 GB/s with 100ns latency: Requests = (50e9 bytes/s) * (100e-9 s) / 64 bytes = 78 cache lines in flight. A single core typically supports 10-12 outstanding L1D misses. To saturate memory bandwidth, need multiple cores or threads. Cost model: max_single_thread_bandwidth = max_outstanding_misses * cache_line_size / memory_latency. With 10 misses, 64B lines, 100ns: 10 * 64 / 100ns = 6.4 GB/s per thread (far below 50 GB/s peak). This explains why single-threaded code rarely achieves peak memory bandwidth.
malloc/free cost varies by size and allocator: Small allocation (<256 bytes): 20-100 cycles with modern allocators (tcmalloc, jemalloc). Medium allocation (<1MB): 100-500 cycles. Large allocation (>1MB, uses mmap): 1000+ cycles, involves syscall. Multithreaded overhead: glibc ptmalloc has lock contention; tcmalloc/jemalloc use per-thread caches. Google data: malloc consumes ~7% of datacenter CPU cycles. Cost model: minimize allocations in hot paths. Use object pools, stack allocation, or arena allocators. Reusing buffers eliminates allocation overhead entirely.
L1 cache access latency is 4-5 cycles on modern Intel and AMD processors (approximately 1-1.5 nanoseconds at 3-4 GHz). L1 data cache (L1D) is typically 32-48KB per core. L1 instruction cache (L1I) is typically 32KB per core. The L1 cache can sustain 2 loads and 1 store per cycle. For cost estimation: assume 4 cycles for any memory access that hits L1. This is your best-case memory latency and baseline for comparison with other cache levels.
CMOV vs branch tradeoffs: CMOV: 1-2 cycles latency, always executed (no prediction). Branch: 1 cycle if correctly predicted, 15-20 cycles if mispredicted. Break-even point: if branch prediction accuracy < 85-90%, CMOV wins. If accuracy > 95%, branch is typically faster. Cost model: branch_cost = 1 + (1 - prediction_rate) * misprediction_penalty. CMOV_cost = 2 cycles (always). Example: 75% predictable branch: 1 + 0.25 * 17 = 5.25 cycles average (CMOV wins). 99% predictable branch: 1 + 0.01 * 17 = 1.17 cycles (branch wins). Compiler heuristics: often conservative with CMOV. Use profile-guided optimization to help compiler choose correctly.
SIMD permute costs vary by type: In-lane permute (within 128-bit lane): 1 cycle latency, 1/cycle throughput. Examples: PSHUFB, PSHUFD. Cross-lane permute (across 128-bit boundaries): 3 cycles latency, 1/cycle throughput. Examples: VPERMPS, VPERMD. Full permute (AVX-512 VPERMB): 3-6 cycles. Cost model for data rearrangement: prefer in-lane shuffles (1 cycle) over cross-lane (3 cycles). Two in-lane shuffles often faster than one cross-lane if possible. VPERMPS for float permutation: 3-cycle latency, handles arbitrary 8-element permutation in AVX2.
Context switch direct cost: 1-3 microseconds (3,000-10,000 cycles at 3 GHz). Components: (1) Save registers and state: 100-200 cycles. (2) Scheduler decision: 200-500 cycles. (3) Load new page tables: 100-200 cycles. (4) Restore registers: 100-200 cycles. (5) Pipeline refill: 10-50 cycles. Indirect costs dominate: TLB flush (100+ cycles per subsequent miss), cache pollution (potentially 1000s of cycles to rewarm). With virtualization: 2.5-3x more expensive. Rule of thumb: budget 30 microseconds total cost including indirect effects.
Page fault costs: Minor fault (page in memory, just needs mapping): 1-10 microseconds (3,000-30,000 cycles). Major fault (page must be read from disk): 1-10 milliseconds for SSD (3-30 million cycles). HDD: 10-100 milliseconds (30-300 million cycles). Kernel overhead per fault: ~1000-5000 cycles for trap handling. Cost model: first touch of allocated memory triggers minor faults. mmap'd file access triggers major faults if not cached. For performance: pre-fault memory with memset, use madvise(MADV_WILLNEED) for file mappings. Huge pages reduce fault frequency by 512x (2MB vs 4KB pages). Detection: perf stat -e page-faults shows fault count.
Atomic CAS (LOCK CMPXCHG) cost varies dramatically by contention: Uncontended (exclusive cache line): 12-40 cycles. Intel: ~35 cycles on Core 2, ~20 cycles on Skylake. AMD: ~12-15 cycles on Zen. Contended (cache line bouncing): 70-200+ cycles per operation due to cache coherency traffic. Highly contended: effectively serializes, can be 100x slower than uncontended. For cost estimation: budget 20-40 cycles for uncontended atomics, but design to minimize contention. A contended atomic is worse than a mutex in most cases.
File I/O cost model: Syscall overhead (read/write): 500-1500 cycles per call. Disk latency (SSD): 50-200 microseconds (150,000-600,000 cycles). Disk latency (HDD): 5-10 milliseconds (15,000,000-30,000,000 cycles). SSD throughput: 500MB/s - 7GB/s depending on interface (SATA vs NVMe). Cost model: small_read_time = syscall + disk_latency + size/bandwidth. For many small reads: syscall overhead dominates. For large sequential reads: bandwidth dominates. Optimization: batch small reads, use mmap for random access, use async I/O. Buffered I/O (fread): fewer syscalls but memory copy overhead. For maximum throughput: use direct I/O with aligned buffers, multiple in-flight requests.
SIMD operations process multiple elements with similar latency to scalar: SSE (128-bit): 4 floats or 2 doubles per instruction. Same latency as scalar (3-4 cycles for add/mul). AVX (256-bit): 8 floats or 4 doubles per instruction. Same latency, double throughput vs SSE. AVX-512 (512-bit): 16 floats or 8 doubles. May cause frequency throttling on some CPUs. Cost model: SIMD_time = Scalar_time / Vector_width (ideal). Real speedup typically 2-6x due to overhead, alignment, and remainder handling. For N elements: cycles = N / vector_width * latency (throughput bound) or N * latency / vector_width (latency bound).
Branch misprediction penalty is 15-20 cycles on modern Intel and AMD processors. Intel Skylake: approximately 16-17 cycles. AMD Zen 1/2: approximately 19 cycles. When mispredicted, the pipeline must be flushed and refilled from the correct target. For cost estimation: if branch prediction accuracy is P, average cost = (1-P) * 15-20 cycles per branch. Random branches (50% predictable) cost about 8-10 cycles on average. Well-predicted branches (>95%) cost <1 cycle average. Avoid data-dependent branches in hot loops.
Thread creation/destruction costs: Linux pthread_create: 10-50 microseconds (30,000-150,000 cycles). Windows CreateThread: 20-100 microseconds. Thread destruction: similar cost to creation. Stack allocation (typically 1-8MB): may cause page faults if touched. Cost model: thread_overhead = creation_time + work_time + destruction_time. Break-even: work must exceed 100 microseconds to justify thread creation. For short tasks: use thread pool to amortize creation cost. Pool dispatch overhead: ~1-5 microseconds (mutex + condition variable). Maximum useful threads: typically number_of_cores for CPU-bound work, more for I/O-bound. Oversubscription causes context switch overhead (30 microseconds each).
L3 cache access latency is 30-50 cycles on modern CPUs (approximately 10-20 nanoseconds). L3 is shared across all cores, typically 8-96MB total. Accessing a local L3 slice takes about 20 cycles, while accessing a remote slice (owned by another core) can take 40+ cycles. For cost estimation: L3 hit adds 25-45 cycles over L1 baseline. This is 5-10x slower than L1. If working set exceeds L2 but fits in L3, expect average latency around 35-40 cycles.
Memory bandwidth formula: Required_BW = (bytes_read + bytes_written) * iterations / time. For array sum of N floats: reads 4N bytes, no writes. At 50 GB/s bandwidth: minimum time = 4N / 50e9 seconds. Compare to compute: N additions at 8 adds/cycle on 3GHz = N/24e9 seconds. Memory-bound when 4N/50e9 > N/24e9, simplifying: 4/50 > 1/24, or 0.08 > 0.04 - always true for simple operations. Rule of thumb: operation is memory-bound if arithmetic intensity < 10-15 FLOPS/byte (the 'ridge point').
Division by constant is optimized by compilers: Compile-time constant divisor: replaced with multiply + shift, ~4-6 cycles. Power of 2: becomes shift, 1 cycle. Runtime variable divisor: full division, 26-40+ cycles. Savings: 5-10x faster for constant divisors. Cost model for repeated division: If dividing N values by same runtime divisor: precompute reciprocal once, then N multiplications. libdivide library automates this. Single division: full cost unavoidable. Modulo by constant: also optimized, same ~4-6 cycles. Example: x / 7 becomes (x * 0x2492492...) >> 32, about 4 cycles. Use const or constexpr for divisors when possible to enable optimization.
Memory bandwidth = Memory_Clock * 2 (DDR) * Bus_Width * Channels / 8. Example for DDR4-3200 dual channel: 1600 MHz * 2 * 64 bits * 2 channels / 8 = 51.2 GB/s. DDR5-6400 dual channel: 3200 * 2 * 64 * 2 / 8 = 102.4 GB/s. Real-world efficiency is 70-90% of theoretical maximum due to refresh cycles, command overhead, and access patterns. Single-threaded code typically achieves only 15-30% of peak bandwidth due to limited memory-level parallelism (need ~64 outstanding requests to saturate bandwidth).
AVX-512 instructions can cause CPU frequency reduction: Light AVX-512 (most instructions): 0-100 MHz reduction. Heavy AVX-512 (FMA, multiply): 100-200 MHz reduction. On Skylake-X: ~0.85x base frequency for heavy AVX-512. Throttling transition: 10-20 microseconds on early Xeon, near-zero on 3rd gen Xeon. Cost model: if AVX-512 section is short, transition overhead may exceed benefit. Break-even: typically need 100+ microseconds of AVX-512 work. Non-AVX code after AVX-512 runs at reduced frequency until transition back. Consider AVX2 for short bursts.
Gather/scatter is much slower than contiguous access: Contiguous vector load: 4-7 cycles from L1 cache for 256-512 bits. Gather (AVX2 VGATHERPS): ~32 cycles for 8 elements from L1 - approximately 8x slower. Each gather element is a separate cache access. AVX-512 gather: improved to ~1.2-1.8x speedup vs scalar in best cases. Cost model: gather_cycles = elements * L1_latency (approximately). Only use gather when data is truly scattered. For rearrangement, prefer: load contiguous + permute (3 cycles) over gather. Scatter (AVX-512 only): similar cost to gather, avoid when possible.
Pointer chasing is latency-bound because each load depends on previous: If list fits in L1: ~4 cycles per node. If list fits in L2: ~12 cycles per node. If list fits in L3: ~40 cycles per node. If list exceeds L3: ~200 cycles per node (DRAM latency). Total cost = nodes * memory_latency_at_that_cache_level. Example: 1000-node list in DRAM = 200,000 cycles = 67 microseconds at 3 GHz. Optimization: convert to array when possible (sequential access is 10-50x faster). Software prefetching rarely helps (can't prefetch next without completing current load). B-trees beat linked lists precisely because of this cost model.
Mutex lock/unlock cost depends on contention: Uncontended (fast path): 15-25 nanoseconds (45-75 cycles). Uses single atomic instruction, no syscall. Contended (must wait): 1-10 microseconds (3,000-30,000 cycles). Includes context switch if blocking. Heavily contended: can serialize to 100x slower than uncontended. Linux futex fast path: ~25ns. Windows CRITICAL_SECTION: ~23ns. For cost estimation: uncontended mutex adds ~50-100 cycles per lock/unlock pair. Design to minimize contention; use lock-free structures or fine-grained locking for hot paths.
Latency-bound: each operation depends on previous, limited by instruction latency. Example: sum += array[i] - each add waits for previous add (3-4 cycles each). Rate: 1 add per 4 cycles = 0.25 adds/cycle. Throughput-bound: operations are independent, limited by execution units. Example: four parallel sums merged at end. Rate: 2 adds/cycle (using both FP add units). Speedup: 8x by using 4 accumulators. Cost model: dependency_chain_length * latency vs total_ops / throughput. Take maximum. Optimization: unroll loops with multiple accumulators to convert latency-bound to throughput-bound. Typical improvement: 4-8x for reduction operations.
Memory barrier/fence costs on x86: MFENCE (full barrier): 30-100 cycles on Intel, varies by microarchitecture. SFENCE (store barrier): 5-30 cycles. LFENCE (load barrier): near zero cycles (serializing, mainly for timing). LOCK prefix (implicit barrier): 12-35 cycles (same as atomic operation). On x86, most barriers are cheap due to strong memory model - stores already have release semantics. Cost model: MFENCE dominates when used; avoid in tight loops. std::atomic with sequential consistency uses MFENCE on stores. Prefer acquire/release semantics when possible (free on x86). ARM/PowerPC: barriers are more expensive (weaker memory model), 50-200 cycles typical.
Integer absolute value costs: Branching (if x < 0) x = -x: 1 cycle if predicted, 15-20 if mispredicted. Branchless using CMOV: 2-3 cycles. Branchless arithmetic: (x ^ (x >> 31)) - (x >> 31) = 3 cycles for 32-bit. SIMD (PABSD): 1 cycle for 8 int32s in AVX2. Cost model: for random signs (50/50), branchless wins. For mostly positive or mostly negative, branch may win due to prediction. Unsigned has no absolute value - already positive. Floating point: ANDPS with sign mask = 1 cycle (just clears sign bit). For bulk processing: always use SIMD PABS instructions (1 cycle per 8-16 elements).
Cache miss costs (cycles added over L1 hit baseline of 4 cycles): L1 miss, L2 hit: +8-10 cycles (total ~12-14). L1+L2 miss, L3 hit: +26-36 cycles (total ~30-40). L1+L2+L3 miss, DRAM: +150-250 cycles (total ~160-260). Cost model for array access: if array_size <= 32KB (L1): 4 cycles/access. if array_size <= 256KB (L2): mix of 4 and 12 cycles. if array_size <= 8MB (L3): mix including ~40 cycle accesses. if array_size > L3: approaching 200+ cycles for random access. Sequential access amortizes: one miss per 64 bytes (8 doubles), so effective cost = miss_latency / 8 per element.
Sorting cost model (comparison-based): Comparisons: O(N log N), typically 1.5N log N for quicksort, 1.0N log N for mergesort. Memory accesses: quicksort mostly in-place, mergesort needs N extra space. Cache behavior: quicksort better cache locality, mergesort more predictable. At scale (N > L3/sizeof(element)): Quicksort: ~2N log N * 200 cycles (random access). Mergesort: more sequential, ~1.5N log N * 40 cycles (better prefetching). Radix sort (non-comparison): O(N * key_width), better for integers. Example: sorting 1M 64-bit integers. std::sort: ~5 million comparisons, ~200ms. Radix sort: 8 passes * 1M = 8M operations, ~50ms. Rule: radix sort wins for large arrays of fixed-size keys.
Type conversion costs on x86: int to float (CVTSI2SS): 4-6 cycles latency. float to int (CVTTSS2SI): 4-6 cycles latency. int32 to int64 (sign extend): 1 cycle. float to double: 1 cycle (just register renaming on modern CPUs). double to float: 4 cycles (rounding). int64 to float: 4-6 cycles. Narrowing conversions may have additional checking cost in safe languages. Cost model: conversions are cheap relative to memory access but expensive relative to basic ALU. Avoid conversions in inner loops when possible. SIMD conversion: 8 int32 to 8 float in 3 cycles (VCVTDQ2PS). Bulk data: convert once, operate in target format.
Memory operation costs: memset (REP STOSB optimized): 32-64 bytes per cycle for large sizes. memcpy (REP MOVSB optimized): 16-32 bytes per cycle for large sizes. Small sizes (<64 bytes): 10-30 cycles overhead. SIMD explicit copy: can achieve 32-64 bytes per cycle from L1. Cost model: Large memcpy (N bytes) from L1: N / 32 cycles. From L3: N / 16 cycles (limited by L3 bandwidth). From DRAM: N / (DRAM_bandwidth_bytes_per_cycle). Example: memcpy of 1MB from DRAM at 50 GB/s: 1MB / 17 bytes-per-cycle = 62,000 cycles = 20 microseconds at 3 GHz. For small copies: inline code often faster than function call overhead. memmove (handles overlap): similar cost to memcpy.
Use arithmetic intensity (AI) = FLOPs / Bytes transferred. Ridge point = Peak_FLOPS / Peak_Bandwidth. If AI < ridge_point: memory-bound. If AI > ridge_point: compute-bound. Modern x86 example: Peak = 100 GFLOPS, Bandwidth = 50 GB/s, Ridge = 2 FLOPS/byte. For dot product (2 FLOPs per 8 bytes loaded): AI = 0.25 < 2, memory-bound. For matrix multiply (2N FLOPs per element, amortized): AI can exceed ridge with blocking. Cost model: Memory_bound_time = Bytes / Bandwidth. Compute_bound_time = FLOPs / Peak_FLOPS. Actual_time = max(memory_time, compute_time).
False sharing occurs when threads modify different variables on the same cache line. Cost: 40-200 cycles per access instead of 4 cycles (10-50x slowdown). The cache line ping-pongs between cores via coherency protocol. Each modification invalidates other cores' copies. Measured impact: 25-50% performance degradation typical, can be 10x in extreme cases. Detection: look for high cache coherency traffic in perf counters (HITM events). Prevention: pad shared structures to cache line boundaries (64 bytes on x86, 128 bytes on ARM). In C++: alignas(64) or compiler-specific attributes.
Integer multiplication (IMUL) has a latency of 3-4 cycles on modern Intel and AMD CPUs. On Intel Skylake and later, 64-bit IMUL has 3-cycle latency with throughput of 1 per cycle. The 128-bit result (RDX:RAX) from MUL r64 has 3-cycle latency for the low 64 bits (RAX) and 4-cycle latency for the high 64 bits (RDX). AMD Zen processors show similar characteristics. For cost estimation: count 3 cycles per dependent multiply, or total_multiplies / throughput for independent operations.
Main memory (DRAM) access latency is 150-300+ cycles (50-100 nanoseconds) on modern systems. This is approximately 50-100x slower than L1 cache. At 3 GHz, 100ns equals 300 cycles. Latency components: L3 miss detection (10-20 cycles) + memory controller (10-20 cycles) + DRAM access (40-60ns). For cost estimation: every cache miss that goes to DRAM costs roughly 200 cycles. A random access pattern in array larger than L3 will average close to DRAM latency.
Execution time estimation: time = max(compute_time, memory_time, branch_time). Compute_time = sum(instruction_count[i] * cycles[i]) / IPC. Memory_time = cache_misses * miss_penalty / memory_level_parallelism. Branch_time = mispredictions * misprediction_penalty. Simplified model: total_cycles = instructions / IPC + L3_misses * 200 + mispredictions * 17. Example: 1M instructions at IPC=2, 1000 L3 misses, 5000 mispredictions. Cycles = 1M/2 + 1000200 + 500017 = 500K + 200K + 85K = 785K cycles. Time at 3GHz = 262 microseconds. Validate with: perf stat -e cycles,instructions,cache-misses,branch-misses. Adjust model based on measured vs predicted discrepancies.
Log and exp function costs: Library exp/log (double): 50-100 cycles typically. Fast approximations: 15-30 cycles with reduced precision. SIMD (Intel SVML): ~10-15 cycles per element vectorized. pow(x,y) = exp(ylog(x)): ~200 cycles (two transcendentals). exp2/log2: slightly faster than exp/ln (base matches hardware). Cost model: transcendental = ~80 cycles rule of thumb. For bulk computation: always vectorize with SIMD libraries. Integer power (x^n for integer n): use exponentiation by squaring, log2(n) multiplies = ~10 cycles for typical n. Avoid pow() for integer exponents - use xx*x or multiplication loop.
Integer division is expensive: 32-bit DIV takes 26 cycles on Intel Skylake, and 8-bit DIV takes about 25 cycles on Coffee Lake. 64-bit division can take 35-90 cycles depending on operand values. AMD processors are generally faster: Zen 2/3 significantly improved division performance. Throughput is very low (one division every 6+ cycles). For cost estimation: assume 26-40 cycles per 32-bit division in dependency chains. Avoid division in hot loops; use multiplication by reciprocal when divisor is constant.
Binary search cost model: Comparisons: log2(N). Memory accesses: log2(N), all potentially cache misses (random access pattern). If array fits in L1 (N < 4K for 4-byte elements): ~4 cycles per comparison. If array fits in L3 (N < 2M elements): ~40 cycles per comparison. If array exceeds L3: ~200 cycles per comparison (DRAM). Example: binary search in 1 million integers exceeding L3. Comparisons = 20. Cost = 20 * 200 = 4000 cycles = 1.3 microseconds. Optimization: for small arrays, linear search may win due to prefetching. For huge sorted arrays, consider B-tree (multiple elements per cache line) or interpolation search.
IPC varies dramatically by workload: Compute-bound vectorized code: 2-4 IPC (utilizing multiple execution units). Integer-heavy general code: 1-2 IPC. Memory-bound code: 0.2-0.5 IPC (waiting for cache misses). Branch-heavy code: 0.5-1 IPC (misprediction stalls). Database workloads: often 0.3-0.7 IPC (random memory access). Modern CPUs can retire 4-6 instructions per cycle theoretically. Rule of thumb: IPC < 0.7 indicates optimization opportunity (likely memory-bound). Measure with: perf stat -e instructions,cycles your_program. IPC = instructions / cycles.