performance_cost_models 75 Q&As

Performance Cost Models FAQ & Answers

75 expert Performance Cost Models answers researched from official documentation. Every answer cites authoritative sources you can verify.

Cost Models

75 questions
A

Software prefetch (PREFETCH) cost: Instruction overhead: 1-4 cycles to issue. No immediate benefit - just initiates memory fetch. Prefetch must arrive before data is needed: prefetch_distance = cache_miss_latency / loop_iteration_cycles. Example: 200 cycle miss latency, 7 cycles per iteration = prefetch 28 iterations ahead. Too early: data evicted before use. Too late: no hiding of latency. Cost model: effective if hiding latency exceeds instruction overhead. For irregular access patterns, prefetch is often counterproductive. Modern CPUs have good hardware prefetchers for regular patterns - software prefetch mainly helps irregular but predictable patterns.

95% confidence
A

Modulo has the same cost as division since it uses the same hardware (DIV/IDIV produces both quotient and remainder). For 32-bit modulo: 26-35 cycles latency, throughput of one every 6 cycles on Intel. Key optimizations: (1) Modulo by power of 2 becomes AND operation (1 cycle): x % 8 = x & 7. (2) For constant divisors, compilers use multiply-by-reciprocal: about 4 cycles total (multiply + shift). (3) For runtime divisors, precompute reciprocal if dividing many values by same divisor. Avoid modulo in tight loops.

95% confidence
A

DRAM refresh overhead: Each row refreshed every 64ms (DDR4 standard). Refresh takes 350ns, blocks access to that bank. Average impact: 5-10% of bandwidth lost to refresh. Worst case: request hits refreshing bank, adds 350ns (1000 cycles) to latency. Temperature effects: above 85C, refresh rate doubles (per DDR4 spec). Cost model: for latency-sensitive code, 99th percentile latency can be 2-3x median due to refresh collisions. Mitigation: modern memory controllers schedule around refresh when possible. For real-time systems: account for refresh-induced jitter. Server-grade memory: higher refresh rates for reliability can increase performance impact.

95% confidence
A

SIMD speedup estimation: Theoretical maximum: vector_width / scalar_width. SSE: 4x for float, 2x for double. AVX: 8x for float, 4x for double. AVX-512: 16x for float, 8x for double. Real speedup factors: Vectorization overhead (setup, remainder): reduces by 10-20%. Memory-bound code: limited by bandwidth, not compute. Complex control flow: may not vectorize. Measured typical speedups: compute-bound: 60-80% of theoretical. Memory-bound: 20-50% of theoretical (bandwidth limited). Mixed: 40-70% of theoretical. Cost model: actual_speedup = min(theoretical_speedup, memory_bandwidth_ratio). For bandwidth-bound: speedup = bytes_loaded_per_scalar / bytes_loaded_per_SIMD (often ~1-2x, not 8x).

95% confidence
A

Integer to string conversion cost: Naive div/mod by 10 approach: ~26 cycles per digit (division cost). 64-bit integer (up to 20 digits): ~520 cycles worst case. Optimized approaches: lookup table for 00-99: ~10 cycles per 2 digits. Multiply by reciprocal: ~5 cycles per digit. Modern libraries (fmt, to_chars): 50-100 cycles for typical integers. Cost model: digit_count = floor(log10(n)) + 1. Naive: digit_count * 26 cycles. Optimized: digit_count * 5-10 cycles. sprintf overhead: adds 100-500 cycles for format parsing. For high throughput: use specialized integer formatting (fmt library is 5-10x faster than sprintf).

95% confidence
A

Arithmetic intensity (AI) = FLOPs / Bytes moved from memory. Count operations and data movement: FLOPs = adds + multiplies + divides + sqrts (count FMA as 2 FLOPs). Bytes = unique cache lines touched * 64 (for cold cache) or actual bytes for hot cache. Example - dot product of N floats: FLOPs = 2N (one mul + one add per element). Bytes = 2 * 4N = 8N (read two vectors). AI = 2N / 8N = 0.25 FLOPs/byte. Example - matrix multiply (NxN, naive): FLOPs = 2N^3. Bytes = 3 * 4N^2 (read two matrices, write one). AI = 2N^3 / 12N^2 = N/6 FLOPs/byte. For N=1000: AI = 166 FLOPs/byte (compute-bound).

95% confidence
A

Memory throughput at each level: L1 cache: 64-128 bytes/cycle (2 loads + 1 store of 32-64 bytes each). L2 cache: 32-64 bytes/cycle. L3 cache: 16-32 bytes/cycle per core. DRAM: ~10-20 bytes/cycle per core (but shared across all cores). Example: DDR4-3200 dual channel = 51.2 GB/s total. At 3 GHz = 17 bytes/cycle shared. Single thread typically achieves 5-8 bytes/cycle from DRAM due to limited memory-level parallelism. Cost model: for bandwidth-bound code, time = bytes / effective_bandwidth_bytes_per_cycle cycles.

95% confidence
A

L2 cache access latency is 12-14 cycles on modern CPUs (approximately 3-5 nanoseconds). L2 cache is typically 256KB-1MB per core. This is roughly 3x slower than L1 cache. For cost estimation: L2 miss but L1 hit adds about 8-10 cycles over L1 baseline. Memory access pattern analysis: if your working set fits in L2 (256KB-1MB per core), expect average latency around 12-14 cycles. L2 bandwidth is typically 32-64 bytes per cycle.

95% confidence
A

Floating-point to string is much more expensive than integer: Simple cases (small integers, simple decimals): 200-500 cycles. Complex cases (large exponents, many decimal places): 1000-5000 cycles. Full precision (17 digits for double): often requires arbitrary precision arithmetic. Algorithms: Grisu2/Grisu3: 100-300 cycles for simple cases. Ryu (newer): ~100 cycles average, handles all cases correctly. Dragonbox: similar to Ryu, different tradeoffs. Cost model: ~5-10x slower than integer conversion. printf/sprintf: adds parsing overhead, total 500-2000 cycles. For high throughput: use Ryu or Dragonbox-based libraries (10x faster than naive approaches).

95% confidence
A

Roofline model: Attainable_FLOPS = min(Peak_FLOPS, Arithmetic_Intensity * Memory_Bandwidth). Where Arithmetic_Intensity (AI) = Work (FLOPs) / Data (Bytes). Example: Peak = 200 GFLOPS, BW = 50 GB/s. If AI = 1 FLOP/byte: Attainable = min(200, 150) = 50 GFLOPS (memory-bound). If AI = 10 FLOPS/byte: Attainable = min(200, 1050) = 200 GFLOPS (compute-bound). Ridge point (crossover): 200/50 = 4 FLOPS/byte. Hierarchical roofline uses different bandwidths for L1/L2/L3/DRAM to identify which cache level is the bottleneck.

95% confidence
A

CPU performance notation explained: Latency: cycles from instruction start to result availability. Used for dependency chains. Throughput (reciprocal): cycles between starting consecutive independent instructions. Example: 4L1T means 4 cycles latency, 1 cycle throughput (start one per cycle). 4L0.5T means 4 cycles latency, 0.5 cycle throughput (start two per cycle). Throughput 0.33 means 3 instructions per cycle. Reading Agner Fog tables: columns show latency and reciprocal throughput separately. uops.info format: similar, also shows micro-op breakdown. Cost model application: dependent chain: multiply latencies. Independent operations: divide by throughput. Real code: somewhere between, use profiling to measure actual performance.

95% confidence
A

Floating-point division is expensive and not fully pipelined: Single precision (DIVSS): Latency 11 cycles, throughput 0.33/cycle (one every 3 cycles). Double precision (DIVSD): Latency 13-14 cycles, throughput 0.25/cycle (one every 4 cycles). This is 3-4x slower than multiplication. Optimization: multiply by reciprocal when dividing by constant. For approximate division, use RCPSS (4 cycles) then Newton-Raphson refinement. For cost estimation in loops: N divisions = N * 3-4 cycles minimum (throughput bound), or N * 11-14 cycles if dependent (latency bound).

95% confidence
A

Network operation costs: Syscall overhead: 500-1500 cycles per send/recv. Localhost roundtrip: 10-50 microseconds. Same datacenter roundtrip: 0.5-2 milliseconds. Cross-datacenter: 10-100+ milliseconds. TCP overhead: connection setup 3-way handshake adds 1.5x RTT. Kernel network stack: ~10 microseconds processing per packet. Cost model: network_time = syscall + RTT + size/bandwidth. For small messages: RTT dominates. For large transfers: bandwidth dominates. Optimization: batch small messages, use TCP_NODELAY to disable Nagle, use kernel bypass (DPDK) for lowest latency. 10 Gbps network: 1.25 GB/s = 400 cycles per byte at 3 GHz.

95% confidence
A

Cache thrashing occurs when working set exceeds cache, causing repeated evictions: Thrashing signatures: L1 miss rate > 10%, L3 miss rate > 5%. Cost model: if working_set > cache_size, every access potentially misses. Effective latency approaches next level: L1 thrash -> L2 latency (12 cycles). L2 thrash -> L3 latency (40 cycles). L3 thrash -> DRAM latency (200 cycles). Example: matrix multiply without blocking, N=1000. Working set per row: 8KB (1000 doubles). L1 (32KB) holds 4 rows - heavy L1 thrashing. Performance: 10-50x slower than cache-blocked version. Detection: perf stat -e cache-misses,cache-references. Fix: restructure algorithm for cache-oblivious or explicit blocking to fit working set in cache.

95% confidence
A

Float vs double on modern x86: Scalar operations: same latency and throughput (both use 64-bit FP unit). SIMD operations: float has 2x throughput (8 floats vs 4 doubles in AVX). Memory: float uses half the bandwidth and cache. Specific costs: float ADD/MUL: 4 cycles latency, 2/cycle throughput (same as double). Division: float ~11 cycles, double ~13-14 cycles (float slightly faster). SIMD ADD (AVX): 8 floats or 4 doubles per instruction, same latency. Cost model: memory-bound code benefits 2x from float (half the bytes). Compute-bound SIMD code: float 2x faster. Pure scalar compute: roughly equal. Choose float for performance unless precision requires double.

95% confidence
A

Virtual function call overhead is 10-20 cycles in the best case (warm cache, predicted branch). Components: (1) Load vptr from object: 4 cycles if L1 hit. (2) Load function pointer from vtable: 4 cycles if L1 hit. (3) Indirect branch: 1-3 cycles if predicted, 15-20 cycles if mispredicted. Cold cache worst case: 100+ cycles (two cache misses + misprediction). Rule of thumb: virtual calls are 3-10x slower than direct calls when cache is warm. In tight loops, devirtualization or template polymorphism can eliminate overhead entirely.

95% confidence
A

Basic function call overhead is 2-5 cycles for the CALL/RET instructions themselves, but total overhead includes: (1) Stack frame setup: PUSH RBP, MOV RBP,RSP = 2-3 cycles. (2) Register saves if needed: 1 cycle per register. (3) Stack alignment: potentially 1-2 cycles. (4) Return: POP RBP, RET = 2-3 cycles. Typical total: 5-15 cycles for small functions. However, if function is not inlined and causes instruction cache miss, add 10-50+ cycles. For tiny functions (1-3 instructions), call overhead can exceed function body cost by 5-10x.

95% confidence
A

Instruction mix by workload: Scientific/HPC: 30-50% FP operations, 20-30% loads/stores, 10-20% branches. Integer-heavy (databases): 40-50% integer ALU, 30-40% loads/stores, 10-15% branches. Web servers: heavy load/store (40-50%), many branches (15-20%), string operations. Games: 30-40% FP/SIMD, 30% loads/stores, 15-20% branches. Compilers: 50%+ branches (control flow heavy), moderate loads/stores. Cost model impact: high branch count = misprediction dominates. High load/store = cache behavior dominates. High FP/SIMD = execution unit throughput matters. Profile your workload to understand which category applies, then optimize accordingly.

95% confidence
A

Min/max operation costs: Integer: compare + conditional move = 2 cycles. Or using bit tricks: 3-4 cycles for branchless. Floating point (MINSS/MAXSS): 3-4 cycles latency. SIMD (VPMINSB, etc.): 1 cycle for packed integers, 4 cycles for packed floats. Branching min/max: 1 cycle if predicted, 15-20 cycles if mispredicted. Cost model: for predictable comparisons (e.g., clamping), branches may be faster. For random comparisons, branchless (CMOV or SIMD) wins. Reduction to find min/max of array: N comparisons, but SIMD can do 8-16 per instruction. SIMD min/max of N elements: N/16 cycles for AVX2, plus horizontal reduction (10 cycles).

95% confidence
A

Indirect branch cost depends on prediction: Predicted correctly: 1-2 cycles (similar to direct branch). Mispredicted: 15-20 cycles (full pipeline flush). Prediction accuracy depends on call site pattern: always same target = near 100% predicted. few targets, stable pattern = well predicted. many targets or random = poorly predicted. Cost model: average_cost = 1 + (1 - prediction_rate) * 15. For virtual function calls: usually well-predicted if object type is consistent. For switch statements compiled to jump tables: depends on case distribution. Profile-guided optimization (PGO) significantly improves indirect branch prediction.

95% confidence
A

Hash table lookup cost: Hash computation: 5-50 cycles depending on hash function (CRC32: ~5, SHA: ~100+). Table lookup: 4-200 cycles depending on table size vs cache. Collision handling: adds one lookup per collision. Best case (small table in L1, no collision): ~10-15 cycles. Typical case (table in L3, avg 0.5 collisions): ~50-80 cycles. Worst case (table in DRAM, several collisions): ~400+ cycles. Cost model: lookup_cost = hash_cycles + expected_probes * memory_latency_for_table_size. Keep tables sized to fit in cache when possible. For high-performance: use cache-line-aligned buckets, minimize pointer chasing.

95% confidence
A

Bitwise operation costs: AND, OR, XOR, NOT: 1 cycle latency, 3-4 per cycle throughput. Shifts (SHL, SHR, SAR): 1 cycle latency, 2 per cycle throughput. Variable shift: 1-2 cycles latency (depends on whether count is CL register). Rotate (ROL, ROR): 1 cycle latency. Population count (POPCNT): 3 cycles latency, 1 per cycle throughput. Leading zeros (LZCNT): 3 cycles latency. Bit test (BT): 1 cycle for register, more for memory. Cost model: bitwise operations are extremely cheap - essentially free compared to memory access. Use bit manipulation for flags, masks, and compact data structures. SIMD bitwise: processes 256-512 bits in 1 cycle.

95% confidence
A

Loop unrolling tradeoffs: Benefits: reduces loop overhead (test, branch) by factor of unroll. 15x unroll eliminates ~94% of branches. Typical speedup: 1.2-2x for unroll factor 4-8. Costs: code size grows linearly with unroll factor. Excessive unrolling causes: instruction cache misses, defeats micro-op cache (typically holds ~1500 micro-ops). Sweet spot: unroll 4-8x usually optimal, beyond 16x rarely helps. Cost model: unrolled_cycles = base_cycles / unroll_factor + overhead. But if code exceeds L1I cache (32KB), add i-cache miss penalty (10-20 cycles per miss). Diminishing returns beyond instruction_cache_size / loop_body_size iterations.

95% confidence
A

NUMA remote memory access costs 1.5-2x more than local: Local node latency: 90-100 nanoseconds (300 cycles at 3 GHz). Remote node latency: 150-250 nanoseconds (450-750 cycles). Bandwidth reduction: 1.5-2x lower for remote access. Measured examples: local 118ns vs remote 242ns (2.0x). Under contention: remote can reach 1200 cycles vs 300 local (4x). Cost model: for NUMA-aware allocation, bind memory to same node as compute thread. Random access across NUMA domains: average_latency = local_latency * local_fraction + remote_latency * remote_fraction.

95% confidence
A

Floating-point square root cost: Single precision (SQRTSS): Latency 12-15 cycles, throughput ~0.33/cycle. Double precision (SQRTSD): Latency 15-20 cycles, throughput ~0.25/cycle. Not pipelined - one sqrt must complete before next starts in same unit. Approximate alternatives: RSQRTSS (reciprocal sqrt): 4 cycles, ~12-bit precision. RSQRTSS + Newton-Raphson: 8-10 cycles, full precision. For cost estimation: treat sqrt like division - expensive, avoid in tight loops. If precision allows, use rsqrt approximation (3-4x faster).

95% confidence
A

System call overhead: 150-1500 cycles depending on the syscall and CPU mitigations. Minimal syscall (e.g., getpid): ~760 cycles, ~250 nanoseconds. Typical syscall (e.g., read): 1000-1500 cycles. With Spectre/Meltdown mitigations: can add 100-500 cycles. vDSO optimization (clock_gettime, gettimeofday): ~150 cycles, ~50 nanoseconds - about 10x faster than true syscall. For cost estimation: budget 500-1000 cycles per syscall in hot paths. Batch operations when possible to amortize overhead. Prefer vDSO functions for high-frequency calls.

95% confidence
A

Reduction operation costs: Scalar sum of N elements: latency-bound at ~4 cycles per add (dependency chain). Total: 4N cycles. Multi-accumulator (4 accumulators): throughput-bound at 2 adds/cycle. Total: N/2 cycles (8x faster). SIMD reduction (AVX with 8 floats): vector adds at 8 elements per instruction, then horizontal reduce. Vertical: N/8 vector adds = N/16 cycles. Horizontal reduction: ~10-15 cycles (shuffle + add tree). Total: N/16 + 15 cycles. Cost model: scalar_reduction = N * op_latency. Optimized: N / (accumulators * throughput) + horizontal_reduction. Memory-bound check: if N * element_size > L3, time = N * element_size / bandwidth.

95% confidence
A

Garbage collection pause costs: Minor GC (young generation): 1-10 milliseconds typically. Major GC (full heap): 100ms to several seconds depending on heap size. Concurrent collectors (G1, ZGC, Shenandoah): target 10ms max pause. Cost model: pause_time proportional to live_objects for copying collectors. For stop-the-world GC: pause = heap_size / scan_rate (~1GB/s typical). Rule of thumb: 1ms per 100MB of live data for minor GC. Mitigation: reduce allocation rate, tune heap size, use low-pause collectors. For latency-critical: budget 10-20% of cycle time for GC. Allocation cost: JIT-optimized bump pointer = 10-20 cycles. Measured: Java allocation can be faster than malloc.

95% confidence
A

C++ exception cost (zero-cost exception model used by modern compilers): No exception thrown: 0 cycles overhead in hot path (tables consulted only on throw). Exception thrown: extremely expensive, 10,000-100,000+ cycles. Components of throw cost: stack unwinding, destructor calls, table lookups, dynamic type matching. Exception construction: typical exception object allocation + string formatting = 1000+ cycles. Cost model: never use exceptions for control flow. Throw rate should be < 0.1% for negligible impact. Code size: exception tables add 10-20% to binary size. For performance-critical code: prefer error codes or std::expected, reserve exceptions for truly exceptional cases.

95% confidence
A

Floating-point ADD/MUL on modern CPUs: Single precision (float): Latency 3-4 cycles, throughput 2 per cycle. Double precision (double): Latency 3-4 cycles, throughput 2 per cycle. FMA (fused multiply-add): Latency 4-5 cycles, throughput 2 per cycle. Intel Skylake: ADD latency 4 cycles, MUL latency 4 cycles, FMA latency 4 cycles - all symmetric. For cost estimation: floating-point math is nearly as fast as integer math on modern CPUs. Dependency chains are the limiting factor, not throughput. Use FMA when possible (a*b+c in one instruction).

95% confidence
A

Array of Structs (AoS) vs Struct of Arrays (SoA): AoS: struct Point { float x, y, z; } points[N]; - loads 12 bytes per point, may waste cache if only accessing x. SoA: struct Points { float x[N], y[N], z[N]; }; - loads only needed components, SIMD-friendly. Cache efficiency: if using 1 field: AoS wastes 2/3 of loaded cache lines. SIMD vectorization: SoA naturally aligns for SIMD (8 x values contiguous). AoS requires gather or shuffle. Cost model: AoS processing all fields: ~equal to SoA. AoS processing one field: 3x memory bandwidth wasted (for 3-field struct). SoA with SIMD: potential 8x speedup from vectorization. Rule: use SoA for hot data processed in bulk; AoS for random access of complete records.

95% confidence
A

Trigonometric function costs on x86: Hardware x87 (FSIN/FCOS): 50-120 cycles, variable based on input range. Library implementations (libm): typically 50-100 cycles for double precision. Fast approximations (polynomial): 10-30 cycles for single precision, reduced accuracy. SIMD vectorized (Intel SVML): 10 cycles per element for 8 floats. Cost model: use fast approximations when accuracy allows (games, graphics). Library sin/cos: assume 80 cycles per call. For bulk computation: use SIMD versions (vectorize the loop). sincos() computes both for price of one (100 cycles). Avoid sin/cos in tight loops if possible; precompute lookup tables.

95% confidence
A

TLB miss penalty ranges from 20-150 cycles depending on page table depth and caching. Best case (page table in L2): 20-40 cycles. Typical case: 40-80 cycles. Worst case (page table in DRAM): 100-150+ cycles. x86-64 uses 4-level page tables, requiring up to 4 memory accesses. With virtualization (nested page tables): 6x more lookups, penalty can exceed 500 cycles. Mitigation: use huge pages (2MB/1GB) to reduce TLB pressure. TLB typically holds 1500-2000 4KB page entries or 32-64 huge page entries.

95% confidence
A

Prefetch distance formula: distance = (cache_miss_latency * IPC) / instructions_per_element. Or simpler: distance = miss_latency_cycles / cycles_per_iteration. Example: L3 miss = 200 cycles, loop body = 25 cycles, distance = 200/25 = 8 iterations ahead. For arrays: prefetch &array[i + distance] while processing array[i]. Tuning: measure actual cache miss rate. Start with distance = 8-16, adjust based on profiling. Too short: still waiting for data. Too long: prefetched data evicted before use. Bandwidth consideration: don't prefetch faster than memory bandwidth allows.

95% confidence
A

An integer ADD instruction has a latency of 1 cycle on modern x86 CPUs (Intel and AMD). The throughput is typically 3-4 operations per cycle due to multiple execution units. This means while each individual ADD takes 1 cycle to produce its result, the CPU can execute 3-4 independent ADDs simultaneously. For dependency chains, count 1 cycle per ADD. For independent operations, divide total ADDs by 3-4 for approximate cycle count.

95% confidence
A

Unaligned access cost on modern x86: Within cache line: 0 extra cycles (handled by hardware). Crossing cache line boundary: +4-10 cycles (two cache accesses needed). Crossing page boundary: very expensive, potentially 100+ cycles (two TLB lookups). SIMD unaligned loads (VMOVDQU vs VMOVDQA): historically 1-2 cycles more, now essentially free on Haswell+. Cost model: ensure arrays are aligned to 16/32/64 bytes for SIMD. For structures with mixed sizes, use alignas() or manual padding. Unaligned atomics may not be atomic (split across cache lines). Rule: align to max(element_size, SIMD_width) for best performance.

95% confidence
A

Cache line size is 64 bytes on x86 (Intel/AMD) and 128 bytes on Apple M-series. Memory is transferred in whole cache lines - accessing 1 byte loads 64 bytes. Performance implications: (1) Sequential access: effectively 64 bytes per cache miss. (2) Strided access: if stride >= 64 bytes, every access is a cache miss. (3) False sharing: two threads modifying different variables on same cache line cause ping-pong. Cost model: array traversal with stride S bytes: cache_misses = array_size / max(S, 64). For stride 8 (double): miss every 8 elements. For stride 64+: miss every element.

95% confidence
A

Memory hierarchy cost ratios (normalized to L1=1): L1 hit: 1x (4 cycles baseline). L2 hit: 3x (12 cycles). L3 hit: 10x (40 cycles). DRAM: 50-75x (200-300 cycles). NUMA remote: 75-150x (300-600 cycles). Bandwidth ratios: L1: 64-128 bytes/cycle. L2: 32-64 bytes/cycle (0.5x L1). L3: 16-32 bytes/cycle (0.25x L1). DRAM: 10-20 bytes/cycle shared (0.1-0.15x L1). Cost model rule of thumb: each cache level is 3-4x slower than previous. Missing all caches is 50-100x slower than L1 hit. Design data structures to maximize L1/L2 hits - the latency difference is dramatic.

95% confidence
A

Little's Law: Concurrency = Throughput * Latency. For memory systems: Outstanding_requests = Bandwidth * Latency. Example: to achieve 50 GB/s with 100ns latency: Requests = (50e9 bytes/s) * (100e-9 s) / 64 bytes = 78 cache lines in flight. A single core typically supports 10-12 outstanding L1D misses. To saturate memory bandwidth, need multiple cores or threads. Cost model: max_single_thread_bandwidth = max_outstanding_misses * cache_line_size / memory_latency. With 10 misses, 64B lines, 100ns: 10 * 64 / 100ns = 6.4 GB/s per thread (far below 50 GB/s peak). This explains why single-threaded code rarely achieves peak memory bandwidth.

95% confidence
A

malloc/free cost varies by size and allocator: Small allocation (<256 bytes): 20-100 cycles with modern allocators (tcmalloc, jemalloc). Medium allocation (<1MB): 100-500 cycles. Large allocation (>1MB, uses mmap): 1000+ cycles, involves syscall. Multithreaded overhead: glibc ptmalloc has lock contention; tcmalloc/jemalloc use per-thread caches. Google data: malloc consumes ~7% of datacenter CPU cycles. Cost model: minimize allocations in hot paths. Use object pools, stack allocation, or arena allocators. Reusing buffers eliminates allocation overhead entirely.

95% confidence
A

L1 cache access latency is 4-5 cycles on modern Intel and AMD processors (approximately 1-1.5 nanoseconds at 3-4 GHz). L1 data cache (L1D) is typically 32-48KB per core. L1 instruction cache (L1I) is typically 32KB per core. The L1 cache can sustain 2 loads and 1 store per cycle. For cost estimation: assume 4 cycles for any memory access that hits L1. This is your best-case memory latency and baseline for comparison with other cache levels.

95% confidence
A

CMOV vs branch tradeoffs: CMOV: 1-2 cycles latency, always executed (no prediction). Branch: 1 cycle if correctly predicted, 15-20 cycles if mispredicted. Break-even point: if branch prediction accuracy < 85-90%, CMOV wins. If accuracy > 95%, branch is typically faster. Cost model: branch_cost = 1 + (1 - prediction_rate) * misprediction_penalty. CMOV_cost = 2 cycles (always). Example: 75% predictable branch: 1 + 0.25 * 17 = 5.25 cycles average (CMOV wins). 99% predictable branch: 1 + 0.01 * 17 = 1.17 cycles (branch wins). Compiler heuristics: often conservative with CMOV. Use profile-guided optimization to help compiler choose correctly.

95% confidence
A

SIMD permute costs vary by type: In-lane permute (within 128-bit lane): 1 cycle latency, 1/cycle throughput. Examples: PSHUFB, PSHUFD. Cross-lane permute (across 128-bit boundaries): 3 cycles latency, 1/cycle throughput. Examples: VPERMPS, VPERMD. Full permute (AVX-512 VPERMB): 3-6 cycles. Cost model for data rearrangement: prefer in-lane shuffles (1 cycle) over cross-lane (3 cycles). Two in-lane shuffles often faster than one cross-lane if possible. VPERMPS for float permutation: 3-cycle latency, handles arbitrary 8-element permutation in AVX2.

95% confidence
A

Context switch direct cost: 1-3 microseconds (3,000-10,000 cycles at 3 GHz). Components: (1) Save registers and state: 100-200 cycles. (2) Scheduler decision: 200-500 cycles. (3) Load new page tables: 100-200 cycles. (4) Restore registers: 100-200 cycles. (5) Pipeline refill: 10-50 cycles. Indirect costs dominate: TLB flush (100+ cycles per subsequent miss), cache pollution (potentially 1000s of cycles to rewarm). With virtualization: 2.5-3x more expensive. Rule of thumb: budget 30 microseconds total cost including indirect effects.

95% confidence
A

Page fault costs: Minor fault (page in memory, just needs mapping): 1-10 microseconds (3,000-30,000 cycles). Major fault (page must be read from disk): 1-10 milliseconds for SSD (3-30 million cycles). HDD: 10-100 milliseconds (30-300 million cycles). Kernel overhead per fault: ~1000-5000 cycles for trap handling. Cost model: first touch of allocated memory triggers minor faults. mmap'd file access triggers major faults if not cached. For performance: pre-fault memory with memset, use madvise(MADV_WILLNEED) for file mappings. Huge pages reduce fault frequency by 512x (2MB vs 4KB pages). Detection: perf stat -e page-faults shows fault count.

95% confidence
A

Atomic CAS (LOCK CMPXCHG) cost varies dramatically by contention: Uncontended (exclusive cache line): 12-40 cycles. Intel: ~35 cycles on Core 2, ~20 cycles on Skylake. AMD: ~12-15 cycles on Zen. Contended (cache line bouncing): 70-200+ cycles per operation due to cache coherency traffic. Highly contended: effectively serializes, can be 100x slower than uncontended. For cost estimation: budget 20-40 cycles for uncontended atomics, but design to minimize contention. A contended atomic is worse than a mutex in most cases.

95% confidence
A

File I/O cost model: Syscall overhead (read/write): 500-1500 cycles per call. Disk latency (SSD): 50-200 microseconds (150,000-600,000 cycles). Disk latency (HDD): 5-10 milliseconds (15,000,000-30,000,000 cycles). SSD throughput: 500MB/s - 7GB/s depending on interface (SATA vs NVMe). Cost model: small_read_time = syscall + disk_latency + size/bandwidth. For many small reads: syscall overhead dominates. For large sequential reads: bandwidth dominates. Optimization: batch small reads, use mmap for random access, use async I/O. Buffered I/O (fread): fewer syscalls but memory copy overhead. For maximum throughput: use direct I/O with aligned buffers, multiple in-flight requests.

95% confidence
A

SIMD operations process multiple elements with similar latency to scalar: SSE (128-bit): 4 floats or 2 doubles per instruction. Same latency as scalar (3-4 cycles for add/mul). AVX (256-bit): 8 floats or 4 doubles per instruction. Same latency, double throughput vs SSE. AVX-512 (512-bit): 16 floats or 8 doubles. May cause frequency throttling on some CPUs. Cost model: SIMD_time = Scalar_time / Vector_width (ideal). Real speedup typically 2-6x due to overhead, alignment, and remainder handling. For N elements: cycles = N / vector_width * latency (throughput bound) or N * latency / vector_width (latency bound).

95% confidence
A

Branch misprediction penalty is 15-20 cycles on modern Intel and AMD processors. Intel Skylake: approximately 16-17 cycles. AMD Zen 1/2: approximately 19 cycles. When mispredicted, the pipeline must be flushed and refilled from the correct target. For cost estimation: if branch prediction accuracy is P, average cost = (1-P) * 15-20 cycles per branch. Random branches (50% predictable) cost about 8-10 cycles on average. Well-predicted branches (>95%) cost <1 cycle average. Avoid data-dependent branches in hot loops.

95% confidence
A

Thread creation/destruction costs: Linux pthread_create: 10-50 microseconds (30,000-150,000 cycles). Windows CreateThread: 20-100 microseconds. Thread destruction: similar cost to creation. Stack allocation (typically 1-8MB): may cause page faults if touched. Cost model: thread_overhead = creation_time + work_time + destruction_time. Break-even: work must exceed 100 microseconds to justify thread creation. For short tasks: use thread pool to amortize creation cost. Pool dispatch overhead: ~1-5 microseconds (mutex + condition variable). Maximum useful threads: typically number_of_cores for CPU-bound work, more for I/O-bound. Oversubscription causes context switch overhead (30 microseconds each).

95% confidence
A

L3 cache access latency is 30-50 cycles on modern CPUs (approximately 10-20 nanoseconds). L3 is shared across all cores, typically 8-96MB total. Accessing a local L3 slice takes about 20 cycles, while accessing a remote slice (owned by another core) can take 40+ cycles. For cost estimation: L3 hit adds 25-45 cycles over L1 baseline. This is 5-10x slower than L1. If working set exceeds L2 but fits in L3, expect average latency around 35-40 cycles.

95% confidence
A

Memory bandwidth formula: Required_BW = (bytes_read + bytes_written) * iterations / time. For array sum of N floats: reads 4N bytes, no writes. At 50 GB/s bandwidth: minimum time = 4N / 50e9 seconds. Compare to compute: N additions at 8 adds/cycle on 3GHz = N/24e9 seconds. Memory-bound when 4N/50e9 > N/24e9, simplifying: 4/50 > 1/24, or 0.08 > 0.04 - always true for simple operations. Rule of thumb: operation is memory-bound if arithmetic intensity < 10-15 FLOPS/byte (the 'ridge point').

95% confidence
A

Division by constant is optimized by compilers: Compile-time constant divisor: replaced with multiply + shift, ~4-6 cycles. Power of 2: becomes shift, 1 cycle. Runtime variable divisor: full division, 26-40+ cycles. Savings: 5-10x faster for constant divisors. Cost model for repeated division: If dividing N values by same runtime divisor: precompute reciprocal once, then N multiplications. libdivide library automates this. Single division: full cost unavoidable. Modulo by constant: also optimized, same ~4-6 cycles. Example: x / 7 becomes (x * 0x2492492...) >> 32, about 4 cycles. Use const or constexpr for divisors when possible to enable optimization.

95% confidence
A

Memory bandwidth = Memory_Clock * 2 (DDR) * Bus_Width * Channels / 8. Example for DDR4-3200 dual channel: 1600 MHz * 2 * 64 bits * 2 channels / 8 = 51.2 GB/s. DDR5-6400 dual channel: 3200 * 2 * 64 * 2 / 8 = 102.4 GB/s. Real-world efficiency is 70-90% of theoretical maximum due to refresh cycles, command overhead, and access patterns. Single-threaded code typically achieves only 15-30% of peak bandwidth due to limited memory-level parallelism (need ~64 outstanding requests to saturate bandwidth).

95% confidence
A

AVX-512 instructions can cause CPU frequency reduction: Light AVX-512 (most instructions): 0-100 MHz reduction. Heavy AVX-512 (FMA, multiply): 100-200 MHz reduction. On Skylake-X: ~0.85x base frequency for heavy AVX-512. Throttling transition: 10-20 microseconds on early Xeon, near-zero on 3rd gen Xeon. Cost model: if AVX-512 section is short, transition overhead may exceed benefit. Break-even: typically need 100+ microseconds of AVX-512 work. Non-AVX code after AVX-512 runs at reduced frequency until transition back. Consider AVX2 for short bursts.

95% confidence
A

Gather/scatter is much slower than contiguous access: Contiguous vector load: 4-7 cycles from L1 cache for 256-512 bits. Gather (AVX2 VGATHERPS): ~32 cycles for 8 elements from L1 - approximately 8x slower. Each gather element is a separate cache access. AVX-512 gather: improved to ~1.2-1.8x speedup vs scalar in best cases. Cost model: gather_cycles = elements * L1_latency (approximately). Only use gather when data is truly scattered. For rearrangement, prefer: load contiguous + permute (3 cycles) over gather. Scatter (AVX-512 only): similar cost to gather, avoid when possible.

95% confidence
A

Pointer chasing is latency-bound because each load depends on previous: If list fits in L1: ~4 cycles per node. If list fits in L2: ~12 cycles per node. If list fits in L3: ~40 cycles per node. If list exceeds L3: ~200 cycles per node (DRAM latency). Total cost = nodes * memory_latency_at_that_cache_level. Example: 1000-node list in DRAM = 200,000 cycles = 67 microseconds at 3 GHz. Optimization: convert to array when possible (sequential access is 10-50x faster). Software prefetching rarely helps (can't prefetch next without completing current load). B-trees beat linked lists precisely because of this cost model.

95% confidence
A

Mutex lock/unlock cost depends on contention: Uncontended (fast path): 15-25 nanoseconds (45-75 cycles). Uses single atomic instruction, no syscall. Contended (must wait): 1-10 microseconds (3,000-30,000 cycles). Includes context switch if blocking. Heavily contended: can serialize to 100x slower than uncontended. Linux futex fast path: ~25ns. Windows CRITICAL_SECTION: ~23ns. For cost estimation: uncontended mutex adds ~50-100 cycles per lock/unlock pair. Design to minimize contention; use lock-free structures or fine-grained locking for hot paths.

95% confidence
A

Latency-bound: each operation depends on previous, limited by instruction latency. Example: sum += array[i] - each add waits for previous add (3-4 cycles each). Rate: 1 add per 4 cycles = 0.25 adds/cycle. Throughput-bound: operations are independent, limited by execution units. Example: four parallel sums merged at end. Rate: 2 adds/cycle (using both FP add units). Speedup: 8x by using 4 accumulators. Cost model: dependency_chain_length * latency vs total_ops / throughput. Take maximum. Optimization: unroll loops with multiple accumulators to convert latency-bound to throughput-bound. Typical improvement: 4-8x for reduction operations.

95% confidence
A

Memory barrier/fence costs on x86: MFENCE (full barrier): 30-100 cycles on Intel, varies by microarchitecture. SFENCE (store barrier): 5-30 cycles. LFENCE (load barrier): near zero cycles (serializing, mainly for timing). LOCK prefix (implicit barrier): 12-35 cycles (same as atomic operation). On x86, most barriers are cheap due to strong memory model - stores already have release semantics. Cost model: MFENCE dominates when used; avoid in tight loops. std::atomic with sequential consistency uses MFENCE on stores. Prefer acquire/release semantics when possible (free on x86). ARM/PowerPC: barriers are more expensive (weaker memory model), 50-200 cycles typical.

95% confidence
A

Integer absolute value costs: Branching (if x < 0) x = -x: 1 cycle if predicted, 15-20 if mispredicted. Branchless using CMOV: 2-3 cycles. Branchless arithmetic: (x ^ (x >> 31)) - (x >> 31) = 3 cycles for 32-bit. SIMD (PABSD): 1 cycle for 8 int32s in AVX2. Cost model: for random signs (50/50), branchless wins. For mostly positive or mostly negative, branch may win due to prediction. Unsigned has no absolute value - already positive. Floating point: ANDPS with sign mask = 1 cycle (just clears sign bit). For bulk processing: always use SIMD PABS instructions (1 cycle per 8-16 elements).

95% confidence
A

Cache miss costs (cycles added over L1 hit baseline of 4 cycles): L1 miss, L2 hit: +8-10 cycles (total ~12-14). L1+L2 miss, L3 hit: +26-36 cycles (total ~30-40). L1+L2+L3 miss, DRAM: +150-250 cycles (total ~160-260). Cost model for array access: if array_size <= 32KB (L1): 4 cycles/access. if array_size <= 256KB (L2): mix of 4 and 12 cycles. if array_size <= 8MB (L3): mix including ~40 cycle accesses. if array_size > L3: approaching 200+ cycles for random access. Sequential access amortizes: one miss per 64 bytes (8 doubles), so effective cost = miss_latency / 8 per element.

95% confidence
A

Sorting cost model (comparison-based): Comparisons: O(N log N), typically 1.5N log N for quicksort, 1.0N log N for mergesort. Memory accesses: quicksort mostly in-place, mergesort needs N extra space. Cache behavior: quicksort better cache locality, mergesort more predictable. At scale (N > L3/sizeof(element)): Quicksort: ~2N log N * 200 cycles (random access). Mergesort: more sequential, ~1.5N log N * 40 cycles (better prefetching). Radix sort (non-comparison): O(N * key_width), better for integers. Example: sorting 1M 64-bit integers. std::sort: ~5 million comparisons, ~200ms. Radix sort: 8 passes * 1M = 8M operations, ~50ms. Rule: radix sort wins for large arrays of fixed-size keys.

95% confidence
A

Type conversion costs on x86: int to float (CVTSI2SS): 4-6 cycles latency. float to int (CVTTSS2SI): 4-6 cycles latency. int32 to int64 (sign extend): 1 cycle. float to double: 1 cycle (just register renaming on modern CPUs). double to float: 4 cycles (rounding). int64 to float: 4-6 cycles. Narrowing conversions may have additional checking cost in safe languages. Cost model: conversions are cheap relative to memory access but expensive relative to basic ALU. Avoid conversions in inner loops when possible. SIMD conversion: 8 int32 to 8 float in 3 cycles (VCVTDQ2PS). Bulk data: convert once, operate in target format.

95% confidence
A

Memory operation costs: memset (REP STOSB optimized): 32-64 bytes per cycle for large sizes. memcpy (REP MOVSB optimized): 16-32 bytes per cycle for large sizes. Small sizes (<64 bytes): 10-30 cycles overhead. SIMD explicit copy: can achieve 32-64 bytes per cycle from L1. Cost model: Large memcpy (N bytes) from L1: N / 32 cycles. From L3: N / 16 cycles (limited by L3 bandwidth). From DRAM: N / (DRAM_bandwidth_bytes_per_cycle). Example: memcpy of 1MB from DRAM at 50 GB/s: 1MB / 17 bytes-per-cycle = 62,000 cycles = 20 microseconds at 3 GHz. For small copies: inline code often faster than function call overhead. memmove (handles overlap): similar cost to memcpy.

95% confidence
A

Use arithmetic intensity (AI) = FLOPs / Bytes transferred. Ridge point = Peak_FLOPS / Peak_Bandwidth. If AI < ridge_point: memory-bound. If AI > ridge_point: compute-bound. Modern x86 example: Peak = 100 GFLOPS, Bandwidth = 50 GB/s, Ridge = 2 FLOPS/byte. For dot product (2 FLOPs per 8 bytes loaded): AI = 0.25 < 2, memory-bound. For matrix multiply (2N FLOPs per element, amortized): AI can exceed ridge with blocking. Cost model: Memory_bound_time = Bytes / Bandwidth. Compute_bound_time = FLOPs / Peak_FLOPS. Actual_time = max(memory_time, compute_time).

95% confidence
A

False sharing occurs when threads modify different variables on the same cache line. Cost: 40-200 cycles per access instead of 4 cycles (10-50x slowdown). The cache line ping-pongs between cores via coherency protocol. Each modification invalidates other cores' copies. Measured impact: 25-50% performance degradation typical, can be 10x in extreme cases. Detection: look for high cache coherency traffic in perf counters (HITM events). Prevention: pad shared structures to cache line boundaries (64 bytes on x86, 128 bytes on ARM). In C++: alignas(64) or compiler-specific attributes.

95% confidence
A

Integer multiplication (IMUL) has a latency of 3-4 cycles on modern Intel and AMD CPUs. On Intel Skylake and later, 64-bit IMUL has 3-cycle latency with throughput of 1 per cycle. The 128-bit result (RDX:RAX) from MUL r64 has 3-cycle latency for the low 64 bits (RAX) and 4-cycle latency for the high 64 bits (RDX). AMD Zen processors show similar characteristics. For cost estimation: count 3 cycles per dependent multiply, or total_multiplies / throughput for independent operations.

95% confidence
A

Main memory (DRAM) access latency is 150-300+ cycles (50-100 nanoseconds) on modern systems. This is approximately 50-100x slower than L1 cache. At 3 GHz, 100ns equals 300 cycles. Latency components: L3 miss detection (10-20 cycles) + memory controller (10-20 cycles) + DRAM access (40-60ns). For cost estimation: every cache miss that goes to DRAM costs roughly 200 cycles. A random access pattern in array larger than L3 will average close to DRAM latency.

95% confidence
A

Execution time estimation: time = max(compute_time, memory_time, branch_time). Compute_time = sum(instruction_count[i] * cycles[i]) / IPC. Memory_time = cache_misses * miss_penalty / memory_level_parallelism. Branch_time = mispredictions * misprediction_penalty. Simplified model: total_cycles = instructions / IPC + L3_misses * 200 + mispredictions * 17. Example: 1M instructions at IPC=2, 1000 L3 misses, 5000 mispredictions. Cycles = 1M/2 + 1000200 + 500017 = 500K + 200K + 85K = 785K cycles. Time at 3GHz = 262 microseconds. Validate with: perf stat -e cycles,instructions,cache-misses,branch-misses. Adjust model based on measured vs predicted discrepancies.

95% confidence
A

Log and exp function costs: Library exp/log (double): 50-100 cycles typically. Fast approximations: 15-30 cycles with reduced precision. SIMD (Intel SVML): ~10-15 cycles per element vectorized. pow(x,y) = exp(ylog(x)): ~200 cycles (two transcendentals). exp2/log2: slightly faster than exp/ln (base matches hardware). Cost model: transcendental = ~80 cycles rule of thumb. For bulk computation: always vectorize with SIMD libraries. Integer power (x^n for integer n): use exponentiation by squaring, log2(n) multiplies = ~10 cycles for typical n. Avoid pow() for integer exponents - use xx*x or multiplication loop.

95% confidence
A

Integer division is expensive: 32-bit DIV takes 26 cycles on Intel Skylake, and 8-bit DIV takes about 25 cycles on Coffee Lake. 64-bit division can take 35-90 cycles depending on operand values. AMD processors are generally faster: Zen 2/3 significantly improved division performance. Throughput is very low (one division every 6+ cycles). For cost estimation: assume 26-40 cycles per 32-bit division in dependency chains. Avoid division in hot loops; use multiplication by reciprocal when divisor is constant.

95% confidence
A

IPC varies dramatically by workload: Compute-bound vectorized code: 2-4 IPC (utilizing multiple execution units). Integer-heavy general code: 1-2 IPC. Memory-bound code: 0.2-0.5 IPC (waiting for cache misses). Branch-heavy code: 0.5-1 IPC (misprediction stalls). Database workloads: often 0.3-0.7 IPC (random memory access). Modern CPUs can retire 4-6 instructions per cycle theoretically. Rule of thumb: IPC < 0.7 indicates optimization opportunity (likely memory-bound). Measure with: perf stat -e instructions,cycles your_program. IPC = instructions / cycles.

95% confidence