Pattern Transformations FAQ & Answers

How do I transform bit extraction to parallel bit extract (PEXT)?

95% confidence

BEFORE: Extract bits at positions defined by mask: uint32_t result = 0, j = 0; for(i=0; i<32; i++) if(mask & (1<<i)) result |= ((src >> i) & 1) << j++;. AFTER: uint32_t result = _pext_u32(src, mask); extracts bits where mask has 1s and packs them contiguously. Example: _pext_u32(0xABCD, 0x0F0F) extracts nibbles B and D, producing 0x00BD. The inverse, PDEP, deposits bits: _pdep_u32(0x00BD, 0x0F0F) produces 0x0B0D. Speedup: Loop is 32+ iterations, PEXT is 1 instruction (3 cycles on Intel). Applications: extracting bit fields, implementing chess move generators, parsing packed formats. Note: PEXT/PDEP are slow on AMD Zen1/2 (~18 cycles), fast on Zen3+ and all Intel.

Sources

felixcloutier.com chessprogramming.org

95% confidence

What is the pattern for transforming UTF-8 validation to SIMD parallel form?

BEFORE: Byte-by-byte state machine checking continuation bytes, overlong encodings, surrogate pairs. AFTER: Use SIMD lookup tables for byte classification. Core algorithm (simdjson approach): 1) Classify each byte (ASCII, 2-byte start, 3-byte start, 4-byte start, continuation). 2) Use _mm256_shuffle_epi8 as 16-entry lookup for byte->class. 3) Compute expected continuation count, compare with actual. 4) Check for overlong encodings and invalid ranges using comparisons. Implementation validates 32-64 bytes per iteration. Speedup: 10-20x. Validating 1GB UTF-8 takes ~50ms with SIMD vs ~800ms scalar. See simdjson and simdutf libraries for production implementations.

Sources

github.com simdjson.org

95% confidence

What is the pattern for transforming prefix sum to parallel SIMD form?

BEFORE: for(i=1; i<n; i++) prefix[i] = prefix[i-1] + arr[i];. AFTER (SIMD parallel prefix): __m256 x = _mm256_loadu_ps(arr); x = _mm256_add_ps(x, _mm256_slli_si256(x, 4)); // shift by 1 float x = _mm256_add_ps(x, _mm256_slli_si256(x, 8)); // shift by 2 floats x = _mm256_add_ps(x, _mm256_permute2f128_ps(x, x, 0x08)); // cross-lane. For larger arrays, compute local prefix sums in blocks, then adjust each block by adding the sum of previous blocks. The Blelloch scan algorithm achieves O(n/p) work with p processors. Speedup: 2-4x for SIMD within block, more with parallelization. Used in stream compaction, sorting, histogram computation.

Sources

en.algorithmica.org developer.nvidia.com

95% confidence

How do I transform integer division by 3 to multiply-shift?

BEFORE: int result = x / 3;. AFTER (unsigned): result = ((uint64_t)x * 0xAAAAAAABULL) >> 33;. AFTER (signed): More complex due to rounding toward zero. The magic constant 0xAAAAAAAB = ceil(2^33 / 3). For x/7: multiply by 0x24924925 and shift right 34. For x/10: multiply by 0xCCCCCCCD and shift right 35. Compilers generate this automatically for constant divisors (inspect assembly!). The technique from Granlund/Montgomery 1994 handles any constant. Speedup: Division is 20-90 cycles, multiply-shift is 3-4 cycles (5-20x faster). For repeated division by same dynamic value, compute reciprocal once: inv = ((1ULL << 32) + d - 1) / d; then result = (x * inv) >> 32;.

Sources

gmplib.org ridiculousfish.com

95% confidence

How do I transform a recursive function to iterative with explicit stack?

BEFORE: int factorial(int n) { if(n<=1) return 1; return n * factorial(n-1); }. AFTER: int factorial(int n) { int result=1; while(n>1) { result*=n; n--; } return result; }. For tree traversal: BEFORE: void dfs(Node* n) { if(!n) return; process(n); dfs(n->left); dfs(n->right); }. AFTER: stack<Node*> s; s.push(root); while(!s.empty()) { Node* n=s.top(); s.pop(); if(!n) continue; process(n); s.push(n->right); s.push(n->left); }. Speedup: 1.5-3x from avoiding function call overhead and potential stack overflow. Use explicit stack sized to max expected depth. Tail recursion can be optimized by compiler (-O2), but complex recursion requires manual transformation.

Sources

en.wikipedia.org geeksforgeeks.org

95% confidence

How do I transform floor/ceil to efficient integer form?

BEFORE: int floor_val = (int)floor(x);. AFTER: int floor_val = (int)x - (x < (int)x); handles negative correctly. Or: floor_val = x >= 0 ? (int)x : (int)x - 1;. For SIMD: _mm256_floor_ps then _mm256_cvttps_epi32 (requires SSE4.1/AVX). Faster when already in integer math: floor(a/b) for positive a,b is simply a/b. For ceil: ceil_val = (int)x + (x > (int)x);. Or: ceil(a/b) = (a + b - 1) / b for positive integers. Speedup: floor() function call is 10-20 cycles, cast with adjustment is 2-3 cycles. The SIMD round functions (_mm256_round_ps with _MM_FROUND_FLOOR) are single instructions. Use -ffast-math to allow compiler floor optimization.

Sources

intel.com graphics.stanford.edu

95% confidence

What is the pattern for transforming sequential search to binary search?

BEFORE: for(i=0; i<n; i++) if(arr[i]==target) return i; return -1; (O(n) linear scan). AFTER: int lo=0, hi=n-1; while(lo<=hi) { int mid=(lo+hi)/2; if(arr[mid]==target) return mid; if(arr[mid]<target) lo=mid+1; else hi=mid-1; } return -1;. Requires sorted array. Speedup: O(log n) vs O(n). For n=1M: ~20 comparisons vs ~500K average. Branchless binary search is even faster: while(len>1) { half=len/2; len-=half; lo+= (arr[lo+half-1]<target)*half; }. For small n (<64), linear search with SIMD may be faster due to cache friendliness. Use std::lower_bound in C++ which is highly optimized.

Sources

en.algorithmica.org pvk.ca

95% confidence

What is the pattern for transforming reciprocal square root to fast inverse square root?

BEFORE: float rsqrt = 1.0f / sqrtf(x);. AFTER (fast inverse square root): float rsqrt(float x) { int i = (int)&x; i = 0x5f3759df - (i >> 1); float y = (float)&i; y = y * (1.5f - 0.5f * x * y * y); return y; }. One Newton-Raphson iteration. The magic constant approximates by exploiting IEEE 754 float format. Modern CPUs: _mm_rsqrt_ps provides hardware approximation (~12 bits accuracy), follow with Newton-Raphson for more precision. Speedup: 4x over sqrtf+division. Used in graphics normalization, physics engines. Note: Modern SSE/AVX rsqrt instructions are preferred over the integer trick, as they're faster and more accurate.

Sources

How do I transform conditional increment to branchless form?

95% confidence

BEFORE: if(condition) count++;. AFTER: count += condition; or count += (int)(condition != 0);. The boolean expression evaluates to 0 or 1 in C/C++, which directly adds to count. For counting set bits in array: BEFORE: for(i=0;i<n;i++) if(arr[i]) count++;. AFTER: for(i=0;i<n;i++) count += (arr[i] != 0);. SIMD: __m256i mask = _mm256_cmpgt_epi32(arr, zero); count_vec = _mm256_sub_epi32(count_vec, mask); (subtract -1 to add 1 where true). Speedup: 1.5-3x when condition is unpredictable. Compilers often generate this transformation automatically, but explicit form guarantees it. Profile to confirm branch misprediction before optimizing.

Sources

en.algorithmica.org chessprogramming.org

95% confidence

What is the pattern for transforming histogram computation to parallel form?

BEFORE: for(i=0;i<n;i++) hist[data[i]]++;. AFTER (parallel with private histograms): #pragma omp parallel { int local_hist[256] = {0}; #pragma omp for for(i=0;i<n;i++) local_hist[data[i]]++; #pragma omp critical for(j=0;j<256;j++) hist[j] += local_hist[j]; }. SIMD approach: Use conflict detection (_mm512_conflict_epi32) and masked accumulation. Alternative: Sort data first, then count runs (better for SIMD). Speedup: Near-linear with threads for large n. The key insight is avoiding atomic operations on shared histogram by using thread-local copies and merging. For small bin counts (256), merge overhead is negligible.

Sources

developer.nvidia.com en.algorithmica.org

95% confidence

What is the pattern for transforming saturating arithmetic to efficient form?

BEFORE: int16_t result = a + b; if(result > 32767) result = 32767; if(result < -32768) result = -32768;. AFTER (SIMD): __m256i result = _mm256_adds_epi16(a, b); for signed, _mm256_adds_epu16 for unsigned. These automatically saturate instead of wrapping. For scalar: int32_t sum = (int32_t)a + b; result = (sum > 32767) ? 32767 : (sum < -32768) ? -32768 : sum; with branchless: int32_t sum = a + b; sum = sum < -32768 ? -32768 : sum; sum = sum > 32767 ? 32767 : sum;. ARM NEON: vqaddq_s16 (saturating add). Speedup: 2-4x with SIMD saturation instructions. Essential for audio (preventing clipping), image processing (pixel clamping), DSP applications.

Sources

intel.com en.wikipedia.org

95% confidence

How do I transform power-of-2 rounding up to efficient form?

BEFORE: uint32_t next_pow2 = 1; while(next_pow2 < x) next_pow2 *= 2;. AFTER: x--; x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; x++;. This fills all bits below the highest set bit with 1s, then adds 1 to get next power of 2. For round down to power of 2: y = 1 << (31 - __builtin_clz(x)); using leading zero count. Alternative: next_pow2 = 1 << (32 - __builtin_clz(x - 1)); for x > 1. Speedup: Loop is O(log n), bit manipulation is O(1). Use cases: hash table sizing, memory allocator bucket sizes, FFT length requirements. For 64-bit, extend the pattern with x |= x >> 32;.

Sources

graphics.stanford.edu en.wikipedia.org

95% confidence

How do I transform counting trailing zeros to hardware instruction?

BEFORE: int ctz = 0; while((x & 1) == 0 && ctz < 32) { x >>= 1; ctz++; } (up to 32 iterations). AFTER: int ctz = __builtin_ctz(x); compiles to BSF (Bit Scan Forward) or TZCNT instruction. TZCNT (BMI1) is preferred: defined for x=0 (returns operand size), constant latency. BSF has undefined result for x=0. For 64-bit: __builtin_ctzll(x) uses BSF/TZCNT on 64-bit operand. The de Bruijn method without hardware: static const int table[32] = {...}; return table[((x & -x) * 0x077CB531U) >> 27];. Speedup: Loop is 32 cycles worst case, hardware instruction is 1-3 cycles. Use for finding lowest set bit position, extracting rightmost 1 bit.

Sources

graphics.stanford.edu felixcloutier.com

95% confidence

What is the pattern for transforming convolution to FFT-based form?

BEFORE (direct): for(i=0;i<n;i++) for(j=0;j<k;j++) out[i] += in[i+j] * kernel[j]; O(n*k) complexity. AFTER (FFT): FFT(in), FFT(kernel), pointwise multiply, IFFT(result). O(n log n) complexity. Use when kernel size k > ~64. Implementation: pad both to next power of 2 >= n+k-1, use FFT library (FFTW, Intel MKL). Speedup: For n=1M, k=1K: direct is 10^9 ops, FFT is ~60M ops (15x faster). For small kernels (k<16), direct convolution with SIMD is faster. Libraries like cuDNN use FFT internally for large convolutions. The crossover point depends on FFT implementation efficiency.

Sources

en.wikipedia.org fftw.org

95% confidence

How do I transform conditional store to masked SIMD store?

BEFORE: for(i=0;i<n;i++) if(cond[i]) out[i] = val[i];. AFTER (AVX-512): __mmask16 mask = _mm512_cmpneq_epi32_mask(cond_vec, zero); _mm512_mask_storeu_ps(out, mask, val);. For AVX2 (no mask store for floats): __m256 mask = _mm256_castsi256_ps(_mm256_cmpgt_epi32(cond, zero)); __m256 result = _mm256_blendv_ps(_mm256_loadu_ps(out), val, mask); _mm256_storeu_ps(out, result);. The blendv approach reads existing values and selectively replaces them. Speedup: 2-4x with proper masking. AVX-512 masking is more efficient as it doesn't require loading existing values. Masked stores also suppress page faults on inactive lanes, enabling safe boundary handling.

Sources

What is the pattern for transforming AoS image data to SoA for SIMD processing?

95% confidence

BEFORE (AoS): struct Pixel { uint8_t r, g, b, a; } pixels[n]; Processing interleaved RGBA requires gather/scatter. AFTER (SoA): uint8_t r[n], g[n], b[n], a[n];. Deinterleave: __m256i rgbargba = _mm256_loadu_si256(src); // 8 pixels __m256i shuffled = _mm256_shuffle_epi8(rgbargba, deinterleave_mask); // group channels. Or process AoS using AVX2 shuffle to extract channels in-place. For conversion: for(i=0;i<n;i+=4) { uint32_t* p = (uint32_t*)&pixels[i]; for(j=0;j<4;j++) { r[i+j]=p[j]&0xFF; g[i+j]=(p[j]>>8)&0xFF; ... } }. Speedup: 2-4x for channel-independent operations. Keep SoA internally, convert at boundaries.

Sources

intel.com software.intel.com

95% confidence

How do I transform integer multiply-high to get upper bits efficiently?

BEFORE: uint64_t product = (uint64_t)a * b; uint32_t high = product >> 32;. AFTER: Use compiler intrinsic or inline asm: uint32_t high = __umulh(a, b); (MSVC) or use asm for MULX. For 64-bit: unsigned __int128 prod = (unsigned __int128)a * b; uint64_t high = prod >> 64;. Or: __uint128_t support in GCC/Clang. SIMD: _mm256_mulhi_epu16 for 16-bit, _mm256_mul_epu32 returns 64-bit products. For modular arithmetic and Montgomery multiplication, mulhi is essential. Speedup: Avoiding 128-bit types can be 1.5-2x faster on 32-bit systems. On 64-bit, compilers handle it well, but direct mulhi intrinsics guarantee optimal code generation.

Sources

intel.com en.wikipedia.org

95% confidence

How do I transform cascaded if-else to binary decision tree?

BEFORE: if(x<a) f0(); else if(x<b) f1(); else if(x<c) f2(); else if(x<d) f3(); else f4();. AFTER (binary decision tree for uniform distribution): if(x<c) { if(x<a) f0(); else if(x<b) f1(); else f2(); } else { if(x<d) f3(); else f4(); }. This ensures average 2-3 comparisons instead of worst-case 4. For sorted thresholds: use binary search then dispatch. For very many cases: int idx = binary_search(thresholds, x); handlersidx;. Speedup: O(n) to O(log n) comparisons for n cases. The balanced tree minimizes expected comparisons when all branches are equally likely. Profile branch frequencies to optimize tree shape for skewed distributions.

Sources

agner.org en.algorithmica.org

95% confidence

What is the pattern for transforming modular exponentiation to square-and-multiply?

BEFORE: uint64_t result = 1; for(i=0; i<exp; i++) result = (result * base) % mod; (O(exp) multiplications). AFTER (square-and-multiply): uint64_t result = 1; while(exp > 0) { if(exp & 1) result = (result * base) % mod; base = (base * base) % mod; exp >>= 1; }. O(log exp) multiplications. Further optimize with Montgomery multiplication to avoid modulo. SIMD: Limited applicability, but multiple independent exponentiations can be parallelized. Speedup: O(exp) to O(log exp). For exp=1000000, from 1M mults to ~20 mults (50000x faster). This is the standard algorithm for RSA, Diffie-Hellman, and other cryptographic operations.

Sources

en.wikipedia.org en.wikipedia.org

95% confidence

How do I transform sequential array initialization to parallel SIMD fill?

BEFORE: for(i=0;i<n;i++) arr[i] = value;. AFTER (AVX): __m256 val_vec = _mm256_set1_ps(value); for(i=0;i<n;i+=8) _mm256_storeu_ps(&arr[i], val_vec);. For large arrays (>L2 cache), use non-temporal stores: _mm256_stream_ps(&arr[i], val_vec); to avoid cache pollution. For sequential values (0,1,2,...): __m256i indices = _mm256_setr_epi32(0,1,2,3,4,5,6,7); __m256i increment = _mm256_set1_epi32(8); for(i=0;i<n;i+=8) { _mm256_storeu_si256((__m256i*)&arr[i], indices); indices = _mm256_add_epi32(indices, increment); }. Speedup: 4-8x for fill, near memory bandwidth for streaming stores. memset uses this internally for byte patterns.

Sources

en.algorithmica.org intel.com

95% confidence

How do I transform multiple boolean ORs to SIMD parallel any-of check?

BEFORE: bool any = (a[0] || a[1] || a[2] || ... || a[n-1]);. AFTER (SIMD): __m256i zero = _mm256_setzero_si256(); __m256i acc = zero; for(i=0;i<n;i+=8) acc = _mm256_or_si256(acc, _mm256_loadu_si256(&a[i])); bool any = !_mm256_testz_si256(acc, acc);. The VPTEST instruction sets ZF if all bits are zero. For short-circuit evaluation (early exit when found): for(i=0;i<n;i+=8) { __m256i v = _mm256_loadu_si256(&a[i]); if(!_mm256_testz_si256(v, v)) return true; }. Speedup: 8x for non-short-circuit check. For all-of: use AND instead of OR, check all bits are 1. Similar pattern works for finding if any element meets a condition via comparison masks.

Sources

How do I transform bit field extraction to efficient form?

95% confidence

BEFORE: unsigned extract_bits(unsigned x, int start, int len) { return (x >> start) & ((1 << len) - 1); }. AFTER: Use BFE (Bit Field Extract) instruction via intrinsic: _bextr_u32(x, start, len) (BMI1). For constant start/len, compilers optimize the shift-and-mask. Without BMI1: precompute mask: unsigned masks[33] = {0, 1, 3, 7, 15, ...}; return (x >> start) & masks[len];. SIMD: No direct support, use shift and AND. Speedup: BFE is 1 instruction vs 3 for shift-AND-mask. Essential for parsing packed binary formats, compression algorithms, bit manipulation. Check BMI1 support: __builtin_cpu_supports('bmi'). AMD has supported BFE since Piledriver (2012), Intel since Haswell (2013).

Sources

felixcloutier.com graphics.stanford.edu

95% confidence

What is the pattern for transforming range check to single comparison?

BEFORE: if(x >= low && x <= high) in_range();. AFTER: if((unsigned)(x - low) <= (unsigned)(high - low)) in_range();. This uses unsigned arithmetic to combine two comparisons into one. Works because if x < low, then x - low wraps to large unsigned value > (high - low). If x > high, then x - low > high - low directly. Speedup: 1 comparison + 1 subtraction vs 2 comparisons + AND. Most significant when checking array bounds: if((unsigned)index < array_size). Compilers often generate this optimization for signed range checks with -O2. For SIMD: _mm256_cmpgt_epu32 for unsigned comparison handles range checks efficiently.

Sources

graphics.stanford.edu agner.org

95% confidence

What is the pattern for transforming linked list traversal to array-based access?

BEFORE: struct Node { int val; Node* next; }; Node* p = head; while(p) { process(p->val); p = p->next; } (pointer chasing, cache-hostile). AFTER: Store data in contiguous array: int arr[n]; for(i=0; i<n; i++) process(arr[i]);. If order matters, use array of indices for logical next pointers: int next[n]; for(i=start; i!=-1; i=next[i]) process(arr[i]);. Or flatten: copy list to array, process array, rebuild list if needed. Speedup: 3-10x. Linked list traversal achieves ~5% of memory bandwidth due to pointer chasing latency (one cache miss per node). Arrays enable prefetching and SIMD. Only use linked lists when O(1) insertion/deletion is critical and cache locality isn't.

Sources

en.algorithmica.org youtube.com

95% confidence

How do I transform naive find-first-set to SIMD parallel find?

BEFORE: for(i=0;i<n;i++) if(arr[i] == target) return i;. AFTER (AVX2): __m256i target_vec = _mm256_set1_epi32(target); for(i=0; i<n; i+=8) { __m256i data = _mm256_loadu_si256((__m256i*)&arr[i]); __m256i cmp = _mm256_cmpeq_epi32(data, target_vec); int mask = _mm256_movemask_ps(_mm256_castsi256_ps(cmp)); if(mask) return i + __builtin_ctz(mask); } return -1;. Checks 8 elements per iteration. For bytes: check 32 per iteration with _mm256_cmpeq_epi8 and _mm256_movemask_epi8. Speedup: 4-8x for large arrays. Critical insight: early-exit preserves first-match semantics. For find-all, remove early exit and collect all positions.

Sources

intel.com lemire.me

95% confidence

How do I transform linear search to SIMD parallel search?

BEFORE: for(i=0; i<n; i++) if(arr[i]==target) return i;. AFTER (AVX2): __m256i target_vec = _mm256_set1_epi32(target); for(i=0; i<n; i+=8) { __m256i data = _mm256_loadu_si256((__m256i*)&arr[i]); __m256i cmp = _mm256_cmpeq_epi32(data, target_vec); int mask = _mm256_movemask_ps(_mm256_castsi256_ps(cmp)); if(mask) return i + __builtin_ctz(mask); }. Compares 8 integers simultaneously. Speedup: 4-8x for large arrays. For strings, use _mm256_cmpeq_epi8 and process 32 bytes at once. SIMD search beats binary search for n<1000 due to sequential memory access. Combine approaches: binary search to narrow range, then SIMD scan the final segment.

Sources

lemire.me intel.com

95% confidence

What is the pattern for transforming blocked/tiled memory access from naive nested loops?

BEFORE: for(i=0; i<N; i++) for(j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; (cache-thrashing for large N). AFTER: for(ii=0; ii<N; ii+=BLOCK) for(jj=0; jj<N; jj+=BLOCK) for(kk=0; kk<N; kk+=BLOCK) for(i=ii; i<min(ii+BLOCK,N); i++) for(j=jj; j<min(jj+BLOCK,N); j++) for(k=kk; k<min(kk+BLOCK,N); k++) C[i][j] += A[i][k] * B[k][j];. Choose BLOCK so 3BLOCKBLOCK*sizeof(element) fits in L1 cache (~32KB). For doubles: BLOCK=32-64. Speedup: 2-10x for matrices larger than cache. Reduces cache misses from O(N^3) to O(N^3/BLOCK). This is the foundation of high-performance BLAS implementations.

Sources

csapp.cs.cmu.edu en.algorithmica.org

95% confidence

How do I transform horizontal SIMD operations to more efficient vertical operations?

BEFORE (horizontal): __m256 sum = _mm256_hadd_ps(a, b); (adds adjacent pairs within vector, crosses lanes). AFTER (vertical): Accumulate using vertical adds throughout loop, single horizontal reduction at end. Loop: __m256 acc = _mm256_setzero_ps(); for(i=0; i<n; i+=8) acc = _mm256_add_ps(acc, _mm256_loadu_ps(&arr[i]));. Final reduction: __m128 lo = _mm256_extractf128_ps(acc, 0); __m128 hi = _mm256_extractf128_ps(acc, 1); __m128 sum128 = _mm_add_ps(lo, hi); sum128 = _mm_hadd_ps(sum128, sum128); sum128 = _mm_hadd_ps(sum128, sum128); float result = _mm_cvtss_f32(sum128);. Speedup: 3-5x. Horizontal ops have 3-7 cycle latency vs 1 cycle for vertical. Minimize horizontal operations; do them once at the end.

Sources

en.algorithmica.org stackoverflow.com

95% confidence

How do I transform byte-wise parsing to SIMD parsing?

BEFORE: while(*p) { if(*p == '"') handle_quote(); else if(*p == '\') handle_escape(); p++; }. AFTER (SIMD character search): __m256i quote = _mm256_set1_epi8('"'); __m256i backslash = _mm256_set1_epi8('\'); while(p < end) { __m256i chunk = _mm256_loadu_si256(p); __m256i q = _mm256_cmpeq_epi8(chunk, quote); __m256i b = _mm256_cmpeq_epi8(chunk, backslash); int qm = _mm256_movemask_epi8(q); int bm = _mm256_movemask_epi8(b); if(qm | bm) { handle_special(p, qm, bm); } p += 32; }. This is how simdjson achieves 2-4GB/s JSON parsing. Speedup: 4-10x for parsing-heavy workloads. The key insight: scan for special characters in bulk, then handle them individually.

Sources

simdjson.org lemire.me

95% confidence

What is the pattern for transforming counting leading zeros to hardware instruction?

BEFORE: int clz = 0; while((x & 0x80000000) == 0 && clz < 32) { x <<= 1; clz++; }. AFTER: int clz = __builtin_clz(x); compiles to BSR (Bit Scan Reverse) + subtraction or LZCNT instruction. LZCNT (ABM/BMI) directly returns leading zero count, defined for x=0 (returns 32/64). BSR finds highest set bit position, then 31-BSR gives leading zeros. For floor(log2(x)): Use 31 - __builtin_clz(x) when x > 0. For ceiling(log2(x)): 32 - __builtin_clz(x - 1) when x > 1. Speedup: Loop 32 cycles vs hardware 1-3 cycles. Applications: finding number magnitude, normalizing floating-point, fast log2 approximation.

Sources

felixcloutier.com graphics.stanford.edu

95% confidence

How do I transform RGB to grayscale using SIMD?

BEFORE: for(i=0;i<n;i++) gray[i] = 0.299fr[i] + 0.587fg[i] + 0.114f*b[i];. AFTER (SoA with AVX): __m256 coef_r = _mm256_set1_ps(0.299f); __m256 coef_g = _mm256_set1_ps(0.587f); __m256 coef_b = _mm256_set1_ps(0.114f); for(i=0;i<n;i+=8) { __m256 rv = _mm256_loadu_ps(&r[i]); __m256 gv = _mm256_loadu_ps(&g[i]); __m256 bv = _mm256_loadu_ps(&b[i]); __m256 gray_v = _mm256_fmadd_ps(rv, coef_r, _mm256_fmadd_ps(gv, coef_g, _mm256_mul_ps(bv, coef_b))); _mm256_storeu_ps(&gray[i], gray_v); }. For packed RGB bytes: deinterleave first, convert to float, compute, convert back. Speedup: 4-8x. Use FMA for 3-multiply-add pattern.

Sources

intel.com en.wikipedia.org

95% confidence

What is the pattern for transforming branch-based dispatch to table lookup?

BEFORE: switch(opcode) { case 0: fn0(); break; case 1: fn1(); break; ... case N: fnN(); break; }. AFTER: typedef void (*Handler)(void); Handler table[N+1] = {fn0, fn1, ..., fnN}; tableopcode;. For small dense ranges, jump tables are most efficient. For sparse ranges: use perfect hashing or binary search. SIMD lookup: _mm256_permutevar8x32_epi32 for 8-entry tables, vpshufb for 16-entry byte tables. Speedup: Eliminates branch misprediction chain (N/2 mispredictions on average for N cases). Jump table is O(1), switch can be O(N) worst case. Compilers often generate jump tables automatically for dense switch statements (check -O2 assembly).

Sources

en.wikipedia.org eli.thegreenplace.net

95% confidence

What is the pattern for transforming naive sorting to cache-efficient form?

BEFORE: qsort(arr, n, sizeof(int), compare); (pointer chasing, cache-unfriendly for large n). AFTER: Use radix sort for integers: void radix_sort(uint32_t* arr, int n) { uint32_t* aux = malloc(n4); for(int shift=0; shift<32; shift+=8) { int count[256]={0}; for(int i=0;i<n;i++) count[(arr[i]>>shift)&0xFF]++; for(int i=1;i<256;i++) count[i]+=count[i-1]; for(int i=n-1;i>=0;i--) aux[--count[(arr[i]>>shift)&0xFF]]=arr[i]; swap(arr,aux); } }. Speedup: O(n log n) vs O(nw/r) where w=key bits, r=radix bits. For n=1M 32-bit integers, radix sort is 2-5x faster than quicksort due to sequential memory access and predictable branches.

Sources

en.algorithmica.org travisdowns.github.io

95% confidence

How do I transform a loop with an unpredictable conditional into branchless code?

BEFORE: for(i=0; i<n; i++) { if(a[i]>0) sum += a[i]; }. AFTER: for(i=0; i<n; i++) { sum += a[i] & -(a[i]>0); } or using arithmetic selection: sum += (a[i]>0) ? a[i] : 0; which compilers convert to CMOV. The bitwise version works because -(a[i]>0) produces all 1s (0xFFFFFFFF) when true, all 0s when false. AND with the value keeps or zeros it. Speedup: 2-4x when branch misprediction rate exceeds 20%. On modern CPUs, a mispredicted branch costs 15-20 cycles. Branchless code has constant latency of 2-3 cycles regardless of data patterns. Profile with perf stat to check branch-misses; if above 5%, consider branchless. Best for random/unpredictable data; predictable patterns may be faster with branches due to speculative execution.

Sources

en.algorithmica.org chessprogramming.org

95% confidence

What is the pattern for transforming recursive tree traversal to Morris traversal?

BEFORE: void inorder(Node* n) { if(!n) return; inorder(n->left); process(n); inorder(n->right); } (O(h) stack space). AFTER (Morris traversal): Node* curr = root; while(curr) { if(!curr->left) { process(curr); curr = curr->right; } else { Node* pred = curr->left; while(pred->right && pred->right != curr) pred = pred->right; if(!pred->right) { pred->right = curr; curr = curr->left; } else { pred->right = NULL; process(curr); curr = curr->right; } } }. O(1) space by temporarily modifying tree structure. Speedup: Not faster (2x more pointer operations), but eliminates stack overflow risk for deep trees. Used when memory is extremely constrained or tree depth is unbounded.

Sources

en.wikipedia.org geeksforgeeks.org

95% confidence

How do I transform min/max operations to branchless SIMD?

BEFORE: for(i=0; i<n; i++) max_val = (arr[i] > max_val) ? arr[i] : max_val;. AFTER (SSE): __m128 max_vec = _mm_set1_ps(-FLT_MAX); for(i=0; i<n; i+=4) { max_vec = _mm_max_ps(max_vec, _mm_loadu_ps(&arr[i])); }. Then horizontal reduction of max_vec. For integer: _mm_max_epi32 (SSE4.1), _mm_max_epu8 (unsigned bytes). For min: _mm_min_ps, _mm_min_epi32. Scalar branchless: max = a - ((a-b) & ((a-b) >> 31));. The SIMD versions are inherently branchless and process 4-16 elements per instruction. Speedup: 4-8x with SSE/AVX. Use _mm256_max_ps for AVX (8 floats) or _mm512_max_ps for AVX-512 (16 floats).

Sources

What is the pattern for transforming division by constant to multiplication by reciprocal?

95% confidence

BEFORE: result = x / 7; (integer division: 20-90 cycles). AFTER: result = (x * 0x24924925ULL) >> 34; (multiply-shift: 3-4 cycles). For floating-point: result = x * 0.142857142857f; (1/7). Compilers do this automatically for constant divisors using the technique from 'Division by Invariant Integers using Multiplication' (Granlund/Montgomery 1994). The magic constant and shift are precomputed. For power-of-2 divisors: x/8 becomes x>>3 for unsigned, (x + ((x>>31)&7)) >> 3 for signed (handles negative rounding). Speedup: 5-20x for integer division. Always prefer multiplication by reciprocal for floating-point hot paths. Use compiler explorer to verify the transformation occurs.

Sources

gmplib.org ridiculousfish.com

95% confidence

What is the pattern for transforming division to fixed-point reciprocal multiply?

BEFORE: for(i=0;i<n;i++) result[i] = data[i] / divisor;. AFTER: float recip = 1.0f / divisor; for(i=0;i<n;i++) result[i] = data[i] * recip;. For integer: uint32_t recip = ((1ULL << 32) + divisor - 1) / divisor; for(i=0;i<n;i++) result[i] = ((uint64_t)data[i] * recip) >> 32;. SIMD: __m256 recip_vec = _mm256_set1_ps(1.0f / divisor); result = _mm256_mul_ps(data, recip_vec);. Speedup: Division is 10-20 cycles, multiply is 4-5 cycles (2-4x faster). Essential when dividing many values by the same divisor. Precision consideration: floating-point reciprocal has rounding error; for exact integer division, use the magic number technique.

Sources

gmplib.org agner.org

95% confidence

How do I transform sparse iteration to compressed sparse row (CSR) traversal?

BEFORE: for(i=0;i<n;i++) for(j=0;j<n;j++) if(matrix[i][j]) process(i, j, matrix[i][j]); O(n^2) even for sparse. AFTER (CSR format): int row_ptr[n+1], col_idx[nnz]; float values[nnz]; for(i=0;i<n;i++) for(k=row_ptr[i]; k<row_ptr[i+1]; k++) process(i, col_idx[k], values[k]); O(nnz). SpMV (sparse matrix-vector multiply): for(i=0;i<n;i++) { y[i] = 0; for(k=row_ptr[i];k<row_ptr[i+1];k++) y[i] += values[k] * x[col_idx[k]]; }. Speedup: For 99% sparse 1000x1000 matrix, from 1M iterations to 10K (100x faster). CSR is the standard format for scientific computing, graph algorithms, and sparse linear algebra.

Sources

en.wikipedia.org scipy.org

95% confidence

What is the pattern for transforming clamp operation to branchless SIMD?

BEFORE: if(x < min) x = min; else if(x > max) x = max;. AFTER (scalar branchless): x = x < min ? min : (x > max ? max : x); compilers generate CMOV. AFTER (SIMD): __m256 clamped = _mm256_min_ps(_mm256_max_ps(values, min_vec), max_vec);. Double min/max is the standard pattern. For integers: _mm256_min_epi32/_mm256_max_epi32 (SSE4.1+). Saturating arithmetic for specific ranges: _mm256_adds_epi16 clamps to [-32768, 32767] automatically. Speedup: 4-8x with SIMD. Clamp is ubiquitous in graphics (color clamping), audio (sample limiting), and physics (bounds checking). The nested min(max()) pattern works for any ordered type with min/max operations.

Sources

intel.com fgiesen.wordpress.com

95% confidence

What is the pattern for transforming abs() to branchless form?

BEFORE: int abs_val = (x < 0) ? -x : x; (branch). AFTER: int mask = x >> 31; int abs_val = (x + mask) ^ mask;. Explanation: For positive x, mask=0, result=(x+0)^0=x. For negative x, mask=-1 (all 1s), result=(x-1)^(-1). XOR with -1 flips all bits, and (x-1) with flipped bits equals -x (two's complement). Alternative: abs_val = (x ^ mask) - mask;. For floating-point: Clear sign bit directly: (uint32_t)&f &= 0x7FFFFFFF;. SIMD: _mm256_andnot_ps(sign_mask, vec) where sign_mask = _mm256_set1_ps(-0.0f). Speedup: 1.5-2x when branches mispredict. Many compilers optimize abs() to branchless form automatically.

Sources

What is the pattern for transforming recursive Fibonacci to matrix exponentiation?

95% confidence

BEFORE: int fib(int n) { if(n<=1) return n; return fib(n-1)+fib(n-2); } O(2^n). AFTER (matrix exponentiation): [[F(n+1), F(n)], [F(n), F(n-1)]] = [[1,1],[1,0]]^n. Use square-and-multiply for matrix power: O(log n). Matrix multiply is 8 multiplications + 4 additions. For n=1000000, naive recursion is impossible, matrix method computes in ~60 matrix multiplications. Speedup: O(2^n) to O(log n), exponentially faster. This pattern applies to any linear recurrence: a(n) = c1a(n-1) + c2a(n-2) + ... can be expressed as matrix power. Used in competitive programming and computing large Fibonacci numbers modulo prime.

Sources

en.wikipedia.org geeksforgeeks.org

95% confidence

What is the pattern for transforming scatter (indirect store) to SIMD scatter instructions?

BEFORE: for(i=0; i<16; i++) data[indices[i]] = values[i];. AFTER (AVX-512): __m512i idx = _mm512_loadu_si512(indices); __m512 vals = _mm512_loadu_ps(values); _mm512_i32scatter_ps(data, idx, vals, sizeof(float));. Mask variant: _mm512_mask_i32scatter_ps(data, mask, idx, vals, scale). Important: Scatter has conflict detection issues - if two indices are equal, behavior is undefined (implementation-dependent which value wins). Use _mm512_conflict_epi32 to detect and handle conflicts. Speedup: Limited. Scatter is primarily for code simplification, not performance. It serializes stores internally. Only AVX-512 has scatter; AVX2 does not. Consider keeping data in SIMD registers and scattering only at boundaries.

Sources

How do I transform a loop with invariant computation to hoisted form?

95% confidence

BEFORE: for(i=0; i<n; i++) { double scale = sin(theta) * cos(phi); result[i] = data[i] * scale; }. AFTER: double scale = sin(theta) * cos(phi); for(i=0; i<n; i++) { result[i] = data[i] * scale; }. For array-based invariants: BEFORE: for(i=0; i<n; i++) { len = strlen(str); if(i < len) process(str[i]); }. AFTER: len = strlen(str); for(i=0; i<n && i<len; i++) process(str[i]);. Speedup: Depends on invariant cost. For sin/cos: 100+ cycles saved per iteration. For strlen on 1KB string: 1000 cycles saved per iteration. Compilers perform basic LICM (Loop Invariant Code Motion) at -O2+, but may miss function calls without attribute((const)) or complex expressions.

Sources

en.wikipedia.org gcc.gnu.org

95% confidence

How do I transform byte swap (endian conversion) to efficient form?

BEFORE: uint32_t swap = ((x >> 24) & 0xFF) | ((x >> 8) & 0xFF00) | ((x << 8) & 0xFF0000) | ((x << 24) & 0xFF000000);. AFTER: uint32_t swap = __builtin_bswap32(x); compiles to single BSWAP instruction. For 16-bit: __builtin_bswap16(x) or use ROL by 8 bits. For 64-bit: __builtin_bswap64(x). SIMD byte shuffle: _mm_shuffle_epi8(vec, shuffle_mask) with mask reversing byte order within each element. Speedup: Shift-and-OR is 8+ operations, BSWAP is 1 instruction (1-2 cycles). Critical for network protocols (ntohl/htonl), file format parsing, cross-platform data exchange. Use htobe32/be32toh (POSIX) or std::byteswap (C++23) for portability.

Sources

felixcloutier.com en.cppreference.com

95% confidence

What is the pattern for transforming dot product to SIMD with horizontal sum?

BEFORE: float dot = 0; for(i=0; i<n; i++) dot += a[i] * b[i];. AFTER (AVX): __m256 sum = _mm256_setzero_ps(); for(i=0; i<n; i+=8) { sum = _mm256_fmadd_ps(_mm256_loadu_ps(&a[i]), _mm256_loadu_ps(&b[i]), sum); }. Horizontal reduction: __m128 lo = _mm256_castps256_ps128(sum); __m128 hi = _mm256_extractf128_ps(sum, 1); __m128 r = _mm_add_ps(lo, hi); r = _mm_hadd_ps(r, r); r = _mm_hadd_ps(r, r); float dot = _mm_cvtss_f32(r);. For SSE4.1, single vector: _mm_dp_ps(a, b, 0xF1) computes dot product directly but only for 4 elements. Speedup: 4-8x. Use FMA (_mm256_fmadd_ps) instead of separate multiply-add for 2x throughput.

Sources

intel.com stackoverflow.com

95% confidence

What is the pattern for transforming is-power-of-2 check to efficient form?

BEFORE: bool is_pow2 = false; for(int p=1; p>0; p<<=1) if(x==p) { is_pow2=true; break; }. AFTER: bool is_pow2 = x && !(x & (x - 1));. Explanation: x-1 flips all bits from the lowest set bit down. AND with x is zero only if there was exactly one set bit. The x && handles the x=0 case. Alternative: is_pow2 = __builtin_popcount(x) == 1;. For finding which power: int log2 = __builtin_ctz(x); when x is known to be power of 2. Speedup: O(1) vs O(log n). Essential for hash table operations, memory alignment checks, bit manipulation algorithms. The pattern x & (x-1) also clears the lowest set bit, useful for iteration: while(x) { process(__builtin_ctz(x)); x &= x-1; }.

Sources

How do I transform power-of-2 multiplication to left shift?

95% confidence

BEFORE: result = x * 8;. AFTER: result = x << 3;. General pattern: x * (2^n) = x << n. Compilers do this automatically, but understanding helps when reading assembly or writing SIMD. For SIMD: _mm256_slli_epi32(vec, 3) shifts all 8 integers left by 3. Combined patterns: x * 10 = (x << 3) + (x << 1) = x8 + x2. x * 7 = (x << 3) - x = x*8 - x. Speedup: Shift is 1 cycle, multiply is 3-4 cycles on modern x86. However, modern CPUs have fast multipliers, so only matters in extremely hot paths. For division by power-of-2: unsigned x/8 = x >> 3; signed requires adjustment for negative numbers.

Sources

graphics.stanford.edu agner.org

95% confidence

How do I transform dependent operations into reassociated parallel form?

BEFORE: sum = a[0]; for(i=1; i<n; i++) sum += a[i]; (serial dependency chain, 3-4 cycles per add). AFTER: sum0=sum1=sum2=sum3=0; for(i=0; i<n; i+=4) { sum0+=a[i]; sum1+=a[i+1]; sum2+=a[i+2]; sum3+=a[i+3]; } sum=sum0+sum1+sum2+sum3;. This creates 4 independent dependency chains that execute in parallel via out-of-order execution. Speedup: 2-4x on modern CPUs with 4+ execution ports. The critical insight: floating-point addition is associative mathematically but not in IEEE 754 (slight precision differences). GCC -ffast-math or -fassociative-math enables automatic reassociation. For exact results, use Kahan summation instead.

Sources

en.algorithmica.org agner.org

95% confidence

What is the pattern for transforming conditional move to arithmetic selection?

BEFORE: if(a > b) max = a; else max = b;. AFTER using subtraction and sign bit: int diff = a - b; int mask = diff >> 31; max = a - (diff & mask);. Explanation: If a>b, diff>0, mask=0, max=a-0=a. If a<=b, diff<=0, mask=-1 (all 1s), max=a-diff=a-(a-b)=b. Alternative using XOR: max = a ^ ((a ^ b) & mask);. For min: min = b + (diff & mask);. These compile to pure arithmetic without branches. Speedup: 2-3x when branches mispredict. Compilers generate CMOV for simple ternary operators, but complex conditions may need manual transformation. Profile to verify branch misprediction is the bottleneck before optimizing.

Sources

How do I transform polynomial evaluation to Horner's method?

95% confidence

BEFORE: result = a[0] + a[1]x + a[2]xx + a[3]xxx + ...;. AFTER (Horner's method): result = a[n]; for(i=n-1;i>=0;i--) result = resultx + a[i];. Or: result = a[n]; result = resultx + a[n-1]; result = result*x + a[n-2]; .... Horner's method uses n multiplications and n additions instead of n(n+1)/2 multiplications. With FMA: for(i=n-1;i>=0;i--) result = fma(result, x, a[i]);. Speedup: O(n^2) multiplies to O(n). For degree-7 polynomial: 28 muls -> 7 muls (4x faster). This is the standard method for polynomial evaluation in numerical computing. Estrin's method offers more parallelism for SIMD but requires more operations.

Sources

en.wikipedia.org agner.org

95% confidence

How do I transform naive string length to SIMD form?

BEFORE: size_t len = 0; while(str[len]) len++;. AFTER (SSE2): __m128i zero = _mm_setzero_si128(); size_t i = 0; while(1) { __m128i chunk = _mm_loadu_si128((__m128i*)(str + i)); __m128i cmp = _mm_cmpeq_epi8(chunk, zero); int mask = _mm_movemask_epi8(cmp); if(mask) return i + __builtin_ctz(mask); i += 16; }. This checks 16 bytes per iteration. PCMPISTRI (SSE4.2) handles null termination implicitly: return _mm_cmpistri(_mm_loadu_si128(str), zero, _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_EACH);. Speedup: 8-16x for long strings. glibc strlen uses this approach with alignment handling. Watch for reading past string end crossing page boundary - align start to 16 bytes.

Sources

strchr.com mischasan.wordpress.com

95% confidence

How do I transform naive GCD to binary GCD (Stein's algorithm)?

BEFORE (Euclidean): while(b) { int t = b; b = a % b; a = t; } return a;. AFTER (Binary GCD): int shift = __builtin_ctz(a | b); a >>= __builtin_ctz(a); while(b) { b >>= __builtin_ctz(b); if(a > b) { int t = a; a = b; b = t; } b -= a; } return a << shift;. Binary GCD replaces expensive division/modulo with cheap shifts and subtraction. Speedup: 2-4x on modern CPUs. The ctz (count trailing zeros) efficiently finds factors of 2. While Euclidean is simpler and compilers optimize division well, binary GCD has more predictable performance and is preferred in cryptographic implementations to avoid timing attacks.

Sources

en.wikipedia.org lemire.me

95% confidence

How do I transform strided array access to transposed contiguous access?

BEFORE: for(j=0; j<N; j++) for(i=0; i<M; i++) sum += matrix[i][j]; (stride of N elements between accesses, cache thrashing). AFTER: Either transpose first, then access row-major: transpose(matrix, transposed); for(i=0; i<M; i++) for(j=0; j<N; j++) sum += transposed[j][i];. Or interchange loops: for(i=0; i<M; i++) for(j=0; j<N; j++) sum += matrix[i][j];. In-place transpose for square matrices: for(i=0; i<N; i++) for(j=i+1; j<N; j++) swap(matrix[i][j], matrix[j][i]);. Speedup: 3-10x depending on stride and cache size. Strided access with stride >= cache line wastes entire cache line per access. Blocking/tiling helps when full transpose isn't feasible.

Sources

en.algorithmica.org intel.com

95% confidence

What is the branchless pattern for conditional assignment using bitwise operations?

BEFORE: if(condition) x = a; else x = b;. AFTER: mask = -(int)(condition); x = (a & mask) | (b & ~mask);. The expression -(int)(condition) converts boolean to all-1s or all-0s mask. When condition is true: mask=0xFFFFFFFF, ~mask=0, so x = (a & 0xFF...F) | (b & 0) = a. When false: mask=0, ~mask=0xFF...F, so x = (a & 0) | (b & 0xFF...F) = b. Alternative using XOR: x = b ^ ((a ^ b) & mask);. Speedup: 2-3x for random conditions. This pattern is essential for cryptographic code (constant-time operations) and SIMD where all lanes must execute the same path. Compilers often generate this automatically from ternary operator when optimizing.

Sources

graphics.stanford.edu fgiesen.wordpress.com

95% confidence

How do I transform gather (indirect load) to SIMD gather instructions?

BEFORE: for(i=0; i<8; i++) result[i] = data[indices[i]];. AFTER (AVX2): __m256i idx = _mm256_loadu_si256((__m256i*)indices); __m256 result = _mm256_i32gather_ps(data, idx, sizeof(float));. Scale parameter (4 for float) handles element size. AVX-512 adds mask support: _mm512_mask_i32gather_ps(src, mask, idx, base, scale). Speedup: Varies widely. Gather is NOT parallel memory access - it serializes internally. Effective when: indices fit in cache, or when combined with other SIMD operations. For truly random access, explicit loads may be faster. Benchmark your specific case. Gather is 12-20 cycles on Intel, faster on AMD Zen4+.

Sources

intel.com lemire.me

95% confidence

How do I transform memory copy to optimized form?

BEFORE: for(i=0; i<n; i++) dst[i] = src[i];. AFTER: memcpy(dst, src, n * sizeof(*dst)); or SIMD: for(i=0; i<n; i+=8) { _mm256_storeu_ps(&dst[i], _mm256_loadu_ps(&src[i])); }. For large copies (>1MB), use non-temporal stores: _mm256_stream_ps(dst, _mm256_loadu_ps(src)); bypasses cache to avoid polluting it. For tiny copies (<64 bytes), rep movsb may be optimal on modern Intel (ERMSB). Speedup: Naive loop achieves ~20% bandwidth, optimized memcpy achieves >90%. glibc memcpy uses SIMD with runtime CPU detection. For moves (overlapping): memmove handles overlap correctly; memcpy may not. Use __builtin_memcpy for compiler optimization opportunities.

Sources

How do I transform naive CRC computation to table-lookup form?

95% confidence

BEFORE: crc = 0xFFFFFFFF; for each bit: crc = (crc >> 1) ^ (polynomial & -(crc & 1));. AFTER (table lookup, 1 byte at a time): static uint32_t table[256]; // precomputed for(i=0;i<len;i++) crc = (crc >> 8) ^ table[(crc ^ data[i]) & 0xFF];. Table generation: for(i=0;i<256;i++) { crc=i; for(j=0;j<8;j++) crc = (crc>>1) ^ (poly & -(crc&1)); table[i]=crc; }. For more speed: 4-way table (slicing-by-4) processes 4 bytes per iteration. Modern CPUs: use CRC32 instruction _mm_crc32_u64 for hardware CRC32C. Speedup: 8x with table lookup, 50x+ with hardware instruction. CRC32C achieves >10GB/s with hardware support.

Sources

What is the pattern for transforming floating-point comparison to integer comparison?

95% confidence

BEFORE: if(fabs(a - b) < epsilon) (expensive fabs, floating-point subtract). AFTER for IEEE 754 positive floats: Reinterpret as integers and compare: int32_t ia = (int32_t)&a; int32_t ib = (int32_t)&b; if(abs(ia - ib) < ulps). This uses ULPs (Units in Last Place) for comparison. Works because IEEE 754 floats are ordered like integers when positive. For signed floats, adjust: if(ia < 0) ia = 0x80000000 - ia;. SIMD: Cast to integer, compare with _mm256_cmpgt_epi32. Speedup: 1.5-2x for comparison-heavy code. This technique is used in physics engines and numerical software. Caveat: Fails for NaN and infinity; add special handling if needed.

Sources

randomascii.wordpress.com bitbashing.io

95% confidence

What is the pattern for transforming string comparison to SIMD?

BEFORE: int cmp = strcmp(a, b); (byte-by-byte comparison). AFTER (SSE4.2): int cmp = 0; for(i=0; ; i+=16) { __m128i va = _mm_loadu_si128((__m128i*)&a[i]); __m128i vb = _mm_loadu_si128((__m128i*)&b[i]); int idx = _mm_cmpistri(va, vb, _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY); if(idx < 16) { cmp = (unsigned char)a[i+idx] - (unsigned char)b[i+idx]; break; } if(_mm_cmpistrz(va, vb, flags)) break; }. PCMPISTRI compares 16 bytes, handles null terminator implicitly. Speedup: 2-4x for long strings. For known-length (memcmp style): use _mm_cmpeq_epi8 and _mm_movemask_epi8. glibc uses this approach for optimized string functions.

Sources

intel.com strchr.com

95% confidence

How do I transform modulo by power of 2 to bitwise AND?

BEFORE: remainder = x % 16; (division instruction, 20+ cycles). AFTER: remainder = x & 15; (AND instruction, 1 cycle). General pattern: x % (2^n) = x & ((1 << n) - 1) for unsigned integers. For signed integers, the pattern is more complex due to negative number representation: remainder = ((x % n) + n) % n or use: int mask = n - 1; remainder = x & mask; if (x < 0 && remainder) remainder |= ~mask; Speedup: 10-20x. This is why hash tables use power-of-2 sizes. Compilers optimize x % CONST automatically when CONST is power of 2. For non-power-of-2, combine with Barrett reduction for repeated modulo by same divisor.

Sources

graphics.stanford.edu en.wikipedia.org

95% confidence

What is the pattern for transforming array compaction to SIMD stream compaction?

BEFORE: j=0; for(i=0;i<n;i++) if(pred(arr[i])) out[j++] = arr[i];. AFTER (AVX2 with pext): __m256i data = _mm256_loadu_si256(src); __m256i mask = predicate_simd(data); int m = _mm256_movemask_ps(data); __m256i indices = _mm256_loadu_si256(&shuffle_table[m]); __m256i compacted = _mm256_permutevar8x32_epi32(data, indices); _mm256_storeu_si256(dst, compacted); dst += __builtin_popcount(m);. Requires precomputed 256-entry shuffle table for each possible 8-bit mask. Speedup: 2-5x. Used in filtering, removing whitespace, extracting valid elements. AVX-512 has VPCOMPRESSD which does this in one instruction: _mm512_mask_compress_epi32.

Sources

lemire.me intel.com

95% confidence

What is the pattern for transforming Array of Structures (AoS) to Structure of Arrays (SoA)?

BEFORE (AoS): struct Particle { float x, y, z, mass; }; Particle particles[N]; for(i=0; i<N; i++) particles[i].x += dt * particles[i].vx;. AFTER (SoA): struct Particles { float x[N], y[N], z[N], mass[N]; }; Particles p; for(i=0; i<N; i++) p.x[i] += dt * p.vx[i];. Speedup: 2-4x for SIMD operations, 1.5-2x for scalar due to cache efficiency. AoS loads entire struct (16+ bytes) when you need one field (4 bytes), wasting 75% bandwidth. SoA enables: (1) SIMD processing of contiguous x values, (2) Better cache utilization when accessing single field across many objects, (3) Streaming stores. Use AoS when all fields accessed together; SoA when iterating over single field.

Sources

How do I transform nested conditionals to lookup table?

95% confidence

BEFORE: if(a && b) return 3; else if(a && !b) return 2; else if(!a && b) return 1; else return 0;. AFTER: int table[2][2] = {{0, 1}, {2, 3}}; return table[a != 0][b != 0];. For multi-variable conditions: pack bits into index: int idx = (a?4:0) | (b?2:0) | (c?1:0); return table[idx];. This eliminates all branches. For character classification: bool is_alpha[256]; return is_alpha[(unsigned char)c];. Speedup: Eliminates O(n) branch mispredictions for n conditions. Best when: conditions are data-dependent (unpredictable), table fits in cache (< 64KB), and access pattern is random. Tables trade memory for speed.

Sources

agner.org en.algorithmica.org

95% confidence

What is the pattern for transforming sign function to branchless form?

BEFORE: int sign = (x > 0) - (x < 0); or if(x>0) return 1; else if(x<0) return -1; else return 0;. AFTER: int sign = (x > 0) - (x < 0);. This actually compiles well but here's the bit manipulation version: int sign = (x >> 31) | ((unsigned)-x >> 31);. Explanation: (x >> 31) is -1 for negative, 0 otherwise. ((unsigned)-x >> 31) is 1 for positive (since -x is negative), 0 otherwise. OR combines them. For floating-point: copysign(1.0, x) returns +1.0 or -1.0 (doesn't return 0 for x=0). SIMD: Compare against zero, mask to -1/0/+1. Speedup: 1.5-2x when branches mispredict. Most useful in physics simulations, smoothstep functions.

Sources