simd_programming 33 Q&As

Simd Programming FAQ & Answers

33 expert Simd Programming answers researched from official documentation. Every answer cites authoritative sources you can verify.

Jump to section:

SIMD for Specific Algorithms (5) Vector Operations (4) Data Alignment (3) AVX-512 Features (3) SIMD Performance Pitfalls (2) Auto-vectorization (2) Manual SIMD Optimization (2) Vectorization Patterns (2) Handling Non-Power-of-2 Data (2) Horizontal vs Vertical Operations (2) ARM SIMD (2) Gather/Scatter Operations (2) Comparison and Selection (1) Data Type Operations (1)

SIMD for Specific Algorithms

5 questions

How do I use SIMD for image processing operations like convolution?

SIMD image convolution pattern: 1) Load rows of pixels into vector registers. 2) For 3x3 kernel: keep 3 rows in registers, slide window horizontally. 3) Multiply-accumulate with kernel weights using FMA. 4) Handle borders with padding (replicate edge, zero, wrap) or skip. Example for horizontal blur (1D kernel [1,2,1]/4): Load 8 pixels, shift left/right versions using palignr or shuffle, add with weights. __m256i row = _mm256_loadu_si256(src); __m256i left = shift_left(row); __m256i right = shift_right(row); __m256i sum = _mm256_add_epi16(row, row); sum = _mm256_add_epi16(sum, left); sum = _mm256_add_epi16(sum, right); result = _mm256_srli_epi16(sum, 2); For color images: process R,G,B planes separately (SoA layout), or use shuffle to deinterleave RGBRGB... to RRR...,GGG...,BBB... for processing. Libraries: OpenCV uses SIMD internally, Intel IPP highly optimized.

Sources

intel.com en.algorithmica.org

95% confidence

How do I implement a vectorized dot product?

SIMD dot product: multiply element-wise, then sum. AVX2 implementation for float arrays: __m256 sum = _mm256_setzero_ps(); for(size_t i=0; i<n; i+=8) { __m256 a = _mm256_loadu_ps(&arr1[i]); __m256 b = _mm256_loadu_ps(&arr2[i]); sum = _mm256_fmadd_ps(a, b, sum); } float result = horizontal_sum(sum); Key optimizations: 1) Use FMA (fused multiply-add) _mm256_fmadd_ps(a,b,c) = a*b+c in one instruction - better precision and throughput. 2) Use multiple accumulators (2-4) to hide FMA latency. 3) Unroll loop for better instruction-level parallelism. 4) Handle remainder with masked load or scalar loop. For AVX-512: 16 floats per iteration, use _mm512_fmadd_ps and _mm512_reduce_add_ps for final sum. FMA throughput: 2/cycle on modern Intel, so 32 FLOPs/cycle with AVX2, 64 with AVX-512.

Sources

aussieai.com ravikiranb.com

95% confidence

How does SIMD prefix sum (scan) work?

Prefix sum computes running totals: output[i] = sum(input[0..i]). SIMD approach for AVX (8 floats): 1) In-register prefix: shift and add log2(8)=3 times. vec = _mm256_add_ps(vec, _mm256_slli_si256(vec, 4)); repeat with shifts 8, 16. 2) For arrays: compute prefix sums in blocks, then add block sums as offsets. Two-pass algorithm: Pass 1: compute local prefix sums per block and save block totals. Pass 2: prefix sum the block totals, add to each block's elements. Complexity: O(n) work, O(log n) depth for parallel version. SIMD achieves ~2.5x speedup over scalar on single core. AVX-512 lacks direct 512-bit shift, use valign instruction for shifting. Prefix sum is inherently less SIMD-friendly than reductions due to lane dependencies - each output depends on all previous inputs. GPU implementations use more sophisticated algorithms (Blelloch scan).

Sources

en.algorithmica.org developer.nvidia.com

95% confidence

How do SIMD sorting networks work?

Sorting networks use fixed compare-exchange sequences independent of data values, making them SIMD-friendly. Bitonic sort is the classic example: recursively builds bitonic sequences (ascending then descending), then merges them. SIMD implementation: 1) Pack elements into vectors. 2) Compare-exchange via _mm256_min_ps/_mm256_max_ps pairs. 3) Permute elements for next comparison stage using shuffle/blend. For 8 floats in AVX: bitonic sort needs 24 compare-exchanges in 6 stages. Each stage is fully parallel within SIMD. Use sorting network for small arrays (base case, n<=256), then hybrid with quicksort for larger arrays. AVX-512 enables 16-element networks. Performance: 10-20% faster than std::sort for small arrays, vectorized quicksort with bitonic merge achieves 8x speedup on Intel Skylake vs C++ STL for small arrays, 4x for large arrays vs STL, 1.4x vs Intel IPP.

Sources

inria.hal.science drops.dagstuhl.de

95% confidence

How do I vectorize quicksort partitioning with SIMD?

SIMD quicksort vectorizes the partition step: 1) Broadcast pivot to vector: __m256 pivot = _mm256_set1_ps(p). 2) Load 8 elements, compare: __m256 mask = _mm256_cmp_ps(data, pivot, _CMP_LT_OS). 3) Use movemask to get scalar bitmask: int m = _mm256_movemask_ps(mask). 4) Compress/expand elements using lookup table or AVX-512 compress. AVX-512 has native compress: _mm512_mask_compress_ps stores only elements where mask=1 contiguously. AVX2 requires lookup table mapping comparison result bitmask to shuffle control. Vectorized partition: process 8+ elements per iteration instead of 1. Fallback to insertion sort for small partitions. Combine with bitonic sort for base case (n<=64). Gueron-Krasnov AVX2 quicksort: 2x speedup over std::sort. Google Highway's vqsort provides portable high-performance implementation.

Sources

arxiv.org github.com

95% confidence

Vector Operations

4 questions

How do I store SIMD register data back to memory?

Use store intrinsics matching your alignment: Aligned stores: _mm_store_ps (16-byte), _mm256_store_ps (32-byte), _mm512_store_ps (64-byte aligned required). Unaligned stores: _mm_storeu_ps, _mm256_storeu_ps, _mm512_storeu_ps (any alignment). Non-temporal stores (bypass cache, for write-only data): _mm_stream_ps, _mm256_stream_ps, _mm512_stream_ps - require aligned addresses. Integer variants: _mm_store_si128, _mm256_store_si256. Example: _mm256_storeu_ps(output_array, result_vec); stores 8 floats. Masked stores (AVX-512): _mm512_mask_storeu_ps(addr, mask, vec) only writes lanes where mask bit is 1. For single-element extraction: _mm_cvtss_f32 extracts lowest float, _mm256_extractf128_ps extracts 128-bit half. ARM NEON: vst1q_f32 for 128-bit stores.

Sources

intel.com en.algorithmica.org

95% confidence

What is the difference between _mm_load_ps and _mm_loadu_ps?

_mm_load_ps requires 16-byte aligned memory addresses and generates MOVAPS instruction. _mm_loadu_ps works with any alignment and generates MOVUPS instruction. On early SSE CPUs, aligned loads were significantly faster. On modern CPUs (Sandy Bridge and later), both perform identically when data is aligned; unaligned loads only incur a penalty when crossing cache line boundaries. Best practice: use unaligned loads (_mm_loadu_ps) unless you can guarantee alignment and need maximum compatibility with older systems. The aligned variant will cause a segmentation fault/access violation if the address isn't properly aligned. Same pattern applies to 256-bit (_mm256_load_ps vs _mm256_loadu_ps) and 512-bit variants. For integer loads: _mm_load_si128 (aligned) vs _mm_loadu_si128 (unaligned).

Sources

community.intel.com en.algorithmica.org

95% confidence

How do I broadcast a scalar value to all lanes of a SIMD vector?

Use broadcast intrinsics to replicate a single value across all vector lanes: SSE: _mm_set1_ps(float) creates [f,f,f,f]. AVX: _mm256_set1_ps(float) creates [f,f,f,f,f,f,f,f], _mm256_broadcast_ss(&float) loads from memory. AVX-512: _mm512_set1_ps(float) creates 16 copies. For integers: _mm256_set1_epi32(int) broadcasts 32-bit int to all lanes. _mm256_broadcast_ss loads and broadcasts from memory in one instruction (VBROADCASTSS), potentially more efficient than set1 which may require multiple operations. Example: __m256 scale = _mm256_set1_ps(2.5f); creates a vector [2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5] for scaling operations. ARM NEON: vdupq_n_f32(float) duplicates scalar to all lanes.

Sources

intel.com cs.uaf.edu

95% confidence

What are shuffle and permute operations in SIMD?

Shuffle and permute rearrange elements within or between vector registers. Shuffle typically uses an immediate (compile-time constant) control: _mm_shuffle_ps(a, b, imm8) selects 4 floats from two 128-bit sources based on imm8 encoding. _mm256_shuffle_ps operates on 128-bit halves independently. Permute allows runtime-variable indices: _mm256_permutevar8x32_ps(vec, idx) rearranges 8 floats according to index vector. _mm256_permute_ps uses immediate for within-lane permutation. Key distinction: in-lane operations (cheap, 1 cycle latency) vs cross-lane operations (expensive, 3+ cycles). _mm256_permute2f128_ps swaps/selects 128-bit halves. AVX-512 adds powerful permutes: _mm512_permutexvar_ps allows any-to-any element movement. ARM NEON: vtbl for table lookup permutation.

Sources

builders.intel.com felixcloutier.com

95% confidence

Data Alignment

3 questions

Why is data alignment important for SIMD operations?

SIMD registers have natural alignment requirements: 128-bit SSE requires 16-byte, 256-bit AVX requires 32-byte, 512-bit AVX-512 requires 64-byte alignment. Misaligned access causes: Historical SSE: General Protection Fault (#GP) for aligned load/store instructions. Modern CPUs: No fault with unaligned instructions, but performance penalty when crossing cache line boundaries (64 bytes). AVX-512: Some instructions always require 64-byte alignment and will #GP(0) otherwise. Alignment matters because: cache lines are 64 bytes, so misaligned 64-byte loads touch 2 cache lines. Memory controllers optimize for aligned transfers. Store forwarding works best with aligned data. Use alignas(32) in C++11, attribute((aligned(32))) in GCC, __declspec(align(32)) in MSVC, or posix_memalign()/aligned_alloc() for dynamic allocation.

Sources

studyplan.dev community.intel.com

95% confidence

How do I allocate aligned memory for SIMD operations in C/C++?

Multiple methods for aligned memory allocation: C11: void* aligned_alloc(size_t alignment, size_t size) - size must be multiple of alignment. POSIX: posix_memalign(&ptr, alignment, size) - returns 0 on success. Windows: _aligned_malloc(size, alignment), free with _aligned_free(). C++17: std::aligned_alloc or operator new with std::align_val_t. Compiler-specific: attribute((aligned(N))) for stack variables (GCC/Clang), __declspec(align(N)) for MSVC. C++11: alignas(N) specifier. Example for AVX: float* data = (float*)aligned_alloc(32, num_floats * sizeof(float)); For stack arrays: alignas(32) float arr[256]; Always ensure size is multiple of alignment for aligned_alloc. Free aligned memory with free() (C11/POSIX) or _aligned_free() (Windows).

Sources

songho.ca ncmiller.dev

95% confidence

What alignment do I need for SSE, AVX, and AVX-512?

Alignment requirements by instruction set: SSE (128-bit): 16-byte alignment for _mm_load_ps and similar aligned operations. AVX (256-bit): 32-byte alignment recommended for optimal performance, though AVX relaxed requirements for most instructions - misaligned access works but may be slower. AVX-512 (512-bit): 64-byte alignment for best performance, some instructions require it. Practical guideline: always align to the vector width you're using. For mixed code targeting multiple ISAs, align to the largest (64 bytes covers all). Cache line alignment (64 bytes) is generally optimal since it prevents loads from spanning two cache lines. Structures containing SIMD types should use alignas or compiler attributes. Memory pools and allocators should respect these alignments for SIMD-heavy code.

Sources

sciencedirect.com intel.com

95% confidence

AVX-512 Features

3 questions

What is AVX-512 conflict detection and when do I need it?

AVX-512CD (Conflict Detection) enables vectorizing loops with potential write conflicts via vpconflictd/vpconflictq instructions. Use case: histogram computation where multiple lanes might increment the same bin. vpconflictd compares each element against all previous elements in the vector, producing a bitmask of conflicts per lane. Pattern for histogram: gather current bin values, detect conflicts with vpconflictd, resolve conflicts iteratively (process non-conflicting lanes, mask them out, repeat), scatter updated values. Without conflict detection, such loops require scalar fallback. Performance depends on conflict rate: best when conflicts rare (all different indices), worst when all elements same index. Works well for sparse data, image processing, database aggregation. Available on Skylake-X, Ice Lake, and later Intel CPUs. Alternative: atomic operations (slower), or restructure algorithm to avoid conflicts.

Sources

0x80.pl intel.com

95% confidence

What are the main AVX-512 subsets and which CPUs support them?

AVX-512 is modular with multiple subsets: AVX-512F (Foundation): core 512-bit operations, required for all AVX-512 CPUs. AVX-512CD: conflict detection for vectorizing histogram-like loops. AVX-512BW: byte and word (8/16-bit) operations. AVX-512DQ: doubleword and quadword enhancements. AVX-512VL: allows AVX-512 features on 128/256-bit registers. AVX-512VNNI: neural network instructions (int8/int16 dot products). AVX-512BF16: bfloat16 for ML. AVX-512FP16: half-precision float. CPU support: Skylake-X (2017): F, CD, BW, DQ, VL. Cascade Lake: adds VNNI. Ice Lake: adds VBMI2, VPOCLMULQDQ, GFNI, VAES. Sapphire Rapids: adds FP16, BF16. AMD Zen4 (2022): F, CD, BW, DQ, VL, VNNI, BF16 (no VL for some). Check support: __builtin_cpu_supports("avx512f") in GCC or CPUID.

Sources

en.wikipedia.org en.wikichip.org

95% confidence

How do I use AVX-512 masking for conditional operations?

AVX-512 masking replaces branches with predicated execution. Create mask from comparison: __mmask16 mask = _mm512_cmp_ps_mask(a, b, _CMP_GT_OS); // a > b. Use mask in operations: Zeroing: __m512 r = _mm512_maskz_add_ps(mask, x, y); // zero where mask=0. Merging: __m512 r = _mm512_mask_add_ps(old, mask, x, y); // keep old where mask=0. Mask operations: _kand_mask16(m1, m2) - AND masks, _knot_mask16(m) - invert, _mm512_mask2int(m) - convert to int. Example conditional assignment: result = _mm512_mask_blend_ps(mask, else_vals, then_vals); Masks enable: vectorized if-else without branches, efficient handling of boundary conditions, sparse computations (skip zero elements). Memory operations: _mm512_mask_loadu_ps, _mm512_mask_storeu_ps only access memory for active lanes. Fault suppression: masked loads from invalid addresses don't fault.

Sources

travisdowns.github.io quickwit.io

95% confidence

SIMD Performance Pitfalls

2 questions

What is the lane crossing penalty in AVX and how do I avoid it?

AVX 256-bit registers are logically divided into two 128-bit lanes. Operations crossing this boundary (lane-crossing) incur 3-cycle latency vs 1-cycle for in-lane operations. Affected instructions: _mm256_permute2f128_ps (swaps/selects 128-bit halves), cross-lane shuffles, 256-bit horizontal adds. Most AVX/AVX2 instructions operate on lanes independently - _mm256_shuffle_ps applies same shuffle to both halves separately. To minimize penalty: structure algorithms to work within lanes when possible, batch lane-crossing operations together, use lane-crossing only for final reduction steps. AVX-512 has 4 lanes (128-bit each) with similar considerations. The vpermd/vpermps instructions (full cross-lane permute) have 3-cycle latency. For reductions: accumulate vertically, cross lanes only at the end. Profile with VTune to identify lane-crossing bottlenecks.

Sources

coursys.sfu.ca community.intel.com

95% confidence

What causes AVX-512 frequency throttling and how do I mitigate it?

AVX-512 instructions consume more power, causing CPU frequency reduction (throttling) on Intel CPUs. Three license levels: L0 (normal), L1 (light AVX-512, ~100MHz drop), L2 (heavy AVX-512, ~200MHz+ drop). Heavy instructions: 512-bit FMA, multiplies, some shuffles. Light instructions: 512-bit adds, logical ops, loads/stores. Mitigation strategies: 1) Use 256-bit AVX2 for short bursts surrounded by scalar code. 2) For sustained vectorized code, AVX-512 throughput often wins despite lower frequency. 3) Avoid mixing scalar and AVX-512 frequently (frequency transitions take microseconds). 4) On Ice Lake and newer, throttling is reduced. Profile actual performance - 512-bit at reduced frequency often beats 256-bit at full frequency for vectorizable workloads. Server CPUs (Xeon) throttle less aggressively than desktop parts. AMD Zen4 has no AVX-512 throttling.

Sources

travisdowns.github.io en.wikipedia.org

95% confidence

Auto-vectorization

2 questions

How do I help the compiler auto-vectorize my loops?

Compiler hints for better auto-vectorization: 1) Use restrict keyword on pointers to indicate no aliasing: void f(float* restrict a, float* restrict b). 2) #pragma omp simd before loops to force vectorization (ignores dependencies). 3) #pragma GCC ivdep or #pragma ivdep to ignore vector dependencies. 4) Align data: attribute((aligned(32))) or alignas(32). 5) Use #pragma omp simd aligned(a,b:32) to specify alignment. 6) Keep loop bodies simple - avoid function calls, complex control flow. 7) Use counted loops with known bounds when possible. 8) Avoid pointer arithmetic in favor of array indexing. Compiler flags: -O3 enables aggressive vectorization, -march=native targets current CPU. Check reports: -fopt-info-vec-missed (GCC) shows why vectorization failed. -Rpass-analysis=loop-vectorize (Clang) provides detailed analysis.

Sources

intel.com cvw.cac.cornell.edu

95% confidence

What is #pragma omp simd and how do I use it?

#pragma omp simd is an OpenMP 4.0+ directive that requests SIMD vectorization of the following loop. It tells the compiler to ignore potential dependencies and vectorize. Basic usage: #pragma omp simd followed by for loop. Optional clauses: simdlen(N): vector length hint (e.g., simdlen(8) for 8-wide). safelen(N): maximum safe vectorization width due to dependencies. aligned(ptr:N): declares pointer alignment. linear(var:step): variable increases linearly each iteration. reduction(op:var): handles reduction operations. private(var): each lane gets private copy. Example: #pragma omp simd aligned(a,b:32) simdlen(8) reduction(+:sum) for(int i=0; i<n; i++) sum += a[i]*b[i]; Compile with -fopenmp or -fopenmp-simd (SIMD without threading). Differs from #pragma ivdep: omp simd ignores ALL dependencies, ivdep ignores only vector dependencies.

Sources

devblogs.microsoft.com intel.com

95% confidence

Manual SIMD Optimization

2 questions

When should I write manual SIMD intrinsics vs relying on auto-vectorization?

Use auto-vectorization when: simple loops with obvious patterns, portability across ISAs matters, code maintainability is priority, compiler does good job (check reports). Use manual intrinsics when: auto-vectorizer fails or produces suboptimal code, complex algorithms (sorting, parsing, compression), need specific instruction sequences, maximum performance critical, algorithm has SIMD-friendly structure not recognized by compiler. Hybrid approach: write scalar reference implementation, let compiler vectorize simple parts, use intrinsics for hot spots. Check auto-vectorization first with compiler reports before writing intrinsics. Intrinsics downsides: harder to maintain, CPU-specific (need separate paths for SSE/AVX/NEON), more bugs. Modern compilers vectorize well for: reductions, simple stencils, element-wise operations. They struggle with: horizontal operations, complex shuffles, conditional logic, gather/scatter patterns.

Sources

en.algorithmica.org llvm.org

95% confidence

How do I implement a SIMD reduction (sum/max/min) efficiently?

Efficient SIMD reduction pattern: 1) Initialize vector accumulator (e.g., _mm256_setzero_ps for sum). 2) Main loop: accumulate vertically (_mm256_add_ps(acc, data)). 3) After loop: single horizontal reduction of accumulator. For sum with AVX2 floats: __m256 acc = _mm256_setzero_ps(); for(i=0; i<n; i+=8) acc = _mm256_add_ps(acc, _mm256_loadu_ps(&arr[i])); then horizontal sum of 8-element acc. For max: _mm256_max_ps vertically, horizontal max at end. Multiple accumulators hide latency: use 2-4 independent accumulators, combine before horizontal step. This exploits instruction-level parallelism - CPU can execute multiple adds in flight. AVX-512 provides _mm512_reduce_add_ps, _mm512_reduce_max_ps for direct reduction. Handle remainder with scalar loop or masked final iteration.

Sources

aussieai.com rust-lang.github.io

95% confidence

Vectorization Patterns

2 questions

What is loop vectorization and how does the compiler do it?

Loop vectorization transforms scalar loops to process multiple iterations simultaneously using SIMD. The compiler analyzes loops for: countable iterations (known trip count), no loop-carried dependencies (each iteration independent), simple memory access patterns (consecutive or strided). The vectorizer: unrolls by vector width (e.g., 8 for AVX floats), replaces scalar operations with vector equivalents, handles remainder iterations. Example transformation: for(i=0;i<n;i++) a[i]=b[i]+c[i]; becomes SIMD adds processing 8 elements per iteration. LLVM/GCC have two vectorizers: Loop Vectorizer (transforms entire loops) and SLP Vectorizer (packs independent scalar ops). Enable with -O2 or higher, -ftree-vectorize (GCC), /O2 (MSVC). Check vectorization reports: -fopt-info-vec (GCC), -Rpass=loop-vectorize (Clang).

Sources

llvm.org developers.redhat.com

95% confidence

What is SLP (Superword Level Parallelism) vectorization?

SLP vectorization packs independent isomorphic (same operation) scalar instructions into vector operations within straight-line code, without requiring loops. The compiler identifies groups of similar operations on different data that can execute in parallel. Example: four separate float additions a=x+1, b=y+2, c=z+3, d=w+4 become one vector add. SLP works in three steps: 1) Pack heuristic selects groups of independent, similar instructions. 2) Reorder instructions so dependencies precede packed groups. 3) Replace packed scalars with vector instructions. SLP complements loop vectorization - loops are unrolled first, then SLP packs the unrolled operations. goSLP (LLVM improvement) achieves 7.58% speedup on SPEC2017fp over standard SLP. Enabled by default in LLVM and GCC at -O2 and higher. Disable with -fno-tree-slp-vectorize.

Sources

groups.csail.mit.edu eme64.github.io

95% confidence

Handling Non-Power-of-2 Data

2 questions

How do I handle arrays with non-power-of-2 sizes in SIMD code?

For arrays not divisible by vector width, use these strategies: 1) Scalar remainder loop: vectorize main loop, process leftover elements with scalar code. for(i=0; i<n-7; i+=8) { /SIMD/ } for(; i<n; i++) { /scalar/ }. 2) Masking (AVX-512): use mask registers to disable lanes for final partial vector. __mmask8 mask = (1<<remainder)-1; _mm256_mask_storeu_ps(). 3) Padding: allocate extra elements, process full vectors, ignore padding in results. 4) Overlapping final iteration: process last full vector even if it overlaps previous (safe for read-only, needs care for writes). AVX-512 masking is preferred as it avoids scalar code and uses full vector width throughout. The mask suppresses memory faults for out-of-bounds lanes, making edge handling cleaner.

Sources

en.algorithmica.org sciencedirect.com

95% confidence

How do AVX-512 mask registers work?

AVX-512 introduces 8 dedicated 64-bit mask registers (k0-k7) for per-element predication. Each bit controls one vector lane. For 512-bit registers: 64 bits for bytes, 32 for words, 16 for dwords, 8 for qwords. k0 is special - when used as writemask, it means no masking (all elements active). Masking modes: Zeroing (_maskz intrinsics): inactive lanes written as zero. Merging (_mask intrinsics): inactive lanes retain destination's previous value. Example: __mmask16 m = 0xFF00; __m512 result = _mm512_maskz_add_ps(m, a, b); adds only upper 8 floats, zeros lower 8. Mask operations: _mm512_cmp_ps_mask compares and produces mask, _kand_mask16 performs AND on masks. Masks enable efficient handling of conditionals, remainder loops, and sparse data without branching. Memory fault suppression: masked loads/stores don't fault on invalid addresses for inactive lanes.

Sources

en.wikipedia.org travisdowns.github.io

95% confidence

Horizontal vs Vertical Operations

2 questions

What is the difference between horizontal and vertical SIMD operations?

Vertical operations process corresponding lanes between vectors independently - lane 0 of result depends only on lane 0 of inputs. Example: _mm256_add_ps(a,b) adds a[0]+b[0], a[1]+b[1], etc. in parallel. These are fast (1 cycle typically) and scale perfectly with vector width. Horizontal operations combine elements within the same vector. Example: summing all elements of a vector (reduction). _mm256_hadd_ps adds adjacent pairs: [a0+a1, a2+a3, b0+b1, b2+b3, a4+a5, a6+a7, b4+b5, b6+b7] - note it doesn't produce a scalar! Horizontal ops are slower (3+ cycles) and don't scale with width. Best practice: accumulate using vertical ops throughout the loop, perform single horizontal reduction at the end. AVX-512 provides true reductions: _mm512_reduce_add_ps sums all 16 floats to scalar.

Sources

rust-lang.github.io en.algorithmica.org

95% confidence

How do I efficiently sum all elements of a SIMD vector (horizontal reduction)?

For AVX 256-bit float vector reduction to scalar: 1) Extract high 128-bits: __m128 hi = _mm256_extractf128_ps(v, 1); 2) Add to low 128-bits: __m128 sum128 = _mm_add_ps(_mm256_castps256_ps128(v), hi); 3) Horizontal add twice: sum128 = _mm_hadd_ps(sum128, sum128); sum128 = _mm_hadd_ps(sum128, sum128); 4) Extract scalar: float result = _mm_cvtss_f32(sum128); Alternative using shuffles (sometimes faster): sum high/low halves, then use _mm_shuffle_ps to bring elements together for final adds. For loop accumulation: maintain vector accumulator, do single horizontal sum after loop. AVX-512 simplifies this: float sum = _mm512_reduce_add_ps(v); - single intrinsic for 16-element reduction. For integers, similar patterns with _mm_hadd_epi32 or _mm512_reduce_add_epi32.

Sources

en.algorithmica.org aussieai.com

95% confidence

ARM SIMD

2 questions

How does SVE vector-length-agnostic (VLA) programming work?

SVE VLA programming uses runtime-determined vector length instead of compile-time fixed width. Key concepts: svcntw() returns number of 32-bit words in vector (varies by hardware). svwhilelt_b32(i, n) creates predicate for loop iteration. svld1_f32(pred, ptr) loads with predication. Loop pattern: svbool_t pg = svwhilelt_b32(i, n); while(svptest_any(svptrue_b32(), pg)) { svfloat32_t v = svld1_f32(pg, &a[i]); /* process */ svst1_f32(pg, &out[i], result); i += svcntw(); pg = svwhilelt_b32(i, n); } The final iteration automatically masks inactive lanes. Benefits: same binary optimal on 128-bit mobile and 512-bit server SVE. No remainder loop needed - predication handles it. Downsides: harder to reason about performance, some algorithms need known vector length. Use ACLE (ARM C Language Extensions) intrinsics or let compiler auto-vectorize with SVE target.

Sources

github.com developer.arm.com

95% confidence

How do I mix ARM NEON and SVE code?

NEON and SVE can interoperate because SVE Z registers' lower 128 bits alias NEON V registers. GCC 14+ provides arm_neon_sve_bridge.h with: svset_neonq(sv_vec, neon_vec) - sets NEON vector into SVE vector's first 128 bits. svget_neonq(sv_vec) - extracts first 128 bits as NEON vector. Use case: leverage optimized NEON library functions within SVE code, or migrate NEON code to SVE incrementally. Example: float32x4_t neon_v = vld1q_f32(ptr); svfloat32_t sve_v = svset_neonq_f32(svundef_f32(), neon_v); /* SVE operations */ float32x4_t result = svget_neonq_f32(sve_v); Caveat: mixing resets SVE vector elements beyond 128 bits. For portable code, prefer SVE throughout or use SIMDe/Highway for abstraction. AWS Graviton3 supports both NEON and SVE, enabling this interop pattern.

Sources

lemire.me arm-software.github.io

95% confidence

Gather/Scatter Operations

2 questions

What are gather and scatter operations in SIMD?

Gather loads elements from non-contiguous memory locations into a vector using an index vector. Scatter stores vector elements to non-contiguous locations. AVX2 gather: _mm256_i32gather_ps(base, vindex, scale) loads base[vindex[i]*scale] for each lane. No AVX2 scatter - requires AVX-512. AVX-512 scatter: _mm512_i32scatter_ps(base, vindex, data, scale) stores data[i] to base[vindex[i]*scale]. Use cases: sparse matrix operations, histogram binning, lookup table access, indirect array indexing. Performance: significantly slower than contiguous loads (each lane may access different cache lines). Broadwell shows 0.95-1.2x speedup vs scalar; Skylake and newer show 1.2-1.8x. Best when: computation is heavy relative to memory access, data is truly scattered. Avoid for data that can be rearranged to be contiguous or permuted after contiguous load.

Sources

johnnysswlab.com sciencedirect.com

95% confidence

When should I use gather/scatter vs contiguous loads with permutation?

Prefer contiguous load + permute when: data can be loaded in chunks and rearranged, access pattern is known at compile time, multiple operations will use the permuted data. Gather is appropriate when: indices are computed at runtime, data is truly sparse in memory, only a few elements are needed from large arrays. Performance comparison on modern CPUs: contiguous 256-bit load is ~4 cycles, gather can be 12-20+ cycles depending on cache behavior. Rule of thumb: if elements span more than 2-3 cache lines, gather overhead grows significantly. For lookup tables fitting in L1 cache, gather performs reasonably well. For histogram-style scatter with potential conflicts, AVX-512 conflict detection (vpconflictd) is needed. Profile your specific access pattern - gather performance varies widely by CPU generation and memory access locality.

Sources

dl.acm.org intel.com

95% confidence

Comparison and Selection

1 question

What is the difference between blend and select operations in SIMD?

Blend and select both choose between two vectors based on a mask, with subtle differences: Blend (_mm256_blend_ps): Uses immediate constant mask (compile-time). Fast - single instruction, 1 cycle latency. Limited to fixed patterns. Example: _mm256_blend_ps(a, b, 0b11110000) takes b[7:4], a[3:0]. Blendv (_mm256_blendv_ps): Uses vector mask (runtime variable). Selection based on sign bit (MSB) of each mask element. Slightly slower (1-2 cycles). Example: __m256 mask = _mm256_cmp_ps(x, zero, _CMP_LT_OS); result = _mm256_blendv_ps(pos_val, neg_val, mask); AVX-512 mask blend: _mm512_mask_blend_ps(kmask, a, b) uses mask register. Cleaner: mask bit=0 selects a, bit=1 selects b (not sign-bit based). Select typically refers to the operation conceptually, while blend is Intel's instruction name. For bitwise per-bit selection: use AND/ANDN/OR pattern: result = (a & mask) | (b & ~mask).

Sources

intel.com en.algorithmica.org

95% confidence

Data Type Operations

1 question

How do I handle endianness with SIMD operations?

SIMD and endianness considerations: x86 is little-endian: lowest address = least significant byte. Vector loads preserve memory order: _mm_loadu_si128 from [0x01,0x02,0x03,0x04,...] puts 0x01 in lane 0 byte 0. Within elements: 32-bit value 0x04030201 at address has 0x01 at lowest address. Cross-platform (ARM): ARM can be big or little endian; NEON uses native endianness. When reading big-endian data (network protocols, file formats) on little-endian: Byte swap needed. SSSE3 PSHUFB can reverse byte order: __m128i shuf = _mm_setr_epi8(3,2,1,0,7,6,5,4,11,10,9,8,15,14,13,12); swapped = _mm_shuffle_epi8(data, shuf); // Reverses each 32-bit word. For 16-bit words: different shuffle pattern. For 64-bit: _mm_shuffle_epi8 with [7,6,5,4,3,2,1,0,15,14,13,12,11,10,9,8]. Alternative: _mm_bswap intrinsics if available, or shift+mask+or combination.

Sources

intel.com en.algorithmica.org

95% confidence

Browse All Topics