Top "Simd" questions

Single instruction, multiple data (SIMD) is the concept of having each instruction operate on a small chunk or vector of data elements.

Number of Compute Units corresponding to the number of work groups

I need some clarification. I'm developing OpenCL on my laptop running a small nvidia GPU (310M). When I query the …

opencl nvidia simd
Why is vectorization, faster in general, than loops?

Why, at the lowest level of the hardware performing operations and the general underlying operations involved (i.e.: things general …

performance language-agnostic vectorization simd low-level
Good portable SIMD library

can anyone recommend portable SIMD library that provides a c/c++ API, works on Intel and AMD extensions and Visual …

c++ open-source cross-platform simd
Get member of __m128 by index?

I've got some code, originally given to me by someone working with MSVC, and I'm trying to get it to …

c++ clang sse simd intrinsics
c++ SSE SIMD framework

Does anyone know an open-source C++ x86 SIMD intrinsics library? Intel supplies exactly what I need in their integrated performance …

c++ sse simd intrinsics
How to use the multiply and accumulate intrinsics in ARM Cortex-a8?

how to use the Multiply-Accumulate intrinsics provided by GCC? float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , …

c arm simd intrinsics neon
SSE (SIMD): multiply vector by scalar

A common operation I do in my program is scaling vectors by a scalar (V*s, e.g. [1,2,3,4]*2 == [2,4,6,8]). Is there …

c x86 sse simd
Intel AVX: 256-bits version of dot product for double precision floating point variables

The Intel Advanced Vector Extensions (AVX) offers no dot product in the 256-bit version (YMM register) for double precision floating …

c++ performance simd avx
SSE multiplication of 4 32-bit integers

How to multiply four 32-bit integers by another 4 integers? I didn't find any instruction which can do it.

x86 sse simd multiplication sse2
SSE-copy, AVX-copy and std::copy performance

I'm tried to improve performance of copy operation via SSE and AVX: #include <immintrin.h> const int sz = 1024; …

c++ performance sse simd avx