Top "Sse" questions

SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set.

Header files for x86 SIMD intrinsics

Which header files provide the intrinsics for the different x86 SIMD instruction set extensions (MMX, SSE, AVX, ...)? It seems impossible …

x86 header-files sse simd intrinsics
How to determine if memory is aligned?

I am new to optimizing code with SSE/SSE2 instructions and until now I have not gotten very far. To …

c optimization memory sse simd
How to check if a CPU supports the SSE3 instruction set?

Is the following code valid to check if a CPU supports the SSE3 instruction set? Using the IsProcessorFeaturePresent() function apparently …

c++ sse instruction-set avx cpuid
Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to …

performance assembly floating-point x86 sse
How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

I'm trying to optimize some matrix computations and I was wondering if it was possible to detect at compile-time if …

gcc clang sse avx avx512
What is the meaning of "non temporal" memory accesses in x86

This is a somewhat low-level question. In x86 assembly there are two SSE instructions: MOVDQA xmmi, m128 and MOVNTDQA xmmi, …

x86 sse assembly
SSE intrinsic functions reference

Does anyone know of a reference listing the operation of the SSE intrinsic functions for gcc, i.e. the functions …

c++ c gcc sse simd
Using SSE instructions

I have a loop written in C++ which is executed for each element of a big integer array. Inside the …

c++ optimization assembly processor sse
Fastest way to do horizontal SSE vector sum (or other reduction)

Given a vector of three (or four) floats. What is the fastest way to sum them? Is SSE (movaps, shuffle, …

assembly optimization floating-point sse simd
How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle …

c sse cpu-architecture avx fma