I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2.
This seems to be verified here, How do I achieve the theoretical maximum of 4 FLOPs per cycle? ,and here, Sandy-Bridge CPU specification.
However the link below seems to indicate that Sandy-bridge can do 16 flops per cycle per core and Haswell 32 flops per cycle per core http://www.extremetech.com/computing/136219-intels-haswell-is-an-unprecedented-threat-to-nvidia-amd.
Can someone explain this to me?
Edit: I understand now why I was confused. I thought the term FLOP only referred to single floating point (SP). I see now that the test at How do I achieve the theoretical maximum of 4 FLOPs per cycle? are actually on double floating point (DP) so they achieve 4 DP FLOPs/cycle for SSE and 8 DP FLOPs/cycle for AVX. It would be interesting to redo these test on SP.
Here are theoretical max FLOPs counts (per core) for a number of recent processor microarchitectures and explanation how to achieve them.
In general, to calculate this look up the throughput of the FMA instruction(s) e.g. on https://agner.org/optimize/ or any other microbenchmark result, and multiply
(FMAs per clock) * (vector elements / instruction) * 2 (FLOPs / FMA)
.
Note that achieving this in real code requires very careful tuning (like loop unrolling), and near-zero cache misses, and no bottlenecks on anything else. Modern CPUs have such high FMA throughput that there isn't much room for other instructions to store the results, or to feed them with input. e.g. 2 SIMD loads per clock is also the limit for most x86 CPUs, so a dot product will bottleneck on 2 loads per 1 FMA. A carefully-tuned dense matrix multiply can come close to achieving these numbers, though.
If your workload includes any ADD/SUB or MUL that can't be contracted into FMAs, the theoretical max numbers aren't an appropriate goal for your workload. Haswell/Broadwell have 2-per-clock SIMD FP multiply (on the FMA units), but only 1 per clock SIMD FP add (on a separate vector FP add unit with lower latency). Skylake dropped the separate SIMD FP adder, running add/mul/fma the same at 4c latency, 2-per-clock throughput, for any vector width.
Note that Celeron/Pentium versions of recent microarchitectures don't support AVX or FMA instructions, only SSE4.2.
Intel Core 2 and Nehalem (SSE/SSE2):
Intel Sandy Bridge/Ivy Bridge (AVX1):
Intel Haswell/Broadwell/Skylake/Kaby Lake/Coffee/... (AVX+FMA3):
Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 1 FMA units: some Xeon Bronze/Silver
Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 2 FMA units: Xeon Gold/Platinum, and i7/i9 high-end desktop (HEDT) chips.
Future: Intel Cooper Lake (successor to Cascade Lake) is expected to introduce Brain Float, a float16 format for neural-network workloads, with support for actual SIMD computation on it, unlike the current F16C extension that only has support for load/store with conversion to float32. This should double the FLOP/cycle throughput vs. single-precision on the same hardware.
Current Intel chips only have actual computation directly on standard float16 in the iGPU.
AMD K10:
AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):
AMD Ryzen
Intel Atom (Bonnell/45nm, Saltwell/32nm, Silvermont/22nm):
AMD Bobcat:
AMD Jaguar:
ARM Cortex-A9:
ARM Cortex-A15:
Qualcomm Krait:
IBM PowerPC A2 (Blue Gene/Q), per core:
IBM PowerPC A2 (Blue Gene/Q), per thread:
Intel Xeon Phi (Knights Corner), per core:
Intel Xeon Phi (Knights Corner), per thread:
Intel Xeon Phi (Knights Landing), per core:
The reason why there are per-thread and per-core datum for IBM Blue Gene/Q and Intel Xeon Phi (Knights Corner) is that these cores have a higher instruction issue rate when running more than one thread per core.