AVX 512 vs AVX2 performance for simple array processing loops

Vojtěch Melda Meluzín picture Vojtěch Melda Meluzín · Sep 26, 2018 · Viewed 10.3k times · Source

I'm currently working on some optimizations and comparing vectorization possibilities for DSP applications, that seem ideal for AVX512, since these are just simple uncorrelated array processing loops. But on a new i9 I didn't measure any reasonable improvements when using AVX512 compared to AVX2. Any pointers? Any good results? (btw. I tried MSVC/CLANG/ICL, no noticeable difference, many times AVX512 code actually seems slower)

Answer

Peter Cordes picture Peter Cordes · Sep 26, 2018

This seems too broad, but there are actually some microarchitectural details worth mentioning.

Note that AVX512-VL (Vector Length) lets you use new AVX512 instructions (like packed uint64_t <-> double conversion, mask registers, etc) on 128 and 256-bit vectors. Modern compilers typically auto-vectorize with 256-bit vectors when tuning for Skylake-AVX512, aka Skylake-X. e.g. gcc -march=native or gcc -march=skylake-avx512, unless you override the tuning options to set the preferred vector width to 512 for code where the tradeoffs are worth it. See @zam's answer.


Some major things with 512-bit vectors (not 256-bit with AVX512 instruction like vpxord ymm30, ymm29, ymm10) on Skylake-X are:

  • Aligning your data to the vector width is more important than with AVX2 (every unaligned load crosses a cache-line boundary, instead of every other while looping over an array). In practice it makes a bigger difference. I totally forget the exact results of something I tested a while ago, but maybe 20% slowdown vs. under 5% from misalignment.

  • Running 512-bit uops shuts down the vector ALU on port 1. (But not the integer execution units on port 1). Some Skylake-X CPUs (e.g. Xeon Bronze) only have 1 per clock 512-bit FMA throughput, but i7 / i9 Skylake-X CPUs, and the higher-end Xeons, have an extra 512-bit FMA unit on port 5 that powers up for AVX512 "mode".

    So plan accordingly: you won't get double speed from widening to AVX512, and the bottleneck in your code might now be in the back-end.

  • Running 512-bit uops also limits your max Turbo, so wall-clock speedups can be lower than core-clock-cycle speedups. There are two levels of Turbo reduction: any 512-bit operation at all, and then heavy 512-bit, like sustained FMAs.

  • The FP divide execution unit for vsqrtps/pd zmm and vdivps/pd is not full width; it's only 128-bit wide, so the ratio of div/sqrt vs. multiply throughput is worse by about another factor of 2. See Floating point division vs floating point multiplication. SKX throughput for vsqrtps xmm/ymm/zmm is one per 3/6/12 cycles. double-precision is the same ratios but worse throughput and latency.

    Up to 256-bit YMM vectors, the latency is the same as XMM (12 cycles for sqrt), but for 512-bit ZMM the latency goes up to 20 cycles, and it takes 3 uops. (https://agner.org/optimize/ for instruction tables.)

    If you bottleneck on the divider and can't get more other instructions in the mix, VRSQRT14PS is worth considering even if you need a Newton iteration to get enough precision. But note that AVX512's approximate 1/sqrt(x) does have more guaranteed-accuracy bits than AVX/SSE.)


As far as auto-vectorization, if there are any shuffles required, compilers might do a worse job with wider vectors. For simple pure-vertical stuff, compilers can do ok with AVX512.

Your previous question had a sin function, and maybe if the compiler / SIMD math library only has a 256-bit version of that it won't auto-vectorize with AVX512.

If AVX512 doesn't help, maybe you're bottlenecked on memory bandwidth. Profile with performance counters and find out. Or try more repeats of smaller buffer sizes and see if it speeds up significantly when your data is hot in cache. If so, try to cache-block your code, or increase computational intensity by doing more in one pass over the data.

AVX512 does double theoretical max FMA throughput on an i9 (and integer multiply, and many other things that run on the same execution unit), making the mismatch between DRAM and execution units twice as big. So there's twice as much to gain from making better use of L2 / L1d cache.

Working with data while it's already loaded in registers is good.