What is the difference in CPU cycles (or, in essence, in 'speed') between
x /= y;
and
#include <cmath>
x = sqrt(y);
EDIT: I know the operations aren't equivalent, I'm just arbitrarily proposing x /= y
as a benchmark for x = sqrt(y)
The answer to your question depends on your target platform. Assuming you are using most common x86 cpus, I can give you this link http://instlatx64.atw.hu/ This is a collection of measured instruction latency (How long will it take to CPU to get result after it has argument) and how they are pipelined for many x86 and x86_64 processors. If your target is not x86, you can try to measure cost yourself or consult with your CPU documentation.
Firstly you should get a disassembler of your operations (from compiler e.g. gcc: gcc file.c -O3 -S -o file.asm
or via dissasembly of compiled binary, e.g. with help of debugger).
Remember, that In your operation there is loading and storing a value, which must be counted additionally.
Here are two examples from friweb.hu:
For Core 2 Duo E6700 latency (L) of SQRT (both x87, SSE and SSE2 versions)
of DIVIDE (of floating point numbers):
For newer processors, the cost is less and is almost the same for DIV and for SQRT, e.g. for Sandy Bridge Intel CPU:
Floating-point SQRT is
Floating-point DIVIDE is
SQRT even a tick faster for 32bit.
So: For older CPUs, sqrt is itself 30-50 % slower than fdiv; For newer CPU the cost is the same. For newer CPU, cost of both operations become lower that it was for older CPUs; For longer floating format you needs more time; e.g. for 64-bit you need 2x time than for 32bit; but 80-bit is cheapy compared with 64-bit.
Also, newer CPUs have vector operations (SSE, SSE2, AVX) of the same speed as scalar (x87). Vectors are of 2-4 same-typed data. If you can align your loop to work on several FP values with same operation, you will get more performance from CPU.