There is _mm_div_ps for floating-point values division, there is _mm_mullo_epi16 for integer multiplication. But is there something for integer division (16 bits value)? How can i conduct such division?
Please see Agner Fog's vectorclass he has implemented a fast algorithm to do integer division with SSE/AVX for 8-bit, 16-bit, and 32-bit words (but not 64-bit) http://www.agner.org/optimize/#vectorclass
Look in the file vectori128.h for the code and a description of the algoirthm as his well written manual VectorClass.pdf
Here is a fragment describing the algorithm from his manual.
"Integer division There are no instructions in the x86 instruction set and its extensions that are useful for integer vector division, and such instructions would be quite slow if they existed. Therefore, the vector class library is using an algorithm for fast integer division. The basic principle of this algorithm can be expressed in this formula: a / b ≈ a * (2n / b) >> n This calculation goes through the following steps: 1. find a suitable value for n 2. calculate 2n / b 3. calculate necessary corrections for rounding errors 4. do the multiplication and shift-right and apply corrections for rounding errors
This formula is advantageous if multiple numbers are divided by the same divisor b. Steps 1, 2 and 3 need only be done once while step 4 is repeated for each value of the dividend a. The mathematical details are described in the file vectori128.h. (See also T. Granlund and P. L. Montgomery: Division by Invariant Integers Using Multiplication, Proceedings of the SIGPLAN."...
Edit: near the end of the file vectori128.h shows how to do short division with a scalar variable "It takes more time to compute the parameters used for fast division than to do the division. Therefore, it is advantageous to use the same divisor object multiple times. For example, to divide 80 unsigned short integers by 10:
short x = 10;
uint16_t dividends[80], quotients[80]; // numbers to work with
Divisor_us div10(x); // make divisor object for dividing by 10
Vec8us temp; // temporary vector
for (int i = 0; i < 80; i += 8) { // loop for 4 elements per iteration
temp.load(dividends+i); // load 4 elements
temp /= div10; // divide each element by 10
temp.store(quotients+i); // store 4 elements
}
"
Edit: integer division by a vector of shorts
#include <stdio.h>
#include "vectorclass.h"
int main() {
short numa[] = {10, 20, 30, 40, 50, 60, 70, 80};
short dena[] = {10, 20, 30, 40, 50, 60, 70, 80};
Vec8s num = Vec8s().load(numa);
Vec8s den = Vec8s().load(dena);
Vec4f num_low = to_float(extend_low(num));
Vec4f num_high = to_float(extend_high(num));
Vec4f den_low = to_float(extend_low(den));
Vec4f den_high = to_float(extend_high(den));
Vec4f qf_low = num_low/den_low;
Vec4f qf_high = num_high/den_high;
Vec4i q_low = truncate_to_int(qf_low);
Vec4i q_high = truncate_to_int(qf_high);
Vec8s q = compress(q_low, q_high);
for(int i=0; i<8; i++) {
printf("%d ", q[i]);
} printf("\n");
}