When should streams be preferred over traditional loops for best performance? Do streams take advantage of branch-prediction?

Kishore Bandi picture Kishore Bandi · Dec 22, 2016 · Viewed 8.8k times · Source

I just read about Branch-Prediction and wanted to try how this works with Java 8 Streams.

However the performance with Streams is always turning out to be worse than traditional loops.

int totalSize = 32768;
int filterValue = 1280;
int[] array = new int[totalSize];
Random rnd = new Random(0);
int loopCount = 10000;

for (int i = 0; i < totalSize; i++) {
    // array[i] = rnd.nextInt() % 2560; // Unsorted Data
    array[i] = i; // Sorted Data
}

long start = System.nanoTime();
long sum = 0;
for (int j = 0; j < loopCount; j++) {
    for (int c = 0; c < totalSize; ++c) {
        sum += array[c] >= filterValue ? array[c] : 0;
    }
}
long total = System.nanoTime() - start;
System.out.printf("Conditional Operator Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
    for (int c = 0; c < totalSize; ++c) {
        if (array[c] >= filterValue) {
            sum += array[c];
        }
    }
}
total = System.nanoTime() - start;
System.out.printf("Branch Statement Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
    sum += Arrays.stream(array).filter(value -> value >= filterValue).sum();
}
total = System.nanoTime() - start;
System.out.printf("Streams Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
    sum += Arrays.stream(array).parallel().filter(value -> value >= filterValue).sum();
}
total = System.nanoTime() - start;
System.out.printf("Parallel Streams Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

Output :

  1. For Sorted-Array :

    Conditional Operator Time : 294062652 ns, (0.294063 sec) 
    Branch Statement Time : 272992442 ns, (0.272992 sec) 
    Streams Time : 806579913 ns, (0.806580 sec) 
    Parallel Streams Time : 2316150852 ns, (2.316151 sec) 
    
  2. For Un-Sorted Array:

    Conditional Operator Time : 367304250 ns, (0.367304 sec) 
    Branch Statement Time : 906073542 ns, (0.906074 sec) 
    Streams Time : 1268648265 ns, (1.268648 sec) 
    Parallel Streams Time : 2420482313 ns, (2.420482 sec) 
    

I tried the same code using List:
list.stream() instead of Arrays.stream(array)
list.get(c) instead of array[c]

Output :

  1. For Sorted-List :

    Conditional Operator Time : 860514446 ns, (0.860514 sec) 
    Branch Statement Time : 663458668 ns, (0.663459 sec) 
    Streams Time : 2085657481 ns, (2.085657 sec) 
    Parallel Streams Time : 5026680680 ns, (5.026681 sec) 
    
  2. For Un-Sorted List

    Conditional Operator Time : 704120976 ns, (0.704121 sec) 
    Branch Statement Time : 1327838248 ns, (1.327838 sec) 
    Streams Time : 1857880764 ns, (1.857881 sec) 
    Parallel Streams Time : 2504468688 ns, (2.504469 sec) 
    

I referred to few blogs this & this which suggest the same performance issue w.r.t streams.

  1. I agree to the point that programming with streams is nice and easier for some scenarios but when we're losing out on performance, why do we need to use them? Is there something I'm missing out on?
  2. Which is the scenario in which streams perform equal to loops? Is it only in the case where your function defined takes a lot of time, resulting in a negligible loop performance?
  3. In none of the scenario's I could see streams taking advantage of branch-prediction (I tried with sorted and unordered streams, but of no use. It gave more than double the performance impact compared to normal streams)?

Answer

Peter Lawrey picture Peter Lawrey · Dec 22, 2016

I agree to the point that programming with streams is nice and easier for some scenarios but when we're losing out on performance, why do we need to use them?

Performance is rarely an issue. It would be usual for 10% of your streams would need to be rewritten as loops to get the performance you need.

Is there something I'm missing out on?

Using parallelStream() is much easier using streams and possibly more efficient as it's hard to write efficient concurrent code.

Which is the scenario in which streams perform equal to loops? Is it only in the case where your function defined takes a lot of time, resulting in a negligible loop performance?

Your benchmark is flawed in the sense that the code hasn't been compiled when it starts. I would do the whole test in a loop as JMH does, or I would use JMH.

In none of the scenario's I could see streams taking advantage of branch-prediction

Branch prediction is a CPU feature not a JVM or streams feature.