If I have:
List<Integer> listInts = { 1, 1, 3, 77, 2, 19, 77, 123, 14, 123... }
in Java what is an efficient way of creating a List<Integer> listDistinctInts
containing only the distinct values from listInts
?
My immediate thought is to create a Set<Integer> setInts
containing all the values from listInts
then call List<Integer> listDistinctInts = new ArrayList<>(setInts);
But this seems potentially inefficient - is there a better solution using Java 7?
I'm not using Java 8, but I believe using it I could do something like this(?):
List<Integer> listDistinctInts = listInts.stream().distinct().collect(Collectors.toList());
Would this be more performant than the approach above and/or is there any more efficient way of doing this in Java 8?
Finally, (and I'm aware that asking multiple questions might be frowned upon but it's directly related) if I only cared about the count of distinct elements in listInts
is there a more efficient way to get that value (in Java 7 and 8) - without first creating a list or set of all the distinct elements?
I'm most interested in native Java ways of accomplishing this and avoiding re-inventing any wheels but would consider hand-rolled code or libraries if they offer better clarity or performance. I've read this related question Java - Distinct List of Objects but it's not entirely clear about the differences in performance between the Java 7 and 8 approaches or whether there might be better techniques?
I've now MicroBenchmarked most of the proposed options from the excellent answers provided. Like most non-trivial performance related questions the answer as to which is best is "it depends".
All my testing was performed with JMH Java Microbenchmarking Harness.
Most of these tests were performed using JDK 1.8, although I performed some of the tests with JDK 1.7 too just to ensure that its performance wasn't too different (it was almost identical). I tested the following techniques taken from the answers supplied so far:
1. Java 8 Stream - The solution using stream()
I had proprosed as a possibility if using Java8:
public List<Integer> testJava8Stream(List<Integer> listInts) {
return listInts.stream().distinct().collect(Collectors.toList());
}
pros modern Java 8 approach, no 3rd party dependencies
cons Requires Java 8
2. Adding To List - The solution proposed by Victor2748 where a new list is constructed and added to, if and only if the list doesn't already contain the value. Note that I also preallocate the destination list at the size of the original (the max possible) to prevent any reallocations:
public List<Integer> testAddingToList(List<Integer> listInts) {
List<Integer> listDistinctInts = new ArrayList<>(listInts.size());
for(Integer i : listInts)
{
if( !listDistinctInts.contains(i) ) { listDistinctInts.add(i); }
}
return listDistinctInts;
}
pros Works in any Java version, no need to create a Set and then copy, no 3rd party deps
cons Needs to repeatedly check the List for existing values as we build it
3. GS Collections Fast (now Eclipse collections) - The solution proposed by Craig P. Motlin using the GS Collections library and their custom List type FastList
:
public List<Integer> testGsCollectionsFast(FastList listFast)
{
return listFast.distinct();
}
pros Reportedly very quick, simple expressive code, works in Java 7 and 8
cons Requires 3rd party library and a FastList
rather than a regular List<Integer>
4. GS Collections Adapted - The FastList solution wasn't quite comparing like-for-like because it needed a FastList
passed to the method rather than a good ol' ArrayList<Integer>
so I also tested the adapter method Craig proposed:
public List<Integer> testGsCollectionsAdapted(List<Integer> listInts)
{
return listAdapter.adapt(listInts).distinct();
}
pros Doesn't require a FastList
, works in Java 7 and 8
cons Has to adapt List so may not perform as well, needs 3rd party library
5. Guava ImmutableSet - The method proposed by Louis Wasserman in comments, and by 卢声远 Shengyuan Lu in their answer using Guava:
public List<Integer> testGuavaImmutable(List<Integer> listInts)
{
return ImmutableSet.copyOf(listInts).asList();
}
pros Reportedly very fast, works in Java 7 or 8
cons Returns an Immutable List
, can't handle nulls in the input List, and requires 3rd party library
7. HashSet - My original idea (also recommended by EverV0id, ulix and Radiodef)
public List<Integer> testHashSet(List<Integer> listInts)
{
return new ArrayList<Integer>(new HashSet<Integer>(listInts));
}
pros Works in Java 7 and 8, no 3rd party dependencies
cons Doesn't retain original order of list, has to construct set then copy to list.
6. LinkedHashSet - Since the HashSet
solution didn't preserve the order of the Integers in the original list I also tested a version which uses LinkedHashSet to preserve order:
public List<Integer> testLinkedHashSet(List<Integer> listInts)
{
return new ArrayList<Integer>(new LinkedHashSet<Integer>(listInts));
}
pros Retains original ordering, works in Java 7 and 8, no 3rd party dependencies
cons Unlikely to be as fast as regular HashSet
approach
Here are my results for various different sizes of listInts
(results ordered from slowest to fastest):
1. taking distinct from ArrayList of 100,000 random ints between 0-50,000 (ie. big list, some duplicates)
Benchmark Mode Samples Mean Mean error Units
AddingToList thrpt 10 0.505 0.012 ops/s
Java8Stream thrpt 10 234.932 31.959 ops/s
LinkedHashSet thrpt 10 262.185 16.679 ops/s
HashSet thrpt 10 264.295 24.154 ops/s
GsCollectionsAdapted thrpt 10 357.998 18.468 ops/s
GsCollectionsFast thrpt 10 363.443 40.089 ops/s
GuavaImmutable thrpt 10 469.423 26.056 ops/s
2. taking distinct from ArrayList of 1000 random ints between 0-50 (ie. medium list, many duplicates)
Benchmark Mode Samples Mean Mean error Units
AddingToList thrpt 10 32794.698 1154.113 ops/s
HashSet thrpt 10 61622.073 2752.557 ops/s
LinkedHashSet thrpt 10 67155.865 1690.119 ops/s
Java8Stream thrpt 10 87440.902 13517.925 ops/s
GsCollectionsFast thrpt 10 103490.738 35302.201 ops/s
GsCollectionsAdapted thrpt 10 143135.973 4733.601 ops/s
GuavaImmutable thrpt 10 186301.330 13421.850 ops/s
3. taking distinct from ArrayList of 100 random ints between 0-100 (ie. small list, some duplicates)
Benchmark Mode Samples Mean Mean error Units
AddingToList thrpt 10 278435.085 14229.285 ops/s
Java8Stream thrpt 10 397664.052 24282.858 ops/s
LinkedHashSet thrpt 10 462701.618 20098.435 ops/s
GsCollectionsAdapted thrpt 10 477097.125 15212.580 ops/s
GsCollectionsFast thrpt 10 511248.923 48155.211 ops/s
HashSet thrpt 10 512003.713 25886.696 ops/s
GuavaImmutable thrpt 10 1082006.560 18716.012 ops/s
4. taking distinct from ArrayList of 10 random ints between 0-50 (ie. tiny list, few duplicates)
Benchmark Mode Samples Mean Mean error Units
Java8Stream thrpt 10 2739774.758 306124.297 ops/s
LinkedHashSet thrpt 10 3607479.332 150331.918 ops/s
HashSet thrpt 10 4238393.657 185624.358 ops/s
GsCollectionsAdapted thrpt 10 5919254.755 495444.800 ops/s
GsCollectionsFast thrpt 10 7916079.963 1708778.450 ops/s
AddingToList thrpt 10 7931479.667 966331.036 ops/s
GuavaImmutable thrpt 10 9021621.880 845936.861 ops/s
If you're only taking the distinct items from a list once, and the list isn't very long any of these methods should be adequate.
The most efficient general approaches came from the 3rd party libraries: GS Collections and Guava performed admirably.
You may need to consider the size of your list and the likely number of duplicates when selecting the most performant method.
The naive approach of adding to a new list only if the value isn't already in it works great for tiny lists, but as soon as you have more than a handful of values in the input list it performs the worst of the methods tried.
The Guava ImmutableSet.copyOf(listInts).asList()
method works the fastest in most situations. But take note of the restrictions: the returned list is Immutable
and the input list cannot contain nulls.
The HashSet
method performs the best of the non 3rd party approaches and usually better than Java 8 streams, but reorders the integers (which may or may not be an issue depending on your use-case).
The LinkedHashSet
approach keeps the ordering but unsurprisingly was usually worse than the HashSet method.
Both the HashSet
and LinkedHashSet
methods will perform worse when using lists of data types that have complex HashCode calculations, so do your own profiling if you're trying to select distinct Foo
s from a List<Foo>
.
If you already have GS Collections as a dependency then it performs very well and is more flexible than the ImmutableList Guava approach. If you don't have it as a dependency, it's worth considering adding it if the performance of selecting distinct items is critical to the performance of your application.
Disappointingly Java 8 streams seemed to perform fairly poorly. There may be a better way to code the distinct()
call than the way I used, so comments or other answers are of course welcome.
NB. I'm no expert at MicroBenchmarking, so if anyone finds flaws in my results or methodology please notify me and I'll endeavour to correct the Answer.