Creating distinct list from existing list in Java 7 and 8?

Matt Coubrough picture Matt Coubrough · Dec 14, 2014 · Viewed 31.2k times · Source

If I have:

List<Integer> listInts = { 1, 1, 3, 77, 2, 19, 77, 123, 14, 123... }

in Java what is an efficient way of creating a List<Integer> listDistinctInts containing only the distinct values from listInts?

My immediate thought is to create a Set<Integer> setInts containing all the values from listInts then call List<Integer> listDistinctInts = new ArrayList<>(setInts);

But this seems potentially inefficient - is there a better solution using Java 7?

I'm not using Java 8, but I believe using it I could do something like this(?):

List<Integer> listDistinctInts = listInts.stream().distinct().collect(Collectors.toList());

Would this be more performant than the approach above and/or is there any more efficient way of doing this in Java 8?

Finally, (and I'm aware that asking multiple questions might be frowned upon but it's directly related) if I only cared about the count of distinct elements in listInts is there a more efficient way to get that value (in Java 7 and 8) - without first creating a list or set of all the distinct elements?

I'm most interested in native Java ways of accomplishing this and avoiding re-inventing any wheels but would consider hand-rolled code or libraries if they offer better clarity or performance. I've read this related question Java - Distinct List of Objects but it's not entirely clear about the differences in performance between the Java 7 and 8 approaches or whether there might be better techniques?

Answer

Matt Coubrough picture Matt Coubrough · Dec 16, 2014

I've now MicroBenchmarked most of the proposed options from the excellent answers provided. Like most non-trivial performance related questions the answer as to which is best is "it depends".

All my testing was performed with JMH Java Microbenchmarking Harness.

Most of these tests were performed using JDK 1.8, although I performed some of the tests with JDK 1.7 too just to ensure that its performance wasn't too different (it was almost identical). I tested the following techniques taken from the answers supplied so far:


1. Java 8 Stream - The solution using stream() I had proprosed as a possibility if using Java8:

public List<Integer> testJava8Stream(List<Integer> listInts) {
    return listInts.stream().distinct().collect(Collectors.toList());
}

pros modern Java 8 approach, no 3rd party dependencies

cons Requires Java 8


2. Adding To List - The solution proposed by Victor2748 where a new list is constructed and added to, if and only if the list doesn't already contain the value. Note that I also preallocate the destination list at the size of the original (the max possible) to prevent any reallocations:

public List<Integer> testAddingToList(List<Integer> listInts) {
    List<Integer> listDistinctInts = new ArrayList<>(listInts.size());
    for(Integer i : listInts)
    {
        if( !listDistinctInts.contains(i) ) { listDistinctInts.add(i); }
    }
    return listDistinctInts;
}

pros Works in any Java version, no need to create a Set and then copy, no 3rd party deps

cons Needs to repeatedly check the List for existing values as we build it


3. GS Collections Fast (now Eclipse collections) - The solution proposed by Craig P. Motlin using the GS Collections library and their custom List type FastList:

public List<Integer> testGsCollectionsFast(FastList listFast)
{
    return listFast.distinct();
}

pros Reportedly very quick, simple expressive code, works in Java 7 and 8

cons Requires 3rd party library and a FastList rather than a regular List<Integer>


4. GS Collections Adapted - The FastList solution wasn't quite comparing like-for-like because it needed a FastList passed to the method rather than a good ol' ArrayList<Integer> so I also tested the adapter method Craig proposed:

public List<Integer> testGsCollectionsAdapted(List<Integer> listInts)
{
    return listAdapter.adapt(listInts).distinct();
}

pros Doesn't require a FastList, works in Java 7 and 8

cons Has to adapt List so may not perform as well, needs 3rd party library


5. Guava ImmutableSet - The method proposed by Louis Wasserman in comments, and by 卢声远 Shengyuan Lu in their answer using Guava:

public List<Integer> testGuavaImmutable(List<Integer> listInts)
{
    return ImmutableSet.copyOf(listInts).asList();
}

pros Reportedly very fast, works in Java 7 or 8

cons Returns an Immutable List, can't handle nulls in the input List, and requires 3rd party library


7. HashSet - My original idea (also recommended by EverV0id, ulix and Radiodef)

public List<Integer> testHashSet(List<Integer> listInts)
{
    return new ArrayList<Integer>(new HashSet<Integer>(listInts));
}

pros Works in Java 7 and 8, no 3rd party dependencies

cons Doesn't retain original order of list, has to construct set then copy to list.


6. LinkedHashSet - Since the HashSet solution didn't preserve the order of the Integers in the original list I also tested a version which uses LinkedHashSet to preserve order:

public List<Integer> testLinkedHashSet(List<Integer> listInts)
{
    return new ArrayList<Integer>(new LinkedHashSet<Integer>(listInts));
}

pros Retains original ordering, works in Java 7 and 8, no 3rd party dependencies

cons Unlikely to be as fast as regular HashSet approach


Results

Here are my results for various different sizes of listInts (results ordered from slowest to fastest):

1. taking distinct from ArrayList of 100,000 random ints between 0-50,000 (ie. big list, some duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units

AddingToList            thrpt        10        0.505        0.012    ops/s
Java8Stream             thrpt        10      234.932       31.959    ops/s
LinkedHashSet           thrpt        10      262.185       16.679    ops/s
HashSet                 thrpt        10      264.295       24.154    ops/s
GsCollectionsAdapted    thrpt        10      357.998       18.468    ops/s
GsCollectionsFast       thrpt        10      363.443       40.089    ops/s
GuavaImmutable          thrpt        10      469.423       26.056    ops/s

2. taking distinct from ArrayList of 1000 random ints between 0-50 (ie. medium list, many duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units

AddingToList            thrpt        10    32794.698     1154.113    ops/s
HashSet                 thrpt        10    61622.073     2752.557    ops/s
LinkedHashSet           thrpt        10    67155.865     1690.119    ops/s
Java8Stream             thrpt        10    87440.902    13517.925    ops/s
GsCollectionsFast       thrpt        10   103490.738    35302.201    ops/s
GsCollectionsAdapted    thrpt        10   143135.973     4733.601    ops/s
GuavaImmutable          thrpt        10   186301.330    13421.850    ops/s

3. taking distinct from ArrayList of 100 random ints between 0-100 (ie. small list, some duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units

AddingToList            thrpt        10   278435.085    14229.285    ops/s
Java8Stream             thrpt        10   397664.052    24282.858    ops/s
LinkedHashSet           thrpt        10   462701.618    20098.435    ops/s
GsCollectionsAdapted    thrpt        10   477097.125    15212.580    ops/s
GsCollectionsFast       thrpt        10   511248.923    48155.211    ops/s
HashSet                 thrpt        10   512003.713    25886.696    ops/s
GuavaImmutable          thrpt        10  1082006.560    18716.012    ops/s

4. taking distinct from ArrayList of 10 random ints between 0-50 (ie. tiny list, few duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units

Java8Stream             thrpt        10  2739774.758   306124.297    ops/s
LinkedHashSet           thrpt        10  3607479.332   150331.918    ops/s
HashSet                 thrpt        10  4238393.657   185624.358    ops/s
GsCollectionsAdapted    thrpt        10  5919254.755   495444.800    ops/s
GsCollectionsFast       thrpt        10  7916079.963  1708778.450    ops/s
AddingToList            thrpt        10  7931479.667   966331.036    ops/s
GuavaImmutable          thrpt        10  9021621.880   845936.861    ops/s

Conclusions

  • If you're only taking the distinct items from a list once, and the list isn't very long any of these methods should be adequate.

  • The most efficient general approaches came from the 3rd party libraries: GS Collections and Guava performed admirably.

  • You may need to consider the size of your list and the likely number of duplicates when selecting the most performant method.

  • The naive approach of adding to a new list only if the value isn't already in it works great for tiny lists, but as soon as you have more than a handful of values in the input list it performs the worst of the methods tried.

  • The Guava ImmutableSet.copyOf(listInts).asList() method works the fastest in most situations. But take note of the restrictions: the returned list is Immutable and the input list cannot contain nulls.

  • The HashSet method performs the best of the non 3rd party approaches and usually better than Java 8 streams, but reorders the integers (which may or may not be an issue depending on your use-case).

  • The LinkedHashSet approach keeps the ordering but unsurprisingly was usually worse than the HashSet method.

  • Both the HashSet and LinkedHashSet methods will perform worse when using lists of data types that have complex HashCode calculations, so do your own profiling if you're trying to select distinct Foos from a List<Foo>.

  • If you already have GS Collections as a dependency then it performs very well and is more flexible than the ImmutableList Guava approach. If you don't have it as a dependency, it's worth considering adding it if the performance of selecting distinct items is critical to the performance of your application.

  • Disappointingly Java 8 streams seemed to perform fairly poorly. There may be a better way to code the distinct() call than the way I used, so comments or other answers are of course welcome.

NB. I'm no expert at MicroBenchmarking, so if anyone finds flaws in my results or methodology please notify me and I'll endeavour to correct the Answer.