Suppose I have an array
a = np.array([1, 2, 1, 3, 3, 3, 0])
How can I (efficiently, Pythonically) find which elements of a
are duplicates (i.e., non-unique values)? In this case the result would be array([1, 3, 3])
or possibly array([1, 3])
if efficient.
I've come up with a few methods that appear to work:
m = np.zeros_like(a, dtype=bool)
m[np.unique(a, return_index=True)[1]] = True
a[~m]
a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]
This one is cute but probably illegal (as a
isn't actually unique):
np.setxor1d(a, np.unique(a), assume_unique=True)
u, i = np.unique(a, return_inverse=True)
u[np.bincount(i) > 1]
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
s = pd.Series(a)
s[s.duplicated()]
Is there anything I've missed? I'm not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).
Testing with a 10 million size data set (on a 2.8GHz Xeon):
a = np.random.randint(10**7, size=10**7)
The fastest is sorting, at 1.1s. The dubious xor1d
is second at 2.6s, followed by masking and Pandas Series.duplicated
at 3.1s, bincount
at 5.6s, and in1d
and senderle's setdiff1d
both at 7.3s. Steven's Counter
is only a little slower, at 10.5s; trailing behind are Burhan's Counter.most_common
at 110s and DSM's Counter
subtraction at 360s.
I'm going to use sorting for performance, but I'm accepting Steven's answer because the performance is acceptable and it feels clearer and more Pythonic.
Edit: discovered the Pandas solution. If Pandas is available it's clear and performs well.
As of numpy version 1.9.0, np.unique
has an argument return_counts
which greatly simplifies your task:
u, c = np.unique(a, return_counts=True)
dup = u[c > 1]
This is similar to using Counter
, except you get a pair of arrays instead of a mapping. I'd be curious to see how they perform relative to each other.
It's probably worth mentioning that even though np.unique
is quite fast in practice due to its numpyness, it has worse algorithmic complexity than the Counter
solution. np.unique
is sort-based, so runs asymptotically in O(n log n)
time. Counter
is hash-based, so has O(n)
complexity. This will not matter much for anything but the largest datasets.