Determining duplicate values in an array

Each Answer to this Q is separated by one/two green lines.

Suppose I have an array

``````a = np.array([1, 2, 1, 3, 3, 3, 0])
``````

How can I (efficiently, Pythonically) find which elements of `a` are duplicates (i.e., non-unique values)? In this case the result would be `array([1, 3, 3])` or possibly `array([1, 3])` if efficient.

I’ve come up with a few methods that appear to work:

``````m = np.zeros_like(a, dtype=bool)
m[np.unique(a, return_index=True)[1]] = True
a[~m]
``````

Set operations

``````a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]
``````

This one is cute but probably illegal (as `a` isn’t actually unique):

``````np.setxor1d(a, np.unique(a), assume_unique=True)
``````

Histograms

``````u, i = np.unique(a, return_inverse=True)
u[np.bincount(i) > 1]
``````

Sorting

``````s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
``````

Pandas

``````s = pd.Series(a)
s[s.duplicated()]
``````

Is there anything I’ve missed? I’m not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).

Conclusions

Testing with a 10 million size data set (on a 2.8GHz Xeon):

``````a = np.random.randint(10**7, size=10**7)
``````

The fastest is sorting, at 1.1s. The dubious `xor1d` is second at 2.6s, followed by masking and Pandas `Series.duplicated` at 3.1s, `bincount` at 5.6s, and `in1d` and senderle’s `setdiff1d` both at 7.3s. Steven’s `Counter` is only a little slower, at 10.5s; trailing behind are Burhan’s `Counter.most_common` at 110s and DSM’s `Counter` subtraction at 360s.

I’m going to use sorting for performance, but I’m accepting Steven’s answer because the performance is acceptable and it feels clearer and more Pythonic.

Edit: discovered the Pandas solution. If Pandas is available it’s clear and performs well.

As of numpy version 1.9.0, `np.unique` has an argument `return_counts` which greatly simplifies your task:

``````u, c = np.unique(a, return_counts=True)
dup = u[c > 1]
``````

This is similar to using `Counter`, except you get a pair of arrays instead of a mapping. I’d be curious to see how they perform relative to each other.

It’s probably worth mentioning that even though `np.unique` is quite fast in practice due to its numpyness, it has worse algorithmic complexity than the `Counter` solution. `np.unique` is sort-based, so runs asymptotically in `O(n log n)` time. `Counter` is hash-based, so has `O(n)` complexity. This will not matter much for anything but the largest datasets.

I think this is most clear done outside of `numpy`. You’ll have to time it against your `numpy` solutions if you are concerned with speed.

``````>>> import numpy as np
>>> from collections import Counter
>>> a = np.array([1, 2, 1, 3, 3, 3, 0])
>>> [item for item, count in Counter(a).items() if count > 1]
[1, 3]
``````

note: This is similar to Burhan Khalid’s answer, but the use of `items` without subscripting in the condition should be faster.

People have already suggested `Counter` variants, but here’s one which doesn’t use a listcomp:

``````>>> from collections import Counter
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> (Counter(a) - Counter(set(a))).keys()
[1, 3]
``````

[Posted not because it’s efficient — it’s not — but because I think it’s cute that you can subtract `Counter` instances.]

For Python 2.7+

``````>>> import numpy
>>> from collections import Counter
>>> n = numpy.array([1,1,2,3,3,3,0])
>>> [x[1] for x in Counter(n).most_common() if x[0] > 1]
[3, 1]
``````

Here’s another approach using set operations that I think is a bit more straightforward than the ones you offer:

``````>>> indices = np.setdiff1d(np.arange(len(a)), np.unique(a, return_index=True)[1])
>>> a[indices]
array([1, 3, 3])
``````

I suppose you’re asking for `numpy`-only solutions, since if that’s not the case, it’s very difficult to argue with just using a `Counter` instead. I think you should make that requirement explicit though.

If `a` is made up of small integers you can use numpy.bincount directly:

``````import numpy as np

a = np.array([3, 2, 2, 0, 4, 3])
counts = np.bincount(a)
print np.where(counts > 1)[0]
# array([2, 3])
``````

This is very similar your “histogram” method, which is the one I would use if `a` was not made up of small integers.

If the array is a sorted numpy array, then just do:

``````a = np.array([1, 2, 2, 3, 4, 5, 5, 6])
rep_el = a[np.diff(a) == 0]
``````

I’m adding my solution to the pile for this 3 year old question because none of the solutions fit what I wanted or used libs besides numpy. This method finds both the indices of duplicates and values for distinct sets of duplicates.

``````import numpy as np

A = np.array([1,2,3,4,4,4,5,6,6,7,8])

# Record the indices where each unique element occurs.
list_of_dup_inds = [np.where(a == A)[0] for a in np.unique(A)]

# Filter out non-duplicates.
list_of_dup_inds = filter(lambda inds: len(inds) > 1, list_of_dup_inds)

for inds in list_of_dup_inds: print inds, A[inds]
# >> [3 4 5] [4 4 4]
# >> [7 8] [6 6]
``````

``````>>> import numpy as np

>>> a=np.array([1,2,2,2,2,3])

>>> uniques, uniq_idx, counts = np.unique(a,return_index=True,return_counts=True)
>>> duplicates = a[ uniq_idx[counts>=2] ]  # <--- Get duplicates
``````

If you also want to get the orphans:

``````>>> orphans = a[ uniq_idx[counts==1] ]
``````

Combination of Pandas and Numpy (Utilizing value_counts():

``````import pandas as pd
import numpy as np

arr=np.array(('a','b','b','c','a'))
pd.Series(arr).value_counts()
``````

OUTPUT:

``````a    2
b    2
c    1
``````

The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .