df = pd.DataFrame({'Col1': ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
'Col2': ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
'Col3': np.random.random(5)})
What is the best way to return the unique values of 'Col1' and 'Col2'?
The desired output is
'Bob', 'Joe', 'Bill', 'Mary', 'Steve'
pd.unique
returns the unique values from an input array, or DataFrame column or index.
The input to this function needs to be one-dimensional, so multiple columns will need to be combined. The simplest way is to select the columns you want and then view the values in a flattened NumPy array. The whole operation looks like this:
>>> pd.unique(df[['Col1', 'Col2']].values.ravel('K'))
array(['Bob', 'Joe', 'Bill', 'Mary', 'Steve'], dtype=object)
Note that ravel()
is an array method that returns a view (if possible) of a multidimensional array. The argument 'K'
tells the method to flatten the array in the order the elements are stored in the memory (pandas typically stores underlying arrays in Fortran-contiguous order; columns before rows). This can be significantly faster than using the method's default 'C' order.
An alternative way is to select the columns and pass them to np.unique
:
>>> np.unique(df[['Col1', 'Col2']].values)
array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)
There is no need to use ravel()
here as the method handles multidimensional arrays. Even so, this is likely to be slower than pd.unique
as it uses a sort-based algorithm rather than a hashtable to identify unique values.
The difference in speed is significant for larger DataFrames (especially if there are only a handful of unique values):
>>> df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loop, best of 3: 1.12 s per loop
>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel('K'))
10 loops, best of 3: 38.9 ms per loop
>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel()) # ravel using C order
10 loops, best of 3: 49.9 ms per loop