How to “select distinct” across multiple data frame columns in pandas?

Each Answer to this Q is separated by one/two green lines.

I’m looking for a way to do the equivalent to the SQL

SELECT DISTINCT col1, col2 FROM dataframe_table

The pandas sql comparison doesn’t have anything about distinct.

.unique() only works for a single column, so I suppose I could concat the columns, or put them in a list/tuple and compare that way, but this seems like something pandas should do in a more native way.

Am I missing something obvious, or is there no way to do this?

You can use the drop_duplicates method to get the unique rows in a DataFrame:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
   a  b
0  1  3
1  2  4
2  1  3
3  2  5

In [32]: df.drop_duplicates()
   a  b
0  1  3
1  2  4
3  2  5

You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.

I’ve tried different solutions. First was:

a_df=np.unique(df[['col1','col2']], axis=0)

and it works well for not object data
Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()


You can also use SQL to do this, but it worked very slow in my case:

from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)

To solve a similar problem, I’m using groupby:

print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")

Whether that’s appropriate will depend on what you want to do with the result, though (in my case, I just wanted the equivalent of COUNT DISTINCT as shown).

There is no unique method for a df, if the number of unique values for each column were the same then the following would work: df.apply(pd.Series.unique) but if not then you will get an error. Another approach would be to store the values in a dict which is keyed on the column name:

In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
for col in df:
    d[col] = df[col].unique()

{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}

I think use drop duplicate sometimes will not be so useful depending dataframe.

I found this:

[in] df['col_1'].unique()
[out] array(['A', 'B', 'C'], dtype=object)

And worked for me!

You can take the sets of the columns and just subtract the smaller set from the larger set:

distinct_values = set(df['a'])-set(df['b'])

The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .

Leave a Reply

Your email address will not be published.