pandas Tutorial => Select distinct rows across dataframe

Example

Let

df = pd.DataFrame({'col_1':['A','B','A','B','C'], 'col_2':[3,4,3,5,6]})
df
# Output:
#   col_1  col_2
# 0     A      3
# 1     B      4
# 2     A      3
# 3     B      5
# 4     C      6

To get the distinct values in col_1 you can use Series.unique()

df['col_1'].unique()
# Output:
# array(['A', 'B', 'C'], dtype=object)

But Series.unique() works only for a single column.

To simulate the select unique col_1, col_2 of SQL you can use DataFrame.drop_duplicates():

df.drop_duplicates()
#   col_1  col_2
# 0     A      3
# 1     B      4
# 3     B      5
# 4     C      6

This will get you all the unique rows in the dataframe. So if

df = pd.DataFrame({'col_1':['A','B','A','B','C'], 'col_2':[3,4,3,5,6], 'col_3':[0,0.1,0.2,0.3,0.4]})
df
# Output:
#   col_1  col_2  col_3
# 0     A      3    0.0
# 1     B      4    0.1
# 2     A      3    0.2
# 3     B      5    0.3
# 4     C      6    0.4

df.drop_duplicates()
#   col_1  col_2  col_3
# 0     A      3    0.0
# 1     B      4    0.1
# 2     A      3    0.2
# 3     B      5    0.3
# 4     C      6    0.4

To specify the columns to consider when selecting unique records, pass them as arguments

df = pd.DataFrame({'col_1':['A','B','A','B','C'], 'col_2':[3,4,3,5,6], 'col_3':[0,0.1,0.2,0.3,0.4]})
df.drop_duplicates(['col_1','col_2'])
# Output:
#   col_1  col_2  col_3
# 0     A      3    0.0
# 1     B      4    0.1
# 3     B      5    0.3
# 4     C      6    0.4

# skip last column
# df.drop_duplicates(['col_1','col_2'])[['col_1','col_2']]
#   col_1  col_2
# 0     A      3
# 1     B      4
# 3     B      5
# 4     C      6

Source: How to “select distinct” across multiple data frame columns in pandas?.

PDF - Download pandas for free

Previous Next

pandas

Fastest Entity Framework Extensions

Example

Got any pandas Question?

pandas

pandas Indexing and selecting data Select distinct rows across dataframe

Fastest Entity Framework Extensions

Example

Got any pandas Question?