
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Detect Duplicate Labels Using Python Pandas Library
Pandas used to deal with large data sets, in that large data tables columns and rows are indexed with some names and those names are called labels. When we are working with datasets there may be some duplicate labels present in the data set.
The duplication can lead to making incorrect conclusions on our data, it may impact our desired outputs. Here we are talking about label duplication, nothing but rows and column index names repeated more than 1 time.
Let’s take an example to identify the duplicate labels in a DataFrame.
Identifying duplicates in column labels
Example
df1 = pd.DataFrame([[6, 1, 2, 7], [8, 4, 5,9]], columns=["A", "A", "B","C"]) print(df1) print(df1.columns.is_unique)
Explanation
Created a DataFrame with a 2X4 shape. to verify if there are any duplicate labels present in columns, here we use DataFrame.columns.is_unique. this will return a boolean data either True or False based on the presence of duplicates.
Output
A A B C 0 6 1 2 7 1 8 4 5 9 False
This output block represents the DataFrame df and the boolean False is representing there is a duplicate label present in columns of DataFrame df.
By using the duplicated method we can also get the duplicate labels in our DataFrame. Below block.
df1.columns[~df1.columns.duplicated()]
df1.columns is only taking column names as an array and the duplicated() method gives you an array of boolean values representing duplicates. By using the above code we can get a unique list of column labels.
Index(['A', 'B', 'C'], dtype='object')
Identifying duplicates in index labels
Same as the above process of identifying duplicates in column labels we can also identify duplicates in the index (rows).
Example
f = pd.DataFrame({"A": [0, 1, 2, 3, 4]}, index=["x", "y", "x", "y","z"]) print(f) print() print(f.index.duplicated()) # getting boolean string unique_f = f[~f.index.duplicated()] # filtering duplicates print() print(unique_f) # removed duplicated data
Explanation
The DataFrame “f” has been created with some duplicate data in indexes. We can identify duplicates using f.index.duplicated() and it will return a list of boolean values representing duplicates indexes. By using this duplicated method we can remove duplicate labels from our DataFrame.
Output
A x 0 y 1 x 2 y 3 z 4 [False False True True False] A x 0 y 1 z 4
The first block is DataFrame “f” with duplicated values and the array of boolean values represents duplicates. And the final block represents the unique index labels from our DataFrame “f”.