Data Cleanups
Data Cleanups
Data
•Clean Data TV Radio Newspaper Sales
2 44.5 39 45 10
3 17 45 69 9
4 151 41 58 19
5 180 10 58 12
Data Cleaning
there are many chances for data to be incorrect,
duplicated, or mislabeled. TV Radio Newspaper Sales
If data is wrong, outcomes and algorithms are unreliable, even though they may look correct.
Data cleaning is the process of changing or eliminating garbage, incorrect, duplicate, corrupted, or incomplete data in a dataset. 230.1 37.8 69.2 -6
44.5 45 10
17 45 69 9
41 58 19
180 10 58 12
230.1 37.8 na 22.1
44.5 39 45 10
17 45 69 9
151 41 58 19
180 10 58 12
37.8 69.2 22/01/2019
44.5 39 45 10
17 45 69 9
151 41 58 19
180 58 5000
Data Cleaning
there are many chances for data to be incorrect,
duplicated, or mislabeled. TV Radio Newspaper Sales
If data is wrong, outcomes and algorithms are unreliable, even though they may look correct.
Data cleaning is the process of changing or eliminating garbage, incorrect, duplicate, corrupted, or incomplete data in a dataset. 230.1 37.8 69.2 -6
44.5 45 10
17 45 69 9
41 58 19
180 10 58 12
230.1 37.8 na 22.1
44.5 39 45 10
17 45 69 9
151 41 58 19
180 10 58 12
37.8 69.2 22/01/2019
44.5 39 45 10
17 45 69 9
151 41 58 19
180 58 5000
Data Cleaning
Data cleaning means fixing bad data in your data
set.
Bad data could be:
1. Empty cells
2. Data in wrong format
3. Wrong data
4. Duplicates
5. Headers
6. Outliers
Data Cleaning
Empty Cells
Empty cells can potentially give you a wrong result
when you analyze data.
a)Remove Rows
b)Impute
c)
import pandas as pd
File=r’
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
Data Cleaning
Empty Cells
Empty cells can potentially give you a wrong result
when you analyze data.
a)Remove Rows
b)Impute
df.isnull().sum()
Data Cleaning
Empty Cells
Empty cells can potentially give you a wrong result
when you analyze data.
df.loc[df["column"].isnull(),"column"] =
df["column"].quantile(0.5)
Data Cleaning A)Empty Cells
Empty cells can potentially give you a wrong result
when you analyze data.
df['Age'] = df['Age'].fillna(df['Age'].mean())
Data Cleaning B) Data of Wrong Format
Cells with data of wrong format can make it
difficult, or even impossible, to analyze data.
df['Date'] = pd.to_datetime(df['Date'])
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
df = df.drop_duplicates()
Now, the dropna(inplace = True) will NOT
return a new DataFrame, but it will # Dropping Based on a Subset of Columns
remove all rows containing NULL values df = df.sort_values(by='Date Modified', ascending=False)
from the original DataFrame.
df = df.drop_duplicates(subset=['Name', 'Age'], keep='first')
Data Cleaning E) Trimming White Space and special characters
new_col = {'name':'listing_name',
'number_of_reviews':'reviews'}
df.rename(columns=new_col, inplace=True)
df.head()