0% found this document useful (0 votes)
5 views

Data Cleanups

Uploaded by

Mr Quainoo
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Cleanups

Uploaded by

Mr Quainoo
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Data Cleaning

Data
•Clean Data TV Radio Newspaper Sales

1 230.1 37.8 69.2 22.1

2 44.5 39 45 10

3 17 45 69 9

4 151 41 58 19

5 180 10 58 12
Data Cleaning
 there are many chances for data to be incorrect,
duplicated, or mislabeled. TV Radio Newspaper Sales
If data is wrong, outcomes and algorithms are unreliable, even though they may look correct.
 Data cleaning is the process of changing or eliminating garbage, incorrect, duplicate, corrupted, or incomplete data in a dataset. 230.1 37.8 69.2 -6
44.5   45 10
17 45 69 9
  41 58 19
180 10 58 12
230.1 37.8 na 22.1
44.5 39 45 10
17 45 69 9
151 41 58 19
180 10 58 12
  37.8 69.2 22/01/2019
44.5 39 45 10
17 45 69 9
151 41 58 19
180   58 5000
Data Cleaning
 there are many chances for data to be incorrect,
duplicated, or mislabeled. TV Radio Newspaper Sales
If data is wrong, outcomes and algorithms are unreliable, even though they may look correct.
 Data cleaning is the process of changing or eliminating garbage, incorrect, duplicate, corrupted, or incomplete data in a dataset. 230.1 37.8 69.2 -6
44.5   45 10
17 45 69 9
  41 58 19
180 10 58 12
230.1 37.8 na 22.1
44.5 39 45 10
17 45 69 9
151 41 58 19
180 10 58 12
  37.8 69.2 22/01/2019
44.5 39 45 10
17 45 69 9
151 41 58 19
180   58 5000
Data Cleaning
  Data cleaning means fixing bad data in your data
set.
Bad data could be:
1. Empty cells
2. Data in wrong format
3. Wrong data
4. Duplicates
5. Headers
6. Outliers
Data Cleaning
  Empty Cells
Empty cells can potentially give you a wrong result
when you analyze data.

a)Remove Rows
b)Impute
c)

import pandas as pd

File=r’

df = pd.read_csv('data.csv')

new_df = df.dropna()

print(new_df.to_string())
Data Cleaning
  Empty Cells
Empty cells can potentially give you a wrong result
when you analyze data.

a)Remove Rows
b)Impute

df.isnull().sum()
Data Cleaning
  Empty Cells
Empty cells can potentially give you a wrong result
when you analyze data.

Impute with Median

There are various ways to imputate missing values


of a column, but replacing null values with the
50th percentile value of the column is the most
widely used method. See the code below:

df.loc[df["column"].isnull(),"column"] =
df["column"].quantile(0.5)
Data Cleaning A)Empty Cells
  Empty cells can potentially give you a wrong result
when you analyze data.

Impute with Mean and mode

Other approaches make use of the backfill and


forward fill methods for missing value imputation.
Now, the dropna(inplace = True) will NOT The mode or the mean value of the column can
return a new DataFrame, but it will also used to replace the empty cells.
remove all rows containing NULL values
from the original DataFrame. df = df.fillna(0)

df = df.fillna({'Name': 'Someone', 'Age': 25, 'Location': 'USA'})

df['Age'] = df['Age'].fillna(df['Age'].mean())
Data Cleaning B) Data of Wrong Format
  Cells with data of wrong format can make it
difficult, or even impossible, to analyze data.

To fix it, you have two options: remove the rows,


or convert all cells in the columns into the same
format.

Now, the dropna(inplace = True) will NOT


return a new DataFrame, but it will
remove all rows containing NULL values
from the original DataFrame.
Data Cleaning B) Data of Wrong Format
  Let's try to convert all cells in the 'Date' column
into dates.

Pandas has a to_datetime() method for this:

df['Date'] = pd.to_datetime(df['Date'])

Remove rows with a NULL value in the "Date"


column:
Now, the dropna(inplace = True) will NOT df.dropna(subset=['Date'], inplace = True)
return a new DataFrame, but it will
remove all rows containing NULL values df[“Date”] = df[‘Date'].fillna(’01/01/2022’)
from the original DataFrame.
Data Cleaning C) Wrong Data
  it can just be wrong, like if someone registered
"199" instead of "1.99".

for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.loc[x, "Duration"] = 120

Now, the dropna(inplace = True) will NOT for x in df.index:


return a new DataFrame, but it will   if df.loc[x, "Duration"] > 120:
remove all rows containing NULL values     df.drop(x, inplace = True)
from the original DataFrame.
Data Cleaning D) Duplicates
 

Returns True for every row that is a duplicate,


othwerwise False:
print(df.duplicated())
Example
Remove all duplicates:
df.drop_duplicates(inplace = True)

df = df.drop_duplicates()
Now, the dropna(inplace = True) will NOT
return a new DataFrame, but it will # Dropping Based on a Subset of Columns
remove all rows containing NULL values df = df.sort_values(by='Date Modified', ascending=False)
from the original DataFrame.
df = df.drop_duplicates(subset=['Name', 'Age'], keep='first')
Data Cleaning E) Trimming White Space and special characters
 

df['Favorite Color'] = df['Favorite Color'].str.strip()

Now, the dropna(inplace = True) will NOT


return a new DataFrame, but it will
remove all rows containing NULL values
from the original DataFrame.
Data Cleaning F) df['Location'] = df['Location'].str.title()
 

Change title case

Now, the dropna(inplace = True) will NOT


return a new DataFrame, but it will
remove all rows containing NULL values
from the original DataFrame.
Data Cleaning E) Trimming White Space and special characters
 

Replacing Text in Strings in Pandas


In the 'Region' column, the word “Region” is
redundant. In this example, you’ll learn how to
replace some text in a column. In particular, you’ll
learn how to remove a given substring in a larger
string. For this, we can use the aptly-
named .replace() method. The method takes a
Now, the dropna(inplace = True) will NOT string we want to replace and a string that we
return a new DataFrame, but it will want to substitute with. Because we want to
remove all rows containing NULL values remove a substring, we’ll simply pass in an empty
from the original DataFrame. string to substitute with.

# Replacing a Substring in Pandas


df['Region'] = df['Region'].str.replace('Region ', '')
print(df)
Data Cleaning F) Column Renaming
 

new_col = {'name':'listing_name',
'number_of_reviews':'reviews'}

df.rename(columns=new_col, inplace=True)
df.head()

Now, the dropna(inplace = True) will NOT


return a new DataFrame, but it will
remove all rows containing NULL values
from the original DataFrame.

You might also like