0% found this document useful (0 votes)
28 views

ch4 Slides PDF

The document discusses cleaning data in Python. It covers converting between data types, handling categorical data, cleaning numeric values stored as strings, dropping duplicate rows, counting and dropping missing values, and filling in missing values using values like the mean or median. The document also demonstrates using regular expressions to validate string values and writing functions to apply complex cleaning steps to DataFrames.

Uploaded by

NourheneMbarek
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

ch4 Slides PDF

The document discusses cleaning data in Python. It covers converting between data types, handling categorical data, cleaning numeric values stored as strings, dropping duplicate rows, counting and dropping missing values, and filling in missing values using values like the mean or median. The document also demonstrates using regular expressions to validate string values and writing functions to apply complex cleaning steps to DataFrames.

Uploaded by

NourheneMbarek
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

CLEANING DATA IN PYTHON

Data types
Cleaning Data in Python

Prepare and clean data


Cleaning Data in Python

Data types
In [1]: print(df.dtypes)
name object
sex object
treatment a object
treatment b int64
dtype: object

● There may be times we want to convert from one type to


another
● Numeric columns can be strings, or vice versa
Cleaning Data in Python

Converting data types


In [2]: df['treatment b'] = df['treatment b'].astype(str)

In [3]: df['sex'] = df['sex'].astype('category')

In [4]: df.dtypes
Out[4]:
name object
sex category
treatment a object
treatment b object
dtype: object
Cleaning Data in Python

Categorical data
● Converting categorical data to ‘category’ dtype:
● Can make the DataFrame smaller in memory
● Can make them be utilized by other Python libraries
for analysis
Cleaning Data in Python

Cleaning data
● Numeric data loaded as a string
Cleaning Data in Python

Cleaning bad data


In [5]: df['treatment a'] = pd.to_numeric(df['treatment a'],
...: errors='coerce')

In [6]: df.dtypes
Out[6]:
name object
sex category
treatment a float64
treatment b object
dtype: object
CLEANING DATA IN PYTHON

Let’s practice!
CLEANING DATA IN PYTHON

Using regular
expressions to
clean strings
Cleaning Data in Python

String manipulation
● Much of data cleaning involves string manipulation
● Most of the world’s data is unstructured text
● Also have to do string manipulation to make
datasets consistent with one another
Cleaning Data in Python

Validate values
● 17
● $17
● $17.89
● $17.895
Cleaning Data in Python

String manipulation
● Many built-in and external libraries
● ‘re’ library for regular expressions
● A formal way of specifying a pa!ern
● Sequence of characters
● Pa!ern matching
● Similar to globbing
Cleaning Data in Python

Example match
● 17 12345678901 \d *

● $17 $12345678901 \$ \d*


● $17.00 $12345678901.42 \$\d* \.\d*
● $17.89 $12345678901.24 \$\d*\.\d{2}
● $17.895 $12345678901.999 ^\$\d*\.\d{2}$

“I have 17.89 USD”


Cleaning Data in Python

Using regular expressions


● Compile the pa!ern
● Use the compiled pa!ern to match values
● This lets us use the pa!ern over and over again
● Useful since we want to match values down
a column of values
Cleaning Data in Python

Using regular expressions


In [1]: import re

In [2]: pattern = re.compile('\$\d*\.\d{2}')

In [3]: result = pattern.match('$17.89')

In [4]: bool(result)
True
CLEANING DATA IN PYTHON

Let’s practice!
CLEANING DATA IN PYTHON

Using functions to
clean data
Cleaning Data in Python

Complex cleaning
● Cleaning step requires multiple steps
● Extract number from string
● Perform transformation on extracted number
● Python function
Cleaning Data in Python

Apply
In [1]: print(df)
treatment a treatment b
Daniel 18 42
John 12 31
Jane 24 27

In [2]: df.apply(np.mean, axis=0)


Out[2]:
treatment a 18.000000
treatment b 33.333333
dtype: float64
Cleaning Data in Python

Apply
In [3]: print(df)
treatment a treatment b
Daniel 18 42
John 12 31
Jane 24 27

In [4]: df.apply(np.mean, axis=1)


Out[4]:
Daniel 30.0
John 21.5
Jane 25.5
dtype: float64
Cleaning Data in Python

Applying functions
Cleaning Data in Python

Write the regular expression


In [5]: import re

In [6]: from numpy import NaN

In [7]: pattern = re.compile('^\$\d*\.\d{2}$')


Cleaning Data in Python

Writing a function
! example.py

def my_function(input1, input2):

# Function Body

return value
Cleaning Data in Python

Write the function


diff_money.py

def diff_money(row, pattern):

icost = row['Initial Cost']


tef = row['Total Est. Fee']

if bool(pattern.match(icost)) and bool(pattern.match(tef)):

icost = icost.replace("$", "")


tef = tef.replace("$", "")

icost = float(icost)
tef = float(tef)

return icost - tef


else:

return(NaN)
Cleaning Data in Python

Write the function


In [8]: df_subset['diff'] = df_subset.apply(diff_money,
...: axis=1,
...: pattern=pattern)

In [9]: print(df_subset.head())
Job # Doc # Borough Initial Cost Total Est. Fee diff
0 121577873 2 MANHATTAN $75000.00 $986.00 74014.0
1 520129502 1 STATEN ISLAND $0.00 $1144.00 -1144.0
2 121601560 1 MANHATTAN $30000.00 $522.50 29477.5
3 121601203 1 MANHATTAN $1500.00 $225.00 1275.0
4 121601338 1 MANHATTAN $19500.00 $389.50 19110.5
CLEANING DATA IN PYTHON

Let’s practice!
CLEANING DATA IN PYTHON

Duplicate and
missing data
Cleaning Data in Python

Duplicate data
● Can skew results
● ‘.drop_duplicates()’ method
Cleaning Data in Python

Drop duplicates
In [1]: df = df.drop_duplicates()

In [2]: print(df)
name sex treatment a treatment b
0 Daniel male - 42
1 John male 12 31
2 Jane female 24 27
Cleaning Data in Python

Missing data

● Leave as-is
● Drop them
● Fill missing value
Cleaning Data in Python

Count missing values


In [3]: tips_nan.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 202 non-null float64
tip 220 non-null float64
sex 234 non-null object
smoker 229 non-null object
day 243 non-null object
time 227 non-null object
size 231 non-null float64
dtypes: float64(3), object(4)
memory usage: 13.4+ KB
None
Cleaning Data in Python

Drop missing values


In [4]: tips_dropped = tips_nan.dropna()

In [5]: tips_dropped.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 147 entries, 0 to 243
Data columns (total 7 columns):
total_bill 147 non-null float64
tip 147 non-null float64
sex 147 non-null object
smoker 147 non-null object
day 147 non-null object
time 147 non-null object
size 147 non-null float64
dtypes: float64(3), object(4)
memory usage: 9.2+ KB
Cleaning Data in Python

Fill missing values with .fillna()


● Fill with provided value
● Use a summary statistic
Cleaning Data in Python

Fill missing values


In [6]: tips_nan['sex'] = tips_nan['sex'].fillna('missing')

In [7]: tips_nan[['total_bill', 'size']] = tips_nan[['total_bill',


...: 'size']].fillna(0)

In [8]: tips_nan.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 244 non-null float64
tip 220 non-null float64
sex 244 non-null object
smoker 229 non-null object
day 243 non-null object
time 227 non-null object
size 244 non-null float64
dtypes: float64(3), object(4)
memory usage: 13.4+ KB
Cleaning Data in Python

Fill missing values with a test statistic


● Careful when using test statistics to fill
● Have to make sure the value you are filling in makes sense
● Median is a be!er statistic in the presence of outliers
Cleaning Data in Python

Fill missing values with a test statistic


In [9]: mean_value = tips_nan['tip'].mean()

In [10]: print(mean_value)
2.964681818181819

In [11]: tips_nan['tip'] = tips_nan['tip'].fillna(mean_value)

In [12]: tips_nan.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 244 non-null float64
tip 244 non-null float64
sex 244 non-null object
smoker 229 non-null object
day 243 non-null object
time 227 non-null object
size 244 non-null float64
dtypes: float64(3), object(4)
memory usage: 13.4+ KB
CLEANING DATA IN PYTHON

Let’s practice!
CLEANING DATA IN PYTHON

Testing with
asserts
Cleaning Data in Python

Assert statements
● Programmatically vs visually checking
● If we drop or fill NaNs, we expect 0 missing values
● We can write an assert statement to verify this
● We can detect early warnings and errors
● This gives us confidence that our code is running
correctly
Cleaning Data in Python

Asserts
In [1]: assert 1 == 1

In [2]: assert 1 == 2
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-65-a810b3a4aded> in <module>()
----> 1 assert 1 == 2

AssertionError:
Cleaning Data in Python

Google stock data


Cleaning Data in Python

Test column
In [1]: assert google.Close.notnull().all()
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-49-eec77130a77f> in <module>()
----> 1 assert google.Close.notnull().all()

AssertionError:
Cleaning Data in Python

Test column
In [1]: google_0 = google.fillna(value=0)

In [2]: assert google_0.Close.notnull().all()


CLEANING DATA IN PYTHON

Let’s practice!

You might also like