ch4 Slides PDF
ch4 Slides PDF
Data types
Cleaning Data in Python
Data types
In [1]: print(df.dtypes)
name object
sex object
treatment a object
treatment b int64
dtype: object
In [4]: df.dtypes
Out[4]:
name object
sex category
treatment a object
treatment b object
dtype: object
Cleaning Data in Python
Categorical data
● Converting categorical data to ‘category’ dtype:
● Can make the DataFrame smaller in memory
● Can make them be utilized by other Python libraries
for analysis
Cleaning Data in Python
Cleaning data
● Numeric data loaded as a string
Cleaning Data in Python
In [6]: df.dtypes
Out[6]:
name object
sex category
treatment a float64
treatment b object
dtype: object
CLEANING DATA IN PYTHON
Let’s practice!
CLEANING DATA IN PYTHON
Using regular
expressions to
clean strings
Cleaning Data in Python
String manipulation
● Much of data cleaning involves string manipulation
● Most of the world’s data is unstructured text
● Also have to do string manipulation to make
datasets consistent with one another
Cleaning Data in Python
Validate values
● 17
● $17
● $17.89
● $17.895
Cleaning Data in Python
String manipulation
● Many built-in and external libraries
● ‘re’ library for regular expressions
● A formal way of specifying a pa!ern
● Sequence of characters
● Pa!ern matching
● Similar to globbing
Cleaning Data in Python
Example match
● 17 12345678901 \d *
In [4]: bool(result)
True
CLEANING DATA IN PYTHON
Let’s practice!
CLEANING DATA IN PYTHON
Using functions to
clean data
Cleaning Data in Python
Complex cleaning
● Cleaning step requires multiple steps
● Extract number from string
● Perform transformation on extracted number
● Python function
Cleaning Data in Python
Apply
In [1]: print(df)
treatment a treatment b
Daniel 18 42
John 12 31
Jane 24 27
Apply
In [3]: print(df)
treatment a treatment b
Daniel 18 42
John 12 31
Jane 24 27
Applying functions
Cleaning Data in Python
Writing a function
! example.py
# Function Body
return value
Cleaning Data in Python
icost = float(icost)
tef = float(tef)
return(NaN)
Cleaning Data in Python
In [9]: print(df_subset.head())
Job # Doc # Borough Initial Cost Total Est. Fee diff
0 121577873 2 MANHATTAN $75000.00 $986.00 74014.0
1 520129502 1 STATEN ISLAND $0.00 $1144.00 -1144.0
2 121601560 1 MANHATTAN $30000.00 $522.50 29477.5
3 121601203 1 MANHATTAN $1500.00 $225.00 1275.0
4 121601338 1 MANHATTAN $19500.00 $389.50 19110.5
CLEANING DATA IN PYTHON
Let’s practice!
CLEANING DATA IN PYTHON
Duplicate and
missing data
Cleaning Data in Python
Duplicate data
● Can skew results
● ‘.drop_duplicates()’ method
Cleaning Data in Python
Drop duplicates
In [1]: df = df.drop_duplicates()
In [2]: print(df)
name sex treatment a treatment b
0 Daniel male - 42
1 John male 12 31
2 Jane female 24 27
Cleaning Data in Python
Missing data
● Leave as-is
● Drop them
● Fill missing value
Cleaning Data in Python
In [5]: tips_dropped.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 147 entries, 0 to 243
Data columns (total 7 columns):
total_bill 147 non-null float64
tip 147 non-null float64
sex 147 non-null object
smoker 147 non-null object
day 147 non-null object
time 147 non-null object
size 147 non-null float64
dtypes: float64(3), object(4)
memory usage: 9.2+ KB
Cleaning Data in Python
In [8]: tips_nan.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 244 non-null float64
tip 220 non-null float64
sex 244 non-null object
smoker 229 non-null object
day 243 non-null object
time 227 non-null object
size 244 non-null float64
dtypes: float64(3), object(4)
memory usage: 13.4+ KB
Cleaning Data in Python
In [10]: print(mean_value)
2.964681818181819
In [12]: tips_nan.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 244 non-null float64
tip 244 non-null float64
sex 244 non-null object
smoker 229 non-null object
day 243 non-null object
time 227 non-null object
size 244 non-null float64
dtypes: float64(3), object(4)
memory usage: 13.4+ KB
CLEANING DATA IN PYTHON
Let’s practice!
CLEANING DATA IN PYTHON
Testing with
asserts
Cleaning Data in Python
Assert statements
● Programmatically vs visually checking
● If we drop or fill NaNs, we expect 0 missing values
● We can write an assert statement to verify this
● We can detect early warnings and errors
● This gives us confidence that our code is running
correctly
Cleaning Data in Python
Asserts
In [1]: assert 1 == 1
In [2]: assert 1 == 2
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-65-a810b3a4aded> in <module>()
----> 1 assert 1 == 2
AssertionError:
Cleaning Data in Python
Test column
In [1]: assert google.Close.notnull().all()
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-49-eec77130a77f> in <module>()
----> 1 assert google.Close.notnull().all()
AssertionError:
Cleaning Data in Python
Test column
In [1]: google_0 = google.fillna(value=0)
Let’s practice!