DSBA+-+Exploratory+Data+Analysis+v2
DSBA+-+Exploratory+Data+Analysis+v2
[email protected]
AB5D4F1ITD
(EDA)
Introduction to EDA
Describe Data (Descriptive Analytics)
Data Pre-processing
Data Visualization
Data Preparation
[email protected]
AB5D4F1ITD
Quantum of data
Features of the data
Understand each feature in the data with help of Data Dictionary
Know the central tendency and data distribution of each feature
Practical data set generally has lot of “noise” and/or “undesired” data points which
might impact the outcome, hence pre-processing is an important step
As these “noise” elements are so well amalgamated with the complete data set,
cleansing process is more governed by the data scientist ability
[email protected]
AB5D4F1ITD
These noise elements are in the form of
Bad values
Anomalies (Not valid or not adhering to business rules)
Missing values
Not Useful Data
Numeric Fields:
Check if datatype of every numeric feature/column is valid
‘Salary Amount’ field is expected to be numeric with data type as float
But if the data type appears as ‘Object’ there is bad data which has to be cleaned
Check range of values
[email protected]
AB5D4F1ITD
‘Age’ field with a minimum value of 0 and maximum as 60
Categorical Fields:
Check categorical levels of each feature/column with “Object” datatype
Level may have some special characters like “?” , “-”, “!” or invalid categories which does not
represent the feature
Understanding the meaning and relevance of each feature and business knowledge
plays an important role in identifying other anomalies in data
In finance, business expects financial ratios to be within range
For a loan data some features like,
[email protected]
AB5D4F1ITD Fixed Obligation to Income Ratio (‘FOIR’) is expected to be in a range of 0-1
Net Loan to Value Ratio (“Net_LTV”) from 0-100 etc.
[email protected]
AB5D4F1ITD
Scaling
Transformation
Outliers Detection & Treatment
Data Encoding
[email protected]
AB5D4F1ITD
For Negatively Skewed features Log, Cube Root, and Square Transformations are used
If data is transformed, results are obtained in terms of transformed data
Hence, care should be taken to reverse the same to conclude the results
Outliers are data points that have a value significantly different than the rest of the
values in the feature
It might be a valid data point or may have been caused due to error
If we consider height of student of class 7, most of them may be in a range of 4.8 Feet to 5.4 Feet.
[email protected]
However, there maybe 1 or 2 students who are around 4 Feet or around 6 feet
AB5D4F1ITD
During data entry extra zeros have been added to an amount field making it different from others
Most of the data provided for Fraud detection will have very few records where fraud has occurred.
There are high chances that these records get identified as outliers
Hence, it is important to analyze the outliers before deciding on treatment
“Object” and/or “Categorical” type of variables which have a values as “Label” like
Male/Female are not allowed in the models, hence the same needs to be “encoded” in
numeric format
There are primarily two types of encoding:
[email protected]
AB5D4F1ITD
One Hot Encoding
Each category is converted to a column having only boolean values
Recommended if the there are less number of categorical levels within the field (less than 25)
Label Encoding
When there are too many levels/categories in a variable in a dataset
If the labels are “Ordinal” like “Satisfaction Score”