0% found this document useful (0 votes)

4 views

DSBA+-+Exploratory+Data+Analysis+v2

The document provides a comprehensive overview of Exploratory Data Analysis (EDA), detailing its purpose, processes, and techniques including data description, pre-processing, visualization, and preparation. It emphasizes the importance of understanding data characteristics, detecting anomalies, and using appropriate visualization methods to derive insights. Additionally, it discusses data scaling, transformation, outlier detection, and encoding methods necessary for effective data analysis.

Uploaded by

nilashish sarkar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

DSBA+-+Exploratory+Data+Analysis+v2

Uploaded by

nilashish sarkar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Exploratory Data Analysis

[email protected]
AB5D4F1ITD

(EDA)

This file is meant for personal use by [email protected] only.

Proprietary content. Sharing
©Great or publishing
Learning. the contents
All Rights in part orUnauthorized
Reserved. full is liable for legal
useaction.
or distribution prohibited.
Agenda

Introduction to EDA
Describe Data (Descriptive Analytics)
Data Pre-processing
Data Visualization
Data Preparation
[email protected]
AB5D4F1ITD

This file is meant for personal use by [email protected] only.

Proprietary content. Sharing
©Great or publishing
Learning. the contents
All Rights in part orUnauthorized
Reserved. full is liable for legal
useaction.
or distribution prohibited.
Introduction to EDA
EDA is an approach to analyze data using both non-
visual and visual techniques
Generation of insights is a “Creative” process ,
however there is a structured approach which is
[email protected]
AB5D4F1ITDfollowed

Involves thorough analysis of data to understand

the current business situation
EDA objective is to extract “Gold” from the “data
mine” based on domain understanding

This file is meant for personal use by [email protected] only.

Proprietary content. Sharing
©Great or publishing
Learning. the contents
All Rights in part orUnauthorized
Reserved. full is liable for legal
useaction.
or distribution prohibited.
Describe Data

Know the Problem Statement or the Business Objective

Load and view the given data
Check the relevance of the data against the objective or goal to be achieved
Scope of the data
[email protected]
Time relevance of the data
AB5D4F1ITD

Quantum of data
Features of the data
Understand each feature in the data with help of Data Dictionary
Know the central tendency and data distribution of each feature

This file is meant for personal use by [email protected] only.

Practical data set generally has lot of “noise” and/or “undesired” data points which
might impact the outcome, hence pre-processing is an important step
As these “noise” elements are so well amalgamated with the complete data set,
cleansing process is more governed by the data scientist ability
[email protected]
AB5D4F1ITD
These noise elements are in the form of
Bad values
Anomalies (Not valid or not adhering to business rules)
Missing values
Not Useful Data

This file is meant for personal use by [email protected] only.

Numeric Fields:
Check if datatype of every numeric feature/column is valid
‘Salary Amount’ field is expected to be numeric with data type as float
But if the data type appears as ‘Object’ there is bad data which has to be cleaned
Check range of values
[email protected]
AB5D4F1ITD
‘Age’ field with a minimum value of 0 and maximum as 60
Categorical Fields:
Check categorical levels of each feature/column with “Object” datatype
Level may have some special characters like “?” , “-”, “!” or invalid categories which does not
represent the feature

This file is meant for personal use by [email protected] only.

Understanding the meaning and relevance of each feature and business knowledge
plays an important role in identifying other anomalies in data
In finance, business expects financial ratios to be within range
For a loan data some features like,
[email protected]
AB5D4F1ITD Fixed Obligation to Income Ratio (‘FOIR’) is expected to be in a range of 0-1
Net Loan to Value Ratio (“Net_LTV”) from 0-100 etc.

This file is meant for personal use by [email protected] only.

Duplicate records or rows

If retained, may result in misleading algorithmic evaluations, hence recommended to be removed
Same data appearing for all features in multiple records
A feature or column that has a single value in all the records
[email protected]
Zero-variance predictors as their value remains same across all the records
AB5D4F1ITD

Feature or column with more than 25-30% missing values

This file is meant for personal use by [email protected] only.

Visualization is a technique for creating diagrams, images or animations to

communicate a message
Usage of charts or graphs to visualize huge amounts of complex data is easier than
poring over spreadsheets or reports
[email protected]
AB5D4F1ITD
Data Analysis using Visualization includes:
Univariate Analysis
Bivariate Analysis
Multivariate Analysis
Key for this analysis is generating insights/inferences aligned with the business problem

This file is meant for personal use by [email protected] only.

Numeric Variables Mean Standard Deviation Histogram Boxplot

Median Range
[email protected]
Mode IQR
AB5D4F1ITD
5 Number Summary

Categorical Mode Frequency of the Countplot

Variables levels

This file is meant for personal use by [email protected] only.

Proprietary content. Sharing
©Great or publishing
Learning. the contents
All Rights in part orUnauthorized
Reserved. full is liable for legal
useaction.
or distribution prohibited.
Bivariate Analysis
Data Visualization
Data Types
(Bivariate Analysis)
Numeric & Numeric Variable OR Pairplot/ Correlation Plot/
All Numeric Variables Scatterplot Heatmap

[email protected]
AB5D4F1ITD

Categorical & Categorical Variable Countplot with hue(diff colour)

Categorical & Numeric Variable Boxplot (x as Categorical & y as Numeric)

This file is meant for personal use by [email protected] only.

Proprietary content. Sharing
©Great or publishing
Learning. the contents
All Rights in part orUnauthorized
Reserved. full is liable for legal
useaction.
or distribution prohibited.
Data Preparation

Scaling
Transformation
Outliers Detection & Treatment
Data Encoding
[email protected]
AB5D4F1ITD

This file is meant for personal use by [email protected] only.

Why do we need to do it?

Data set has features with different “weights”
In “Distance” based algorithms it is recommended to transform the features so that all features are
in same “scale”
Most commonly used scaling techniques are
[email protected]
AB5D4F1ITD
Z-Score
Z = (X - μ ) / σ
Scaled data will have mean tending to 0 and standard deviation tending to 1
Used in weight based techniques (PCA, Neural Network etc.)
Min-Max
(X-Xmin)/(Xmax-Xmin)
Scaled data will range between 0 and 1
Used in distance based techniques (Clustering, KNN etc.)

This file is meant for personal use by [email protected] only.

Why do we need to do it?

When a variable is on larger scale, we can transform it to a lower scale using Log Transformation
To deal with Skewness
Most commonly used transformation techniques are
[email protected]
For Positively Skewed features Log, Exponential, and Square Root Transformations are used
AB5D4F1ITD

For Negatively Skewed features Log, Cube Root, and Square Transformations are used
If data is transformed, results are obtained in terms of transformed data
Hence, care should be taken to reverse the same to conclude the results

This file is meant for personal use by [email protected] only.

Outliers are data points that have a value significantly different than the rest of the
values in the feature
It might be a valid data point or may have been caused due to error
If we consider height of student of class 7, most of them may be in a range of 4.8 Feet to 5.4 Feet.
[email protected]
However, there maybe 1 or 2 students who are around 4 Feet or around 6 feet
AB5D4F1ITD

During data entry extra zeros have been added to an amount field making it different from others
Most of the data provided for Fraud detection will have very few records where fraud has occurred.
There are high chances that these records get identified as outliers
Hence, it is important to analyze the outliers before deciding on treatment

This file is meant for personal use by [email protected] only.

Outlier treatment is not mandatory

There are algorithms in machine learning that are not very sensitive to outliers
We can choose the relevant algorithms to work on the data
When essential, outlier treatments are done with following considerations:
[email protected]
Treatment of outliers should not change the meaning of the data to a great extent which in turn
AB5D4F1ITD

reflects current business situation

Business or domain knowledge to be taken into account to decide on the treatment
Basic techniques to detect outliers
Z Score
Boxplot

This file is meant for personal use by [email protected] only.

Proprietary content. Sharing
©Great or publishing
Learning. the contents
All Rights in part orUnauthorized
Reserved. full is liable for legal
useaction.
or distribution prohibited.
Outlier Detection
ZScore
First scale the variables by applying ZScore
All records with score greater than 3 and less than -3 are
considered as outliers
For a feature, if we assume a normal distribution, 99.7% of
the data points are within ± 3 σ value, anything beyond it is
[email protected]
AB5D4F1ITD
outlier which are very few data points
Boxplot
Any data point more than Q3+1.5*IQR or less than Q1-
1.5*IQR is taken as an outlier
50% of data points are within ± 0.5 IQR of the median
In a normal distribution 68% are with ± 1σ
So IQR (50%) is slightly less than ± 1σ (68%)
In order to correspond ± 3 σ range, ±1.5IQR (i.e. 3* ± 0.5
IQR) is taken as range to identify outliers

This file is meant for personal use by [email protected] only.

“Object” and/or “Categorical” type of variables which have a values as “Label” like
Male/Female are not allowed in the models, hence the same needs to be “encoded” in
numeric format
There are primarily two types of encoding:
[email protected]
AB5D4F1ITD
One Hot Encoding
Each category is converted to a column having only boolean values
Recommended if the there are less number of categorical levels within the field (less than 25)
Label Encoding
When there are too many levels/categories in a variable in a dataset
If the labels are “Ordinal” like “Satisfaction Score”

This file is meant for personal use by [email protected] only.

3 - AI Acceptable Use Policy
100% (3)
3 - AI Acceptable Use Policy
5 pages
ISMS-103 - Acceptable Use Policy v2.0
No ratings yet
ISMS-103 - Acceptable Use Policy v2.0
8 pages
A318/A319/A320/A321: Service Bulletin
No ratings yet
A318/A319/A320/A321: Service Bulletin
27 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
Random Forest
No ratings yet
Random Forest
16 pages
Descriptive Statsistics
No ratings yet
Descriptive Statsistics
44 pages
MRA MS Week 1
No ratings yet
MRA MS Week 1
11 pages
MRA MS Week 1
No ratings yet
MRA MS Week 1
11 pages
Introduction - Lecture Slides-1
No ratings yet
Introduction - Lecture Slides-1
12 pages
Stock Prices Stats Exercise
No ratings yet
Stock Prices Stats Exercise
2 pages
MRA MS Week 2
No ratings yet
MRA MS Week 2
12 pages
Random Forest
No ratings yet
Random Forest
16 pages
Model Deployment - Slide
No ratings yet
Model Deployment - Slide
14 pages
Data+Science+Essentials+-+The+must+know+mathematics+and+statistics
No ratings yet
Data+Science+Essentials+-+The+must+know+mathematics+and+statistics
17 pages
Principal Component Analysis Concepts: T56Gzsrvah
No ratings yet
Principal Component Analysis Concepts: T56Gzsrvah
16 pages
PDS - Week 2
No ratings yet
PDS - Week 2
10 pages
ITP-Day 4 - Deck
No ratings yet
ITP-Day 4 - Deck
52 pages
Information Security Ch05 2.1.2024
No ratings yet
Information Security Ch05 2.1.2024
73 pages
chương 5 đang ôn
No ratings yet
chương 5 đang ôn
73 pages
Information Security - Ch05
No ratings yet
Information Security - Ch05
73 pages
Reference Material - LDA
No ratings yet
Reference Material - LDA
24 pages
Data Warehouse - Introduction
No ratings yet
Data Warehouse - Introduction
15 pages
Stats_essentials
No ratings yet
Stats_essentials
17 pages
The Art and Science of Data_ Navigating the Data Science Lifecycle
No ratings yet
The Art and Science of Data_ Navigating the Data Science Lifecycle
33 pages
Politics
No ratings yet
Politics
8 pages
Principles of Human Resource Management 16 e Bohlander - Snell
No ratings yet
Principles of Human Resource Management 16 e Bohlander - Snell
40 pages
CH 2 - HR Planning
No ratings yet
CH 2 - HR Planning
40 pages
FRA Week 1
No ratings yet
FRA Week 1
30 pages
MLS 2+Creating+Interactive+Dashboards
No ratings yet
MLS 2+Creating+Interactive+Dashboards
38 pages
Probability 2
No ratings yet
Probability 2
28 pages
MgmtOfInfoSec 6e-Ch02 PR
No ratings yet
MgmtOfInfoSec 6e-Ch02 PR
72 pages
DDC Datasheet
No ratings yet
DDC Datasheet
1 page
Thrive-AI-Policy-Template
No ratings yet
Thrive-AI-Policy-Template
5 pages
3 ITE403 Whitman Ch03 W3C1
No ratings yet
3 ITE403 Whitman Ch03 W3C1
28 pages
OpenSAP Aie1 Unit 5 RISK Presentation
No ratings yet
OpenSAP Aie1 Unit 5 RISK Presentation
9 pages
Market+Risk Worksheet
No ratings yet
Market+Risk Worksheet
18 pages
FRA Week 2
No ratings yet
FRA Week 2
11 pages
Product_ The BCG Matrix
No ratings yet
Product_ The BCG Matrix
5 pages
18 Data Privacy Program Report
No ratings yet
18 Data Privacy Program Report
20 pages
Risk Assessment Policy
No ratings yet
Risk Assessment Policy
1 page
Azure Week1 - Class Presentation
No ratings yet
Azure Week1 - Class Presentation
43 pages
Advanced Computer Networks & Computer and Network Security: Prof. Dr. Hasan Hüseyin BALIK (1 Week)
No ratings yet
Advanced Computer Networks & Computer and Network Security: Prof. Dr. Hasan Hüseyin BALIK (1 Week)
25 pages
Bank Database Management System
No ratings yet
Bank Database Management System
19 pages
Azure book 27
No ratings yet
Azure book 27
1 page
Riemen Solution PVT LTD - NDA - V1.1
No ratings yet
Riemen Solution PVT LTD - NDA - V1.1
7 pages
Security-Week-4_5-2023-Risk-Management-Slides
No ratings yet
Security-Week-4_5-2023-Risk-Management-Slides
57 pages
Chapter 1 Merged
No ratings yet
Chapter 1 Merged
534 pages
03. Risk Register
No ratings yet
03. Risk Register
18 pages
Central Limit Theorem
No ratings yet
Central Limit Theorem
3 pages
Data Science Hindi
67% (3)
Data Science Hindi
29 pages
Information Security_Ch03
No ratings yet
Information Security_Ch03
94 pages
7 ITE403 Whitman Ch05 W5C2
No ratings yet
7 ITE403 Whitman Ch05 W5C2
19 pages
Dashboards Intro
No ratings yet
Dashboards Intro
27 pages
Measures+of+Central+tendency dispersion-+Lecture+Slides
No ratings yet
Measures+of+Central+tendency dispersion-+Lecture+Slides
14 pages
Clustering
No ratings yet
Clustering
7 pages
Chap 3 Griffin - Mgmt12e - PPT
No ratings yet
Chap 3 Griffin - Mgmt12e - PPT
18 pages
PRSE_Session 2
No ratings yet
PRSE_Session 2
12 pages
T17 - Protect and Govern your Sensitive Data with Microsoft Purview in Microsoft Teams
No ratings yet
T17 - Protect and Govern your Sensitive Data with Microsoft Purview in Microsoft Teams
49 pages
Hacking : Unlocking the Secrets of Technology
From Everand
Hacking : Unlocking the Secrets of Technology
Rohan Yadav
No ratings yet
Super Plasticizer
No ratings yet
Super Plasticizer
7 pages
Project - Working of An Airport (Simulation in Software)
No ratings yet
Project - Working of An Airport (Simulation in Software)
5 pages
2.2 SAP FIORI - Concepts and Influence Factors
No ratings yet
2.2 SAP FIORI - Concepts and Influence Factors
6 pages
Influence of Celebrity Credibility On Perceived Brand Trust Study On Multinational Mobile Service Brand in Bangladesh
No ratings yet
Influence of Celebrity Credibility On Perceived Brand Trust Study On Multinational Mobile Service Brand in Bangladesh
53 pages
Factors Affecting The Teachers Job Satisfaction
No ratings yet
Factors Affecting The Teachers Job Satisfaction
17 pages
bank-reconciliation-test-bank_compress
No ratings yet
bank-reconciliation-test-bank_compress
15 pages
ASHA Update Jan 2018
No ratings yet
ASHA Update Jan 2018
68 pages
How To Search For Text Inside of Any File Using Windows Search
No ratings yet
How To Search For Text Inside of Any File Using Windows Search
3 pages
TPV 1 L - World - en - 7 - Rep666815
No ratings yet
TPV 1 L - World - en - 7 - Rep666815
11 pages
CFP Ibcast 2024
No ratings yet
CFP Ibcast 2024
5 pages
Work Order Procedure
100% (1)
Work Order Procedure
5 pages
Calibration Laboratory Notes
No ratings yet
Calibration Laboratory Notes
18 pages
Food Safety On Wheels
100% (1)
Food Safety On Wheels
8 pages
New Trends in Management
No ratings yet
New Trends in Management
13 pages
Akij Group
50% (2)
Akij Group
11 pages
Central Wisconsin Resorter 2014 No. 31
No ratings yet
Central Wisconsin Resorter 2014 No. 31
28 pages
61 - 2004 Winter-Spring
No ratings yet
61 - 2004 Winter-Spring
25 pages
Updating Workplace First Aid Kits
No ratings yet
Updating Workplace First Aid Kits
2 pages
Commissioner of Internal Revenue vs. Court of Appeals
No ratings yet
Commissioner of Internal Revenue vs. Court of Appeals
7 pages
Practical Exam Notice Nov Dec 2024
No ratings yet
Practical Exam Notice Nov Dec 2024
1 page
OTM Vs OTO Soubhik
No ratings yet
OTM Vs OTO Soubhik
5 pages
Aparna Khatri: Soft Skills Education
100% (1)
Aparna Khatri: Soft Skills Education
2 pages
Chapter 5. Transportation Methods
No ratings yet
Chapter 5. Transportation Methods
57 pages
10 Separation Systems 2007 Petroleum Production Engineering
100% (1)
10 Separation Systems 2007 Petroleum Production Engineering
16 pages
Broadcasting Chat Server
83% (6)
Broadcasting Chat Server
25 pages
Coronado vs. CA
No ratings yet
Coronado vs. CA
5 pages
Fahami RP New Slides
No ratings yet
Fahami RP New Slides
83 pages
Extension Education PDF
No ratings yet
Extension Education PDF
2 pages
Intelligence, Surveillance and Reconnaissance in 2035 and Beyond - Peter Roberts & Andrew Payne
No ratings yet
Intelligence, Surveillance and Reconnaissance in 2035 and Beyond - Peter Roberts & Andrew Payne
36 pages