IMPDAV
IMPDAV
AY 2024-2025 SEM-II
Unit III - Syllabus
• Exploratory data analysis is a data analytics process that aims to understand the data in depth and
learn the different data characteristics, often using visual means. This allows you to get a better
feel of your data and find useful patterns.
Exploratory Data Analysis
Importance in the Data Analysis
• It helps you gather insights, make better sense of the data, and remove irregularities and
unnecessary values from data.
• Helps you prepare your dataset for analysis.
• Allows a machine learning model to predict our dataset better.
• Gives you more accurate results.
• It also helps us to choose a better machine-learning model
Exploratory Data Analysis
Goals of EDA
Discover patterns and trends.
Spot errors, anomalies, and outliers.
Visualize relationships between variables.
e.g., a raw scatterplot vs. a cleaned-up, annotated version.
Exploratory Data Analysis
Steps Involved in Exploratory Data Analysis
1. Data Collection - Data collection is an essential part of exploratory data analysis. It refers to the
process of finding and loading data into our system. Good, reliable data can be found on various
public sites or bought from private organizations. Some reliable sites for data collection are
Kaggle, Github, Machine Learning Repository, etc.
• The data depicted below represents the housing dataset available on Kaggle. It contains
information on houses and their sale prices.
Exploratory Data Analysis
Steps Involved in Exploratory Data Analysis
2. Data Cleaning - Data cleaning refers to removing unwanted variables and values
from your dataset and eliminating any irregularities in it. Such anomalies can
disproportionately skew the data and, hence, adversely affect the results. Some
steps that can be done to clean data are:
● Removing missing values, outliers, and unnecessary rows/ columns.
● Re-indexing and reformatting our data.
Now, it’s time to clean the housing dataset. You first need to check to see the number of
missing values in each column and the percentage of missing values they contribute to
Exploratory Data Analysis
Steps Involved in Exploratory Data Analysis
To do so, drop the columns which are missing more than 15% of the data. Further, some
variables are missing a significant chunk of the data, like 'PoolQC' , 'MiscFeature', 'Alley',
etc., seem to be outliers.
Exploratory Data Analysis
Steps Involved in Exploratory Data Analysis
Your final dataset after cleaning looks as shown below. You now have only 63 columns of
importance.
Exploratory Data Analysis
Exploratory Data Analysis
Exploratory Data Analysis
Exploratory Data Analysis
Exploratory Data Analysis
Univariate Analysis
In Univariate Analysis, you analyze data of just one variable. A variable in your dataset
refers to a single feature/ column. You can do this with graphical or non-graphical means
by finding specific mathematical values in the data. Some visual methods include:
● Histograms: Bar plots in which the frequency of data is represented with rectangle
bars.
● Box plots: Here, the information is represented in the form of boxes.
Univariate Analysis
Exploratory Data Analysis
Univariate Analysis
Exploratory Data Analysis
Univariate Analysis
Right skew
Also known as positive skew, this distribution has a longer tail on the right
side of its peak. The mean of the data is greater than the median.
Left skew
Also known as negative skew, this distribution has a longer tail on the left
side of its peak. The mean of the data is less than the median.
Univariate Analysis
•High kurtosis
•A narrow box with long whiskers indicates high kurtosis. This means the
distribution has a narrow peak and many extreme values.
•Low kurtosis
•A wide box with short whiskers indicates low kurtosis. This means the
distribution has a broad peak and few extreme values.
•Normal distribution
Univariate Analysis
Exploratory Data Analysis
Univariate Analysis
Exploratory Data Analysis
Univariate Analysis
• From the graph, you can say that the graph
deviates from the normal and is positively
skewed.
• The above figure shows that the lower range values fall in a
similar range and are too far from 0. Meanwhile, all the higher
range values have a range far from 0.
• You cannot consider that all of them are outliers, but you have to
be careful with the last two variables that are above 7.
Exploratory Data Analysis
Tools and Libraries
Python: Pandas, Matplotlib, Seaborn, Plotly.
R: ggplot2, dplyr.
Visualization tools: Tableau, Power BI.
Exploratory Data Analysis
Deleting Outliers
Exploratory Data Analysis
Bivariate Analysis
Now, plot a scatter plot of the Basement area vs. the Sales Price and see their
relationship. Again, you can see that the greater the basement area, the more
the sales price.
Exploratory Data Analysis
Bivariate Analysis
Moving ahead, plot a boxplot of the Sales Price with Overall Quality. The overall
quality feature is categorical here. It falls in the range of 1 to 10. Here, you can
see the increase in sales price as the quality increases. The rise looks a bit like
an exponential curve.
Exploratory Data Analysis
Advanced EDA Techniques
●Outlier Detection
●Time Series Analysis
●Dimensionality Reduction (PCA)
●Real-world Examples
Exploratory Data Analysis
Advanced EDA Techniques
●Outlier Detection - Ensuring data quality and reliability is crucial
for making informed decisions and extracting meaningful insights.
However, datasets often contain irregularities known as outliers,
which can significantly impact the integrity and accuracy of
analyses. This makes outlier detection a crucial task in data analysis.
Exploratory Data Analysis
Advanced EDA Techniques
●Outlier Detection.
Exploratory Data Analysis
Advanced EDA Techniques
Outlier Detection.
Types of Outliers - Outliers can be classified into various types based
on their characteristics:
4.Contextual Outliers: These are data points that are considered outliers in a
specific context. For example, a high temperature may be normal in summer
but an outlier in winter.
●https://www.kaggle.com/code/prashant111/eda-logistic
-regression-pca
Exploratory Data Analysis
Advanced EDA Techniques Application
● Advanced Exploratory Data Analysis (EDA) in real-world
scenarios includes using techniques like
● Interaction plots to examine complex relationships between
multiple variables,
● Time series analysis to identify patterns in data over time,
● Dimensionality reduction to visualize high-dimensional data,
outlier detection using advanced statistical methods, and
applying
● Clustering algorithms to identify distinct groups within a
dataset, often applied in fields like customer churn prediction,
fraud detection, healthcare analytics, and market research.
Exploratory Data Analysis
Advanced EDA Techniques Application
A. Customer Churn Analysis:
●Interaction plots: Visualizing how factors like customer
tenure, monthly usage, and recent support interactions
combine to influence churn probability.
●Time series analysis: Identifying patterns in customer
behavior over time to predict churn risk based on
usage trends.
●Clustering: Grouping customers with similar
characteristics to target churn prevention strategies.
Exploratory Data Analysis
Advanced EDA Techniques
B. Healthcare Analytics:
• Dimensionality reduction: Analyzing large medical
datasets with many variables using techniques like
Principal Component Analysis (PCA) to identify key
factors impacting patient outcomes.
• Outlier detection: Identifying unusual patient data
points (e.g., extreme lab values) that could signal
potential health issues.
• Survival analysis: Studying factors influencing patient
survival rates using time-to-event analysis.
Exploratory Data Analysis
Advanced EDA Techniques:
1. Interaction Plot - Used to visualize how two or more variables interact
with each other.
• Example: Interaction between marketing spend and customer age on
sales.
Exploratory Data Analysis
Advanced EDA Techniques:
2. Time Series Analysis Plot
Shows how a variable changes over time.
• Example: Stock market trends, COVID-19 cases over time.
Exploratory Data Analysis
Advanced EDA Techniques:
3. Dimensionality Reduction (PCA, t-SNE, UMAP)
Used to visualize high-dimensional data in a lower-dimensional space.
• Example: PCA visualization of customer segmentation.
Exploratory Data Analysis
Advanced EDA Techniques:
3. Dimensionality Reduction (PCA, t-SNE, UMAP)
A. Interpreting the PCA Cluster Plot
• The X and Y axes represent Principal Component 1 and Principal Component 2, which
contain the most variance in the data.
• Each point represents a data sample, colored by the cluster it belongs to.
• Even though the data originally had more features (e.g., 5D or 10D), we compressed it to
2D while preserving the structure.
B. Advantages of PCA
• Reduces noise and redundancy in the data.
• Speeds up computations in machine learning models.
• Aids visualization of complex datasets.
Exploratory Data Analysis
Advanced EDA Techniques:
4. Outlier Detection (Boxplot, Z-score, Isolation Forest)
Identifies anomalies in data distribution.
• Example: Detecting fraud in credit card transactions.
Exploratory Data Analysis
Imputation Methods:
o Mean/Median Imputation: Fill in missing values with the mean/median of the column.
o Mode Imputation: Fill categorical missing values with the most frequent value.
o KNN Imputation: Use K-Nearest Neighbours to predict missing values.
o Multiple Imputation: Create multiple datasets with different imputed values.
Solutions:
Visualization Techniques:
o Boxplots and Z-scores help detect outliers.
o Interquartile Range (IQR): Values outside Q1 - 1.5*IQR and Q3 + 1.5*IQR are considered outliers.
Transformations:
o Log transformation or Winsorization to cap extreme values.
1. Understand the Data Context: Know the domain to guide cleaning and transformations.
4. Data Scaling & Normalization: Helps in models that rely on distance calculations (e.g., KNN,
SVM).
5. Automate EDA with Tools: Pandas Profiling, Sweetviz, AutoViz for rapid insights.
Exploratory Data Analysis
●Visual Demonstrations
Exploratory Data Analysis
1. Introduction to Tools like Jupyter Notebooks, R
Shiny, etc.
R Shiny (R)
• Web-based interactive dashboards for EDA and data
visualization.
• Ideal for building dynamic reports that update with user
input.
• Used in data science, finance, and healthcare analytics.
Exploratory Data Analysis
1. Introduction to Tools like Jupyter Notebooks, R
Shiny, etc.
Other Tools
• Tableau / Power BI: Drag-and-drop interactive EDA.
• Google Colab: Cloud-based Jupyter alternative with free
GPU/TPU.
• Streamlit / Dash: Python frameworks for custom web-
based data apps.
Exploratory Data Analysis
• Plotly, or Streamlit?
Exploratory Data Analysis
3. Visual Demonstrations
Option 1: Pandas Profiling (Automated EDA)
• Generates a full report of data insights, including:
• Missing values, distributions, correlations, and key statistics.
https://colab.research.google.com/drive/1wBByojR4ce
felJ1T7z85hEVo1GFcosPB?usp=sharing
Python Libraries for Analysis and Visualization
• NumPy, Pandas, Seaborn, and Sklearn are a few of the foremost prevalent
libraries utilized in Python programming.
• NumPy may be a library for scientific computing, Pandas could be a library for
data analysis, Seaborn could be a library for visualizing information, and Sklearn
could be a library for machine learning.
• Each library provides effective, however simple, data manipulation and analysis
tools. With these libraries, engineers can rapidly and effectively make capable
applications that use the control of data science.
Python Libraries for Analysis and Visualization
1. NumPy (numpy)
Key Features:
• Supports multi-dimensional arrays.
https://colab.research.google.com/drive/1gv_iUCnb301Zqh7UPI9Eq0ga4TJ6-GPn?usp=sharing
Python Libraries for Analysis and Visualization
2. Pandas (pandas)
Purpose: Data manipulation & analysis, primarily using DataFrames & Series.
Key Features:
• Handles missing data efficiently.
https://colab.research.google.com/drive/1gv_iUCnb301Zqh7UPI9Eq0ga4TJ6-GPn?usp=sharing
Python Libraries for Analysis and Visualization
3. Seaborn (seaborn)
Key Features:
• Attractive & informative statistical graphics.
https://colab.research.google.com/drive/1gv_iUCnb301Zqh7UPI9Eq0ga4TJ6-GPn?usp=sharing
Python Libraries for Analysis and Visualization
4. Scikit-Learn (sklearn)
Key Features:
• Provides algorithms for classification, regression, clustering.
https://colab.research.google.com/drive/1gv_iUCnb301Zqh7UPI9Eq0ga4TJ6-GPn?usp=sharing
Python Libraries for Analysis and Visualization
• Exploratory Data Analysis (EDA) is often used to access, retrieve, or send data
between different systems or platforms for analysis and visualization.
• In data analysis, APIs fetch data from online sources, databases, or other systems.
Examples:
2. Web Services:
• A type of API that operates over a network (commonly the internet) to enable communication
between different systems.
• Web services typically use standard protocols like HTTP/HTTPS to send and receive data.
• Formats: Most web services provide data in structured formats like JSON or XML, which are
easy to process in Python.
Invoke APIs and Web Services
• Python provides libraries like requests, urllib, and others to simplify this
process.
Invoke APIs and Web Services
1. Access Live/Real-Time Data: APIs allow analysts to work with up-to-date datasets
from external services (e.g., social media platforms, financial systems, or weather
services).
3. Diverse Data Sources: APIs make combining multiple data sources into a single
analysis easy, enriching the EDA process.
Invoke APIs and Web Services
3. urllib: Another library for accessing web services, often more detailed but less user-friendly than
requests.
Invoke APIs and Web Services
import requests
Integrating APIs and web services with EDA techniques allows analysts to work
efficiently with dynamic and diverse datasets.
Python libraries for Analysis
Applications in EDA
• Problem Solving:
https://docs.google.com/document/u/3/d/16ZDBRZvOQAegQ8dqZdm-l
EG4o7MVmAvD/edit?usp=drive_web&ouid=10737257361508226957
7&rtpof=true