0% found this document useful (0 votes)
3 views

IMPDAV

The document outlines the syllabus for Unit III of a Data Analytics and Visualization course at MIT School of Computing, focusing on Exploratory Data Analysis (EDA) techniques and tools. It covers the importance of EDA, steps involved in data collection and cleaning, as well as univariate and bivariate analysis methods using Python libraries. Advanced EDA techniques such as outlier detection, time series analysis, and dimensionality reduction are also discussed, along with real-world applications and challenges faced in EDA.

Uploaded by

GAYATRI BHOSALE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

IMPDAV

The document outlines the syllabus for Unit III of a Data Analytics and Visualization course at MIT School of Computing, focusing on Exploratory Data Analysis (EDA) techniques and tools. It covers the importance of EDA, steps involved in data collection and cleaning, as well as univariate and bivariate analysis methods using Python libraries. Advanced EDA techniques such as outlier detection, time series analysis, and dimensionality reduction are also discussed, along with real-world applications and challenges faced in EDA.

Uploaded by

GAYATRI BHOSALE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 105

MIT Art, Design and Technology University

MIT School of Computing, Pune


21BTCS027 - Data Analytics and Visualization

Class - T.Y. (SEM-II), Core


Unit – III EDA FOR ANALYSIS AND
VISUALIZATION
Prof. Dr. Aditya Pai H
Prof. Shubhangi Divekar
Prof. Revati Deshpande

AY 2024-2025 SEM-II
Unit III - Syllabus

Unit III – EDA FOR ANALYSIS AND VISUALIZATION


Exploratory Data Analysis: Basic, Examples, Techniques.
Python libraries for Analysis: Pandas and Numpy, and invoke
APIs and Web Services. Visualize using Python: Matplotlib,
Seaborn, and Folium.
Exploratory Data Analysis
Basic Concepts of EDA

● Overview of Descriptive Statistics


● Central Tendency and Dispersion Measures
● Key Concepts: Mean, Median, Variance, Standard Deviation
Exploratory Data Analysis
Definition of EDA

• Exploratory data analysis is a data analytics process that aims to understand the data in depth and
learn the different data characteristics, often using visual means. This allows you to get a better
feel of your data and find useful patterns.
Exploratory Data Analysis
Importance in the Data Analysis

• It helps you gather insights, make better sense of the data, and remove irregularities and
unnecessary values from data.
• Helps you prepare your dataset for analysis.
• Allows a machine learning model to predict our dataset better.
• Gives you more accurate results.
• It also helps us to choose a better machine-learning model
Exploratory Data Analysis
Goals of EDA
 Discover patterns and trends.
 Spot errors, anomalies, and outliers.
 Visualize relationships between variables.
 e.g., a raw scatterplot vs. a cleaned-up, annotated version.
Exploratory Data Analysis
Steps Involved in Exploratory Data Analysis

1. Data Collection - Data collection is an essential part of exploratory data analysis. It refers to the
process of finding and loading data into our system. Good, reliable data can be found on various
public sites or bought from private organizations. Some reliable sites for data collection are
Kaggle, Github, Machine Learning Repository, etc.

• The data depicted below represents the housing dataset available on Kaggle. It contains
information on houses and their sale prices.
Exploratory Data Analysis
Steps Involved in Exploratory Data Analysis

2. Data Cleaning - Data cleaning refers to removing unwanted variables and values
from your dataset and eliminating any irregularities in it. Such anomalies can
disproportionately skew the data and, hence, adversely affect the results. Some
steps that can be done to clean data are:
● Removing missing values, outliers, and unnecessary rows/ columns.
● Re-indexing and reformatting our data.

Now, it’s time to clean the housing dataset. You first need to check to see the number of
missing values in each column and the percentage of missing values they contribute to
Exploratory Data Analysis
Steps Involved in Exploratory Data Analysis

3. Finding Missing Values

To do so, drop the columns which are missing more than 15% of the data. Further, some
variables are missing a significant chunk of the data, like 'PoolQC' , 'MiscFeature', 'Alley',
etc., seem to be outliers.
Exploratory Data Analysis
Steps Involved in Exploratory Data Analysis

3. Finding Missing Values


Exploratory Data Analysis
Steps Involved in Exploratory Data Analysis

4. Dropping Missing Values


Exploratory Data Analysis
Steps Involved in Exploratory Data Analysis

Your final dataset after cleaning looks as shown below. You now have only 63 columns of
importance.
Exploratory Data Analysis
Exploratory Data Analysis
Exploratory Data Analysis
Exploratory Data Analysis
Exploratory Data Analysis
Univariate Analysis
In Univariate Analysis, you analyze data of just one variable. A variable in your dataset
refers to a single feature/ column. You can do this with graphical or non-graphical means
by finding specific mathematical values in the data. Some visual methods include:

● Histograms: Bar plots in which the frequency of data is represented with rectangle
bars.
● Box plots: Here, the information is represented in the form of boxes.

Let's make a histogram out of our SalePrice column.


Exploratory Data Analysis

Univariate Analysis
Exploratory Data Analysis

Univariate Analysis
Exploratory Data Analysis

Univariate Analysis
Right skew
Also known as positive skew, this distribution has a longer tail on the right
side of its peak. The mean of the data is greater than the median.

Left skew
Also known as negative skew, this distribution has a longer tail on the left
side of its peak. The mean of the data is less than the median.

Zero skew (normal distribution)


A symmetrical distribution where the data graph is the same on both sides of
a central point.
Exploratory Data Analysis

Univariate Analysis
•High kurtosis
•A narrow box with long whiskers indicates high kurtosis. This means the
distribution has a narrow peak and many extreme values.
•Low kurtosis
•A wide box with short whiskers indicates low kurtosis. This means the
distribution has a broad peak and few extreme values.
•Normal distribution

•A bell-shaped curve with a kurtosis of 3. This is the ideal level of kurtosis,


neither too heavy nor too light.
Exploratory Data Analysis

Univariate Analysis
Exploratory Data Analysis

Univariate Analysis
Exploratory Data Analysis

Univariate Analysis
• From the graph, you can say that the graph
deviates from the normal and is positively
skewed.

• Now, find the Skewness and Kurtosis of the


graph.

Skewness and Kurtosis in your


data
Exploratory Data Analysis
Univariate Analysis - To understand exactly which variables are outliers, you need to establish a threshold. To
do this, you have to standardize the data. Hence, the data should have a mean of 1 and a standard deviation of 0.

• The above figure shows that the lower range values fall in a
similar range and are too far from 0. Meanwhile, all the higher
range values have a range far from 0.

• You cannot consider that all of them are outliers, but you have to
be careful with the last two variables that are above 7.
Exploratory Data Analysis
Tools and Libraries
 Python: Pandas, Matplotlib, Seaborn, Plotly.
 R: ggplot2, dplyr.
 Visualization tools: Tableau, Power BI.
Exploratory Data Analysis

Bivariate Analysis - Here, you use two variables and


compare them. This way, you can find how one feature
affects the other. It is done with scatter plots, which plot
individual data points or correlation matrices that plot the
correlation in hues. You can also use boxplots.
Exploratory Data Analysis

Bivariate Analysis - Now, plot a scatter plot of the Basement


area vs. the Sales Price and see their relationship. Again,
you can see that the greater the basement area, the
more the sales price.
Exploratory Data Analysis
Bivariate Analysis

Now, delete the last two values as they are outliers.

Deleting Outliers
Exploratory Data Analysis
Bivariate Analysis
Now, plot a scatter plot of the Basement area vs. the Sales Price and see their
relationship. Again, you can see that the greater the basement area, the more
the sales price.
Exploratory Data Analysis
Bivariate Analysis
Moving ahead, plot a boxplot of the Sales Price with Overall Quality. The overall
quality feature is categorical here. It falls in the range of 1 to 10. Here, you can
see the increase in sales price as the quality increases. The rise looks a bit like
an exponential curve.
Exploratory Data Analysis
Advanced EDA Techniques
●Outlier Detection
●Time Series Analysis
●Dimensionality Reduction (PCA)
●Real-world Examples
Exploratory Data Analysis
Advanced EDA Techniques
●Outlier Detection - Ensuring data quality and reliability is crucial
for making informed decisions and extracting meaningful insights.
However, datasets often contain irregularities known as outliers,
which can significantly impact the integrity and accuracy of
analyses. This makes outlier detection a crucial task in data analysis.
Exploratory Data Analysis
Advanced EDA Techniques
●Outlier Detection.
Exploratory Data Analysis
Advanced EDA Techniques
Outlier Detection.
Types of Outliers - Outliers can be classified into various types based
on their characteristics:

1.Univariate Outliers: These are outliers that occur in a single variable


or feature.

2.Multivariate Outliers: These outliers occur when considering


multiple variables simultaneously. A data point may not be an outlier
in any single dimension but can be an outlier when considering
multiple dimensions.
Exploratory Data Analysis
Advanced EDA Techniques
Outlier Detection.
Types of Outliers

3.Global Outliers: Also known as point anomalies, these data points


significantly differ from the rest of the dataset.

4.Contextual Outliers: These are data points that are considered outliers in a
specific context. For example, a high temperature may be normal in summer
but an outlier in winter.

5.Collective Outliers: A collection of data points that deviate significantly from


the rest of the dataset, even if individual points within the collection are not
outliers.
Exploratory Data Analysis
Advanced EDA Techniques
●Time Series Analysis - In Exploratory Data Analysis (EDA), "time
series analysis" refers to the process of examining data collected
over time to identify patterns, trends, seasonality, and outliers by
visualizing the data through techniques like line plots,
autocorrelation plots, and decomposition, which helps in
understanding the underlying structure of the time series data and
guiding further analysis or modeling decisions.
Exploratory Data Analysis
Advanced EDA Techniques
●Time Series Analysis
Exploratory Data Analysis
Advanced EDA Techniques
• Time Series Analysis - The obvious graph to start with is the time
plot. That is, the observations are plotted against the time they
were observed, with consecutive observations joined by lines.
• In Python , we can use Pandas and Matplotlib:
Exploratory Data Analysis
Advanced EDA Techniques
• Time Series Analysis -
Exploratory Data Analysis
Advanced EDA Techniques
●Dimensionality Reduction (PCA)
●In Exploratory Data Analysis (EDA), dimensionality
reduction using Principal Component Analysis (PCA)
is a technique used to transform high-dimensional
data into a lower-dimensional space, allowing for
easier visualization and identification of patterns
within complex datasets, while still preserving the
most important information from the original data.
Exploratory Data Analysis
Advanced EDA Techniques
●Dimensionality Reduction (PCA)
●Principal Component Analysis (PCA) is a
dimensionality reduction technique that can be used to
reduce a larger set of feature variables into a smaller
set that still contains most of the variance in the larger
set.

●https://www.kaggle.com/code/prashant111/eda-logistic
-regression-pca
Exploratory Data Analysis
Advanced EDA Techniques Application
● Advanced Exploratory Data Analysis (EDA) in real-world
scenarios includes using techniques like
● Interaction plots to examine complex relationships between
multiple variables,
● Time series analysis to identify patterns in data over time,
● Dimensionality reduction to visualize high-dimensional data,
outlier detection using advanced statistical methods, and
applying
● Clustering algorithms to identify distinct groups within a
dataset, often applied in fields like customer churn prediction,
fraud detection, healthcare analytics, and market research.
Exploratory Data Analysis
Advanced EDA Techniques Application
A. Customer Churn Analysis:
●Interaction plots: Visualizing how factors like customer
tenure, monthly usage, and recent support interactions
combine to influence churn probability.
●Time series analysis: Identifying patterns in customer
behavior over time to predict churn risk based on
usage trends.
●Clustering: Grouping customers with similar
characteristics to target churn prevention strategies.
Exploratory Data Analysis
Advanced EDA Techniques
B. Healthcare Analytics:
• Dimensionality reduction: Analyzing large medical
datasets with many variables using techniques like
Principal Component Analysis (PCA) to identify key
factors impacting patient outcomes.
• Outlier detection: Identifying unusual patient data
points (e.g., extreme lab values) that could signal
potential health issues.
• Survival analysis: Studying factors influencing patient
survival rates using time-to-event analysis.
Exploratory Data Analysis
Advanced EDA Techniques:
1. Interaction Plot - Used to visualize how two or more variables interact
with each other.
• Example: Interaction between marketing spend and customer age on
sales.
Exploratory Data Analysis
Advanced EDA Techniques:
2. Time Series Analysis Plot
Shows how a variable changes over time.
• Example: Stock market trends, COVID-19 cases over time.
Exploratory Data Analysis
Advanced EDA Techniques:
3. Dimensionality Reduction (PCA, t-SNE, UMAP)
Used to visualize high-dimensional data in a lower-dimensional space.
• Example: PCA visualization of customer segmentation.
Exploratory Data Analysis
Advanced EDA Techniques:
3. Dimensionality Reduction (PCA, t-SNE, UMAP)
A. Interpreting the PCA Cluster Plot
• The X and Y axes represent Principal Component 1 and Principal Component 2, which
contain the most variance in the data.
• Each point represents a data sample, colored by the cluster it belongs to.
• Even though the data originally had more features (e.g., 5D or 10D), we compressed it to
2D while preserving the structure.
B. Advantages of PCA
• Reduces noise and redundancy in the data.
• Speeds up computations in machine learning models.
• Aids visualization of complex datasets.
Exploratory Data Analysis
Advanced EDA Techniques:
4. Outlier Detection (Boxplot, Z-score, Isolation Forest)
Identifies anomalies in data distribution.
• Example: Detecting fraud in credit card transactions.
Exploratory Data Analysis

Advanced EDA Techniques:


4. Outlier Detection (Boxplot)

Normal Transactions (Inside the Box & Whiskers)


• Most credit card transactions fall within the IQR.
• These are regular spending patterns that follow normal
behavior.
Exploratory Data Analysis

Advanced EDA Techniques:


4. Outlier Detection (Boxplot)

Suspicious Transactions (Outliers - Dots Beyond the Whiskers)


• Transactions outside the whiskers are considered anomalies.
• These may indicate fraudulent activity, such as:
• Unusually high transactions (e.g., a user who normally spends $50 suddenly spends
$5,000).
• Multiple small transactions in a short time (indicative of fraudsters testing a stolen
card).
• Spending in unfamiliar locations (geographical anomalies).
Exploratory Data Analysis

Advanced EDA Techniques:


5. Clustering (K-Means, DBSCAN, Hierarchical Clustering)
Groups similar data points.
• Example: Customer segmentation in market research.
Exploratory Data Analysis
Advanced EDA Techniques:
5. Clustering (K-Means, DBSCAN, Hierarchical Clustering)
Exploratory Data Analysis
Advanced EDA Techniques:
5. Clustering (K-Means, DBSCAN, Hierarchical Clustering)
The scatter plot above shows the results of applying K-Means clustering for customer
segmentation based on:
• Annual Income ($1000s) (X-axis)
• Spending Score (1-100) (Y-axis)
Interpretation
• Customers are grouped into 4 clusters, represented by different colors.
• Cluster Centroids (black 'X' markers) indicate the center of each group.
• This segmentation helps businesses identify customer behavior patterns, such as:
• High-income, high-spending customers (Luxury buyers)
• Low-income, low-spending customers (Budget-conscious buyers)
• High-income, low-spending customers (Potential luxury market)
• Low-income, high-spending customers (Discount seekers)
Exploratory Data Analysis
Challenges in EDA
●Dealing with Missing Data
●Addressing Outliers
●Handling Skewed Distributions
●Strategies and Best Practices
Exploratory Data Analysis
1. Dealing with Missing Data
Problem:
 Missing values can lead to biased analysis and reduce model
performance.
 Causes: Human errors, data corruption, sensor failures, or
incomplete records.
Exploratory Data Analysis
1. Dealing with Missing Data
Solutions:

 Imputation Methods:
o Mean/Median Imputation: Fill in missing values with the mean/median of the column.
o Mode Imputation: Fill categorical missing values with the most frequent value.
o KNN Imputation: Use K-Nearest Neighbours to predict missing values.
o Multiple Imputation: Create multiple datasets with different imputed values.

 Dropping Missing Data: If missing values are excessive and random.

 Domain-specific handling (E.g., using business rules to infer missing values.


Exploratory Data Analysis
2. Addressing Outliers
Problem:
 Outliers can skew results and lead to incorrect conclusions.
 Causes: Errors in data entry, fraud, rare but valid occurrences.
Exploratory Data Analysis
2. Addressing Outliers

Solutions:

 Visualization Techniques:
o Boxplots and Z-scores help detect outliers.

o Interquartile Range (IQR): Values outside Q1 - 1.5*IQR and Q3 + 1.5*IQR are considered outliers.

 Transformations:
o Log transformation or Winsorization to cap extreme values.

 Machine Learning Approaches:


o Isolation Forest, DBSCAN, One-Class SVM for anomaly detection.
Exploratory Data Analysis
3. Handling Skewed Distributions
Problem:
 Highly skewed data affects the performance of statistical tests and
machine learning models.
 Right-skewed: Income, sales, transaction amounts (heavy tail on the
right).
 Left-skewed: Negative reviews, rare events (heavy tail on the left).
Exploratory Data Analysis
3. Handling Skewed Distributions
Solutions:
 Transformation Methods:
o Log Transformation: Reduces right-skew.
o Box-Cox Transformation: Normalizes both left- and right-skewed
data.
o Square Root & Reciprocal Transformations: Adjust distributions
with mild skew.
 Binning Data: Converting continuous data into categorical bins.
Exploratory Data Analysis
4. Strategies and Best Practices

Best Practices for EDA:

1. Understand the Data Context: Know the domain to guide cleaning and transformations.

2. Use Visualization Techniques:


o Histograms, Boxplots, Pairplots, and Correlation Heatmaps to explore patterns.

3. Feature Engineering: Create meaningful features to improve analysis.

4. Data Scaling & Normalization: Helps in models that rely on distance calculations (e.g., KNN,
SVM).

5. Automate EDA with Tools: Pandas Profiling, Sweetviz, AutoViz for rapid insights.
Exploratory Data Analysis

Interactive EDA Tools


●Introduction to Tools like Jupyter Notebooks, R Shiny,
etc.

●Benefits of Interactive Exploration

●Visual Demonstrations
Exploratory Data Analysis
1. Introduction to Tools like Jupyter Notebooks, R
Shiny, etc.

Jupyter Notebooks (Python)


• Interactive coding environment for Python, R, and Julia.
• Supports live visualizations (Matplotlib, Seaborn, Plotly).
• Allows step-by-step data exploration with Markdown
documentation.
Exploratory Data Analysis
1. Introduction to Tools like Jupyter Notebooks, R
Shiny, etc.

R Shiny (R)
• Web-based interactive dashboards for EDA and data
visualization.
• Ideal for building dynamic reports that update with user
input.
• Used in data science, finance, and healthcare analytics.
Exploratory Data Analysis
1. Introduction to Tools like Jupyter Notebooks, R
Shiny, etc.

Other Tools
• Tableau / Power BI: Drag-and-drop interactive EDA.
• Google Colab: Cloud-based Jupyter alternative with free
GPU/TPU.
• Streamlit / Dash: Python frameworks for custom web-
based data apps.
Exploratory Data Analysis

2. Benefits of Interactive Exploration


 Real-Time Analysis → Immediate feedback on data trends.

Dynamic Filtering → Select specific ranges, apply filters, and update


visualizations.
 Better Collaboration → Share notebooks/dashboards for team analysis.

 Custom Reports → Generate automated insights for decision-making.


Exploratory Data Analysis
3. Visual Demonstrations
• Would you like a live interactive EDA example
using

• Jupyter Notebooks with Pandas Profiling

• Plotly, or Streamlit?
Exploratory Data Analysis
3. Visual Demonstrations
Option 1: Pandas Profiling (Automated EDA)
• Generates a full report of data insights, including:
• Missing values, distributions, correlations, and key statistics.

Option 2: Plotly (Interactive Graphs)


• Creates dynamic visualizations (scatter plots, histograms, and
bar charts).
• Users can zoom, filter, and hover over data points.
https://colab.research.google.com/drive/1IYD4dgd0pCpcx0ZDmnb
AydXdnWdFOTeC?usp=sharing
Exploratory Data Analysis
3. Visual Demonstrations
Option 3: Streamlit (Web App for EDA)
• Builds a lightweight web-based dashboard for exploring
datasets interactively.
• Supports real-time filtering, uploading files, and interactive
charts.
Exploratory Data Analysis
Exploratory Data Analysis

Case Study: Retail Sales Analysis


●Walkthrough of a Retail Sales Dataset
●Application of Various EDA Techniques
●Key Findings and Insights

https://colab.research.google.com/drive/1wBByojR4ce
felJ1T7z85hEVo1GFcosPB?usp=sharing
Python Libraries for Analysis and Visualization

• NumPy, Pandas, Seaborn, and Sklearn are a few of the foremost prevalent
libraries utilized in Python programming.

• NumPy may be a library for scientific computing, Pandas could be a library for
data analysis, Seaborn could be a library for visualizing information, and Sklearn
could be a library for machine learning.

• Each library provides effective, however simple, data manipulation and analysis
tools. With these libraries, engineers can rapidly and effectively make capable
applications that use the control of data science.
Python Libraries for Analysis and Visualization

1. NumPy (numpy)

Purpose: Numerical computations, handling large arrays & matrices efficiently.

Key Features:
• Supports multi-dimensional arrays.

• Provides mathematical & statistical functions.

• Faster than Python lists due to vectorization.

https://colab.research.google.com/drive/1gv_iUCnb301Zqh7UPI9Eq0ga4TJ6-GPn?usp=sharing
Python Libraries for Analysis and Visualization

2. Pandas (pandas)

Purpose: Data manipulation & analysis, primarily using DataFrames & Series.

Key Features:
• Handles missing data efficiently.

• Supports SQL-like operations on data.

• Works well with CSV, Excel, SQL, and JSON files.

https://colab.research.google.com/drive/1gv_iUCnb301Zqh7UPI9Eq0ga4TJ6-GPn?usp=sharing
Python Libraries for Analysis and Visualization

3. Seaborn (seaborn)

Purpose: Advanced data visualization based on Matplotlib.

Key Features:
• Attractive & informative statistical graphics.

• Built-in themes for styling.

• Integrated with Pandas for easy plotting.

https://colab.research.google.com/drive/1gv_iUCnb301Zqh7UPI9Eq0ga4TJ6-GPn?usp=sharing
Python Libraries for Analysis and Visualization

4. Scikit-Learn (sklearn)

Purpose: Machine learning, data preprocessing, and model evaluation.

Key Features:
• Provides algorithms for classification, regression, clustering.

• Supports feature selection and dimensionality reduction.

• Comes with utilities for train-test splitting and performance evaluation.

https://colab.research.google.com/drive/1gv_iUCnb301Zqh7UPI9Eq0ga4TJ6-GPn?usp=sharing
Python Libraries for Analysis and Visualization

Library Purpose Key Functionality


Numerical computing,
NumPy np.array(), np.mean(), np.std()
arrays, matrices
Data handling & pd.DataFrame(), df.describe(),
Pandas
analysis df.groupby()
Statistical data
Seaborn sns.scatterplot(), sns.histplot()
visualization
Machine learning & train_test_split(),
Scikit-Learn
model evaluation LinearRegression()
Invoke APIs and Web Services

• APIs (Application Programming Interfaces) and Web Services allow software


applications to communicate with each other over a network.

• Exploratory Data Analysis (EDA) is often used to access, retrieve, or send data
between different systems or platforms for analysis and visualization.

1. API (Application Programming Interface):


• An API is a set of rules and protocols allowing one software application to interact.

• In data analysis, APIs fetch data from online sources, databases, or other systems.

Examples:

1. Weather APIs to fetch weather data.

2. Financial APIs to retrieve stock prices or economic indicators.


Invoke APIs and Web Services

2. Web Services:
• A type of API that operates over a network (commonly the internet) to enable communication
between different systems.

• Web services typically use standard protocols like HTTP/HTTPS to send and receive data.

• Formats: Most web services provide data in structured formats like JSON or XML, which are
easy to process in Python.
Invoke APIs and Web Services

3. Invoking APIs/Web Services:


• Invoking means sending a request to the API endpoint (a URL) and receiving
the response (data).

• Python provides libraries like requests, urllib, and others to simplify this
process.
Invoke APIs and Web Services

Why Use APIs in EDA?

1. Access Live/Real-Time Data: APIs allow analysts to work with up-to-date datasets
from external services (e.g., social media platforms, financial systems, or weather
services).

2. Automation: Automating data retrieval through APIs saves time compared to


manual data collection.

3. Diverse Data Sources: APIs make combining multiple data sources into a single
analysis easy, enriching the EDA process.
Invoke APIs and Web Services

Python Libraries for APIs:

1. requests: Used to send HTTP requests to APIs and receive responses.


https://colab.research.google.com/drive/1-B7tpWiHUg15tOTGmT2UWV7ZPFt53bga?usp=shari
ng

2. json: Used to parse JSON responses from APIs.

3. urllib: Another library for accessing web services, often more detailed but less user-friendly than
requests.
Invoke APIs and Web Services
import requests

# Step 1: Define the API endpoint


url = "https://jsonplaceholder.typicode.com/posts/1"

# Step 2: Send a GET request to that URL


response = requests.get(url)

# Step 3: Print the status and the response content


print("Status Code:", response.status_code)
print("Response JSON:", response.json())
Invoke APIs and Web Services
OUTPUT
Status Code: 200

Response JSON: {'userId': 1, 'id': 1, 'title': 'sunt aut facere


repellat provident occaecati excepturi optio reprehenderit', 'body':
'quia et suscipit\nsuscipit recusandae consequuntur expedita et
cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum
est autem sunt rem eveniet architecto'}
Invoke APIs and Web Services
Status Code: 200
• This means the request was successful.

response.json() gives you a dictionary containing:


Invoke APIs and Web Services
Invoke APIs and Web Services
• If API works – normal output
Invoke APIs and Web Services
❌ If There’s a Problem

Error Type Example Trigger What You’ll See

"HTTP Error: 404 Client Error: Not


HTTPError Invalid URL or page not found
Found"

ConnectionError No internet / API server down "Connection Error: ..."

Timeout Slow or unresponsive API "Timeout Error: ..."

RequestException Catch-all for anything else "Something went wrong: ..."


Invoke APIs and Web Services

Example 2: Invoking an API in Python

Let’s fetch weather data from an example API:


import requests

# API endpoint and parameters


url = "http://api.weatherapi.com/v1/current.json"
params = {
"key": "YOUR_API_KEY", # Replace with your API key
"q": "New York", # Location
"aqi": "no" # Air Quality Index (optional)
}

# Send GET request to the API


response = requests.get(url, params=params)

# Parse the JSON response


if response.status_code == 200:
data = response.json()
print("Location:", data['location']['name'])
print("Temperature (C):", data['current']['temp_c'])
print("Condition:", data['current']['condition']['text'])
else:
print("Failed to fetch data. Status Code:", response.status_code)
Invoke APIs and Web Services

How APIs Work Here?


1.The client sends a request to the server using a specific endpoint.
2.The server processes the request and returns a response in JSON or XML format.
3.The client processes the response for further analysis.

Anatomy of an API Request


•Endpoint: The URL to which a request is sent (e.g., https://api.example.com/data).
•Request Methods:
•GET: Fetch data.
•POST: Send new data.
•PUT: Update existing data.
•DELETE: Remove data.
•Response: The server returns data in JSON, XML, or plain text format.
Invoke APIs and Web Services

Working with APIs in Python

Objective: How to interact with APIs using Python libraries.

• Python Libraries for APIs Requests


• A popular library to make HTTP requests.
• Install using: pip install requests.

• Sending API Requests


• Syntax: requests.get(endpoint), requests.post(endpoint, data).
• Parse JSON responses using .json().

• Common API Response Codes


• 200: Success.
• 404: Resource not found.
• 500: Internal server error.
Invoke APIs and Web Services

Practical Example in EDA:

1.Fetch data using APIs (e.g., stock prices).

2.Analyze data trends (e.g., using Pandas and Numpy).

3.Visualize trends (e.g., using Matplotlib, Seaborn).

Integrating APIs and web services with EDA techniques allows analysts to work
efficiently with dynamic and diverse datasets.
Python libraries for Analysis

Applications in EDA

• Cleaning real-world messy data for analysis.

• Extracting key metrics and patterns through grouping and aggregation.

• Transforming data to make it analysis-ready.


Visualize using Python: Matplotlib, Seaborn, and Folium.

Video Reference: https://www.youtube.com/watch?v=OOLlVlleaN4


Github Reference: https://github.com/oladapo-joseph/Automobile_Sales_Analysis
Visualize using Python: Matplotlib, Seaborn, and Folium.

1. Matplotlib: Basic Plotting


• Matplotlib is a foundational visualization library for creating static and interactive
plots.
Visualize using Python: Matplotlib, Seaborn, and Folium.

1. Matplotlib: Basic Plotting


• Matplotlib is a foundational visualization library for creating static and interactive
plots.
Visualize using Python: Matplotlib, Seaborn, and Folium.

1. Matplotlib: Basic Plotting


• Matplotlib is a foundational visualization library for creating static and interactive
plots.
Visualize using Python: Matplotlib, Seaborn, and Folium.

2. Seaborn: Statistical Visualization


• Seaborn builds on Matplotlib and is great for creating complex statistical graphics.
Visualize using Python: Matplotlib, Seaborn, and Folium.
Visualize using Python: Matplotlib, Seaborn, and Folium.
Visualize using Python: Matplotlib, Seaborn, and Folium.
Visualize using Python: Matplotlib, Seaborn, and Folium.

3. Folium: Interactive Maps


• Folium is perfect for creating interactive maps with markers and other features.
Visualize using Python: Matplotlib, Seaborn, and Folium.
ICT Teaching
• Experiential Learning:
https://docs.google.com/document/u/3/d/19I906QSWwJqziroItmA1uTc
b3xEjgjVq/edit?usp=drive_web&ouid=107372573615082269577&rtpo
f=true

• Problem Solving:
https://docs.google.com/document/u/3/d/16ZDBRZvOQAegQ8dqZdm-l
EG4o7MVmAvD/edit?usp=drive_web&ouid=10737257361508226957
7&rtpof=true

You might also like