ML Lac0 Notes
ML Lac0 Notes
Overview
Exploratory Data Analysis (EDA) and its application in spatial data. The presentation covers the
fundamentals of EDA, the importance of EDA before modelling, and the specific techniques used
in spatial data analysis.
Learning Objectives
The presentation outlines a typical data analysis workflow, starting from data preparation, which
involves ingesting and cleaning data, followed by EDA to summarize data characteristics using
statistical numbers and visualizations.
Data ingestion involves reading data from various formats using Python libraries such as:
Data cleaning is emphasized as a crucial step to transform messy data into tidy data suitable for
modeling.
EDA is described as a method to summarize data characteristics with statistical measures and
visualizations. Key benefits of EDA include:
• Generating hypotheses.
The presentation highlights the importance of combining statistical analysis with visualization to
maximize data insights and uncover underlying structures. Examples include:
Bi-Variate Analysis
Bi-variate analysis techniques are discussed to understand relationships between two variables.
Methods include:
• Pair plots to show pairwise relationships and identify patterns and outliers.
ESDA applies traditional EDA techniques to spatial datasets, connecting variables to specific
locations or times and considering spatial autocorrelation. Key concepts include:
• Spatial autocorrelation: Describing how variable values are correlated across space.
• Various advanced mapping methods like conditional choropleth maps and Voronoi
diagrams.
• Data collection on potential locations, foot traffic, competitor locations, rent prices, and
demographics.
Key Takeaways
• The transformation of messy data into valuable insights through proper cleaning and
analysis.
This comprehensive presentation provides a solid foundation for understanding and applying
EDA and ESDA techniques in various data analysis scenarios.
Introduction to Exploratory
(Spatial) Data Analysis
Mahdi KHODADADZADEH
Assistant Professor
Faculty of Geo-Information Science and Earth Observation (ITC)
Department of Geo-information Processing (GIP)
[email protected]
May 2024
Exploratory
Data
Analysis
From: https://xkcd.com
2
This lesson’s learning objectives
Explain to peers
• the fundamentals of E(S)DA
• the importance of E(S)DA before modelling
Apply statistical and visualization methods on different types of
data
Develop familiarity with Python
3
You are a Python master. Congrats!
4
M a g ic B
ox
ta )
( D a
_ X
t h m
or i
A l g
l =
M ode
5
M a g ic B
ox
6
Data Analysis Workflow
Data Preparation
From: https://davpy.netlify.app/3-data-workflow.html
7
Ingesting Data
8
Data Cleaning
From: https://www.openscapes.org/blog/2020/10/12/tidy-data/
9
Exploratory Data Analysis (EDA)
From: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
11
Statistics + Visualization
Visualization
Maximize insight into a
data set
Uncover underlying
structure
From: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
12
Univariate Analysis
13
Univariate Analysis
five-number summary à
14
Univariate Analysis
https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51
15
Univariate Analysis
Bar plots
From: https://matplotlib.org/
16
From: https://xkcd.com
17
Bi-Variate Analysis
Correlation
Relationship between two variables quantitatively
18
Bi-Variate Analysis
19
Bi-Variate Analysis
Pair-plot
Note: -
20
Exploratory Spatial Data Analysis
21
Applying EDA to geospatial data
22
Spatial autocorrelation
23
Spatial autocorrelation
24
Spatial autocorrelation
From: https://mgimond.github.io/Spatial/spatial-autocorrelation.html
25
SPATIAL AUTOCORRELATION: MORAN’S I
-1 0* +1
26
Visualization on map
27
Connection map
From: https://www.data-to-viz.com/story/MapConnection.html
28
Box map
Note: -
The box map in Figure separates the three lower outliers (the
observations with zero values) from the other four observations
in the first quartile. They are depicted in dark blue. Similarly, it
separates the six outliers in Manhattan from the eight other
observations in the upper quartile. The upper outliers are
colored dark red.
29
ESDA maps
30
Team Based Learning
Team based learning assignment
Ghelgheli decided to change his job, and as a tea lover, he opted to open a teahouse. He aimed
to find the right location for his business, where many people were passing by and not many
competitors around.
Ghelgheli started by collecting data, organizing it into rows and columns within a table on his
computer. However, the data was somewhat messy, containing several missing values and
even some anomalies. Nevertheless, Ghelgheli was enthusiastic about working with such a
dataset. He used some cool techniques to clean the data, extract statistical measures, and
generate plots and maps.
Through his analysis, Ghelgheli pinpointed a suitable location for his teahouse, and soon after
opening, it became a local favorite.
Which data and methods do you think Ghelgheli utilized for his analysis?
What interesting learnings did you derive from Ghelgheli's story?
Can you provide some real-life examples similar to Ghelgheli's experience?
32
Data Collection: Ghelgheli started by collecting data on potential locations for his
teahouse. This could include foot traffic data, competitor locations, rent prices,
demographic information of the area, etc.
Data Cleaning: The data Ghelgheli collected was described as messy, with missing
and strange values. Ghelgheli likely employed techniques like data imputation,
outlier detection, and data validation to clean the dataset.
Visualization: Ghelgheli created plots and maps to visualize the data. This could
include scatter plots, histograms, heatmaps, and geographical maps to identify
patterns and trends in the data.
Decision Making: Through the analysis, Ghelgheli identified a suitable location for
his teahouse based on the insights gained from the data analysis.
33
• The importance of data in decision-making processes
34