0% found this document useful (0 votes)
17 views

ML Lac0 Notes

The document provides an overview of Exploratory Data Analysis (EDA) and its application in spatial data analysis, emphasizing the importance of EDA in the data analysis workflow. It discusses techniques for data ingestion and cleaning, statistical analysis, visualization methods, and introduces Exploratory Spatial Data Analysis (ESDA) concepts. A case study involving a teahouse location decision illustrates the practical application of EDA techniques in real-world scenarios.

Uploaded by

odpc4979
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

ML Lac0 Notes

The document provides an overview of Exploratory Data Analysis (EDA) and its application in spatial data analysis, emphasizing the importance of EDA in the data analysis workflow. It discusses techniques for data ingestion and cleaning, statistical analysis, visualization methods, and introduces Exploratory Spatial Data Analysis (ESDA) concepts. A case study involving a teahouse location decision illustrates the practical application of EDA techniques in real-world scenarios.

Uploaded by

odpc4979
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Introduction to Exploratory (Spatial) Data Analysis - Summary

Overview

Exploratory Data Analysis (EDA) and its application in spatial data. The presentation covers the
fundamentals of EDA, the importance of EDA before modelling, and the specific techniques used
in spatial data analysis.

Learning Objectives

The primary learning objectives of the lesson are:

• Explaining the fundamentals and importance of EDA to peers.

• Applying statistical and visualization methods to different types of data.

• Developing familiarity with Python for data analysis tasks.

Data Analysis Workflow

The presentation outlines a typical data analysis workflow, starting from data preparation, which
involves ingesting and cleaning data, followed by EDA to summarize data characteristics using
statistical numbers and visualizations.

Data Ingestion and Cleaning

Data ingestion involves reading data from various formats using Python libraries such as:

• pandas.read_csv() for CSV files.

• pandas.read_excel() for Excel files.

• scipy.io.loadmat() for MATLAB files.

• geopandas.read_file() for shapefiles and GeoJSON files.

• rasterio.open() for GeoTIFF files.

• matplotlib.pyplot.imread() for images.

Data cleaning is emphasized as a crucial step to transform messy data into tidy data suitable for
modeling.

Exploratory Data Analysis (EDA)

EDA is described as a method to summarize data characteristics with statistical measures and
visualizations. Key benefits of EDA include:

• Providing an overview of the data.

• Guiding further analysis and method selection.

• Generating hypotheses.

• Identifying data problems.

• Understanding variable properties and relationships.


Statistical Analysis and Visualization

The presentation highlights the importance of combining statistical analysis with visualization to
maximize data insights and uncover underlying structures. Examples include:

• Histograms and Probability Density Functions (PDFs) for univariate analysis.

• Box plots for summarizing data distributions.

• Bar plots for categorical data.

Bi-Variate Analysis

Bi-variate analysis techniques are discussed to understand relationships between two variables.
Methods include:

• Correlation analysis to quantify relationships.

• 2-D scatter plots to visualize linear relationships.

• Pair plots to show pairwise relationships and identify patterns and outliers.

Exploratory Spatial Data Analysis (ESDA)

ESDA applies traditional EDA techniques to spatial datasets, connecting variables to specific
locations or times and considering spatial autocorrelation. Key concepts include:

• Spatial autocorrelation: Describing how variable values are correlated across space.

o Positive spatial autocorrelation: Similar values cluster together.

o Zero spatial autocorrelation: Random distribution of values.

o Negative spatial autocorrelation: Dissimilar values disperse.

Visualization Techniques in ESDA

Several ESDA mapping techniques are introduced, including:

• Box maps to identify outliers and visualize data distribution.

• Connection maps to show spatial relationships.

• Various advanced mapping methods like conditional choropleth maps and Voronoi
diagrams.

Case Study: Ghelgheli’s Teahouse

The presentation includes a team-based learning assignment involving a hypothetical scenario


where Ghelgheli, a tea lover, uses data analysis to find a suitable location for his teahouse. The
case study emphasizes:

• Data collection on potential locations, foot traffic, competitor locations, rent prices, and
demographics.

• Data cleaning and imputation to handle missing and anomalous values.

• Statistical analysis to extract descriptive statistics.


• Visualization techniques to identify patterns and trends.

• Decision-making based on data insights to select the best location.

Key Takeaways

The document concludes with several important lessons:

• The critical role of data in decision-making processes.

• The effectiveness of EDA techniques in uncovering insights.

• The transformation of messy data into valuable insights through proper cleaning and
analysis.

This comprehensive presentation provides a solid foundation for understanding and applying
EDA and ESDA techniques in various data analysis scenarios.
Introduction to Exploratory
(Spatial) Data Analysis

Mahdi KHODADADZADEH
Assistant Professor
Faculty of Geo-Information Science and Earth Observation (ITC)
Department of Geo-information Processing (GIP)
[email protected]

May 2024
Exploratory

Data

Analysis

From: https://xkcd.com
2
This lesson’s learning objectives

Explain to peers
• the fundamentals of E(S)DA
• the importance of E(S)DA before modelling
Apply statistical and visualization methods on different types of
data
Develop familiarity with Python

3
You are a Python master. Congrats!

4
M a g ic B
ox

ta )
( D a
_ X
t h m
or i
A l g
l =
M ode

You’ve learned how to build a model in Python. Congrats!

5
M a g ic B
ox

But you run into some issues!

6
Data Analysis Workflow

Data Preparation

From: https://davpy.netlify.app/3-data-workflow.html

7
Ingesting Data

Getting data in a shape that we can use to start our


analysis.
Python:
Reading comma separated value (CSV) data: pandas.read_csv()
Reading an Excel file: pandas.read_excel()
Reading a MATLAB file: scipy.io.loadmat()
Reading shapefile and GeoJSON files: geopandas.read_file()
Reading GeoTIFF: rasterio.open()
Reading an image: matplotlib.pyplot.imread()

8
Data Cleaning

Data preparation: messy data à tidy data


Rectangular data structures à Data modelling

From: https://www.openscapes.org/blog/2020/10/12/tidy-data/
9
Exploratory Data Analysis (EDA)

EDA aims at summarizing the characteristics of a dataset


with statistical numbers and graphs

Statistical Analysis + Visualization

Get an overview of the data


Orient further analysis à choose correct methods/approaches
Help you to generate hypothesis
Spot problems in data
Understand properties of the variables (e.g., mean)
Understand relationships between variables
10
Statistics + Visualization

From: https://en.wikipedia.org/wiki/Anscombe%27s_quartet

11
Statistics + Visualization

Visualization
Maximize insight into a
data set
Uncover underlying
structure

From: https://en.wikipedia.org/wiki/Anscombe%27s_quartet

12
Univariate Analysis

Mean and Standard Deviation

Histogram and PDF


distribution of the
data, showing the
number of
observations that fall
within each bin.
PDF is the continuous
version of the
histogram

13
Univariate Analysis

Min, Max, Median, Percentile, Quartile

Percentile: Given a vector V of length N, the q-th percentile of V is the


value q/100 of the way from the minimum to the maximum in a sorted copy
of V.
Quartile: The q-th quantile of V is the value q of the way from the minimum to
the maximum in a sorted copy of V.

five-number summary à

14
Univariate Analysis

Box plot: displays the five-number summary (the minimum, first


quartile, median, third quartile, and maximum) of a set of data.
It can tell you about your outliers and what their values are

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

15
Univariate Analysis

Bar plots

From: https://matplotlib.org/

16
From: https://xkcd.com

17
Bi-Variate Analysis

Correlation
Relationship between two variables quantitatively

18
Bi-Variate Analysis

2-D Scatter Plots


They can show the
linear relationship
between two variables

19
Bi-Variate Analysis

Pair-plot
Note: -

A pair plot is a visualization that


shows pairwise relationships
between variables in a dataset. It’s
a great way to explore how different
variables correlate with each other.

What is a Pair Plot?

A pair plot displays scatterplots,


histograms, or kernel density
estimates for each variable pair in
your dataset.
It’s useful for identifying patterns,
correlations, and potential outliers.

20
Exploratory Spatial Data Analysis

Geospatial data → ESDA

“Traditional” EDA can be applied to spatial datasets for


obtaining statistics and basic plots (barplot, histograms,
boxplots,..).
ESDA tools connects a specific variable to a location/time
It takes into account the values of the same variable in
different locations/time.

21
Applying EDA to geospatial data

22
Spatial autocorrelation

Correlation of a variable with itself across space (in different places in


space) à relationships to neighbors

Positive spatial autocorrelation


values are similar to their neighbors or other close objects
clusters of similar values on the map
Zero or no spatial autocorrelation
random values of close objects or neighbors
no clear pattern visually
Negative spatial autocorrelation
values are dissimilar to their neighbors or close objects
dispersed patterns of values on the map

23
Spatial autocorrelation

From: (Radil, 2011)

24
Spatial autocorrelation

From: https://mgimond.github.io/Spatial/spatial-autocorrelation.html

25
SPATIAL AUTOCORRELATION: MORAN’S I

• n is the number of cases


• xi is the variable value at a
particular location
• xj is the variable value at nåi å j wi , j ( xi - x)( x j - x)
another location I=
• ! is the mean of the variable
𝑿 åi å j i , j åi i
w ( x - x ) 2

• wij is a weight applied to the


comparison between location
i and location j

-1 0* +1

high negative spatial no spatial high positive spatial


autocorrelation autocorrelation* autocorrelation

Check out the link below for more in-depth explanation:


https://rpubs.com/corey_sparks/105700

26
Visualization on map

27
Connection map

From: https://www.data-to-viz.com/story/MapConnection.html

28
Box map

Note: -

A box map (Anselin 1994) is the mapping counterpart of the


idea behind a box plot. The point of departure is again a
quantile map, more specifically, a quartile map. But the four
categories are extended to six bins, to separately identify the
lower and upper outliers. The definition of outliers is a function
of a multiple of the inter-quartile range (IQR), the difference
between the values for the 75 and 25 percentile. As we will see
in a later chapter in our discussion of the box plot, we use two
options for these cut-off values, or hinges, 1.5 and 3.0. The box
map uses the same convention.

The box map in Figure separates the three lower outliers (the
observations with zero values) from the other four observations
in the first quartile. They are depicted in dark blue. Similarly, it
separates the six outliers in Manhattan from the eight other
observations in the upper quartile. The upper outliers are
colored dark red.
29
ESDA maps

Some examples of ESDA maps:


Box Map: https://geodacenter.github.io/workbook/3a_mapping/lab3a.html#extreme-
value-maps

Brushing & linking:


https://www.spatialanalysisonline.com/HTML/eda__esda_and_estda.htm

Conditional choropleth mapping:


http://publichealthintelligence.org/content/geography-diabetes-us-conditioned-map

Voronoi analysis: https://www.gislounge.com/voronoi-diagrams-and-gis/


Cartograms: https://gisgeography.com/cartogram-maps/
Connection map: https://www.data-to-viz.com/story/MapConnection.html

30
Team Based Learning
Team based learning assignment
Ghelgheli decided to change his job, and as a tea lover, he opted to open a teahouse. He aimed
to find the right location for his business, where many people were passing by and not many
competitors around.
Ghelgheli started by collecting data, organizing it into rows and columns within a table on his
computer. However, the data was somewhat messy, containing several missing values and
even some anomalies. Nevertheless, Ghelgheli was enthusiastic about working with such a
dataset. He used some cool techniques to clean the data, extract statistical measures, and
generate plots and maps.
Through his analysis, Ghelgheli pinpointed a suitable location for his teahouse, and soon after
opening, it became a local favorite.

Which data and methods do you think Ghelgheli utilized for his analysis?
What interesting learnings did you derive from Ghelgheli's story?
Can you provide some real-life examples similar to Ghelgheli's experience?

32
Data Collection: Ghelgheli started by collecting data on potential locations for his
teahouse. This could include foot traffic data, competitor locations, rent prices,
demographic information of the area, etc.

Data Cleaning: The data Ghelgheli collected was described as messy, with missing
and strange values. Ghelgheli likely employed techniques like data imputation,
outlier detection, and data validation to clean the dataset.

Statistical Analysis: Ghelgheli extracted statistical measures from the cleaned


dataset. This could involve calculating means, medians, standard deviations, and
other descriptive statistics to understand the characteristics of the data.

Visualization: Ghelgheli created plots and maps to visualize the data. This could
include scatter plots, histograms, heatmaps, and geographical maps to identify
patterns and trends in the data.

Decision Making: Through the analysis, Ghelgheli identified a suitable location for
his teahouse based on the insights gained from the data analysis.

33
• The importance of data in decision-making processes

• The power of EDA techniques in uncovering insights and


making informed decisions.

• How messy data can be transformed into valuable insights


through proper cleaning and analysis.

34

You might also like