0% found this document useful (0 votes)

34 views

L4 Exploratory Analysis en

This document provides an overview of exploratory data analysis techniques that will be covered in an Introduction to Data Science course. The course covers topics such as data crawling, preprocessing, cleaning, visualization, machine learning, and analyzing text, image and graph data. The document outlines the learning outcomes, motivation, process and focus of exploratory data analysis. It describes common EDA questions, strategies, techniques and tools for summarizing univariate and bivariate data, including measures of central tendency, variability, frequency distributions, histograms, box plots, and scatter plots.

Uploaded by

Đức Anh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

L4 Exploratory Analysis en

Uploaded by

Đức Anh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

1

Introduction to
Data Science
(IT4142E)
Contents
q Lecture 1: Overview of Data Science
q Lecture 2: Data crawling and preprocessing
q Lecture 3: Data cleaning and integration
q Lecture 4: Exploratory data analysis
q Lecture 5: Data visualization
q Lecture 6: Multivariate data visualization
q Lecture 7: Machine learning
q Lecture 8: Big data analysis
q Lecture 9: Capstone Project guidance
q Lecture 10+11: Text, image, graph analysis
q Lecture 12: Evaluation of analysis results

Learning outcomes
• Understand key elements in exploratory data analysis
(EDA)
• Explain and use common summary statistics for EDA
• Plot and explain common graphs and charts for EDA

4
Motivation
• Before making inferences from data it is essential to
examine all your variables.
• To understand your data
• Why?
• To listen to the data:
• to catch mistakes
• to see patterns in the data
• to find violations of statistical assumptions
• to generate hypotheses
• …and because if you don’t, you will have trouble later

Data science process

1. Formulate a question

4. Product
2. Gather data

3. Analyze data

6
Source: Foundational Methodology for Data Science, IBM, 2015
Exploratory data analysis (EDA) focus
• The focus is on the data—its structure, outliers, and
models suggested by the data.
• EDA approach makes use of (and shows) all of the
available data. In this sense there is no corresponding
loss of information.
• Summary statistics
• Visualization
• Clustering and anomaly detection
• Dimensionality reduction

EDA definition
• The EDA is precisely not a set of techniques, but an
attitude/philosophy about how a data analysis should
be carried out.
• Helps to select the right tool for preprocessing or analysis
• Makes use of humans’ abilities to recognize patterns in data

8
EDA common questions
• What is a typical value?
• What is the uncertainty for a typical value?
• What is a good distributional fit for a set of numbers?
• Does an engineering modification have an effect?
• Does a factor have an effect?
• What are the most important factors?
• Are measurements coming from different laboratories
equivalent?
• What is the best function for relating a response variable to
a set of factor variables?
• What are the best settings for factors?
• Can we separate signal from noise in time dependent data?
• Can we extract any structure from multivariate data?
• Does the data have outliers?

EDA is an iterative process

• Repeat...
• Identify and prioritize relevant questions in
decreasing order of importance
• Ask questions
• Construct graphics to address questions
• Inspect “answer” and derive new questions

10
EDA strategy
• Examine variables one by one, then look at the
relationships among the different variables
• Start with graphs, then add numerical summaries of
specific aspects of the data
• Be aware of attribute types
• Categorical vs. Numeric

EDA techniques
• Graphical techniques
• scatter plots, character plots, box plots, histograms, probability
plots, residual plots, and mean plots.
• Quantitative techniques

12
Describing univariate data

Observations and variables

• Data is an collection of observations
• an attribute is thought of as a set of values describing
some aspect across all observations, it is called a
variable

14
Types of variables

Dimensionality of data sets

• Univariate: Measurement made on one variable per
subject
• Bivariate: Measurement made on two variables per
subject
• Multivariate: Measurement made on many variables
per subject

16
Measures of central tendency
• Measures of Location: estimate a location parameter
for the distribution; i.e., to find a typical or central value
that best describes the data.
• Measures of Scale: characterize the spread, or
variability, of a data set. Measures of scale are simply
attempts to estimate this variability.
• Skewness and Kurtosis

Mean
• To calculate the average value of a set of observations,
sum of their values divided by the number of
observations:

18
Median
• The median is the value of the point which has half the
data smaller than that point and half the data larger
than that point.
• Calculation
• If there are an odd number of observations, find the middle
value
• If there are an even number of observations, find the middle
two values and average them
• Example
• Age of participants: 17 19 21 22 23 23 23 38
• Median = (22+23)/2 = 22.5

Mode
• mode is the most commonly reported value for a
particular variable
• Eg. 3, 4, 5, 6, 7, 7, 7, 8, 8, 9. Mode = 7
• Eg. 3, 4, 5, 6, 7, 7, 7, 8, 8, 8, 9. Mode = {7, 8} = 7.5

20
Which location measure is best?
• Mean is best for symmetric distributions without outliers
• Median is useful for skewed distributions or data with
outliers

Measure of scale : Variance and standard

deviation
• Variance: average of squared deviations of values from
the mean

• Standard Deviation: simply the square root of the

variance

22
Run sequence plot
• displays observed data in a time sequence.
• The run sequence plot can be used to answer the
following questions
• Are there any shifts in location?
• Are there any shifts in variation?
• Are there any outliers?

Bar charts
• a bar chart displays the relative frequencies for the
different values.
• or a chart presents categorical
data with rectangular bars
with heights or lengths proportional to the values that
they represent

24
Histogram plot
• A histogram is to graphically summarize the distribution
of a univariate data set.
• The histogram can be used to answer the following
questions:
• What kind of population distribution do the data come from?
• Where are the data located?
• How spread out are the data?
• Are the data symmetric or skewed?
• Are there outliers in the data?

Example of frequency distributions

26
Box plot
• Box plot displayed: the lowest value, the lower quartile
(Q1), the median (Q2), the upper quartile (Q3), the
highest value, and the mean.

Box plot (2)

• The box plot can provide answers to the following
questions:
• Is a factor significant?
• Does the location differ between subgroups?
• Does the variation differ between subgroups?
• Are there any outliers?

28
Skewness
• Skewness is a measure of asymmetry. A distribution, or
data set, is symmetric if it looks the same to the left
and right of the center point
• Symetrical distribution

29
Mean = median = mode = 3

Negative, positive skewness

30
Kurtosis
• Kurtosis is a measure of whether the data are peaked
or flat relative to a normal distribution. data sets with
high kurtosis tend to have a distinct peak near the
mean, decline rather rapidly, and have heavy tails.
Data sets with low kurtosis tend to have a flat top near
the mean rather than a sharp peak.

Understanding relationships

32
Scatter plot
• identify whether a relationship exists between two
continuous variables measured on the ratio or interval
scales
• two variables are plotted on the x-and y-axis
• each point is a single observation.

Scatter plot
• Scatter plots can provide answers to the following
questions:
• Are variables X and Y related?
• Are variables X and Y linearly related?
• Are variables X and Y non-linearly related?
• Does the variation in Y change depending on X?
• Are there outliers?

34
Scatter plot: No relationship

Scatter plot: Strong linear (positive - negative

correlation)

36
Scatter plot: Sinusoidal relationship
(damped)

Scatter plot: variation of Y does not

depend on X (homoscedastic)

38
Scatter plot: Outlier

Scatterplot matrix
• a collection of scatterplots organized into a grid
(or matrix).
• Each scatterplot shows the relationship between a
pair of variables

40
Lag plot
• For data values Y1, Y2, …, YN, the k-period (or kth) lag
of the value Yi is defined as the data point that
occurred k time points before time i. That is Lag!(!") =
!"−! For example, Lag1(!2) = !1 and Lag3(!10) = !7
• Lag plots can provide answers to the following
questions:
• 1. Are the data random?
• 2. Is there serial correlation in the data?
• 3. What is a suitable model for the data?
• 4. Are there outliers in the data?

Lag plot patterns

• Random Data

42
Data with weak autocorrelation

Data with moderate autocorrelation

44
Data with high autocorrelation

Sinusoidal data

46
Contour plots
• show a three-dimensional surface on a two-
dimensional plane. Contour lines indicate elevations
that are the same
• The contour plot is used to answer the question
• How does Z change as a function of X and Y?

Demo

48
Identifing and understanding groups
Clustering Methods in Exploratory Analysis

Motivation
• Decomposing a data set into simpler subsets helps
make sense of the entire collection of observations
• uncover relationships in the data such as groups of
consumers who buy certain combinations of products
• identify rules from the data
• discover observations dissimilar from those in the major
identified groups (possible errors or anomalies)

50
Clustering
• A way of grouping together data samples that are
similar in some way - according to some criteria
• A form of unsupervised learning – you generally don’t
have examples demonstrating how the data should be
grouped together

Can we find things that are close together?

• Clustering organizes things that are close into groups

• How do we define close?
• How do we group things?
• How do we visualize the grouping?
• How do we interpret the grouping?

52
Types of clustering
• Hierarchical clustering
• Flat clustering

Hierarchical clustering
• An agglomerative approach
• Find closest two things
• Put them together
• Find next closest
• Requires
• A defined distance
• A merging approach
• Produces
• A tree showing how close things are to each other
(dendrogram)

54
Distances
• A method of clustering needs a way to measure how
similar observations are to each other.
• Continuous - Euclidean distance
• Continuous - correlation similarity
• Binary - Manhattan distance
• Pick a distance/similarity that makes sense for the
problem

Euclidean distance

56
Manhattan distance
• is the sum of the lengths of the
projections of the line
segment between the points onto
the coordinate axes

Cosine distance

58
Agglomerative Hierarchical Clustering Algorithm

Linkage rules

60
AHC result

K-mean clustering
• A partitioning approach
• Fix a number of clusters
• Get “centroids” of each cluster
• Assign things to closest centroid
• Recalculate centroids
• Requires
• A defined distance metric
• A number of clusters
• An initial guess as to cluster centroids
• Produces
• Final estimate of cluster centroids
• An assignment of each point to clusters

62
63

64
65

66
67

Dimensionality reduction
Principal Components Analysis and
Singular Value Decomposition

68
Motivation
• Most machine learning and data mining techniques
may not be effective for high-dimensional data
• Curse of Dimensionality. Irrelevant and redundant features
can “confuse” learners!
• The intrinsic dimension may be small.

Curse of dimensionality
• The required number of samples (to achieve the same
accuracy) grows exponentionally with the number of
variables!
• In practice: number of training examples is fixed!
• => the classifier’s performance usually will degrade for a
large number of features!

After a certain point, increasing the

dimensionality of the problem by adding
new features would actually degrade the
performance of classifier.
Motivation
• Dimensionality reduction is an effective approach to
downsizing data
• Visualization: projection of high-dimensional data onto 2D or
3D.
• Data compression: efficient storage and retrieval.
• Noise removal: positive effect on query accuracy.

Data compression

Reduce data from

(inches)

2D to 1D

(cm)
Data compression (2)

Reduce data from

(inches)

2D to 1D

(cm)

Data compression (2)

Reduce data from 3D to 2D

Principal Component Analysis (PCA) problem
formulation

Principal Component Analysis (PCA) problem

formulation

Reduce from 2-dimension to 1-dimension: Find a direction (a vector )

onto which to project the data so as to minimize the projection error.
Reduce from n-dimension to k-dimension: Find vectors
onto which to project the data, so as to minimize the projection error.
Demo

References

78
Thank Thank
you you for your attention!
for your
Q&A
attention!!!

Exploratory data analysis in Tableau

80
CitiesExt.csv
• Ten countries with the highest population, bar chart
showing populations
• Pie chart showing relative number of cities with
negative longitude and positive longitude. Label the
two slices “west” for west of the Prime Meridian
(negative longitude), and “east” for east of the Prime
Meridian (positive longitude)
• Is there is any relationship between the latitude of
cities in a country (x-axis) and the population of that
country (y-axis) (scatter plot)

PlayersExt.csv
• Create a bar chart showing the average number of minutes
played by players in each of the four positions.
• Create a stacked bar chart for teams that played more than
4 games, showing their number of wins, draws, and losses.
• Create a pie chart showing the relative percentage of teams
with 0, 1, and 2 red cards. Note: the pie should have three
slices.
• Create a scatterplot of players showing passes (y-axis)
versus minutes (x-axis). (Why are there some lines of dots?)
• Create a map of countries colored light to dark blue based
on how many goals their team made (“goalsFor”).
• Create a pie chart showing the relative percentage of
players making <= 0.25 passes per minute, >= 0.5 passes
per minute, and between 0.25 and 0.5.
82
Lag plot
• Lag plots can provide answers to the following
questions:
• 1. Are the data random?
• 2. Is there serial correlation in the data?
• 3. What is a suitable model for the data?
• 4. Are there outliers in the data?

Block plot

Sa01933la-D - 1711 STC AML 2017
100% (2)
Sa01933la-D - 1711 STC AML 2017
16 pages
Factory Acceptance Test For PRV
No ratings yet
Factory Acceptance Test For PRV
4 pages
The Devil Is A Part-Timer!, Vol. 21 Dark
No ratings yet
The Devil Is A Part-Timer!, Vol. 21 Dark
363 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Exploratory Data Analysis and Data Visualization: Credits: Chrisvolinsky - Columbia University
No ratings yet
Exploratory Data Analysis and Data Visualization: Credits: Chrisvolinsky - Columbia University
49 pages
Exploratory Data Analysis Reference
No ratings yet
Exploratory Data Analysis Reference
50 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
EDA - Module 4
No ratings yet
EDA - Module 4
35 pages
02a EDA and Data Visualization
No ratings yet
02a EDA and Data Visualization
79 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Unit 2
No ratings yet
Unit 2
20 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
data mining 2
No ratings yet
data mining 2
64 pages
Module 1
No ratings yet
Module 1
64 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Exploratory Data Analysis Reference
100% (2)
Exploratory Data Analysis Reference
49 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Unit .......
No ratings yet
Unit .......
45 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
02 Data
No ratings yet
02 Data
41 pages
02 Data
No ratings yet
02 Data
62 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Lect 3
No ratings yet
Lect 3
51 pages
ds unit 2 qb
No ratings yet
ds unit 2 qb
25 pages
02 Data
No ratings yet
02 Data
64 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
02 Data
No ratings yet
02 Data
65 pages
Chapter 2_ Data Exploration, Preprocessing and Visualization
No ratings yet
Chapter 2_ Data Exploration, Preprocessing and Visualization
92 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Notes: Section 1: Exploratory Data Analysis
No ratings yet
Notes: Section 1: Exploratory Data Analysis
6 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
02Data
No ratings yet
02Data
65 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
Deck 1- Data Types, Data Display, and Summary 2024F
No ratings yet
Deck 1- Data Types, Data Display, and Summary 2024F
42 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
43 pages
Unit 3
No ratings yet
Unit 3
47 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
L5 6 DataViz
No ratings yet
L5 6 DataViz
79 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Crash Course Data Science
No ratings yet
Crash Course Data Science
7 pages
Lectur 4 Basic Statistical Descriptions of Data
No ratings yet
Lectur 4 Basic Statistical Descriptions of Data
44 pages
4-DataUnderstanding
No ratings yet
4-DataUnderstanding
51 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
48 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
Unit _Data Visualization
No ratings yet
Unit _Data Visualization
33 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
Foundations or Research Analysis
No ratings yet
Foundations or Research Analysis
31 pages
02Data
No ratings yet
02Data
66 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Practice Makes Perfect Statistics
From Everand
Practice Makes Perfect Statistics
Sandra McCune
No ratings yet
PNB Overseas Directory
No ratings yet
PNB Overseas Directory
6 pages
29Th Feb
No ratings yet
29Th Feb
10 pages
Long Shunt Compound DC
No ratings yet
Long Shunt Compound DC
3 pages
Carey Foster's Bridge by MR - Charis
No ratings yet
Carey Foster's Bridge by MR - Charis
4 pages
Hilti Hit RE 500 - Hilti Aust Pty LTD
No ratings yet
Hilti Hit RE 500 - Hilti Aust Pty LTD
5 pages
01 - Cards - Declarations (Print On Card Stock)
No ratings yet
01 - Cards - Declarations (Print On Card Stock)
11 pages
Solidcam Application Tutorial: Simple Impeller
No ratings yet
Solidcam Application Tutorial: Simple Impeller
27 pages
SMPSYSTH008 v2014 QCCI Part 2 1 PDF
No ratings yet
SMPSYSTH008 v2014 QCCI Part 2 1 PDF
8 pages
Skema Kerja PT - LED Tahun 2021
No ratings yet
Skema Kerja PT - LED Tahun 2021
2 pages
Types of Wood Materials in Interior Design Projects
No ratings yet
Types of Wood Materials in Interior Design Projects
8 pages
Polymers: Introduction: Monomer Polymer
No ratings yet
Polymers: Introduction: Monomer Polymer
4 pages
HP-I, Chapter - Five, Conveyance Structures
No ratings yet
HP-I, Chapter - Five, Conveyance Structures
174 pages
Irritec Sprinkling Catalogue
No ratings yet
Irritec Sprinkling Catalogue
42 pages
TOKYO
No ratings yet
TOKYO
7 pages
Part 3-Chap 2 - Classification of Buildings Based On Occupancies
No ratings yet
Part 3-Chap 2 - Classification of Buildings Based On Occupancies
60 pages
STP
No ratings yet
STP
15 pages
Analysis & Design Using SAFE - Long-Term Deflection in SAFE 12
100% (2)
Analysis & Design Using SAFE - Long-Term Deflection in SAFE 12
1 page
Catalogo Rodi Domestico 2015 2016 Pt1
No ratings yet
Catalogo Rodi Domestico 2015 2016 Pt1
112 pages
Act 4
No ratings yet
Act 4
11 pages
OSRAM LEDriving HL H4 Gen2
No ratings yet
OSRAM LEDriving HL H4 Gen2
3 pages
The-Penny-Debate
No ratings yet
The-Penny-Debate
3 pages
Introduction To Architecture - Architectural Dictionary of Terms, Movements and Architects
No ratings yet
Introduction To Architecture - Architectural Dictionary of Terms, Movements and Architects
22 pages
Fractional-N Frequency Synthesizer ADF4154: Features General Description
No ratings yet
Fractional-N Frequency Synthesizer ADF4154: Features General Description
24 pages
Donald AND Deisy
No ratings yet
Donald AND Deisy
11 pages
Psychrometry: V RH T T SH
No ratings yet
Psychrometry: V RH T T SH
13 pages
Cartridge Pressure Switch Type ACB and CCB: Data Sheet
No ratings yet
Cartridge Pressure Switch Type ACB and CCB: Data Sheet
12 pages
Geometry
No ratings yet
Geometry
150 pages

L4 Exploratory Analysis en

Uploaded by

L4 Exploratory Analysis en

Uploaded by

1

Data science process

EDA is an iterative process

Observations and variables

Dimensionality of data sets

Measure of scale : Variance and standard

• Standard Deviation: simply the square root of the

Example of frequency distributions

Box plot (2)

Negative, positive skewness

Scatter plot: Strong linear (positive - negative

Scatter plot: variation of Y does not

Lag plot patterns

Data with moderate autocorrelation

Can we find things that are close together?

• Clustering organizes things that are close into groups

After a certain point, increasing the

Reduce data from

Reduce data from

Data compression (2)

Reduce data from 3D to 2D

Principal Component Analysis (PCA) problem

Reduce from 2-dimension to 1-dimension: Find a direction (a vector )

Exploratory data analysis in Tableau

You might also like