L4 Exploratory Analysis en
L4 Exploratory Analysis en
Introduction to
Data Science
(IT4142E)
Contents
q Lecture 1: Overview of Data Science
q Lecture 2: Data crawling and preprocessing
q Lecture 3: Data cleaning and integration
q Lecture 4: Exploratory data analysis
q Lecture 5: Data visualization
q Lecture 6: Multivariate data visualization
q Lecture 7: Machine learning
q Lecture 8: Big data analysis
q Lecture 9: Capstone Project guidance
q Lecture 10+11: Text, image, graph analysis
q Lecture 12: Evaluation of analysis results
Learning outcomes
• Understand key elements in exploratory data analysis
(EDA)
• Explain and use common summary statistics for EDA
• Plot and explain common graphs and charts for EDA
4
Motivation
• Before making inferences from data it is essential to
examine all your variables.
• To understand your data
• Why?
• To listen to the data:
• to catch mistakes
• to see patterns in the data
• to find violations of statistical assumptions
• to generate hypotheses
• …and because if you don’t, you will have trouble later
4. Product
2. Gather data
3. Analyze data
6
Source: Foundational Methodology for Data Science, IBM, 2015
Exploratory data analysis (EDA) focus
• The focus is on the data—its structure, outliers, and
models suggested by the data.
• EDA approach makes use of (and shows) all of the
available data. In this sense there is no corresponding
loss of information.
• Summary statistics
• Visualization
• Clustering and anomaly detection
• Dimensionality reduction
EDA definition
• The EDA is precisely not a set of techniques, but an
attitude/philosophy about how a data analysis should
be carried out.
• Helps to select the right tool for preprocessing or analysis
• Makes use of humans’ abilities to recognize patterns in data
8
EDA common questions
• What is a typical value?
• What is the uncertainty for a typical value?
• What is a good distributional fit for a set of numbers?
• Does an engineering modification have an effect?
• Does a factor have an effect?
• What are the most important factors?
• Are measurements coming from different laboratories
equivalent?
• What is the best function for relating a response variable to
a set of factor variables?
• What are the best settings for factors?
• Can we separate signal from noise in time dependent data?
• Can we extract any structure from multivariate data?
• Does the data have outliers?
10
EDA strategy
• Examine variables one by one, then look at the
relationships among the different variables
• Start with graphs, then add numerical summaries of
specific aspects of the data
• Be aware of attribute types
• Categorical vs. Numeric
11
EDA techniques
• Graphical techniques
• scatter plots, character plots, box plots, histograms, probability
plots, residual plots, and mean plots.
• Quantitative techniques
12
Describing univariate data
13
14
Types of variables
15
16
Measures of central tendency
• Measures of Location: estimate a location parameter
for the distribution; i.e., to find a typical or central value
that best describes the data.
• Measures of Scale: characterize the spread, or
variability, of a data set. Measures of scale are simply
attempts to estimate this variability.
• Skewness and Kurtosis
17
Mean
• To calculate the average value of a set of observations,
sum of their values divided by the number of
observations:
18
Median
• The median is the value of the point which has half the
data smaller than that point and half the data larger
than that point.
• Calculation
• If there are an odd number of observations, find the middle
value
• If there are an even number of observations, find the middle
two values and average them
• Example
• Age of participants: 17 19 21 22 23 23 23 38
• Median = (22+23)/2 = 22.5
19
Mode
• mode is the most commonly reported value for a
particular variable
• Eg. 3, 4, 5, 6, 7, 7, 7, 8, 8, 9. Mode = 7
• Eg. 3, 4, 5, 6, 7, 7, 7, 8, 8, 8, 9. Mode = {7, 8} = 7.5
20
Which location measure is best?
• Mean is best for symmetric distributions without outliers
• Median is useful for skewed distributions or data with
outliers
21
22
Run sequence plot
• displays observed data in a time sequence.
• The run sequence plot can be used to answer the
following questions
• Are there any shifts in location?
• Are there any shifts in variation?
• Are there any outliers?
23
Bar charts
• a bar chart displays the relative frequencies for the
different values.
• or a chart presents categorical
data with rectangular bars
with heights or lengths proportional to the values that
they represent
24
Histogram plot
• A histogram is to graphically summarize the distribution
of a univariate data set.
• The histogram can be used to answer the following
questions:
• What kind of population distribution do the data come from?
• Where are the data located?
• How spread out are the data?
• Are the data symmetric or skewed?
• Are there outliers in the data?
25
26
Box plot
• Box plot displayed: the lowest value, the lower quartile
(Q1), the median (Q2), the upper quartile (Q3), the
highest value, and the mean.
27
28
Skewness
• Skewness is a measure of asymmetry. A distribution, or
data set, is symmetric if it looks the same to the left
and right of the center point
• Symetrical distribution
29
Mean = median = mode = 3
30
Kurtosis
• Kurtosis is a measure of whether the data are peaked
or flat relative to a normal distribution. data sets with
high kurtosis tend to have a distinct peak near the
mean, decline rather rapidly, and have heavy tails.
Data sets with low kurtosis tend to have a flat top near
the mean rather than a sharp peak.
31
Understanding relationships
32
Scatter plot
• identify whether a relationship exists between two
continuous variables measured on the ratio or interval
scales
• two variables are plotted on the x-and y-axis
• each point is a single observation.
33
Scatter plot
• Scatter plots can provide answers to the following
questions:
• Are variables X and Y related?
• Are variables X and Y linearly related?
• Are variables X and Y non-linearly related?
• Does the variation in Y change depending on X?
• Are there outliers?
34
Scatter plot: No relationship
35
36
Scatter plot: Sinusoidal relationship
(damped)
37
38
Scatter plot: Outlier
39
Scatterplot matrix
• a collection of scatterplots organized into a grid
(or matrix).
• Each scatterplot shows the relationship between a
pair of variables
40
Lag plot
• For data values Y1, Y2, …, YN, the k-period (or kth) lag
of the value Yi is defined as the data point that
occurred k time points before time i. That is Lag!(!") =
!"−! For example, Lag1(!2) = !1 and Lag3(!10) = !7
• Lag plots can provide answers to the following
questions:
• 1. Are the data random?
• 2. Is there serial correlation in the data?
• 3. What is a suitable model for the data?
• 4. Are there outliers in the data?
41
42
Data with weak autocorrelation
43
44
Data with high autocorrelation
45
Sinusoidal data
46
Contour plots
• show a three-dimensional surface on a two-
dimensional plane. Contour lines indicate elevations
that are the same
• The contour plot is used to answer the question
• How does Z change as a function of X and Y?
47
Demo
48
Identifing and understanding groups
Clustering Methods in Exploratory Analysis
49
Motivation
• Decomposing a data set into simpler subsets helps
make sense of the entire collection of observations
• uncover relationships in the data such as groups of
consumers who buy certain combinations of products
• identify rules from the data
• discover observations dissimilar from those in the major
identified groups (possible errors or anomalies)
50
Clustering
• A way of grouping together data samples that are
similar in some way - according to some criteria
• A form of unsupervised learning – you generally don’t
have examples demonstrating how the data should be
grouped together
51
52
Types of clustering
• Hierarchical clustering
• Flat clustering
53
Hierarchical clustering
• An agglomerative approach
• Find closest two things
• Put them together
• Find next closest
• Requires
• A defined distance
• A merging approach
• Produces
• A tree showing how close things are to each other
(dendrogram)
54
Distances
• A method of clustering needs a way to measure how
similar observations are to each other.
• Continuous - Euclidean distance
• Continuous - correlation similarity
• Binary - Manhattan distance
• Pick a distance/similarity that makes sense for the
problem
55
Euclidean distance
56
Manhattan distance
• is the sum of the lengths of the
projections of the line
segment between the points onto
the coordinate axes
57
Cosine distance
58
Agglomerative Hierarchical Clustering Algorithm
59
Linkage rules
60
AHC result
61
K-mean clustering
• A partitioning approach
• Fix a number of clusters
• Get “centroids” of each cluster
• Assign things to closest centroid
• Recalculate centroids
• Requires
• A defined distance metric
• A number of clusters
• An initial guess as to cluster centroids
• Produces
• Final estimate of cluster centroids
• An assignment of each point to clusters
62
63
64
65
66
67
Dimensionality reduction
Principal Components Analysis and
Singular Value Decomposition
68
Motivation
• Most machine learning and data mining techniques
may not be effective for high-dimensional data
• Curse of Dimensionality. Irrelevant and redundant features
can “confuse” learners!
• The intrinsic dimension may be small.
Curse of dimensionality
• The required number of samples (to achieve the same
accuracy) grows exponentionally with the number of
variables!
• In practice: number of training examples is fixed!
• => the classifier’s performance usually will degrade for a
large number of features!
71
Data compression
2D to 1D
(cm)
Data compression (2)
2D to 1D
(cm)
77
References
78
Thank Thank
you you for your attention!
for your
Q&A
attention!!!
79
80
CitiesExt.csv
• Ten countries with the highest population, bar chart
showing populations
• Pie chart showing relative number of cities with
negative longitude and positive longitude. Label the
two slices “west” for west of the Prime Meridian
(negative longitude), and “east” for east of the Prime
Meridian (positive longitude)
• Is there is any relationship between the latitude of
cities in a country (x-axis) and the population of that
country (y-axis) (scatter plot)
81
PlayersExt.csv
• Create a bar chart showing the average number of minutes
played by players in each of the four positions.
• Create a stacked bar chart for teams that played more than
4 games, showing their number of wins, draws, and losses.
• Create a pie chart showing the relative percentage of teams
with 0, 1, and 2 red cards. Note: the pie should have three
slices.
• Create a scatterplot of players showing passes (y-axis)
versus minutes (x-axis). (Why are there some lines of dots?)
• Create a map of countries colored light to dark blue based
on how many goals their team made (“goalsFor”).
• Create a pie chart showing the relative percentage of
players making <= 0.25 passes per minute, >= 0.5 passes
per minute, and between 0.25 and 0.5.
82
Lag plot
• Lag plots can provide answers to the following
questions:
• 1. Are the data random?
• 2. Is there serial correlation in the data?
• 3. What is a suitable model for the data?
• 4. Are there outliers in the data?
83
Block plot
84