2.multivariate Analysis & Visualization
2.multivariate Analysis & Visualization
Data
Start
End
One or more value at each time
Time Series data
No quantitative relationship receives more
attention than value changing over time.
A statistics from 15 yrs of newspaper analysis
> 75% features time series analysis
Ex
Time Series
Stock performance over time (day/hour)
Fiscal spending over time (year)
patterns
Rate of change
Co-variation
Cycles
Exceptions
The overall tendency of a series of values to
Trend:
increase,
decrease or
Trend:
Line Charts
Anatomy of
Line Chart
Drawn by
- First, mapping the data (time and value) a
2D point on a Cartesian coordinate grid,
and
- then connecting a line between all of these
points.
Typically, the y-axis has a quantitative value,
while the x-axis is a timescale or a sequence of
intervals. https://datavizcatalogue.com/methods/line_graph.html
Line charts are one of the oldest forms of data
visualization
https://www.eea.europa.eu/data-and-maps/daviz/learn-more/chart-dos-and-donts
Time Series Displays
Scatter plots
Time Series Displays
Area chart: represents the change in one or more
quantities over time.
Similar to line graph.
Line Graphs
data points are plotted and then connected by line
Area Chart segments to show the value of a quantity at several
different times.
Bar Graphs
Different from line graphs: the area between the x
Point Plots axis and the line is filled in with color or shading:
Radar Graphs
Heatmaps
Box plots
Scatter plots
Anatomy of
Area Chart
https://datavizcatalogue.com/methods/area_graph.html
Primary Use: Changes over time
Data type:
quantitative values
Visual property:
Area Chart
height, slope, area
Area charts are a good choice to use when
you want to show a trend over time but aren't
as concerned with showing exact values.
William Playfair [1786] is credited with inventing
the area charts as well as the line, bar, and pie
charts.
Area Chart
Wiki
“The Scottish Scoundrel Who Changed How We
See Data: When he wasn’t blackmailing lords
and being sued for libel, William Playfair
invented the pie chart, the bar graph, and the
line graph.” Atlas Obscura (June 28, 2016).
Playfair
failed at silversmithing,
falsely claimed to have invented the semaphore
telegraphy,
tried blackmailing a Scottish lord,
sold tracts of American land he didn't actually
own to French nobility, and
died in poverty and obscurity.
http://conversableeconomist.blogspot.com/2017/08/william-playfair-inventor-of-bar-graph.html
Difference Area Chart
Exports and
imports to
and from
Denmark &
Norway from
Wiki 1700 to 1780
Overlapping Area Chart
https://en.wikipedia.org/wiki/Area_chart
Stacked Area Chart
https://www.theguardian.com/environment/2018/dec/05/brutal-
news-global-carbon-emissions-jump-to-all-time-high-in-2018
Stacked Area Chart: Anatomy
https://datavizcatalogue.com/methods/stacked_area_graph.html
Stacked Area Chart
Stream
Graph/Stream
Chart
http://nvd3.org/examples/stackedArea.html
Stacked Chart vs. Stream chart
Stream Graph/Stream Chart:
Anatomy
https://datavizcatalogue.com/methods/stream_graph.html
Cons:
Making
Sense of
cluttered with large datasets.
it's impossible to read the exact values visualised
Graph http://www.visualisingdata.com/2010/08/makin
g-sense-of-streamgraphs/
Stream Chart / Stream Graph
The overall effect
is artful.
The New York Times
produced this
fascinating graphic
showing box
office receipts for films
from 1986-2008.
https://blog.datawrapper.de/area-charts/
Time Series Displays
Point Plots
Radar Graphs
Heatmaps
Time Series Displays
Radar/ Spider Graphs: Circular
shape can be used to show the
cyclical nature of time
Line Graphs/Area Chart Candle stick chart: used a lot in the financial
world.
Bar Graphs
It visualizes the change of price information
Point Plots against time by providing multiple price
parameters at the same time.
Radar Graphs
Heatmaps
Candle stick chart
Candle stick
https://en.wikipedia.org/wiki/Candlestick_chart
Sample market data and
Candle Stick chart
Visualization:
Use Line Chart
Time Series line chart with point plot: use for non-regular time
interval
Analysis and Bar chart: use for comparing specific x-axis
values.
Best Aggregating to various time intervals
Point Plots
Radar Graphs
Bar plot
Time Series Viz:
Timeline Chart
Events (Nominal data) happening as a
function of time.
Time Series Viz: Timeline Chart
Source: https://www.cnn.com/2023/02/03/economy/january-jobs-report-final/index.html
Bar Chart
Frequency of Letters in English Text
Primary Use:
Comparing Category,
Comparing changes over time
Data Types:
Discrete or categorical
Independent Variable and
Quantitative Dependent
Variable
Visual Properties:
Length/Height, Color-hue
Bar Chart
Presents time series or categorical data with
rectangular bars with heights or lengths proportional
to the values that they represent.
The bars can be plotted vertically or horizontally.
Used for comparing.
Bar Chart
Source: https://datavizcatalogue.com/methods/bar_chart.html
Bar Chart
Primary Use:
Comparing Category,
Comparing changes over time
Data Types:
Discrete or categorical
Independent Variable and
Quantitative Dependent
Variable
Visual Properties:
Length/Height, Color-hue
Source https://www.cdc.gov/measles/cases-outbreaks.html
Effective visualization: Vertical or horizontal
Long Category name
or
large number of categories
Vertical bar chart can be
unintuitive
Effective visualization: Vertical or horizontal
Artistic
Rendition
using Shapes
Span Chart
Source:
Report from Center for Excellence in Education
https://www.cee.org/about-us/cee-index-
excellence-stem-education
Anatomy of Span Chart
https://datavizcatalogue.com/methods/span_chart.html
Multivariate Bar Chart
BAR CHARTS CAN PRESENT AND COMPARE MORE THAN ONE GROUP
OF DATA AT A TIME.
Population Pyramid
Bivariate Horizontal Bar chart:
Independent dimension:
Categorical (ordinal) data
Dependent dimension:
Quantitative data
Population: Male and Female
Source:
https://www.populationpyramid.net/india/2020/
Population Pyramid
Source: https://www.visualcapitalist.com/population-pyramids-compared/
Grouped bar charts are good for comparing
between each element in the categories, and
comparing elements across categories or
changes over time.
Grouped
Clustered Bar
Bar Chart Chart
Source:
https://www.reuters.com/world/india/despite-world-beating-
growth-indias-lack-jobs-threatens-its-young-2023-05-30/
Group Bar Chart
Primary Use:
Comparing Category, Comparing
changes over time
Data Types:
Discrete or categorical
Independent Variable
and Quantitative Dependent
Variable
Visual Properties:
Length/Height, Color-hue
Presents more than one group of data at a
time.
Compares Part-to-whole relationship over time.
Stacked
Grouped
Bar Chart
https://www.pewresearch.org/fact-tank/2023/01/03/118th-congress-has-a-record-number-of-women/ft_22-01-03_womencongress_1/
A variation on the stacked bar chart is one in
which the stacks diverge from a central
baseline in opposite directions.
Used to show differences in opposing
sentiments or groups
Diverging
Bar Chart
Bar Chart
Shows variation among multiple groups as
percentages rather than absolute numbers.
Grouped
Bar Chart
General Guideline
Ordering:
Use the bars in any order for nominal categories,
but in order, for ordinal categories.
Bar Chart
Source:
https://exceljet.net/glossary/ordinal-data
General Guideline
Ordering:
Use the bars in any order for nominal categories,
but in order, for ordinal categories.
Limit the number of bars.
Bar Chart
Too crowded:
Not ideal
General Guideline
Order:
Use the bars in any order for nominal categories.
In order, for ordinal categories.
Limit the number of bars.
Use Horizontal:
A large number of different categories and
Bar Chart there is insufficient space to fit all the columns
required for a vertical bar chart across the page
When Categories have long names (difficult to
put them below vertical bars)
Some choose Horizontal chart for nominal
categories and Vertical chart for ordinal category.
Some call Vertical Charts are Column Charts and
Horizontal charts as Bar Charts (ex: Excel)
Plotly Express bar chart Support
With bar charts it's easy to distort the proportion by changing the
scale.
Bar Charts are Good, but …
Source:
https://tradingeconomics.com/india/enrolment-in-upper-secondary-education-both-sexes-number-wb-data.html
Misleading charts 1
see http://viz.wtf/
Bar Charts are Good, but …
With bar charts it's easy to distort the proportion by changing the
scale.
Bar Chart: Best Practices
General Guideline
Order:
Use the bars in any order for nominal categories. In order, for ordinal
categories.
Limit the number of bars.
Use Horizontal:
A large number of different categories and there is insufficient space to fit all
the columns required for a vertical bar chart across the page
When Categories have long names (difficult to put them below vertical bars)
Some choose Horizontal chart for nominal categories and Vertical chart for
ordinal category.
Some call Vertical Charts are Column Charts and Horizontal charts as Bar Charts
(ex: Excel)
In general, do not show a bar chart without a zero.
Bar Charts are Good, but …
Confusing label
https://www.idealmedicalcare.org/1103-doctor-suicides-13-reasons-why/
Bar Charts are Good, but …
Source:
https://viz.wtf/post/90690163613/bar-chart-table#notes
Bar Chart: Best Practices
General Guideline
Order:
Use the bars in any order for nominal categories. In order, for ordinal
categories.
Limit the number of bars.
Use Horizontal:
A large number of different categories and there is insufficient space to fit all
the columns required for a vertical bar chart across the page
When Categories have long names (difficult to put them below vertical bars)
Some choose Horizontal chart for nominal categories and Vertical chart for
ordinal category.
Some call Vertical Charts are Column Charts and Horizontal charts as Bar Charts
(ex: Excel)
In general, do not show a bar chart without a zero.
Make proper labeling
Misleading Charts!
see http://viz.wtf/
Bar Chart: Best Practices
General Guideline
Order:
Limit the number of
bars.
Use Horizontal:
...
…
Label your scales
correctly
…
Do not use multiple
scales
See http://viz.wtf/
Variations to Bar Chart
Very much similar to a normal bar chart.
the bar is replaced by a line anchored from the
x axis and a dot at the end to mark the value.
Conveys the same information as bar charts.
Lollipop
Chart
https://datavizproject.com/data-type/lollipop-chart/
Dot at the end may be replaced by another
symbol
Like in Bar chart, Coordinates may be flipped to
make the segments horizonal.
Lollypop
Chart
Source: http://www.datarevelations.com/tag/ryan-sleeper
Radial Bar (column) Chart
Uses a grid of concentric circles to plot bars
on. Each circle on the graph represents a
value on a scale, while the radial dividers (lines
spanning from the center) are used for each
category.
Same purpose as the bar-chart
Re:
https://datavizcatalogue.com/
methods/radial_column_chart.h
tml
Circular Bar Chart
Pros: Compact
Cons: the problem with Radial Bar Charts is
that the bar lengths can be misinterpreted.
Each bar on the outside gets relatively
longer than the last, even if they represent
the same value.
Our visual systems are better at interpreting
straight lines, so the Cartesian bar chart is a
better choice for comparing values.
Therefore, Radial Bar Charts are used
primarily for aesthetic reasons.
Polar Area diagram
(Cox-Comb chart)
Same purpose as the bar-chart. UK temperatures in 2012.
Suitable for cyclic data
Constant angle division
Radius proportional to the value.
Pros: Good for cyclical data.
Cons:
The area of the sectors is proportional
to the squared radius. So it amplifies
the data.
http://prcweb.co.uk/radialbarchart/
This chart was famously used by Florence
Nightingale to communicate about the
avoidable deaths of soldiers during the
Crimean war.
In her chart, the area represented the data, not
the radius.
Nightingale’s
Rose Chart
https://datavizcatalogue.com/methods/nightingale_rose_chart.html
Funnel Chart
https://plotly.com/python/funnel-charts/
Ranking
ORDERED COMPARISON.
Bar Chart for
Ranking
Bar charts are great for ranking
Done by sorting data
or even plot methods allows you
Plotly allow you to draw them
ordered fashion.
Ranking
Patterns in Ranking
Uniform: all values are roughly the same.
Different:
Uniformly different: difference between the adjacent data are the
same
Non-uniformly different: difference between adjacent data vary
Increasingly/decreasingly different
Alternating difference: Differences begin with small, shift to large
and then shift back to small.
Exceptional : one or more data are exceptionally different from
the next
Lollipop chart for ranking
Visual Analysis of 2D Data
Bar Chart: Best Practices (Recap)
General Guideline
Order:
Use the bars in any order for nominal categories. In order, for ordinal
categories.
Limit the number of bars.
Use Horizontal:
A large number of different categories and there is insufficient space to fit all
the columns required for a vertical bar chart across the page
When Categories have long names (difficult to put them below vertical bars)
➢ Some choose Horizontal chart for nominal categories and Vertical chart for
ordinal category.
➢ Some call Vertical Charts are Column Charts and Horizontal charts as Bar Charts
(ex: Excel)
In general, do not show a bar chart without a zero.
Bar Charts are Good, but …
Confusing label
https://www.idealmedicalcare.org/1103-doctor-suicides-13-reasons-why/
Bar Charts are Good, but …
Source:
https://viz.wtf/post/90690163613/bar-chart-table#notes
Bar Chart: Best Practices
General Guideline
Order:
Use the bars in any order for nominal categories. In order, for ordinal
categories.
Limit the number of bars.
Use Horizontal:
A large number of different categories and there is insufficient space to fit all
the columns required for a vertical bar chart across the page
When Categories have long names (difficult to put them below vertical bars)
➢ Some choose Horizontal chart for nominal categories and Vertical chart for
ordinal category.
➢ Some call Vertical Charts are Column Charts and Horizontal charts as Bar Charts
(ex: Excel)
In general, do not show a bar chart without a zero.
Make proper labeling
Misleading Charts!
see http://viz.wtf/
Bar Chart: Best Practices
General Guideline
Order:
Limit the number of
bars.
Use Horizontal:
...
…
Label your scales
correctly
…
Do not use multiple
scales
See http://viz.wtf/
Variations to Bar Chart
Very much similar to a normal bar chart.
the bar is replaced by a line anchored from the
x axis and a dot at the end to mark the value.
Conveys the same information as bar charts.
Lollipop
Chart
https://datavizproject.com/data-type/lollipop-chart/
Dot at the end may be replaced by another
symbol
Like in Bar chart, Coordinates may be flipped to
make the segments horizonal.
Lollypop
Chart
Source: http://www.datarevelations.com/tag/ryan-sleeper
Radial Bar (column) Chart
Uses a grid of concentric circles to plot bars
on. Each circle on the graph represents a
value on a scale, while the radial dividers (lines
spanning from the center) are used for each
category.
Same purpose as the bar-chart
Re:
https://datavizcatalogue.com/
methods/radial_column_chart.h
tml
Circular Bar Chart
Pros: Compact
Cons: the problem with Radial Bar Charts is
that the bar lengths can be misinterpreted.
Each bar on the outside gets relatively
longer than the last, even if they represent
the same value.
Our visual systems are better at interpreting
straight lines, so the Cartesian bar chart is a
better choice for comparing values.
Therefore, Radial Bar Charts are used
primarily for aesthetic reasons.
Polar Area diagram
(Cox-Comb chart)
Same purpose as the bar-chart. UK temperatures in 2012.
Suitable for cyclic data
Constant angle division
Radius proportional to the value.
Pros: Good for cyclical data.
Cons:
The area of the sectors is proportional
to the squared radius. So it amplifies
the data.
http://prcweb.co.uk/radialbarchart/
This chart was famously used by Florence
Nightingale to communicate about the
avoidable deaths of soldiers during the
Crimean war.
In her chart, the area represented the data, not
the radius.
Nightingale’s
Rose Chart
https://datavizcatalogue.com/methods/nightingale_rose_chart.html
Funnel Chart
https://plotly.com/python/funnel-charts/
Ranking
ORDERED COMPARISON.
Bar Chart for
Ranking
Bar charts are great for ranking
Done by sorting data
or even plot methods allows you
Plotly allow you to draw them
ordered fashion.
Ranking
Patterns in Ranking
Uniform: all values are roughly the same.
Different:
Uniformly different: difference between the adjacent data are the
same
Non-uniformly different: difference between adjacent data vary
Increasingly/decreasingly different
Alternating difference: Differences begin with small, shift to large
and then shift back to small.
Exceptional : one or more data are exceptionally different from
the next
Lollipop chart for ranking
Hotel Venue G
Part-to-whole Two of the simplest types of analysis involve
Comparing parts with the whole
and
Source: https://datavizcatalogue.com/methods/pie_chart.html
How individual parts make up the whole of
something?
Part-to-whole
ex: Faculty, Student proportion of BITS.
Categorical subdivisions are measured as a
ratio to the whole (i.e., a percentage out of
100%).
Pie Chart
Wiki
The earliest known pie chart is generally
credited to William Playfair's Statistical
Breviary of 1801.
Simple and very good way to show Part-to-
whole relationship
Compare relative sizes.
Pie Chart
https://en.wikipedia.org/wiki/William_Playfair
Doughnut (Donut) Chart
https://www.pewresearch.org/religion/2021/06/
29/religion-in-india-tolerance-and-segregation/
Wiki
3D Pie Chart (Not to be used)
https://developers.google.com/chart/interactive/docs/gallery/piechart
Infamous
MacWorld's
iPhone Pie
Chart
src: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Waffle%20Chart
Waffle Chart
Part-to-whole relationship
Mapping:
area, color- hue, symbol
Pie Chart: To use or not to use
https://en.wikipedia.org/wiki/Pie_chart
Pie Chart: To use or not to use
https://en.wikipedia.org/wiki/Pie_chart
Most used and most criticized of all charts!
Why criticism?
Our eyes are great at comparing differences in
2-D location and differences in line length, but
not 2-D areas and angles.
Pie Chart: To
use or not to
use
https://en.wikipedia.org/wiki/Pie_chart
Pie Chart Problems
http://speakingppt.com/2013/03/18/why-tufte-is-flat-out-wrong-about-pie-charts/
Pie Chart may be A preferable if…
When you’re comparing percentages, bars are NOT more effective than
pie charts.
Which one
communicates
more quickly?
http://speakingppt.com/2013/03/18/why-tufte-is-flat-out-wrong-about-pie-charts/
Pie Charts can be good
whole and Bar charts may be more precise for ranking and
part-to-whole (when plotting % values)
Ranking
relationship
Pareto Charts: Part-to-whole and
ranking relationship
Pareto charts are useful for
ranking relationship and to
examine the cumulative
contribution of parts to the whole.
May be considered to be the best
compromise for showing Part-to-
whole and ranking.
Pareto Chart
A Pareto chart: contains
both bars and a line graph, where
individual values are represented in
descending order by bars, and the
cumulative total is represented by the
line. [Wiki]
The name comes from Pareto Principle,
(also known as the 80/20 rule) states that,
for many events, roughly 80% of the
effects come from 20% of the causes.
[named it after Italian economist Vilfredo
Pareto]
ex: ~80% Federal income tax is paid
by 20% earner.
https://en.wikipedia.org/wiki/Pareto_chart
Ranking Changes over Time:
Slope Charts
Slope charts:
dot chart for ranking
line chart for time sequence
https://blogs.sas.com/content/iml/2020/03/02/deviation-plot-baseline.html
Diverging
Bars: Z-score
Deviation is sometimes computed
in standard score, also called Z-
score or Z-value. Its computation is
as follows:
Diverging
Bars: T-score
A T-score is a type of normalized
score, usually used for
Psychometric tests in which mean
in 50 and a standard deviation of is
10.
Psychometric tests:
• Intelligence.
• Aptitudes and skills.
• Personality.
Deviation as a function of time
Deviation analysis
Ref: https://clauswilke.com/dataviz/visualizing-associations.html#associations-scatterplots
Scatter Plot
https://datavizcatalogue.com/methods/scatterplot.html
Scatter Plot
https://datavizcatalogue.com/methods/scatterplot.html
Strength of Correlation
https://en.wikipedia.org/wiki/Pearson_correlation_co
efficient
Variance
Population Variance
2
Measure how far a set of numbers is 2
∑ 𝑥𝑖 − 𝜇
spread out. 𝜎 =
measure of how much something
𝑁
changes
Variance describes how much ∑𝑥𝑖
a random variable differs from where 𝜇=
𝑁
its expected value (mean).
defined as the average of Sample Variance
the squares of the differences
between the individual (observed)
and the expected value. ∑ 𝑥𝑖 − 𝜇 2
2
𝜎 =
it is always positive.
𝑁−1
Note: The variance is not the average
difference from the expected value. https://simple.m.wikipedia.org/wiki/Variance
Variance and standard deviation
∑ 𝑥 − 𝜇 2
Population Variance 2
𝜎 =
𝑖
𝑁
2
∑ 𝑥𝑖 − 𝜇 ∑𝑥𝑖 ∑ 𝑥𝑖2 − 2𝜇𝑥𝑖 + 𝜇2
𝜎2 = where 𝜇 =
𝑁 =
𝑁 𝑁
∑𝑥𝑖2 2𝜇∑𝑥𝑖 𝑁𝜇2
= − +
Sample Variance 𝑁 𝑁 𝑁
∑𝑥𝑖2
∑ 𝑥𝑖 −𝜇 2 ∑𝑥𝑖2 𝑁 = − 2𝜇2 + 𝜇2
2
𝜎 = = − 𝜇2 𝑁
𝑁−1 𝑁−1 𝑁−1
∑𝑥𝑖2
= − 𝜇2
standard deviation = 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑁
Covariance Population Covariance
∑ 𝑥𝑖 −𝜇𝑥 ∗ 𝑦𝑖 −𝜇𝑦
𝑐𝑜𝑣(𝑥, 𝑦) =
𝑁
a measure of the relationship
between two random variables.
where
∑𝑥𝑖 ∑𝑦𝑖
evaluates to what extent the
variables change together.
𝜇𝑥 = and 𝜇𝑦 =
𝑁 𝑁
In other words, it’s a measure of
the variance between two
variables.
Sample Covariance
A positive covariance means that
both move together while a
negative covariance means they ∑ 𝑥𝑖 −𝜇𝑥 ∗ 𝑦𝑖 −𝜇𝑦
move inversely. 𝑐𝑜𝑣(𝑥, 𝑦) =
Note: It does not assess the 𝑁−1
dependency between variables.
Correlation Coefficient r
Correlation Coefficient r
𝑦−𝜇𝑦
and standardized y (or z-score of y) : z𝑦 =
𝜎𝑦
1
𝑟𝑥𝑦 = ∑ z𝑥,𝑖 𝑧𝑦,𝑖
𝑁
Correlation r - Interpretation
Predictive modeling:
Correlation analysis can be used to identify which variables are most
strongly related to an outcome, and this information can be used to
build predictive models.
Correlation Analysis
Line of best fit: Trend line
‘Best Fit’ would mean
Squared difference between
Actual Y Values & Predicted Y
Values for X are a Minimum
Why not Difference Between
Actual Y Values & Predicted Y
Values for X Are Minimized.?
Positive Differences would Off-
Set Negative ones
Ref: https://clauswilke.com/dataviz/visualizing-trends.html
Use of Correlation Analysis (contd…)
Predictive modeling:
Correlation analysis can be used to identify which variables are most
strongly related to an outcome, and this information can be used to
build predictive models.
Identifying confounding variables:
Correlation analysis can help identify variables that may be influencing
the relationship between two other variables, thus allowing researchers
to control for these confounding variables in their analysis.
Least Squared Regression
http://mathworld.wolfram.com/LeastSquaresFitting.html
http://mathworld.wolfram.com/LeastSquaresFittingPerpendicularOffsets.ht
ml
Least Squared Minimization
see:
http://mathworld.wolfram.com/
LeastSquaresFitting.html
Best Fit line coefficients and
Correlation coefficient
see:
http://mathworld.wolfram.com/
LeastSquaresFitting.html
Linear Regression
https://en.wikipedia.org/wiki/Local_regression
Local Regression
https://en.wikipedia.org/wiki/Local_regression
Local Regression
∑𝑤𝑖 𝑥𝑖
𝑥ҧ =
∑𝑤𝑖
∑𝑤𝑖 𝑦𝑖
𝑦
ത =
∑𝑤𝑖
𝑎 = 𝑦ത − 𝑏𝑥ҧ
Upcoming
Assignment
Table Scraping from Web
Correlation Analysis
Recap: Scatter Plot
https://datavizcatalogue.com/methods/scatterplot.html
Recap: Variance, Covariance
Variance:
∑ 𝑥𝑖 −𝜇 2
Computation: 𝑣𝑎𝑟 𝑥 = 𝜎𝑥2 =
𝑁
Predictive modeling:
Correlation analysis can be used to identify which variables are most
strongly related to an outcome, and this information can be used to
build predictive models.
Correlation Analysis
Note: The extreme values +1 and -1
indicate perfect linear relationship (points
lie exactly along a straight line)
Line of best fit: Trend line
‘Best Fit’ would mean
Squared difference between
Actual Y Values & Predicted Y
Values for X are a Minimum
Why not Difference Between
Actual Y Values & Predicted Y
Values for X Are Minimized.?
Positive Differences would Off-Set
Negative ones
Ref: https://clauswilke.com/dataviz/visualizing-trends.html
Use of Correlation Analysis (contd…)
Predictive modeling:
Correlation analysis can be used to identify which variables are most
strongly related to an outcome, and this information can be used to
build predictive models.
Identifying confounding variables:
Correlation analysis can help identify variables that may be influencing
the relationship between two other variables, thus allowing researchers
to control for these confounding variables in their analysis.
Least Squared Regression
http://mathworld.wolfram.com/LeastSquaresFitting.html
http://mathworld.wolfram.com/LeastSquaresFittingPerpendicularOffsets.ht
ml
Least Squared Minimization
see:
http://mathworld.wolfram.com/
LeastSquaresFitting.html
Best Fit line coefficients and
Correlation coefficient
see:
http://mathworld.wolfram.com/
LeastSquaresFitting.html
Linear Regression
Local fitting
for the fit at point x, the fit is made using points in a
neighborhood of x
the size of neighborhood is controlled by parameter
“span”
The resulting smooth curve is called LOESS curve
See:
https://clauswilke.com/dataviz/visualizing-trends.html Source:
https://en.wikipedia.org/wiki/Local_regression
Local Regression
https://en.wikipedia.org/wiki/Local_regression
Local Regression
𝑎 = 𝑦ത − 𝑏𝑥ҧ
Visual Analysis of 2D Data
CLUSTERING AND CLUSTER ANALYSIS
Cluster and Cluster Analysis
https://en.wikipedia.org/wiki/Cluster_analysis
Visualizing Bi-Variate Data
https://www.xarg.org/2018/04/how-to-plot-a-covariance-error-ellipse/
https://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/
Cluster Analysis: Application
k-Mean Clustering:
1. starts with k randomly selected points as centroid (each centroid
defines one cluster)
either randomly generated
or randomly selected from the data points,
2. assign each data point to its nearest centroid, based on the squared
Euclidean distance.
3. calculate centroid of the group assigned to each centroid
1. compute the mean of the points in the group
4. performs steps 2 and 3 iteratively until the following condition is met:
the centroids have stabilized
there is no change in their v alues because the clustering has been successful.
or the defined number of iterations has been achieved.
Choosing the value of “k”
https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
k-Means++
https://en.wikipedia.org/wiki/Contour_line
Contour Plot
Contour Plot
Density Contour Plot
Multivariate Data Visualization
Multivariate Data
Example:
Multivariate Data
Example:
Multiple 1D Data: Distribution analysis
Source: https://blogs.sas.com/content/graphicallyspeaking/2013/03/24/custom-box-plot
Multiple 1D Data: Distribution analysis
https://datavizcatalogue.com/methods/violin_plot.html
Multiple Time series data
Scatter plot Matrix: for Multivariate
Correlation analysis
Scatterplot is designed
to compare only two
variables
Scatterplot Matrix
addresses the problem.
Explores how several
variables interact with
each other.
Correlation Analysis for multiple
variables
Scatterplot is designed to
compare only two
variables
Scatterplot Matrix
addresses the problem.
Explores how several
variables interact with
each other.
Correlation Display for multiple
variables
Scatterplot is designed to
compare only two
variables
Scatterplot Matrix
addresses the problem.
Explores how several
variables interact with
each other.
How do we visualize Multivariate
Data?
Example:
Multivariate Analysis
https://plotly.com/python/3d-scatter-plots/ Px.scatter_3d
Bubble Chart
https://observablehq.com/d/005a613631862b1b
Parallel Coordinates
ideal for comparing many
variables together and
seeing the relationships
between them.
each variable is given its
own axis and all the axes
are placed in parallel to
each other.
Source: https://datavizcatalogue.com/methods/parallel_coordinates.html
Parallel Coordinates
Dataviz Catalogue
How to visually read Parallel Coordinates
https://en.wikipedia.org/wiki/Radar_chart
Dimension
Reduction
Dimension Reduction
https://towardsdatascience.com/https-medium-com-abdullatif-h-
dimensionality-reduction-for-dummies
Principal Component Analysis (PCA)
https://towardsdatascience.com/https-medium-com-abdullatif-h-
dimensionality-reduction-for-dummies
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
https://medium.com/@aptrishu/understanding-
principle-component-analysis-e32be0253ef0
Variance, Covariance (Recap)
Variance
measure of the variability or
simply measures how spread the data set is.
Mathematically: the average squared deviation from the
mean .
Covariance
measure of the extent to which corresponding elements from
two sets of ordered data move in the same or opposite
direction.
Covariance matrix is symmetric.
As, we discussed earlier we want the data to be spread out
i.e. it should have high variance along dimensions.
If two dimensions are independent of each other then
covariance should be zero
Covariance Matrix
Let us assume that the data is reshaped in a form where dimensions are
Covariance Matrix of X?
1 𝑇
PCA Analysis
Let’s say the dataset is X has zero mean and its covariance matrix is
CX
1
That means C𝐱 = 𝑿𝑿𝑇
𝑛
We want to transform them to Y such that its covariance matrix CY is a
diagonal matrix
Let 𝒀 = 𝑷𝑿
1 1 1
Then C𝐘 = 𝒀𝒀𝑇 = 𝑷𝑿 𝑷𝑿 𝑇
= 𝑃𝑋𝑋 𝑇 𝑃𝑇
𝑛 𝑛 𝑛
1
C𝐘= 𝑃C 𝐱𝑃𝑇
𝑛
𝐶𝑌 = 𝑃𝐶𝑋 𝑃 𝑇
What P to choose such that CY is a diagonal matrix?
In linear algebra, eigen decomposition is the factorization of a
matrix into a diagonal form, whereby the matrix is represented in
terms of its eigenvalues and eigenvectors.
https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix#Real_symmetric_matrices
Eigenvector and Eigenvalue of a
matrix
If Av = v where A is a matrix, v is a vector and k is a
scalar then v and are respectively the eigenvector and
eigenvalue of A
Av = Iv
So (A- I)v = 0
That means determinant of (A- I) = 0
Determinant is a “m” degree polynomial in and has
“m” roots.
Hence, for m×m matrix A, there are “m” eigen vectors and “m”
eigen values.
See: https://www.mghassany.com/MLcourse/principal-
AV =V or A = VV-1. components-analysis.html
PCA Analysis
https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix#Real_symmetric_matrices
PCA Demo
https://setosa.io/ev/principal-component-analysis/
PCA Dimension Reduction of IRIS
Dataset
PCA: simplified explanation
https://builtin.com/data-science/step-step-explanation-principal-
component-analysis
PCA Analysis (recap)
Compute
1
The covariance matrix of Y: 𝒀𝒀𝑇 is a diagonal matrix.
𝑛
Multidimensional Scaling
Multidimensional Scaling (MDS)
Maps data from n-D space to k-D space such that difference in
distance between the data points is minimized.
Given a distance matrix with the distances between each pair of
objects in a set, and a chosen number of dimensions, k, an
MDS algorithm places each object into k-dimensional space such
that the between-object distances are preserved as well as
possible.
For scatter-plot visualization we can choose k = 2.
https://en.m.wikipedia.org/wiki/Multidimensional_scaling
MDS Algorithm
Mathematically:
MDS takes an input matrix giving dissimilarities (say, distance) between
pairs of items and outputs a coordinate matrix whose configuration
minimizes a loss function called strain.
Compute the Euclidean distance between data item i and data item j
and minimize the stress function:
General MDS Algorithm
𝑓𝑖𝑡𝑡𝑒𝑑 2
A typical strain/stress function: ∑ 𝑑𝑖𝑗 − 𝑑𝑖𝑗 where
the sum is over N data points.
http://www.analytictech.com/networks/mds.htm
MDS Algorithm: which “k” to choose?
http://www.analytictech.com/networks/mds.htm
Scree plot
Scree is a collection of broken rock fragments at the base
of mountain cliffs, that has accumulated through periodic rockfall from cliff
faces.
https://en.wikipedia.org/wiki/Scree
Uncertainty Visualization
Uncertainty
Uncertainty
The lack of certainty, a state of limited knowledge where it is impossible
to exactly describe the existing state, a future outcome, or more than
one possible outcome. [Wiki]
ex: Who will win presidential election?
US Employment Rate & Uncertainty
“The unemployment rate declined by 0.2 percentage point to 5.2
percent, the U.S. Bureau of Labor Statistics. [Aug 2021]
How is this rate computed?
From people collecting
unemployment insurance (UI)?
Govt counts every unemployed
person every month?
The government conducts a
monthly survey called the Current
Population Survey (CPS) to
measure the extent of
unemployment in the country.
Measurement of uncertainty
the range of possible values within which the true value lies and the
probability assigned to the possible values.
Often standard deviation and standard error are general measures of
uncertainty around a particular single choice (or the measured value or
a mean, median, or mode).
Wiki
Uncertainty Measure
https://serialmentor.com/dataviz/visualizing-
uncertainty.html
Parameter estimates and their
uncertainties.
Bayesians approach:
We have some prior knowledge about the world, and we will use the
sample to update this knowledge.
Frequentist approach:
We make precise statements about the world without having any prior
knowledge in hand.
Standard Deviation vs Standard
Error
standard deviation (SD) measures the amount of variability,
or dispersion, for a subject set of data from the mean,
How much spread there is in the data around the mean.
𝑠𝑎𝑚𝑝𝑙𝑒𝑖 −𝑚𝑒𝑎𝑛 2
SD =
𝑁−1
standard error of the mean (SEM) measures how far the sample mean
of the data is likely to be from the true population mean.
How accurate the sample mean is.
The SEM is always smaller than the SD.
SEM = SD/√𝑁
SD is a measure of volatility (spread of measured value).
68% of the possible values around the measured values are withing 1 SD.
SEM is the standard deviation of the means within a dataset.
Confidence Interval
https://www.dummies.com/education/math/statistics/how-to-calculate-a-confidence-
interval-for-a-population-mean-when-you-know-its-standard-deviation/
Confidence Interval in Polling
See: https://www.dummies.com/education/math/statistics/how-to-determine-the-
confidence-interval-for-a-population-proportion/
For polling visualization see: https://fivethirtyeight.com/
Error Bars for Uncertainty visualization
Note:
Best fit line does not mean it
is the right fit.
See for example:
(Recap) Best Fit line coefficients
and Correlation coefficient
Measure of Error in Linear Regression
Standard SSE:
SSE relative to the spread of y-values:
∑ ∗
∑
∑ ∗
Coefficient of determination:
∑
Value range is 0 to 1.
Closer to 1 better the fit.
Compare it with the Correlation coefficient.
In some statistics books you may find the following equation
∑ ∗
∑
(recap) Local Regression
See:
https://clauswilke.com/dataviz/visualizing-trends.html Source:
https://en.wikipedia.org/wiki/Local_regression
(recap) Local Regression
https://en.wikipedia.org/wiki/Local_regression
Visual Analysis of 2D Data
CLUSTERING AND CLUSTER ANALYSIS
Cluster and Cluster Analysis
https://en.wikipedia.org/wiki/Cluster_analysis
Visualizing Bi-Variate Data
https://www.xarg.org/2018/04/how-to-plot-a-covariance-error-ellipse/
https://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/
Cluster Analysis: Application
https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
k-Means++
https://en.wikipedia.org/wiki/Contour_line
Contour Plot
Density Contour Plot
Multivariate Data Visualization
Multivariate Data
Example:
(Recap) Multiple 1D Data: Distribution
analysis
Multiple distribution display
Box Plots
Violin Plots
Source: https://blogs.sas.com/content/graphicallyspeaking/2013/03/24/custom-box-plot
(Recap) Multiple 1D Data: Distribution
analysis
Multiple distribution display
Box Plots
Violin Plots
https://datavizcatalogue.com/methods/violin_plot.html
(Recap)Multiple Time series data
Multivariate Data
Example:
Scatter plot Matrix: for Multivariate
Correlation analysis
Scatterplot is designed to
compare only two
variables
Scatter plot Matrix: for Multivariate
Correlation analysis
Scatterplot is designed to
compare only two
variables
Scatterplot Matrix (SPLOM)
addresses the problem.
Explores the correlation of
the data dimensions with
each other.
Correlation Display for multiple
variables
Scatterplot is designed to
compare only two
variables
Scatterplot Matrix
addresses the problem.
Explores how several
variables interact with
each other.
https://seaborn.pydata.org/generated/seaborn.pairplot.html
Correlation Display for multiple
variables
Scatterplot is designed to
compare only two
variables
Scatterplot Matrix
addresses the problem.
Explores how several
variables interact with
each other.
Correlation Analysis for multiple
variables
Scatterplot is designed to
compare only two
variables
Scatterplot Matrix
addresses the problem.
Explores how several
variables interact with
each other.
Multivariate Analysis in a Single 2D plot
https://plotly.com/python/3d-scatter-plots/ Px.scatter_3d
Bubble Chart
An extended scatter plot
o 3 quantitative dimensions
o 1 categorical dimension
Parallel Coordinates
Dataviz Catalogue
Parallel Coordinates
ideal for comparing many
variables together and
seeing the relationships
between them.
each variable is given its
own axis and all the axes
are placed in parallel to
each other.
Source: https://datavizcatalogue.com/methods/parallel_coordinates.html
How to visually read Parallel Coordinates
https://en.wikipedia.org/wiki/Radar_chart
Dimension
Reduction
Dimension Reduction
https://towardsdatascience.com/https-medium-com-abdullatif-h-
dimensionality-reduction-for-dummies
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
https://medium.com/@aptrishu/understanding-
principle-component-analysis-e32be0253ef0
Variance, Covariance (Recap)
Variance
measure of the variability or
simply measures how spread the data set is.
Mathematically: the average squared deviation from the
mean .
Covariance
measure of the extent to which corresponding elements from
two sets of ordered data move in the same or opposite
direction.
Covariance matrix is symmetric.
we discussed earlier we want the data to be spread out i.e. it
should have high variance along dimensions.
If two dimensions are independent of each other then
covariance should be zero
Covariance Matrix
𝑥 , ⋯ 𝑥,
Let us assume that the data is reshaped in a form ⋮ ⋱ ⋮ where dimensions are
𝑥 , ⋯ 𝑥 ,
rows and observations are columns.
Covariance between two dimensions k and m ?
∑ 𝑥 , −𝑥 𝑥 , −𝑥
𝑥 , −𝑥 ⋯ 𝑥 , −𝑥
For convenience we will transform the data rows to zero mean: ⋮ ⋱ ⋮
𝑥 , −𝑥 ⋯ 𝑥 , −𝑥
𝑥 , −𝑥 ⋯ 𝑥 , −𝑥
𝑿= ⋮ ⋱ ⋮
𝑥 , −𝑥 ⋯ 𝑥 , −𝑥
Covariance Matrix of X?
𝑿𝑿
PCA Analysis
Let’s say the dataset is X has zero mean and its covariance matrix is
CX
That means C𝐱 = 𝑿𝑿
We want to transform them to Y such that its covariance matrix CY is a
diagonal matrix
Let 𝒀 = 𝑷𝑿
Then C𝐘 = 𝒀𝒀 = 𝑷𝑿 𝑷𝑿 = 𝑃𝑋𝑋 𝑃
C𝐘 = 𝑃C𝐱 𝑃
Covariance matrices are Symmetric.
PCA Analysis
𝐶 = 𝑃𝐶 𝑃
What P to choose such that CY is a diagonal matrix?
In linear algebra, eigen decomposition is the factorization of a
matrix into a diagonal form, whereby the matrix is represented in
terms of its eigenvalues and eigenvectors.
https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix#Real_symmetric_matrices
Eigenvector and Eigenvalue of a
matrix
If Av = v where A is a matrix, v is a vector and is a
scalar then v and are respectively the eigenvector and
eigenvalue of A
Av = Iv
So (A- I)v = 0
That means determinant of (A- I) = 0
Determinant is a “m” degree polynomial in and has
“m” roots.
Hence, for m×m matrix A, there are “m” eigen vectors and “m”
eigen values.
See: https://www.mghassany.com/MLcourse/principal-
AV =V or A = VV-1. components-analysis.html
PCA Analysis
https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix#Real_symmetric_matrices
PCA Demo
https://setosa.io/ev/principal-component-analysis/
PCA Dimension Reduction of IRIS
Dataset
PCA: simplified explanation
https://builtin.com/data-science/step-step-explanation-principal-
component-analysis
PCA Analysis (recap)
, ,
Compute
, ,
https://en.wikipedia.org/wiki/Principal_component_analysi s
Multidimensional Scaling
Multidimensional Scaling (MDS)
Maps data from n-D space to k-D space such that difference in
distance between the data points is minimized.
Given a distance matrix with the distances between each pair of
objects in a set, and a chosen number of dimensions, k, an
MDS algorithm places each object into k-dimensional space such
that the between-object distances are preserved as well as
possible.
For scatter-plot visualization we can choose k = 2.
https://en.m.wikipedia.org/wiki/Multidimensional_scaling
MDS Algorithm
Mathematically:
MDS takes an input matrix giving dissimilarities (say, distance) between
pairs of items and outputs a coordinate matrix whose configuration
minimizes a loss function called strain.
Compute the Euclidean distance between data item i and data item j
and minimize the stress function:
Metric MDS
𝟏
Centering matrix 𝑋: 𝑋 = 𝑋 − 𝑋 = 𝑰 − 𝑋
Ref: https://stats.stackexchange.com/questions/14002/whats-the-difference-between-principal-component-
analysis-and-multidimensional
General MDS Algorithm
∑
or its scaled version ∑
http://www.analytictech.com/networks/mds.htm
MDS Algorithm: which “k” to choose?
http://www.analytictech.com/networks/mds.htm
Scree plot
Scree is a collection of broken rock fragments at the base
of mountain cliffs, that has accumulated through periodic rockfall from cliff
faces.
https://en.wikipedia.org/wiki/Scree
Uncertainty Visualization
Uncertainty
Uncertainty
The lack of certainty, a state of limited knowledge where it is impossible
to exactly describe the existing state, a future outcome, or more than
one possible outcome. [Wiki]
ex: Who will win presidential election?
Example
Not Verified
Source: https://www.adda247.com/upsc-exam/unemployment-rate-in-india/
Unemployment
How is it measured?
Does the government count every unemployed person each year?
To do this, every home in the country would have to be contacted—just as in
the population census every 10 years.
This procedure would cost way too much and take far too long to produce the
data.
In addition, people would soon grow tired of having a census taker contact them
every month, year after year, to ask about job-related activities.
Unemployment Measure: In US
The government conducts a monthly survey called the Current Population Survey
(CPS) to measure the extent of unemployment in the country.
60,000 eligible households in the sample for this survey.
approximately 110,000 individuals each month, and are asked about the labor force
activities (jobholding and job seeking)
Sample is representative of the entire population of the United States.
all of the counties and independent cities in the country first are grouped into approximately
2,000 geographic areas (sampling units).
a sample of ~ 800 of these geographic areas are chosen to represent each state and DC.
Every month, one-fourth of the households in the sample are changed, so that no household
is interviewed for more than 4 consecutive months.
After a household is interviewed for 4 consecutive months, it leaves the sample for 8 months,
and then is again interviewed for the same 4 calendar months a year later, before leaving
the sample for good.
As a result, approximately 75 percent of the sample remains the same from month to month
and 50 percent remains the same from year to year.
US Bureau of Labor Statistics
Unemployment Measure: In India
https://www.thehindu.com/business/Economy/how-unemployment-is-measured/article67278546.ece
Back to Unemployment Rate &
Uncertainty
Sample and Population:
Population: All the citizens of the country
Parameter: the actual unemployment number collected by asking each
member of the population
Sample: Subset of citizens queried
Estimate: the number computed from the sample.
Uncertainty
Measurement of uncertainty
the range of possible values within which the true value lies and the
probability assigned to the possible values.
Often standard deviation and standard error are general measures of
uncertainty around a particular single choice (or the measured value or
a mean, median, or mode).
Wiki
Uncertainty Measure
https://serialmentor.com/dataviz/visualizing-
uncertainty.html
Standard Deviation vs Standard
Error
standard deviation (SD) measures the amount of variability,
or dispersion, for a subject set of data from the mean,
How much spread there is in the data around the mean.
SD =
standard error of the mean (SEM) measures how far the sample mean
of the data is likely to be from the true population mean.
How accurate the sample mean is.
The SEM is always smaller than the SD.
SEM = SD/√𝑁
SD is a measure of volatility (spread of measured value).
68% of the possible values around the measured values are withing 1 SD.
SEM is the standard deviation of the means within a dataset.
Confidence Interval
https://www.dummies.com/education/math/statistics/how-to-calculate-a-confidence-
interval-for-a-population-mean-when-you-know-its-standard-deviation/
Confidence Interval in Polling
See: https://www.dummies.com/education/math/statistics/how-to-determine-the-
confidence-interval-for-a-population-proportion/
For polling visualization see: https://fivethirtyeight.com/
Error Bars for Uncertainty visualization
𝑦 = 𝑎 + 𝑏𝑥
where
𝑎: 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝑏: 𝑠lope
𝑎, 𝑏are computed
Least squared error
minimization,
where
Error = ∑ 𝑦 − 𝑎 + 𝑏𝑥
Linear Regression revisited
𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑅𝑒𝑑𝑖𝑑𝑢𝑎𝑙 𝑆𝑆𝑅 𝑜𝑓 𝑙𝑖𝑛𝑒𝑎𝑟
𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛:
𝑦 − 𝑎 + 𝑏𝑥
𝑜𝑟
𝑦 −𝑦
where 𝑦 is the
predicted value.
Error in Estimate of Slope
𝑆𝐸 = ×∑ ̅
Where 𝑆𝑆𝑅 = ∑ 𝑦 − 𝑦
Error in Estimate of intercept
Where 𝑆𝑆𝑅 = ∑ 𝑦 − 𝑦
The standard error in slope and intercept are used to calculate the confidence
interval for the intercept estimate.
The 95% confidence ( *100) interval is given by:
intercept ± t(/2 , n-2)*SEintercept
slope ± t(/2 , n-2)*SEslope
Seaborn support
sns.regplot(df,
x="sepal_length",
y="sepal_width“
)
Also available in
sns.lmplot and
sns.jointplot