0% found this document useful (0 votes)
6 views

Session 3 - 4 Data Visualization Through Basic Statistics

It talks about visually presenting the data in charts, tables, and frequency distributions.

Uploaded by

kishoremorya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Session 3 - 4 Data Visualization Through Basic Statistics

It talks about visually presenting the data in charts, tables, and frequency distributions.

Uploaded by

kishoremorya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Data Visualization through Basic

Statistics

Kishore Kumar Morya PhD.


SoM
Lecture Objectives
¨ Review approaches to visually displaying Data.
¨ Graphics that display key statistical features of measurements from
a sample.
¨ Define the distribution of a set of data.
¨ Review common basic statistics.
• Extremes (Minimum and Maximum) 1234
Standard
• Central Tendency ( Mean, Median. Mode) Deviation
• Spread (Range, Variance, Standard Deviation)
¨ Review not so common basic statistics.
• Extremes (upper and lower quartiles)
• Central Tendency (Mode,Winsorized Mean)
• Spread (Interquartile Range)
 Baby Name Wizard–
http://www.babynamewizard.com/voyager

 Origin of Species – Edits– http://benfry.com/traces/

 Netflix Queues–
http://www.nytimes.com/interactive/2010/01/10/nyreg
ion/20100110-netflix-map.html?ref=nyregion

 Unemployment Visualization (NYTimes)–


http://www.nytimes.com/interactive/2009/11/06/busin
ess
/economy/unemployment-lines.html

Kishore Kumar Morya PhD.


SoM
Data Visualization Goal
 Understand what makes a visualization effective
through the study of core principles
 Critically evaluate a visual representation of data by
looking at various examples in media (newspapers,
television and so on)
 Gain hands-on experience with visualization tools
(Tableau, Many Eyes, Prefuse, Parallel Sets)
 Incorporate visualization principles to build an
interactive visualization of your own data

Kishore Kumar Morya PhD.


SoM
Data Scientists
 Professionals responsible for filtering out the noise
and analysing essential information
 Integral part of competitive intelligence, a newly

emerging field that encompasses data analysis to


help
businesses gain a competitive edge
 A shortfall of about 140,000 to 190,000 individuals

with analytical expertise is projeted


 Glassdoor.com shows average data scientist salaries

ranging from $60,000 to $115,000

Kishore Kumar Morya PhD.


SoM
What is data visualization?
 Visual Representation of Data
 For exploration, discovery, insight, ..
 Interactive component provides more
insight as compared to a static image

Kishore Kumar Morya PhD.


SoM
Types of data visualization
 Scientific Visualization –
– Structural Data – Seismic, Medical, ..
 Information Visualization

– No inherent structure – News, stock market,


top
grossing movies, facebook connections
 Visual Analytics – Use visualization to

understand and synthesize large amounts of


multimodal data – audio, video, text, images,
networks of people ..
Kishore Kumar Morya PhD.
SoM
Graphics
The visual portrayal of quantitative information

Are used to: Graphical Display


• Display the actual data table
• Display quantities derived from the Objectives
• Tabulation
data
• Show what has been learned about • Description
• Illustration
the data from other analyses
• Allow one to see what may be • Exploration
occurring in the data over and above
what has already been described

“A picture is worth a
thousand words…”
Objectives
As you create graphics keep the following in mind.

 Avoid distortion of the true story.


 Induce the viewer to think about the substance,
not the graph.
 Reveal the data at several layers of detail.
 Encourage the eye to compare different pieces.
 Support the statistical and verbal descriptions of
the data.
Nutrient Profiles for Selected Candy
URL: http://www.candyusa.org/nutfact.html Standard data format
Qualitative characteristic Quantitative characteristics
Candy data as Excel spreadsheet
0
50
100
150
200
250
chart
AfterDinnerMint
Column

Candy Corn

Chewing Gum

Gummy Bears

LicoriceTwists

Milk…

What are the problems with this graph?


MilkChocolateCove…
Display the data table

MilkChocolateMalte…
Calories in Common Candies

PectinSlices

Sour Balls

Taffy
Alternate Display
Sorting and expanding the scale of the graph allows all labels to
be seen as well as displaying a characteristic of the data.

Calories in Common Candies

250
200
150
100
50
0
Vertical Display of Data
Calories in Common Candies

MilkChocolate Bar

DarkChocolateBar

MilkChocolateMaltedMilkBalls

MilkChocolateCoveredRaisins

Caramels

AfterDinnerMint

LicoriceTwists

SemiSweetChocolateChips

StarlightMints

Lollipop

Chewing Gum

0 50 100 150 200 250

In this case, a vertical display allows better comparison of calorie


amounts.
Pie Charts
Pie Chart of SatFatC

NoSatFat (13, 59.1%)

Pie Chart of protein

3 ( 3, 13.6%)
1 ( 3, 13.6%)

6 ( 1, 4.5%)

SatFat ( 9, 40.9%) 4 ( 1, 4.5%)

0 (14, 63.6%)

A pie chart is good for making relative comparisons among pieces


of a whole.
Statistical Uses of Graphics
Describe Distributions of Measurements Compare Distributions
• Box & Whisker plot (Boxplot) • Multiple Box & Whisker plots
• Histogram

Associations and Bivariate Distributions


• Scatter plot
• Symbolic scatter plot
Multidimensional Data Displays
• All pairwise scatter plot
• Rotating scatter plot

Graphical Methods in Support of Statistical Inference


• Regression lines Most of these
• Residual plots will be
• Quantile-quantile plots demonstrated
• Cumulative distribution function plots at some point
• Confidence and prediction interval plots in the course.
• Partial leverage plots
• Smoothed curves
Basic Statistics
Before we get more into statistical uses of graphics, we need to
define some basic statistics. These statistics are typically referred
to as “descriptive statistics”, although as we will see, they are
much more than that. These basic statistics address specific
aspects of the distribution of the data.

• What is the range of the data?


• When we sort the data, what number might we see in the
“middle” of the range of values?
• What number tells us over what sub range do we find the
bulk of the data ?

We will use the calorie data to illustrate.


First, if we sort the data we can immediately
identify the extremes.

Extremes
Extremes
• Minimum(calories) = 10
• Maximum(calories) = 210

The minimum and maximum are “statistics”.

Reminder: A statistic is a function of the data. In this


case, the function is very simple.

10 60 60 60 60 60 70 130 140 140 160 160 160 160 160 160 180 180 200 210 210 210
Range
Range: the difference between the largest and
smallest measurements of a variable.

Extremes
• Minimum(calories) = 10 Range = 210-10 = 200
• Maximum(calories) = 210

Tells us something about the spread of the data.


The middle of the range is a measure of the “center” of
the data.
Midrange = minimum + (Range/2)
=10 + 200/2
=110
Is it a “good” measure of the center of the data?
Measures of Central
Tendency
Estimate the value that is in the center of the
“distribution” of the data .
Median = middle value in the sorted list of n numbers: at position (n+1)/2
= unique value at (n+1)/2 if n is an odd number or
= average of the values at n/2 and n/2+1 if n is even
= (160 + 160)/2 = 160

Mean = sum of all values divided by number of values (average)


= (10 + 60 + 60 + 60 + … + 210 + 210)/22
= 133.6

Trimmed mean = mean of data where some fraction of the smallest and
largest data values are not considered. Usually the
smallest 5% and largest 5% values (rounded to nearest
integer) of data are removed for this computation.
= 136.0 (with 10% trimmed, 5% each tail).

Again – these are statistics (functions of the data)


We will need some mathematical notation if we are to make any
progress in understanding statistics. In particular, since all
statistics are functions of the data, we should be able to represent
Mathematical Notation
these statistics symbolically as equations using mathematical
notation.

Let Y be the symbolic name of a random variable (e.g. a placeholder


for the true name of a variable – weight, gender, time, etc.) Let y i
symbolically represent the i-th value of variable Y, observed in the
sample. Let the symbol, S, represent the mathematical equation for
summation. Then the sample mean can be expressed as:
Number of observations
Symbolic “name” n
for sample mean
y i
y1  y2    yn
y i 1

n n
Quartiles
Suppose we divide the sorted data into four equal parts. The values which
separate the four parts are known as the quartiles. The first or lower quartile
Q1, is the 25th percentile of the sorted data, the second quartile, Q2, is the median and
the third or upper quartile, Q3, is the 75th percentile of the data. Because the sample size
integer, n+1, does not always divide easily by 4, we do some estimating of these quartiles
by linear interpolation between values.

Here n=22, (n+1)/4=23/4=5.75, hence Q1 is three quarters between the 5th and 6th
observations in the sorted list. The 5th value is 60 and the 6th
value is 60, thus

60 + .75(60-60)=60.

For Q2, (n+1)/2 = 23/2 = 11.5, e.g. half way between the 11 th and 12th obs.
Q2 = 160 + .5(160-160) = 160.

For Q3, 3(n+1)/4 = 3(23)/4 = 69/4 = 17.25, e.g a quarter of the way between the 17 th
and 18th observations.
Q3 = 180 + .25(180-180) = 180

10 60 60 60 60 60 70 130 140 140 160 160 160 160 160 160 180 180 200 210 210 210
Percentiles
100pth Percentile: that value in a sorted list of the data that
has approx p100% of the measurements below it
and approx (1-p)100% above it. (The p quantile.)
Distribution
function 0<p<1

Examples:
Q1 = 25th percentile
Q2 = 50th percentile
Q3 = 75th percentile
Simplified Quartiles
A simpler way to find Q1 & Q3 is as follows:

1. Order the data from the lowest to the highest value, and find the median.
2. Divide the ordered data into the lower half and the upper half, using the
median as the dividing value. (Always exclude the median itself from each
half.)
3. Q1 is just the median of the lower half.
4. Q3 is just the median of the upper half.

Ex: For the candy data we still get Q1=60 and Q3=180.

Ex: {3, 4, 7, 8, 9, 11, 12, 15, 18}.

We get Q1=(4+7)/2=5.5 and Q3=(12+15)/2=13.5.


Measures of Variability
¨ Range
¨ Interquartile Range
¨ Variance
¨ Standard Deviation

Interquartile Range (IQR): Difference between the third


quartile (Q3) and the first quartile (Q1).

Quartiles:
Q1 = 25th = 60
Q2 = 50th = median = 160
Q3 = 75th = 180

IQR = Q3-Q1 = 180 - 60 = 120


Variance and Standard Deviation

Variance: The sum of squared deviations Sample Mean

of measurements from their n

mean divided by n-1. y i


y  i 1
n n

 i
y  y 2

s 2  i 1
n 1

Standard Deviation: The square


root of the variance.
s  s2

Rough approximation for large n:


These measure the spread
srange/4.
of the data.
Using Excel Data Analysis
Tool
Excel Data Analysis Tool
Select the Data Analysis Tool
Select Descriptive Statistics
The menu below appears.
Enter the Input Range and
check the output options
desired.
Excel Descriptive Statistics
Output

You should be able to easily


identify the basic statistics we
have described so far.

Note: the variance is not in this


list. This is typical of statistics
packages. Since the variance is
simply the square of the
Standard Deviation, it is often
considered redundant.

Learn to use the Excel Help


files. Type “Statistic” in the
Excel Help Keyword dialog for
a list of helps available.
Pull down
menus

Session
worksheet
with script
commands

Spreadsheet
like data area

Importing a text data file in standard format into


Minitab
Descriptive Statistics

Variable N Mean Median TrMean StDev SEMean


calories 22 133.6 160.0 136.0 60.5 12.9

Variable Min Max Q1 Q3


calories 10.0 210.0 60.0 180.0

Computing Descriptive Stats


Frequency Table
A tabular representation of a set of data.
A frequency table also describes the distribution of the
data and facilitates the estimation of probabilities.

The “Histogram” dialog in the Excel Data Mode = most abundant


Analysis Tool can be used to create this table.
But it is not straightforward.
Stem and Leaf Plot
Rough grouping or “binning” of the data.

Histogram of calories N = 22
• A printer graph of the frequency Midpoint Count
20 1 *
table. 40 0
• Easy to do by hand. 60 5 *****
• Quick visualization of the data. 80 1 *
100 0
120 0
140 3 ***
160 6 ******
180 2 **
200 1 *
220 3 ***
Box Plot for Calories
A visualization of most of the basic statistics.

Maximum

Interquartile 200 75th percentile (Q3)


range Median (Q2)
calories

100

25th percentile (Q1)


0

Minimum

Box Plot
(SAS Proc Insight)
Percentiles
100pth Percentile: that value in a sorted list of the data that has
approx p100% of the measurements below it and
approx (1-p)100% above it. (The p quantile.)
Smoothed
histogram 0<p<1

Examples:
Q1 = 25th percentile
Q2 = 50th percentile
Q3 = 75th percentile

A distribution is said to be symmetric if the distance from the median to the 100p th
percentile is the same as the distance from the median to the 100(1-p) th percentile.
Otherwise the distribution is said to be skewed.
In the case above, the distribution is skewed to the right since the right tail is longer than
the left tail.
Frequency Histogram
A graphical presentation of the frequency table where the relative areas of the
bars are in proportion to the frequencies.

This is a frequency histogram

Frequency 9

6
F re q u e n c y

0 50 100 150 200

calories

Bin width
A density histogram (or simply a histogram) is constructed just like
Density Histogram
a frequency histogram, but now the total area of the bars sums to
one. This is accomplished by rescaling the vertical axis. Instead of
frequencies, the vertical axis records the rescaled value of the
density.

Histograms have
important ties to
probability.

Sum of shaded area is equal to one.


Number of Bins for
Histograms Smoothed histogram or density curve.

Six bins Five bins

How we view the


“distribution” of a dataset
can depend on how
much data we have and
how it is binned.
Eleven bins
Scatterplot
Graphics to examine relationships

200 Is the relationship linear


or non-linear?
c a lo r ie s

100

Beware, changing the relative


0

0 5 10 15
lengths of the axes can
totfat
change how the relationship is
perceived.

200
calories

100

0 5 10 15

totfat
Matrix Plot

View multiple variables at one time.


Brushing the plot
to identify Three-D
interesting points. Views
Chernoff Faces

Displaying
multiple variables
symbolically.
Kishore Kumar Morya PhD.
SoM

You might also like