Session 3 - 4 Data Visualization Through Basic Statistics
Session 3 - 4 Data Visualization Through Basic Statistics
Statistics
Netflix Queues–
http://www.nytimes.com/interactive/2010/01/10/nyreg
ion/20100110-netflix-map.html?ref=nyregion
“A picture is worth a
thousand words…”
Objectives
As you create graphics keep the following in mind.
Candy Corn
Chewing Gum
Gummy Bears
LicoriceTwists
Milk…
MilkChocolateMalte…
Calories in Common Candies
PectinSlices
Sour Balls
Taffy
Alternate Display
Sorting and expanding the scale of the graph allows all labels to
be seen as well as displaying a characteristic of the data.
250
200
150
100
50
0
Vertical Display of Data
Calories in Common Candies
MilkChocolate Bar
DarkChocolateBar
MilkChocolateMaltedMilkBalls
MilkChocolateCoveredRaisins
Caramels
AfterDinnerMint
LicoriceTwists
SemiSweetChocolateChips
StarlightMints
Lollipop
Chewing Gum
3 ( 3, 13.6%)
1 ( 3, 13.6%)
6 ( 1, 4.5%)
0 (14, 63.6%)
Extremes
Extremes
• Minimum(calories) = 10
• Maximum(calories) = 210
10 60 60 60 60 60 70 130 140 140 160 160 160 160 160 160 180 180 200 210 210 210
Range
Range: the difference between the largest and
smallest measurements of a variable.
Extremes
• Minimum(calories) = 10 Range = 210-10 = 200
• Maximum(calories) = 210
Trimmed mean = mean of data where some fraction of the smallest and
largest data values are not considered. Usually the
smallest 5% and largest 5% values (rounded to nearest
integer) of data are removed for this computation.
= 136.0 (with 10% trimmed, 5% each tail).
Here n=22, (n+1)/4=23/4=5.75, hence Q1 is three quarters between the 5th and 6th
observations in the sorted list. The 5th value is 60 and the 6th
value is 60, thus
60 + .75(60-60)=60.
For Q2, (n+1)/2 = 23/2 = 11.5, e.g. half way between the 11 th and 12th obs.
Q2 = 160 + .5(160-160) = 160.
For Q3, 3(n+1)/4 = 3(23)/4 = 69/4 = 17.25, e.g a quarter of the way between the 17 th
and 18th observations.
Q3 = 180 + .25(180-180) = 180
10 60 60 60 60 60 70 130 140 140 160 160 160 160 160 160 180 180 200 210 210 210
Percentiles
100pth Percentile: that value in a sorted list of the data that
has approx p100% of the measurements below it
and approx (1-p)100% above it. (The p quantile.)
Distribution
function 0<p<1
Examples:
Q1 = 25th percentile
Q2 = 50th percentile
Q3 = 75th percentile
Simplified Quartiles
A simpler way to find Q1 & Q3 is as follows:
1. Order the data from the lowest to the highest value, and find the median.
2. Divide the ordered data into the lower half and the upper half, using the
median as the dividing value. (Always exclude the median itself from each
half.)
3. Q1 is just the median of the lower half.
4. Q3 is just the median of the upper half.
Ex: For the candy data we still get Q1=60 and Q3=180.
Quartiles:
Q1 = 25th = 60
Q2 = 50th = median = 160
Q3 = 75th = 180
s 2 i 1
n 1
Session
worksheet
with script
commands
Spreadsheet
like data area
Histogram of calories N = 22
• A printer graph of the frequency Midpoint Count
20 1 *
table. 40 0
• Easy to do by hand. 60 5 *****
• Quick visualization of the data. 80 1 *
100 0
120 0
140 3 ***
160 6 ******
180 2 **
200 1 *
220 3 ***
Box Plot for Calories
A visualization of most of the basic statistics.
Maximum
100
Minimum
Box Plot
(SAS Proc Insight)
Percentiles
100pth Percentile: that value in a sorted list of the data that has
approx p100% of the measurements below it and
approx (1-p)100% above it. (The p quantile.)
Smoothed
histogram 0<p<1
Examples:
Q1 = 25th percentile
Q2 = 50th percentile
Q3 = 75th percentile
A distribution is said to be symmetric if the distance from the median to the 100p th
percentile is the same as the distance from the median to the 100(1-p) th percentile.
Otherwise the distribution is said to be skewed.
In the case above, the distribution is skewed to the right since the right tail is longer than
the left tail.
Frequency Histogram
A graphical presentation of the frequency table where the relative areas of the
bars are in proportion to the frequencies.
Frequency 9
6
F re q u e n c y
calories
Bin width
A density histogram (or simply a histogram) is constructed just like
Density Histogram
a frequency histogram, but now the total area of the bars sums to
one. This is accomplished by rescaling the vertical axis. Instead of
frequencies, the vertical axis records the rescaled value of the
density.
Histograms have
important ties to
probability.
100
0 5 10 15
lengths of the axes can
totfat
change how the relationship is
perceived.
200
calories
100
0 5 10 15
totfat
Matrix Plot
Displaying
multiple variables
symbolically.
Kishore Kumar Morya PhD.
SoM