Basic Statistics: Shyam Karmakar
Basic Statistics: Shyam Karmakar
Shyam Karmakar
May 5, 2005
Contents
1. Descriptive and Inferential Statistics
2. Scale of Measurement
3. Percentiles and Quartiles
4. Measure of Central Tendency
5. Measure of Variability
6. Grouped Data and the Histogram
7. Relations between the Mean and Standard Deviation
8. Methods of Displaying Data
9. Correlation and Regression
10. Cross Tabulation
11. Probability
Copyright © 2004 marketRx, Inc. All rights reserved.
2
WHAT IS STATISTICS?
The Pth percentile in the ordered set is the value that has
P% (P percent) of the data points below it and (100 – p)%
above it.
The 10th observation is 16, and the 11th observation is also 16.
The 50th percentile will lie halfway between the 10th and 11th
values (which are both 16 in this case) and is thus 16.
Quartiles are the percentage points that break down the ordered
data set into quarters.
Mean Average
.
. . . . . : . : : : . . . . .
---------------------------------------------------------------
6 9 10 12 13 14 15 16 17 18 19 20 21 22 24
Mode = 16
Where,
Xi = Observed values of the variable X
n = Number of observations (sample size)
.
. . . . . : . : : : . . . . .
---------------------------------------------------------------
6 9 10 12 13 14 15 16 17 18 19 20 21 22 24
Mean = 15.85
Median and Mode = 16
Kurtosis
Measure of flatness or peakedness of a frequency distribution
• Platykurtic (relatively flat)
• Mesokurtic (normal)
• Leptokurtic (relatively peaked)
Symmetric
Symmetric Distribution
Skewed Distribution
Mean
Median
Mode
(a)
Range
Difference between maximum and minimum values
Interquartile Range
Difference between third and first quartiles (Q3 - Q1)
Variance
Average*of the squared deviations from the mean
Standard Deviation
Square root of the variance
n 2
(Xi - X)
Var x = S
i =1 n - 1
The variance can never be negative.
CV = s x/X
Copyright © 2004 marketRx, Inc. All rights reserved.
29
Variance and Standard Deviation
Population Variance Sample Variance
(x - x)
n
N 2
(x - m) 2
s =
2 i =1
s 2 = i=1
N
(n - 1)
( )
2
( x)
2
N n
x
i =1
N
x -
n
x - 2 i =1 2
= n
i =1
= i=1 N
N (n - 1)
s= s 2
s= s 2
(x - x)
2
6 -9.85 97.0225 36 378.55
s =
2 i =1
=
9
10
-6.85
-5.85
46.9225
34.2225
81
100 (n - 1) (20 - 1)
12 -3.85 14.8225 144 378.55
13 -2.85 8.1225 169 = = 19.923684
14 -1.85 3.4225 196 19
14 -1.85 3.4225 196
n x
2
=
i =1 n
16
17
0.15
1.15
0.0225
1.3225
256
289 (n - 1)
17 1.15 1.3225 289 2
100489
317
18 2.15 4.6225 324 5403 - 5403 -
18 2.15 4.6225 324
= 20 = 20
19
20
3.15
4.15
9.9225
17.2225
361
400
(20 - 1) 19
21 5.15 26.5225 441 5403 - 5024.45 378.55
22 6.15 37.8225 484 = = = 19.923684
24 8.15 66.4225 576 19 19
317 0 378.5500 5403 s = s = 19.923684 = 4.46
2
Chebyshev’s Theorem
Applies to any distribution, regardless of shape
Places lower limits on the percentages of observations within a given
number of standard deviations from the mean
Empirical Rule
Applies only to roughly bell-shaped and symmetric distributions
Specifies approximate percentages of observations within a given number
of standard deviations from the mean
1 1 3
1 - 2 = 1 - = = 75%
2 4 4 2
Standard
At 1 1 8 Lie
1 - 2 = 1 - = = 89% 3 deviations
least 3 9 9 within of the mean
1 1 15
1- 2 = 1 - = = 94% 4
4 16 16
Exhaustive
• Every observation is assigned to a group
8
7
6
5
Frequency
4
3
2
1
0
2 3 4 5 6 7
Familiarity
Copyright © 2004 marketRx, Inc. All rights reserved.
36
Histogram Example
Frequency Histogram
x f(x) f(x)/n
Spending Class ($) Frequency (number of customers) Relative Frequency
184 1.000
x F(x) F(x)/n
Spending Class ($) Cumulative Frequency Cumulative Relative Frequency
Valid Cumulative
Value label Value Frequency (N) Percentage percentage percentage
Bar Graphs
Heights of rectangles represent group frequencies
Frequency Polygons
Height of line represents frequency
Ogives
Height of line represents cumulative frequency
Time Plots
Represents values over time
19.0%
Enjoy job, but it is not on my career path
23.0%
My job just pays the bills
1.5
1.2
0.9
0.6
0.3
0.0
1Q 2Q 3Q 4Q 1Q
2003 C4 2004
6.0 5.6
5.0
4.2
Mean minutes
4.0 3.4
2.6 2.8
3.0 2.4 2.4
1.7 1.6 1.6
2.0 1.3
1.2 1.2 1.0 1.21.1
1.0
0.0
Remicade Enbrel Humira Kineret Amevive Raptiva Asacol Pentasa
100%
28%
80% 41% 38%
Percent of patients
14%
60%
12%
22%
40%
58%
50%
20% 38%
0%
Videx EC Zerit_d4T Sustiva
Drug
30%
20%
% Change in Share
13% 11%
10%
1%
0%
Remicade Humira Antegren Other
-10%
-20%
-30% -24%
0.3
1.0
0.2
0.5
0.1
0.0
0.0
0 10 20 30 40 50
0 10 20 30 40 50
Sales Sales
(Cumulative frequency or
relative frequency graph)
8 .5
Millions of Tons
7 .5
6 .5
5 .5
M o n th J F M AM J J A S ON D J F M AM J J A SON D J F M AM J J AS O
In simple language, one can say that the correlation coefficient determines the
extent to which values of two variables are "proportional" to each other.
The value of the correlation (i.e., correlation coefficient) does not depend on the
specific measurement units used; for example, the correlation between height
and weight will be identical regardless of whether inches and pounds, or
centimeters and kilograms are used as measurement units.
Proportional means linearly related; that is, the correlation is high if it can be
approximated by a straight line (sloped upwards or downwards). This line is
called the regression line or least squares line, because it is determined such that
the sum of the squared distances of all the data points from the line is the lowest
possible.
Pearson correlation assumes that the two variables are measured on at least
interval scales. The Pearson product moment correlation coefficient is calculated
as follows:
r12 = [ (Yi1 - Y-bar1)*(Yi2 - Y-bar2)] / [ (Yi1 - Y-bar1)2 * (Yi2 - Y-bar2)2]1/2
Example:
Year Price Index
1992 225 225*100/225 = 100
1996 240 240*100/225 = 106.6
2000 250 250*100/225 = 111.1 250*100/240 = 104.2
2004 255 255*100/225 = 113.3 = 111.1/106.6 = 104.2
Gender
Row
Internet Usage Male Female Total
Light (1) 5 10 15
Heavy (2) 10 5 15
Column Total 15 15
Gender
Internet Usage
Note that probabilities can range from 0 to 1.0. If in your calculations you
arrive at a probability greater than 1.0, you did something wrong.