ch2
ch2
1.73
1.8
1.65 1.73 1.55
1.55 1.5, 1.67, 1.8, 1.4, 1.9
1.71 1.67 1.66
1.43 1.90
1.55
subject ID Height(meters)
1 1.50
2 1.67
3 1.80
4 1.40
5 1.90
Graphical descriptions
MEASURES OF THE CENTER OF THE DATA
RANGE
VARIANCE/STANDARD DEVIATION ( σ 2 / σ ESTIMATES )
Some other ????
Definition:
Numerical descriptive measures for the population
Are called parameters their counterparts
Computed from the sample are called STATISTICS
Statistics
Parameters
2.2 Measures of center
• Measurements: 2, 4, 9, 8, 6, 5, 3 n=7
• Sort: 2, 3, 4, 5, 6, 8, 9
• Position: .5(n + 1) = .5(7 + 1) = 4th
Median = 4th largest measurement
• The set: 2, 4, 9, 8, 6, 5 n=6
• Sort: 2, 4, 5, 6, 8, 9
• Position: .5(n + 1) = .5(6 + 1) = 3.5th
Median = (5 + 6)/2 = 5.5 — average of the 3rd and 4th
measurements
The height example: median=1.67
Definition:
The mode is the most frequently occurring
measurement.
• The mode is the measurement which occurs
most frequently.
• Measurements: 2, 4, 9, 8, 8, 5, 3
– The mode is 8, which occurs twice
• Measurements : 2, 2, 9, 8, 8, 5, 3
– There are two modes—8 and 2 (bimodal)
• Measurements : 2, 4, 9, 8, 5, 3
– There is no mode (each value is unique).
Example
The number of quarts of milk
purchased by 25 households
(ordered):
0 0 1 1 1 1 1 2 2 2 2 2 2 2 2
2 3 3 3 3 3 4 4 4 5
• Mean?
∑ xi 55
x= = = 2.2
n 25
• Median?
m=2
• Mode? (Highest peak)
mode = 2
Effect of extreme values on center
• The mean is more easily affected by
EXTREME values than median.
– Example;
data: 1,3,5,7, 8
mean= 24/5=4.8; median=5
data (modified): 1,3,5,7,100
mean=23.2; median=5
• Median is used as measure of center when
data is skewed..
Relation among mean, median and
skewness
• The mean follows skewness direction
Symmetric: Mean = Median
xi − x
45
x= =9
5
4 6 8 10 12 14
Definition:
The variance, σ2, of population of N measurements
is the average of the squared deviations of the
measurements about their mean µ.
∑ ( x − µ ) 2
σ2 = i
N
Definition:
The variance, s2, of a sample of n measurements
is the sum of the squared deviations of the
measurements about their mean, divided by (n – 1).
∑ ( x − x ) 2
s2 = i
n −1
Standard deviation
• In forming the variance we squared the
deviations from the mean, thus squaring also
the scale
• To put back the measure of variability into
the original scale of the units we take the
positive square root of the variance
5 -4 16 ∑ ( x − x ) 2
s2 = i
12 3 9 n −1
6 -3 9
60
8 -1 1 = = 15
14 5 25
4
Sum 45 0 60
s = s 2 = 15 = 3.87
• Another way of calculating standard
deviations:
Use the Computational Formula:
xi xi2
( ∑ x ) 2
∑ xi −
2 i
5 25
s2 = n
12 144
n −1
6 36
452
8 64 465 −
14 196 = 5 = 15
Sum 45 465 4
s = s 2 = 15 = 3.87
Summary & Remarks
• The value of s (stdev) is always greater or
equal to zero
• The larger the s the greater the variability of
the data
• If s is zero, all measurements must be equal
(no variability/spread)
• Why we divide by (n-1) instead of n?
because (n-1) gives more accurate estimate if
sample size is small
2.4 Practical use of the standard
deviation
Tchebysheff’s theorem:
Given a number k greater than or equal to 1
and a set of n measurements, at least
1-(1/k2) of the measurement will lie within k
standard deviations of the mean.
• For k=2, 1-(1/k2)=3/4 so at least ¾ of the n
measurements fall within 2 sigma of the mean (in
the interval µ ± 2σ )
• For k=3, at least 8/9 of the n measurements fall
within µ ± 3σ
Empirical rule
Given a distribution of measurements that is
approximately mound-shaped (bell-shaped):
The interval µ ± 2σ contains approximately
95% of the measurements.
The interval µ ± 3σ contains approximately
99.7% of the measurements.
The interval µ ± σ contains approximately
68% of the measurements.
Tchebysheff’s vs Empirical
• Tchebysheff’s rule is always true for all
distributions and gives lower bound on the
fraction of measurements falling into a
interval
• Empirical rule is a only descriptive
approximate rule. More accurate than
Tchebysheff’s when distribution is bell-
shaped
Example
Time from onset to recurrence of a disease for
50 patients (ex01.22)
2.1 3.7 12.6 23.1 5.6
9 4.4 2.7 32.3 9.9
14.7 2 6.6 3.9 1.6
19.2 9.6 16.7 7.4 8.2
4.1 6.9 4.3 3.3 1.2
7.4 18.4 0.2 6.1 13.5
14.1 0.2 8.3 0.3 1.3
8.7 1 2.4 2.4 18
1.6 24 1.4 8.2 5.8
3.7 3.5 11.4 18 26.7
0.4
Histogram of TIME
x = 8.37 16
14
12
s = 7.67 10
Frequency
•Yes. Tchebysheff’s
•Do the actual proportions in the three Theorem must be
intervals agree with those given by true for any data
Tchebysheff’s Theorem? set.
•Do they agree with the Empirical •No. Not very well.
Rule?
•The data distribution is not very
•Why or why not? mound-shaped, but skewed right.
Example
• The length of time for a worker to complete a
specified operation averages 12.8 minutes
with a standard deviation of 1.7 minutes. If
the distribution of times is approximately
mound-shaped, what proportion of workers
will take longer than 16.2 minutes to
complete the task?
95% between 9.4 and 16.2
47.5% between 12.8 and 16.2
.475 .475 .025 (50-47.5)% = 2.5% above 16.2
2.5 A check on the value of s
• From Tchebysheff’s and Empirical rule,
almost all measurements should be within 4s
to 6s about the mean, thus R ≈ 4s-6s
• To approximate the stdev use:
s ≈ R/4
or s ≈ R / 6 for a largedataset.
Example (ex01.22)
• The disease recurrence data:
2.1 3.7 12.6 23.1 5.6
9 4.4 2.7 32.3 9.9
14.7 2 6.6 3.9 1.6
19.2 9.6 16.7 7.4 8.2
4.1 6.9 4.3 3.3 1.2
7.4 18.4 0.2 6.1 13.5
14.1 0.2 8.3 0.3 1.3
8.7 1 2.4 2.4 18
1.6 24 1.4 8.2 5.8
3.7 3.5 11.4 18 26.7
0.4
R = 32.3-.2 = 32.1
s ≈ R / 4 = 32.1 / 4 = 8.025
Actual s = 7.67
2.6 Measures of relative standing
Z-scores
• Where does one particular measurement
stand in relation to the other measurements in
the data set?
Definition:
The sample z-score is a measure of relative
standing of individual measurements defined by:
x−x
z - score=
s
• The z-score tells us How many standard
deviations away from the mean does the
measurement lie?
Suppose s = 2. s
4
s
x =5 x=9
x = 9 lies z =2 std dev from the mean.
• From Tchebysheff’s Theorem and the Empirical Rule
– At least 3/4 and more likely 95% of measurements lie
within 2 standard deviations of the mean.
– At least 8/9 and more likely 99.7% of measurements lie
within 3 standard deviations of the mean.
• z-scores between –2 and 2 are not unusual. z-scores should
not be more than 3 in absolute value. z-scores larger than 3
in absolute value would indicate a possible outlier.
-3 -2 -1 0 1 2 3
Somewhat unusual
Percentiles
Definition:
The pth percentile of a set of n ordered
measurements is the value of x that exceeds p% of
the measurements and is less than the remaining
(100-p)%.
p% (100-p) %
x
p-th percentile
Example
3. Using the following Stem and leafs plot , compute median, Q1, Q3, IQR
Stems Leaves
7 34
7 6799
8 123444
8 5667788
9 1223
9 788
Extra problems
Q: Amount of Food Sold. Suppose the hourly dollar amount of food sold by a local
restaurant follows an approximately mound-shaped distribution, with a mean sales level of
$400 per hour and a standard deviation of $60 per hour.
Q:If the 18th and 19th observations in a set of 25 data values are 42.6 and 43.8, what is the
70th percentile value?