0% found this document useful (0 votes)
8 views

ch2

Chapter 2 covers measures of center (mean, median, mode) and variability (range, variance, standard deviation) in data analysis. It includes a case study on adult female heights in Canada, demonstrating how to compute these statistics and their implications. The chapter also discusses the empirical rule and Tchebysheff's theorem for understanding data spread.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ch2

Chapter 2 covers measures of center (mean, median, mode) and variability (range, variance, standard deviation) in data analysis. It includes a case study on adult female heights in Canada, demonstrating how to compute these statistics and their implications. The chapter also discusses the empirical rule and Tchebysheff's theorem for understanding data spread.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Chapter 2

• In this chapter we shall learn:


– Measures of center: mean, median, mode
– Measures of variability(spread): range,
variance, standard deviation
– Empirical rule
– Other ways of thinking about spread:
Percentiles, quartiles, inter-quartile range
2.1 Describing data with numerical
measures
• Graphical methods of describing data are
somewhat imprecise.
– Hard to present graphs verbally
– You cannot explain the “how much” (e.g.,
how much skewness, how much is the
center etc…)
A case study
• A case study: we want to study what do
female heights look like in Canada
– What is the characteristic (variable)of
interest?
– Type of Measurements?
– What are the experimental units on which
measurements are taken?
– What is your population of experimental
units?
– How much sample? How ?
• A case study: we want to study what do adult female
heights look like in Canada
– What is the characteristic (variable)of interest?
Height
– Type of Measurements? Quantitative Continuous
(on cm units)
– What are the experimental units on which
measurements are taken? adult female Canadian
– What is your population of experimental units? All
adult female Canadians
– How much sample? How ?(big topic) …..
Suppose we pick a sample of five experimental units
subject ID Height(meters)
1 1.50
Can. Pop of females A sample of Can. 2 1.67

Pop of Can female females 3 1.80


4 1.40
heights A sample of Can 5 1.90
female heights

1.73
1.8
1.65 1.73 1.55
1.55 1.5, 1.67, 1.8, 1.4, 1.9
1.71 1.67 1.66
1.43 1.90
1.55
subject ID Height(meters)
1 1.50
2 1.67
3 1.80
4 1.40
5 1.90

Graphical descriptions
MEASURES OF THE CENTER OF THE DATA

Mean(otherwise known as: average, expectation),


Median
Mode
Some other ????

MEASURES OF THE SPREAD(VARIABILITY) OF THE DATA

RANGE
VARIANCE/STANDARD DEVIATION ( σ 2 / σ ESTIMATES )
Some other ????
Definition:
Numerical descriptive measures for the population
Are called parameters their counterparts
Computed from the sample are called STATISTICS

Quantity Population Sample


Mean, average, μ X
expectation

Median m m
Variance σ2 S2
Standard deviation σ S

Statistics
Parameters
2.2 Measures of center

• A measure of center is a number locating


the center of the distribution of the
population or the sample
Definition:
Arithmetic mean or Average of a set of n
measurements is equal to the sum of the
measurements divided by n..
∑ xi
x=
n
where n = number of measurements
∑ xi = sum of all the measurements
Example
• Consider our population is: all adult
Canadians. Our variable (characteristic of
interest) is the height ):
– If we were able to reach all adult Canadians,
the sum of their heights divided by total
number of Canadians would be the
population mean height, µ
– Suppose we only reached 5 people (a
sample);
Data: 1.5, 1.67, 1.8, 1.4, 1.9 (in meters)
∑ xi 1.5 + 1.67 + 1.8 + 1.4 + 1.9 8.27
x= = = = 1.654
n 5 5
Definition:
The median of a set of measurements is
the middle measurement when the
measurements are ranked from smallest to
largest.
• The position of the median is
.5(n + 1)
once the measurements have been ordered
Example

• Measurements: 2, 4, 9, 8, 6, 5, 3 n=7
• Sort: 2, 3, 4, 5, 6, 8, 9
• Position: .5(n + 1) = .5(7 + 1) = 4th
Median = 4th largest measurement
• The set: 2, 4, 9, 8, 6, 5 n=6
• Sort: 2, 4, 5, 6, 8, 9
• Position: .5(n + 1) = .5(6 + 1) = 3.5th
Median = (5 + 6)/2 = 5.5 — average of the 3rd and 4th
measurements
The height example: median=1.67
Definition:
The mode is the most frequently occurring
measurement.
• The mode is the measurement which occurs
most frequently.
• Measurements: 2, 4, 9, 8, 8, 5, 3
– The mode is 8, which occurs twice
• Measurements : 2, 2, 9, 8, 8, 5, 3
– There are two modes—8 and 2 (bimodal)
• Measurements : 2, 4, 9, 8, 5, 3
– There is no mode (each value is unique).
Example
The number of quarts of milk
purchased by 25 households
(ordered):
0 0 1 1 1 1 1 2 2 2 2 2 2 2 2
2 3 3 3 3 3 4 4 4 5
• Mean?
∑ xi 55
x= = = 2.2
n 25
• Median?
m=2
• Mode? (Highest peak)
mode = 2
Effect of extreme values on center
• The mean is more easily affected by
EXTREME values than median.
– Example;
data: 1,3,5,7, 8
mean= 24/5=4.8; median=5
data (modified): 1,3,5,7,100
mean=23.2; median=5
• Median is used as measure of center when
data is skewed..
Relation among mean, median and
skewness
• The mean follows skewness direction
Symmetric: Mean = Median

Skewed right: Mean > Median

Skewed left: Mean < Median


2.3 Measures of variability
• Measures of variability describe population
or data spread about their centers..
Range
Definition:
The range, R, of a set of n measurements
is the difference between the largest and
smallest measurements.
• Example: Heights of 5 adult Canadians:
1.5, 1.6, 1.8, 1.4, 1.9 (in meters)
• The range is R = 1.9 – 1.4 = .5

•Quick and easy, but only uses 2 of


the 5 measurements.
Variance
• The variance, of a set of n measurements
measures the average deviation of the
measurements about their mean.
• Example: Data={5, 12, 6, 8, 14}

xi − x
45
x= =9
5
4 6 8 10 12 14
Definition:
The variance, σ2, of population of N measurements
is the average of the squared deviations of the
measurements about their mean µ.
∑ ( x − µ ) 2
σ2 = i
N

Definition:
The variance, s2, of a sample of n measurements
is the sum of the squared deviations of the
measurements about their mean, divided by (n – 1).
∑ ( x − x ) 2
s2 = i
n −1
Standard deviation
• In forming the variance we squared the
deviations from the mean, thus squaring also
the scale
• To put back the measure of variability into
the original scale of the units we take the
positive square root of the variance

Population standard deviation : σ = σ 2


Sample standard deviation : s = s 2
Example
• One way of calculating standard deviations:
Use the Definition Formula:
xi xi − x ( xi − x ) 2

5 -4 16 ∑ ( x − x ) 2
s2 = i
12 3 9 n −1
6 -3 9
60
8 -1 1 = = 15
14 5 25
4
Sum 45 0 60
s = s 2 = 15 = 3.87
• Another way of calculating standard
deviations:
Use the Computational Formula:
xi xi2
( ∑ x ) 2
∑ xi −
2 i
5 25
s2 = n
12 144
n −1
6 36
452
8 64 465 −
14 196 = 5 = 15
Sum 45 465 4

s = s 2 = 15 = 3.87
Summary & Remarks
• The value of s (stdev) is always greater or
equal to zero
• The larger the s the greater the variability of
the data
• If s is zero, all measurements must be equal
(no variability/spread)
• Why we divide by (n-1) instead of n?
because (n-1) gives more accurate estimate if
sample size is small
2.4 Practical use of the standard
deviation
Tchebysheff’s theorem:
Given a number k greater than or equal to 1
and a set of n measurements, at least
1-(1/k2) of the measurement will lie within k
standard deviations of the mean.
• For k=2, 1-(1/k2)=3/4 so at least ¾ of the n
measurements fall within 2 sigma of the mean (in
the interval µ ± 2σ )
• For k=3, at least 8/9 of the n measurements fall
within µ ± 3σ
Empirical rule
Given a distribution of measurements that is
approximately mound-shaped (bell-shaped):
The interval µ ± 2σ contains approximately
95% of the measurements.
The interval µ ± 3σ contains approximately
99.7% of the measurements.
The interval µ ± σ contains approximately
68% of the measurements.
Tchebysheff’s vs Empirical
• Tchebysheff’s rule is always true for all
distributions and gives lower bound on the
fraction of measurements falling into a
interval
• Empirical rule is a only descriptive
approximate rule. More accurate than
Tchebysheff’s when distribution is bell-
shaped
Example
Time from onset to recurrence of a disease for
50 patients (ex01.22)
2.1 3.7 12.6 23.1 5.6
9 4.4 2.7 32.3 9.9
14.7 2 6.6 3.9 1.6
19.2 9.6 16.7 7.4 8.2
4.1 6.9 4.3 3.3 1.2
7.4 18.4 0.2 6.1 13.5
14.1 0.2 8.3 0.3 1.3
8.7 1 2.4 2.4 18
1.6 24 1.4 8.2 5.8
3.7 3.5 11.4 18 26.7
0.4

Histogram of TIME

x = 8.37 16

14

12

s = 7.67 10
Frequency

Shape? Skewed right 0


0.1 6.1 12.1 18.1
TIME
24.1 30.1
k x ±ks Interval Proportion Tchebysheff Empirical
in Interval Rule

1 8.37 ±7.67 .7 to 16.04 37/50 (.74) At least 0 ≈ .68


2 8.37 ±15.34 -6.97 to 23.71 47/50 (.94) At least .75 ≈ .95

3 8.37 ±23.01 -14.64 to 31.38 49/50 (.98) At least .89 ≈ .997

•Yes. Tchebysheff’s
•Do the actual proportions in the three Theorem must be
intervals agree with those given by true for any data
Tchebysheff’s Theorem? set.
•Do they agree with the Empirical •No. Not very well.
Rule?
•The data distribution is not very
•Why or why not? mound-shaped, but skewed right.
Example
• The length of time for a worker to complete a
specified operation averages 12.8 minutes
with a standard deviation of 1.7 minutes. If
the distribution of times is approximately
mound-shaped, what proportion of workers
will take longer than 16.2 minutes to
complete the task?
95% between 9.4 and 16.2
47.5% between 12.8 and 16.2
.475 .475 .025 (50-47.5)% = 2.5% above 16.2
2.5 A check on the value of s
• From Tchebysheff’s and Empirical rule,
almost all measurements should be within 4s
to 6s about the mean, thus R ≈ 4s-6s
• To approximate the stdev use:

s ≈ R/4
or s ≈ R / 6 for a largedataset.
Example (ex01.22)
• The disease recurrence data:
2.1 3.7 12.6 23.1 5.6
9 4.4 2.7 32.3 9.9
14.7 2 6.6 3.9 1.6
19.2 9.6 16.7 7.4 8.2
4.1 6.9 4.3 3.3 1.2
7.4 18.4 0.2 6.1 13.5
14.1 0.2 8.3 0.3 1.3
8.7 1 2.4 2.4 18
1.6 24 1.4 8.2 5.8
3.7 3.5 11.4 18 26.7
0.4

R = 32.3-.2 = 32.1
s ≈ R / 4 = 32.1 / 4 = 8.025
Actual s = 7.67
2.6 Measures of relative standing
Z-scores
• Where does one particular measurement
stand in relation to the other measurements in
the data set?

Definition:
The sample z-score is a measure of relative
standing of individual measurements defined by:
x−x
z - score=
s
• The z-score tells us How many standard
deviations away from the mean does the
measurement lie?
Suppose s = 2. s

4
s

x =5 x=9
x = 9 lies z =2 std dev from the mean.
• From Tchebysheff’s Theorem and the Empirical Rule
– At least 3/4 and more likely 95% of measurements lie
within 2 standard deviations of the mean.
– At least 8/9 and more likely 99.7% of measurements lie
within 3 standard deviations of the mean.
• z-scores between –2 and 2 are not unusual. z-scores should
not be more than 3 in absolute value. z-scores larger than 3
in absolute value would indicate a possible outlier.

Outlier Not unusual Outlier


z

-3 -2 -1 0 1 2 3
Somewhat unusual
Percentiles
Definition:
The pth percentile of a set of n ordered
measurements is the value of x that exceeds p% of
the measurements and is less than the remaining
(100-p)%.

p% (100-p) %
x

p-th percentile
Example

• 90% of all Canadians drink 1/4 liter of


milk or more per week.
10% 90% ¼ l of milk is the
¼ l of 10th percentile.
milk

50th Percentile ≡ Median


25th Percentile ≡ Lower Quartile (Q1)
75th Percentile ≡ Upper Quartile (Q3)
Quartiles and the IQR
• The lower quartile (Q1) is the value of x
which is larger than 25% and less than 75%
of the ordered measurements (25th
percentile).
• The upper quartile (Q3) is the value of x
which is larger than 75% and less than 25%
of the ordered measurements (75th
percentile).
• The range of the “middle 50%” of the
measurements is the interquartile range,
IQR = Q3 – Q1
Calculating the sample quartiles
• The lower and upper quartiles (Q1 and
Q3), can be calculated as follows:
• The position of Q1 is .25(n + 1)

•The position of Q3 is .75(n + 1)

once the measurements have been


ordered. If the positions are not integers,
find the quartiles by interpolation.
Example
• Time from onset to recurrence (ordered):
0.2 1.6 4.1 8.2 14.1
0.2 2 4.3 8.2 14.7
0.3 2.1 4.4 8.3 16.7
0.4 2.4 5.6 8.7 18
1 2.4 5.8 9 18
1.2 2.7 6.1 9.6 18.4
1.3 3.3 6.6 9.9 19.2
1.4 3.5 6.9 11.4 23.1
1.6 3.7 7.4 12.6 24
3.9 7.4 13.5 26.7
32.3

Position of Q1 = .25(50 + 1) = 12.75


Position of Q3 = .75(50 + 1) = 38.25
Q1is 3/4 of the way between the 12th and 13th
ordered measurements, or
Q1 = 2.1 + .75(2.4-2.1) = 2.325
Example(ex01.22)
• Time from onset to recurrence (ordered):
0.2 1.6 4.1 8.2 14.1
0.2 2 4.3 8.2 14.7
0.3 2.1 4.4 8.3 16.7
0.4 2.4 5.6 8.7 18
1 2.4 5.8 9 18
1.2 2.7 6.1 9.6 18.4
1.3 3.3 6.6 9.9 19.2
1.4 3.5 6.9 11.4 23.1
1.6 3.7 7.4 12.6 24
3.9 7.4 13.5 26.7
32.3
Position of Q1 = .25(50 + 1) = 12.75
Position of Q3 = .75(50 + 1) = 38.25
Q3is 1/4 of the way between the 38th and 39th
ordered measurements, or
Q3 = 12.6 + .25(13.5-12.6) = 12.825
Thus
IQR = Q3 – Q1 = 12.825-2.325 = 10.5
Extra problems
• Find Q1, Q2, Q3 and 30th percentile of the
following set of measurements: 6, 4, 5, 14, 2,
6, 0, 15 and 8.
Extra problems
• You are given that the mean and the range of
n=25 measurements are 10 and 8,
respectively. If the distribution is mound-
shaped, approximately, what proportion of
the data is expected to be between 4 and 16?
Show your work and justify it
Extra problems
Q: The following data represent the numbers of minutes an athlete spends training per day:
73, 74, 76, 77, 79, 79, 83, 84, 88, 84, 84, 85, 86, 86, 87, 87, 88, 91, 92, 92, 93, 97, 98, 98,
81, and 82.

1. What is the: population, variable of interest, type of the variable?


2. Compute the mean and standard deviation

3. Using the following Stem and leafs plot , compute median, Q1, Q3, IQR

Stems Leaves
7 34
7 6799
8 123444
8 5667788
9 1223
9 788
Extra problems
Q: Amount of Food Sold. Suppose the hourly dollar amount of food sold by a local
restaurant follows an approximately mound-shaped distribution, with a mean sales level of
$400 per hour and a standard deviation of $60 per hour.

1. What is the: population, variable of interest, type of the variable?


2. During what percentage of working hours does this restaurant sell between $280 and $520
worth of food per hour
Extra problems
Q: Two students are enrolled in different sections of an introductory statistics class at a local
university. The first student, enrolled in the morning section, earns a score of 76 on a
midterm exam where the class mean was 64 with a standard deviation of 8. The second
student, enrolled in the afternoon section, earns a score of 72 on a midterm exam where the
class mean was 60 with a standard deviation of 7.5. If the scores on the midterm exams are
normally distributed, which student scored better relative to his or her classmates?
Extra problems
Q: If the 90th and 91st observations in a set of 100 data values are 158 and 167, respectively,
what is the 90th percentile value?

Q:If the 18th and 19th observations in a set of 25 data values are 42.6 and 43.8, what is the
70th percentile value?

You might also like