Lecture 1 - Introduction To Statistics
Lecture 1 - Introduction To Statistics
APPLICATION (QHO430)
Figure: Educational Attainment, Based on a Sample of 78,000 Households in the 2013 Current
Population Survey.
Inferential Statistics Example
Suppose we’d like to know what people think about controls over
the sales of handguns. We can study results from a recent poll of
834 Florida residents.
In that poll, 54.0% of the sampled subjects said they favoured controls
over the sales of handguns.
We are 95% confident that the percentage of all adult Floridians
favouring control over sales of handguns falls between 50.6% and
57.4%.
Example Parameter versus Statistic
Suppose the percentage of all students on your campus who
have a job is 84.9%.
This value represents a parameter because it is a numerical summary of a
population.
x
x
n
Example: Centre of the Cereal Sodium
Data
We find the mean by adding all the observations and then
dividing this sum by the number of observations, which is 20:
0, 340, 70, 140, 200, 180, 210, 150, 100, 130,
140, 180, 190, 160, 290, 50, 220, 180, 200, 210
An outlier is an
observation that falls
well above or well
below the overall
bulk of the data.
Example: CO2 Pollution
CO2 pollution levels in 9 largest nations measured in metric tons per person:
The CO2 values have n = 9 observations. The ordered values are: 0.3, 0.4, 0,8, 1.4, 1.8, 2.1,
5.9, 11.6, 16.9. Since n is odd, the median is the middle value, 1.8.
• The range is simple to compute and easy to understand, but it uses only
the extreme values and ignores the other values. Therefore, it’s affected
severely by outliers.
Standard Deviation
The deviation of an observation x from the mean
( x x ), The deviation of an observation x from the mean
• Each observation has a deviation from the mean.
• A deviation is positive if the value falls above the mean and negative if the
value falls below the mean.
• The sum of the deviations for all the values in a data set is always zero.
• Summary measures of variability from the mean use either the squared
deviations or their absolute values.
• The average of the squared deviations is called the variance.
• The symbol ( x x ), is called a sum of squares.
Standard Deviation Example 1
• For the cereal sodium values, the mean is x 167.
• The observation of 210 for Honeycomb has a deviation of 210 − 167 = 43.
The observation of 50 for Honey Smacks has a deviation of 50 − 167 = −117.
The figure shows these deviations.
Figure: Dot Plot for Cereal Sodium Data, Showing Deviations for Two Observations.
Question: When is a deviation positive and when is it negative?
The Standard Deviation s of n
Observations
Gives a measure of variation by summarizing the deviations of
each observation from the mean and calculating an adjusted
average of these deviations.
s (x x ) 2
n 1
The larger the standard deviation, s, the greater
the variability of the data.
Standard Deviation Example 2
• Women’s and Men’s Ideal Number of Children
• Men: 0, 0, 0, 2, 4, 4, 4 Women: 0, 2, 2, 2, 2, 2, 4
• Both men and women have a mean of 2 and a range of 4.
s
( x x ) 2
24
2.0.
n 1 6
a.Use Excel to draw a histogram, then comment on the appropriateness of using the
Empirical Rule to make any general statements about driver’s speeds. The data appear to
roughly follow a symmetric, bell-shaped distribution. It is appropriate to apply the Empirical
Rule.
b.Use the Empirical Rule to estimate the % of speeds that are between 26 and 38 mph. (Hint:
the sample mean is approximately 32 mph) 68% (because 68% of the observations fall within
1 standard deviation of the mean)
Empirical Rule Example
c. Determine the actual % of drivers whose speeds were between 26 and 38 mph.
9x100/14 64%
d. Determine the actual % of drivers whose speeds are 26 mph or higher, so they
exceed the posted speed limit. 12x100/14 86%
Using Measures of Position
to Describe Variability
Quartiles
Quartiles
The Quartiles Split the Distribution Into Four Parts. 25% is below the first
quartile (Q1), 25% is between the first quartile and the second quartile
The median of the 20 values is the average of the 10th and 11th observations, 180 and 180,
which is Q2 = 180 m g .
illi rams
The first quartile Q1 is the median of the 10 smallest observations (in the top row), which is
the average of 130 and 140, Q1 = 135 m g .illi rams
The third quartile Q3 is the median of the 10 largest observations (in the bottom row), which is
the average of 200 and 210, Q3 = 205 m g .illi rams
Checking for Outliers by Using Quartiles
Step 1 Determine the first and third quartiles of the data.
Step 2 Compute the interquartile range. IQR = Q3-Q1
Step 3 Determine the fences. Fences serve as cutoff points for determining
outliers.
Lower Fence = Q1 − 1.5(IQR)
Upper Fence = Q3 + 1.5(IQR)
Step 4 If a data value is less than the lower fence or greater than the upper
fence, it is considered an outlier.
Checking for Outliers Example
A group of students collected data on the speed of vehicles travelling through a
construction zone on a state highway, where the posted speed was 25 mph. The
recorded speed of 14 randomly selected vehicles is given below:
20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40
a.Determine the quartiles. Q1= 28 mph; Q2= 32.5; Q3=38 mph
c.Determine the lower and upper fences. Are there any outliers, according to this
criterion? Lower fence = Q1-1.5(IQR) = 13 mph, Upper fence = Q3+1.5(IQR) = 53
mph; there are no outliers according to this criterion
Five-Number Summary of Positions
The five - number summary is the basis of a graphical display
called the box plot, and consists of
Minimum value
First Quartile
Median
Third Quartile
Maximum value
Constructing a Box Plot
A box goes from Q1 to Q3.
A line is drawn inside the box at the median.
A line goes from the lower end of the box to the smallest observation
that is not a potential outlier and from the upper end of the box to the
largest observation that is not a potential outlier.
The potential outliers are shown separately.
Example: Box Plot for Cereal Sodium Data
The figure shows a box plot for the sodium values. Labels are also given for the five - number
summary of positions.
Figure: Box Plot and Five-Number Summary for 20 Breakfast Cereal Sodium Values. The central
box contains the middle 50% of the data. The line in the box marks the median. Whiskers extend from
the box to the smallest and largest observations, which are not identified as potential outliers.
Potential outliers are marked separately. Question: Why is the left whisker drawn down only to 50
rather than to 0?
Z - Score
The z - score also identifies position and potential outliers.
The z - score for an observation is the number of standard deviations that it falls
from the mean. A positive z - score indicates the observation is above the mean. A
negative z - score indicates the observation is below the mean. For sample data,
the z - score is calculated as:
observation-mean
z= .
standard deviation
20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40
a.Use Excel to check that the sample mean and standard deviation for these
vehicle speeds are 32.1 and 6.2, respectively.
b.Find the z-score for a car driving 20 mph in this construction zone. -1.95
c.Find the z-score for a car driving 40 mph in this construction zone. +1.27
d.Use the standard deviation criterion to check if there are any outliers.
Summary
Introduction to the module: Module Descriptor; What we have to achieve; Assessments;
Weekly tasks and expectations; Resources.
Revision of Basic Statistics - What do you mean?
Some of the most basic statistics functions we use in conversation regularly, for
example....
Asked at the doctors... what is your average alcohol consumption in units?
On average Scottish gin accounts for 70% of the UK's overall gin production
Common questions - on average how many times a week do you eat meat?
Mode What is the most common shoe size for men and women?
Median: splits the data in half
Standard deviation How is the data spread out?
Quartiles and Boxplot
Question?