Statistical Analysis Notes
Statistical Analysis Notes
Data that is not associated with another data set is called one-variable data.
Example: The weights of all the students in your class, their ages, their math marks, or the
distance they travel to get to school.
The mean of a data set is the sum of all of the values divided by the number of values in the set. In
x1 x2 ... xn
other words, if the data set is {x1, x2, x3, ... xn-1, xn}, then the mean is .
n
For a data set with an odd number of values, the median is the middle value of the set when the data is
listed in ascending order. For a data set with an even number of values, the median is the mean of the two
middle values of the set when the data is listed in ascending order.
The mode is the value or values that occur most frequently within a data set.
MEASURES OF DISPERSION
The range of a data set is the difference between the maximum data value and the minimum data value.
The interquartile range is the difference between the third quartile (Q3) and the first quartile (Q1) of
the data set Q3 – Q1.
Example 1: Find the mean, median, range, standard deviation, variance, and the interquartile range for the
data set 3, 12, 6, 8, 7, 2, 9, 5, using a TI-83 graphing calculator. Round the approximate
measures to the nearest hundredth.
Page 1 of 21
MDM4U – STATISTICAL ANALYSIS
Press the down arrow several times to view the following additional information.
From the screen display you can see that, to the nearest hundredth, the measures are:
Mean = _____
Median = _____
Range = maxX - minX = _____ - _____ = _____
Standard deviation = _____
Variance = 2 = _____2 = _____
Interquartile range = _____ - _____ = _____ - _____ = _____
Example 2: If 10 was added to every value in a data set, then the measure that would change is the
a) range
b) median
c) standard deviation
d) interquartile range
Example 3: Determine the standard deviation of the data set {5, 8, 9,12,15,15,17, 20, 21,27} to the
nearest hundredth.
The percentile, or percentile rank, for any data value indicates that a certain percent of values fall
beneath that data value. The percentile for a data value, x, is given by the formula:
Page 2 of 21
MDM4U – STATISTICAL ANALYSIS
1
B E
Percentile for x = 2 100 where B = # of data values below x, E = # of values equal to x, and
n
n = total number of values in the data set
Percentiles are usually truncated to the nearest whole number.
The 25th percentile is equal to the first quartile, Q1.
The is 50th percentile is equal to the second quartile, Q 2.
The 75th percentile is equal to the third quartile, Q3.
Example 4: In a set of 414 mathematics scores, 164 are below a score of 50% and 8 are equal to
50%. What is the percentile rank of a score of 50%?
The z-score for a data value of x is the number of standard deviations of the value from the mean
of the data set. For example, a z-score of 1.6 means the data value is 1.6 standard deviations
above the mean, whereas a z-score of-2.1 means the data value is 2.1 standard deviations
below the mean. The z-score for a data value x is given by the following formula:
x
z where = standard deviation, = mean of the set of data
Example 5: In Kelly's English class, the marks for the 39 students have a normal distribution, with
a mean of 61% and a standard deviation of 11.5. Kelly has a mark of 71%. What is Kelly's z
score, rounded to the nearest hundredth?
Example 6: Terry had a score of 122 points in a city-wide contest with 3200 participants. There were
473 scores better than Terry's, and 64 scores equal to Terry's. What was Terry's percentile
rank in the contest?
a) 16 b) 17 c) 84 d) 85
Page 3 of 21
MDM4U – STATISTICAL ANALYSIS
Example 7: Karl is 171 cm tall. The mean height of the students in Karl's class is 156 cm, and the standard
deviation of their heights is 2.2 cm. If the heights have a normal distribution, then
determine the z-score for Karl's height, to the nearest hundredth.
GRAPHS
Circle graphs are used to show how one part relates to the whole by splitting the
circle into sectors. The size of each sector is proportional to the amount it
represents.
For example, a circle graph could be used to show how the household
budget is divided up for the various expenses.
Bar graphs are used to display relative amounts or frequencies for comparable
categories.
For example, the annual sales of the major auto manufacturers can be
compared with a bar graph.
Box plots are used to summarize the distribution of data by showing the median,
the first and third quartiles, and the minimum and maximum data values.
For example, the test marks data from the stem-and-leaf plot in the
previous example could be summarized using a box plot.
Page 4 of 21
MDM4U – STATISTICAL ANALYSIS
The types of data and the graphs commonly used to display each type are summarized in the following
table:
Type of Data Description Example Type(s) of Graphs
Data values can be put An example is the
Circle Graphs
Categorical into non-overlapping number of males and
Bar Graphs
categories. females in a math class.
Data values can be For example, a movie
ranked, or put in order. rating in which the
Ordinal The data values can be movie is rated from 1 to Bar Graphs
counted but not 5 stars by a group of
measured. moviegoers.
Data that applies to an
amount, a quantity, or a
For example, the Histograms
range of values. Usually
Quantitative heights of students in Stem-and-Leaf Plots
measurement units are
the class. Box Plots
associated with the
data.
Example 8: A family has a total monthly budget of $3200. They budget $650 for groceries. If a circle
graph were drawn to display the budget categories, what would be the measure of the central
angle of the sector that represents the groceries? Give the answer to the nearest degree.
Example 9: Which of the following graph types would best display the populations of Canada’s ten
provinces?
a) Box plot
b) Bar Graph
c) Histogram
d) Stem-and-Leaf plot
Page 5 of 21
MDM4U – STATISTICAL ANALYSIS
Example 10: The following data represents the vertical jump heights of physical education students at a
particular elementary school, rounded to the nearest centimetre. Use a TI-83 graphing
calculator to draw a histogram and a box plot for the data.
Using a TI-83 Graphing Calculator Height (cm) Frequency
14 2
1. Press [2 ] [+] [4:ClrAllLists] [ENTER]
nd
15 4
2. Press [STAT] [1:Edit] and enter the height in L1 and 16 6
the frequency in L2 17 9
18 13
3. Press [2nd] [Y=] [ENTER] 19 15
4. Use the arrows to scroll to On, press [ENTER], 20 14
scroll to the histogram icon, press [ENTER] 21 11
22 8
5. Scroll to Xlist:, press [2nd] [1] to choose L1 23 7
24 7
6. Scroll to Freq:, press [2nd] [2] to choose L2
25 4
7. Press [ZOOM] scroll down to 9:ZoomStat, press
[ENTER]
The box plot is the fifth type of graph listed in the STAT
PLOT menu.
Page 6 of 21
MDM4U – STATISTICAL ANALYSIS
Example 11: The following graph was obtained by entering one-variable data into a TI-83 graphing
calculator.
Which of the following data sets was most likely entered into the calculator to obtain the
given graph?
a) Restaurant ratings on a 10-point scale
b) Sales of different types of vacuum cleaners
c) The number of days with specific high temperatures in September
d) Marks on a math exam compared to amount of time spent studying
When a survey is conducted using a sample of the population, it is unlikely that the sample represents the
entire population exactly. If the same survey was conducted on a different sample of the population, the
chances are that the results would be at least somewhat different.
The confidence level is a measure of how confident the statistician is of the results. A 95% confidence
level means that the statistician is 95% confident of the results. The margin of error establishes the
interval for which the confidence level applies.
Assume a survey was done that reports that 70% of Canadians prefer watching hockey to football on
television. The survey states that it is accurate to within 3 percentage points 95% of the time. What
does this last statement mean? It means that if the survey were conducted 100 times, 95 times you would
get a result of 70 + 3%, or between 67% and 73%, of the respondents saying they prefer hockey.
If the sample size in the previous example were 1000 people, and a new survey were done with a sample
size of 2 000 people, then the margin of error for the same 95% confidence level would decrease,
possibly to 2%. Increasing the sample size has the effect of decreasing the margin of error, but not
proportionally, i.e., doubling the sample size does not halve the margin of error.
Dynamic statistical software for a computer, and programs built into some calculators, can efficiently
calculate confidence intervals and margins of error, and demonstrate the relationship between the two as
the sample sizes change. The following example illustrates the procedure with a TI-83 graphing
calculator.
Example 12: In a survey of 170 students, representing a normally distributed population, it was found
that 72 of the students owned a skateboard.
a) If the survey had a 95% confidence level, determine the proportion of students in the
represented population who own a skateboard.
b) Determine the margin of error as a percentage to the nearest tenth.
c) Determine the effect on the margin of error if the sample size were increased to 400.
Page 7 of 21
MDM4U – STATISTICAL ANALYSIS
In order to obtain the answers, follow the given procedure with a TI-83 graphing calculator:
Solutions:
a)
b)
c)
Example 13: If the sample size for a normally distributed population were doubled from 200 to 400, then
the margin of error for a 95% confidence level will
Page 8 of 21
MDM4U – STATISTICAL ANALYSIS
Example 14: The following screen shows information obtained by using the A: 1-PropZInt function on a
TI-83 graphing calculator.
For the data that has been entered in the A:l-PropZInt function, what is the margin of error
for the proportion, to the nearest tenth of a percent?
a) 1.8% b)3.6% c) 7.6% d) 9.5%
The following two graphs compare the sales in a particular year for companies A and B.
Although the two graphs show the same information, they will make different impressions on readers. In
the first graph, it appears that the sales for company B were not much greater than the sales for
company A. However, the second graph shows that the sales for company B were significantly greater
than the sales for company A.
Company A would likely use the first graph to show that they are competitive with company B. On the
other hand, company B would likely use the second graph to show that their sales are far greater than the
sales for company A.
The term "average" has a very loose meaning. Although it is usually interpreted as the measure of the
mean, it could be used to represent any of the measures of central tendency. The mean is not always the
best representation of central tendency, as illustrated in the next example.
Example 14: A president of a small company reports that the average annual income of his employees is
$55 000. However, the president does not report the amounts of their salaries, which are
$28 000, $30 000, $40 000, $42 000, $50 000, $52 000, $98 000, and $100 000. Although
the president correctly reported the mean salary, this report is misleading because only two
of the eight employees earn more then $55 000. A better representation of the "average" in
this situation would be to use the ____________ salary, which is $____________
Page 9 of 21
MDM4U – STATISTICAL ANALYSIS
Example 15: The following data set represents the number of goals scored by individual players on a
professional hockey team during the 2007-2008 season.
The coach of the team reports that the average number of goals scored by his players is
more than 12. Which of the following statements is true?
a) The coach reported the mean number of goals, and it is not misleading.
b) The coach reported the median number of goals and it is misleading; the mode would have
been a better measure of central tendency.
c) The coach reported the mean number of goals and it is misleading; the median would have
been a better measure of central tendency.
d) The coach reported the median number of goals and it is misleading; the mean would have
been a better measure of central tendency.
Example 16: In preparation for a meeting with shareholders, the president of a company wants to prepare
a graph to show that while this year's revenues are less than last year's revenues, they are
not much less.
Which of the following described graphs would best convey his message?
a) A bar graph using a vertical scale that makes the bar representing this year's revenues
slightly shorter than the bar representing last year's revenues.
b) A bar graph using a vertical scale that makes the bar representing this year's revenues
much shorter than the bar representing last year's revenues.
c) A circle graph that shows the sector representing this year's revenues smaller than a
sector representing last year's revenues.
d) A line graph with a line connecting this year's revenues to last year's revenues.
Example 17: Consider the following headline and pictograph appearing in a newspaper.
Page 10 of 21
MDM4U – STATISTICAL ANALYSIS
In two-variable data, a relationship, or function, exists between the two data sets. For some types of
functions, such as a linear function, a correlation coefficient can be used to determine how closely the
two data sets fit this particular relationship.
The linear correlation coefficient, denoted by r, measures the fit of two sets of data to a linear model.
The value of r can be -1 or +1, or any value in between -1 and +1 (-1 < r < +1).
If r is close to +1, then there is a strong positive correlation between the variables and the function.
This means that one variable will increase as the other variable increases. If r = +1, then the relation is a
linear relation with a positive slope.
If r is close to —1, then there is a strong negative correlation between the variables and the
function. This means that one variable will increase while the other variable decreases. If r = -1, then the
relation is a linear relation with a negative slope.
Generally, if r > 0.8 then the relationship has a strong correlation, and if r < 0.5, the relationship has
a weak correlation.
Example 18: A survey was completed to determine the correlation between the amount of time students
spent studying mathematics in a week and their math marks. The following table shows a
sample of the data collected.
Study Time (h) Math Mark (%)
2 52
2.5 60
3 65
3.5 68
4 68
4.5 65
5 72
5.5 75
6 80
6.5 80
7 78
7.5 74
Use a calculator to determine the linear correlation coefficient, r, and whether there is a
strong positive correlation.
In order to obtain the answers, follow the given procedure with a TI-83 graphing calculator:
3. Press [STAT] [1:Edit] and enter the study time in L1 and the math mark in L2.
4. Press [STAT] to highlight “CALC”, then scroll down to [4:linReg(ax+b)], press [ENTER],
Page 11 of 21
MDM4U – STATISTICAL ANALYSIS
[ENTER]
A contingency table can display the frequencies of data elements that are classified by two variables in
which the rows of the table represent one variable and the columns of the table represent the other
variable.
Example 19: A sample of 40 patients were randomly given either Drug A or Drug B as part of a
pharmaceutical drug trial. The test patients either reacted positively to the drug they were
given or they did not react. The results of this medical trial are summarized in the following
contingency table, in which the numbers represent the frequencies of results.
From the contingency table, it appears that Drug B was more effective than Drug A because a
greater proportion of the users had a positive reaction. However, more studies would be
necessary because it may be that the 11 who reacted positively to Drug A would have no
reaction to Drug B, or that the 16 that reacted with Drug B may also react with Drug A.
Example 20: A survey collected data from adult men comparing the amount of time they spent exercising
to the number of hours of television they watched on a weekly basis. The following sample of
data is from the survey.
Exercising (h) Television (h)
0 12
0.5 10
For the data given in the table,
1 14 the linear
2 10 correlation coefficient is
2.5 10
3 8 approximately
3.5 8 a) -0.90
4 6 b) -0.81
4.5 6
5 5
c) 0.81
d) 0.90
Page 12 of 21
MDM4U – STATISTICAL ANALYSIS
Example 21: Two experimental groups of tree seedlings were given either Fertilizer X or Fertilizer Y.
They were all planted in the same type of soil, and they received the same amount of water
and sunlight. After two months, the heights of the seedlings were measured to see how many
were taller or shorter than 20 cm tall, which is the expected height of the seedlings without
fertilizer applied. The results of the experiment are shown if the following table.
Which of the following statements best describes the effectiveness of the fertilizers?
Example 22: The following table shows the average wage in each of the ten regions in Canada, as well as
the average number of working days lost per worker in each region due to any cause (e.g.,
illness, injury). The data is adapted from Statistics Canada for 2007.
Average Wage Days Lost
($) (days) Determine the linear correlation
16.91 10.2 A coefficient to the nearest hundredth for
15.07 9.8 B the relationship between the
average wage and the number of
17.29 12.0 C
days lost per worker.
16.58 10.5 D
19.20 12.0 E
Interpret what this value tells you
21.31 9.3 F
about the relationship.
18.39 10.8 G
18.87 10.5 H
Hint: A calculator or computer with
22.33 9.0 I
statistical functions is required for
20.37 10.1 J
this question.
Page 13 of 21
MDM4U – STATISTICAL ANALYSIS
Example 23: A study examines the speed limits on various roads in a city and the number of accidents
that occur on these roads. The data results in a relationship with a linear correlation
coefficient of 0.36. What does this correlation coefficient mean? Are there any other
pieces of data that might be significant to help describe the relationship between the speed
limits and accidents?
Example 24: Which of the following relationships most likely has a linear correlation coefficient of -0.10?
a) The speeds of cars and their stopping distances
b) The amount of cloud cover at night and the number of visible stars.
c) The amount of water applied to a lawn and the outdoor temperature.
d) The price of a cup of coffee and the price of a muffin at numerous coffee shops.
Page 14 of 21
MDM4U – STATISTICAL ANALYSIS
Two-variable categorical data may be graphed by using double bar graphs or side-by-side circle
graphs.
Two-variable data that is ordinal may also be summarized in a double bar graph.
Two-variable data that is quantitative is most often summarized by a scatter plot and a corresponding
line (or curve) of best fit.
If one of the variables is categorical and the other is quantitative, side-by-side graphs of various
types are possible. Side-by-side box plots are illustrated in the following example.
Example 25: The following sets of math scores are from students who wrote a test before having a review
class and another test after having a review class.
Student Before Review After Review
A 65 84
B 48 65
C 76 98
D 43 67
E 87 86
F 67 76
G 55 44
H 59 68
I 76 87
J 56 78
K 76 66
L 88 89
M 39 58
N 75 54
For the data given in the table above, use a TI-83 (or equivalent) graphing calculator to
a) Draw a scatter plot of the data with the original scores as the independent variable and
the scores after the review as the dependent variable.
b) Graph side-by-side box plots of the data.
c) Determine the median and interquartile range for each set of data.
Solution
a) Using a TI-83 graphing calculator the steps are:
1. Press [2nd] [+] [4:ClrAllLists] [Enter].
2. Press [STAT] [1:Edit] and enter the “before review scores in L1 and the “after review”
scores in L2.
3. Press [2nd] [Y=] [ENTER]. Use the arrows to scroll to “On”, press [ENTER], scroll to
the first icon (scatter plot), press [ENTER]
4. Check that the variables are labelled as Xlist:L1 and Ylist:L2. Change if necessary.
5. Press [ZOOM], scroll down to 9:ZoomStat, press [ENTER].
Page 15 of 21
MDM4U – STATISTICAL ANALYSIS
c) Using the TRACE button and the scrolling arrows, you should find that the median for the
first set of data is _____, and that the interquartile range is _____ - _____ = _____.
For the second set of data you should find that the median is _____, and that the
interquartile range is _____ - _____ = _____.
Example 26: A survey collected data from golfers comparing the average number of times they played per
month to their average scores.
Games per Average
Month Score Which of the following statements is most
4 98
5 92 accurate?
6 89 a) The data is one-variable data
7 90 and is best graphed
8 86 by using a box plot.
9 85 b) The data is one-variable and is best graphed by
10 82 using a bar graph.
11 83 c) The data is two-variable and is best graphed by
12 78 using a scatter plot.
13 78 d) The data is two-variable and is best graphed by
using side-by-side circle graphs.
Page 16 of 21
MDM4U – STATISTICAL ANALYSIS
Example 27: Patrons of a restaurant were asked to fill out a form rating the restaurant on its food and on
its service. Both ratings were on a scale of 1 to 5. Which of the following graphing choices
would best give the owner of the restaurant the best visual summary of the results of the
survey?
a) Double bar graph with one bar for each food and one bar for service.
b) Side-by-side circle graphs with one circle for food and one for service.
c) A scatter plot.
d) A histogram.
LINEAR REGRESSION
Recall: If r > 0.8 then the relationship has a strong correlation, and if r < 0.5, the relationship has a
weak correlation.
Determining the effect of outliers (points that are a significant distance from the line). The
vertical distance from an outlier to a line of best fit is called a vertical deviation or residual.
An outlier can have a major effect on the value of the regression coefficient, r, especially
when the number of points is relatively small. When working with regression models you need
to be aware of outliers, their significance, and what may have caused them.
Example 28: The following data was collected about sales of a popular music CD from various Internet
sites.
Price per CD Number Sold
($) Use a TI-83 calculator to
3. Press [STAT] [1:Edit] and enter the price per CD amounts in L1 and the # sold in L2
4. Press [STAT] to highlight “CALC”, then scroll down to [4:linReg(ax+b)], press
[ENTER], [ENTER]
You should see the following screen:
Page 17 of 21
MDM4U – STATISTICAL ANALYSIS
To insert the line of best fit, you have a couple of choices. You can type -13.9x + 191.9
into the Y1 = function, or you can paste the linear regression equation into Y 1 = by the
following procedure:
c) The TI-83 calculator has an automatic residual feature. It is in the LIST NAMES menu.
You can see the residuals on the main screen or paste them into a number of places. One
of the best places is into L3.
1. Press [STAT] [1 :Edit] and use the scrolling arrows to put the cursor on L3.
2. Press [2nd] [STAT], scroll down to 7:RESID, press [ENTER], [ENTER]. The residuals
Page 18 of 21
MDM4U – STATISTICAL ANALYSIS
You can see that the residual for the most extreme outlier, which is the _____ point, is
approximately _____.
Example 29: For the data in the following table, the line of best fit in the form y = ax + b is
approximately
x 3 7 8 9 12 14 16 19 20 25
Y 28 26 25 20 18 17 15 12 11 6
Example 30: Which of the following statements regarding outliers is most accurate?
a) The effect of outliers on the equation of the line of best fit is always very slight.
b) The effect of outliers on the equation of the line of best fit is always slight when the
number of data points is small.
c) The effect of outliers on the equation of the line of best fit may be great when the
number of data points is small.
d) The effect of outliers on the equation of the line of best fit is always great when the
number of data points is great.
Example 31: The following stopping distances were measured for a particular brand of motorcycle.
b) Use the equation from part a) to estimate the stopping distance to the nearest tenth of
a metre, when a motorcycle is traveling at 120 km/h.
Page 19 of 21
MDM4U – STATISTICAL ANALYSIS
When analyzing two-variable statistical summaries you should be aware of the following factors and
techniques that may misrepresent the true relationships between the variables.
Example 32: Sonia conducted a study to determine if there was a relationship between the average
amount of time a student spends on the computer each day and his or her math mark. She
surveyed students in her class and obtained data that led to scatter plots with the two lines
of best fit shown below.
1. __________________________________________________________________
__________________________________________________________________
2. __________________________________________________________________
__________________________________________________________________
3. __________________________________________________________________
__________________________________________________________________
4. __________________________________________________________________
__________________________________________________________________
5. __________________________________________________________________
__________________________________________________________________
Page 20 of 21
MDM4U – STATISTICAL ANALYSIS
Example 33: Marco conducted a study with a sample size of 100 where he compared the number of years
people spent in post-secondary education to their starting salary. He drew a scatter plot of
his results and the corresponding line of best fit, which had a correlation coefficient of
0.98.
The line of best fit is shown below.
If Marco wants to make a change to show that years of
post-secondary education do not significantly increase the
starting salary, he could
Page 21 of 21