0% found this document useful (0 votes)
47 views21 pages

Statistical Analysis Notes

This document discusses analyzing one-variable data through numerical summaries and graphs. It defines measures of central tendency (mean, median, mode), measures of dispersion (range, interquartile range, variance, standard deviation), and explains how to calculate them. Percentiles and z-scores are also introduced to describe data values relative to a distribution. Different types of graphs are described for displaying categorical, ordinal and quantitative data, including circle graphs, bar graphs, histograms, stem-and-leaf plots and box plots. Examples are provided to demonstrate calculating numerical summaries and interpreting graphs.

Uploaded by

blee47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views21 pages

Statistical Analysis Notes

This document discusses analyzing one-variable data through numerical summaries and graphs. It defines measures of central tendency (mean, median, mode), measures of dispersion (range, interquartile range, variance, standard deviation), and explains how to calculate them. Percentiles and z-scores are also introduced to describe data values relative to a distribution. Different types of graphs are described for displaying categorical, ordinal and quantitative data, including circle graphs, bar graphs, histograms, stem-and-leaf plots and box plots. Examples are provided to demonstrate calculating numerical summaries and interpreting graphs.

Uploaded by

blee47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 21

MDM4U – STATISTICAL ANALYSIS

ANALYSING ONE-VARIABLE DATA

Data that is not associated with another data set is called one-variable data.
 Example: The weights of all the students in your class, their ages, their math marks, or the
distance they travel to get to school.

Several different numerical summaries can be applied to one-variable data.

MEASURES OF CENTRAL TENDENCY MEAN

The mean of a data set is the sum of all of the values divided by the number of values in the set. In
x1  x2  ...  xn
other words, if the data set is {x1, x2, x3, ... xn-1, xn}, then the mean is   .
n

For a data set with an odd number of values, the median is the middle value of the set when the data is
listed in ascending order. For a data set with an even number of values, the median is the mean of the two
middle values of the set when the data is listed in ascending order.

The mode is the value or values that occur most frequently within a data set.

MEASURES OF DISPERSION

The range of a data set is the difference between the maximum data value and the minimum data value.

The interquartile range is the difference between the third quartile (Q3) and the first quartile (Q1) of
the data set Q3 – Q1.

(x1   )2  (x2   )2  ...  (xn   )2


The variance (v) is calculated as   , where  represents the mean.
n

The standard deviation ( ) is the square root of the variance.


(x1   )2  (x2   )2  ...  (xn   )2
 
n

USING TECHNOLOGY TO DETERMINE NUMERICAL SUMMARIES

Example 1: Find the mean, median, range, standard deviation, variance, and the interquartile range for the
data set 3, 12, 6, 8, 7, 2, 9, 5, using a TI-83 graphing calculator. Round the approximate
measures to the nearest hundredth.

Follow the steps using a TI-83 graphing calculator:

Page 1 of 21
MDM4U – STATISTICAL ANALYSIS

 Press [2nd] [+] [4:ClrAllLists] [ENTER]


 Press [STAT] [1:Edit] and enter the data in L1
 Press [STAT]  to highlight “CALC”, then [1:1-Var SA], [ENTER]

You should see the following screen:

Press the down arrow several times to view the following additional information.

From the screen display you can see that, to the nearest hundredth, the measures are:

Mean = _____
Median = _____
Range = maxX - minX = _____ - _____ = _____
Standard deviation = _____
Variance = 2 = _____2 = _____
Interquartile range = _____ - _____ = _____ - _____ = _____

Example 2: If 10 was added to every value in a data set, then the measure that would change is the
a) range
b) median
c) standard deviation
d) interquartile range

Example 3: Determine the standard deviation of the data set {5, 8, 9,12,15,15,17, 20, 21,27} to the
nearest hundredth.

ANALYSING ONE-VARIABLE DATA

Recall the following:


 When a data set contains a large number of values and the values are in numerical order, the
distribution of the data values approaches a normal distribution.
 A normal distribution on a graph has the shape of a bell-shaped curve with the mean, median, and
mode all in the centre of the curve.
 The position of an individual data value in the distribution can be described by using percentiles or
z-scores.

The percentile, or percentile rank, for any data value indicates that a certain percent of values fall
beneath that data value. The percentile for a data value, x, is given by the formula:

Page 2 of 21
MDM4U – STATISTICAL ANALYSIS

1
B E
Percentile for x = 2  100 where B = # of data values below x, E = # of values equal to x, and
n
n = total number of values in the data set
Percentiles are usually truncated to the nearest whole number.
The 25th percentile is equal to the first quartile, Q1.
The is 50th percentile is equal to the second quartile, Q 2.
The 75th percentile is equal to the third quartile, Q3.

Example 4: In a set of 414 mathematics scores, 164 are below a score of 50% and 8 are equal to
50%. What is the percentile rank of a score of 50%?

The z-score for a data value of x is the number of standard deviations of the value from the mean
of the data set. For example, a z-score of 1.6 means the data value is 1.6 standard deviations
above the mean, whereas a z-score of-2.1 means the data value is 2.1 standard deviations
below the mean. The z-score for a data value x is given by the following formula:

x 
z where  = standard deviation, = mean of the set of data


Example 5: In Kelly's English class, the marks for the 39 students have a normal distribution, with
a mean of 61% and a standard deviation of 11.5. Kelly has a mark of 71%. What is Kelly's z
score, rounded to the nearest hundredth?

Example 6: Terry had a score of 122 points in a city-wide contest with 3200 participants. There were
473 scores better than Terry's, and 64 scores equal to Terry's. What was Terry's percentile
rank in the contest?
a) 16 b) 17 c) 84 d) 85

Page 3 of 21
MDM4U – STATISTICAL ANALYSIS

Example 7: Karl is 171 cm tall. The mean height of the students in Karl's class is 156 cm, and the standard
deviation of their heights is 2.2 cm. If the heights have a normal distribution, then
determine the z-score for Karl's height, to the nearest hundredth.

GRAPHS

Circle graphs are used to show how one part relates to the whole by splitting the
circle into sectors. The size of each sector is proportional to the amount it
represents.
 For example, a circle graph could be used to show how the household
budget is divided up for the various expenses.

Bar graphs are used to display relative amounts or frequencies for comparable
categories.
 For example, the annual sales of the major auto manufacturers can be
compared with a bar graph.

Histograms are used to display relative frequencies of categories of quantitative data.


The areas of the bars in a histogram represent the values for each category.
 For example, frequencies of income categories for all wage earners in Ontario
can be compared using a histogram.

Stem-and-leaf plots show the distribution of data as it is collected.


 For example, after marking a set of class tests, the teacher can record the
scores in a stem-and-leaf plot to see the distribution of the scores.

Box plots are used to summarize the distribution of data by showing the median,
the first and third quartiles, and the minimum and maximum data values.
 For example, the test marks data from the stem-and-leaf plot in the
previous example could be summarized using a box plot.

Page 4 of 21
MDM4U – STATISTICAL ANALYSIS

The types of data and the graphs commonly used to display each type are summarized in the following
table:
Type of Data Description Example Type(s) of Graphs
Data values can be put An example is the
Circle Graphs
Categorical into non-overlapping number of males and
Bar Graphs
categories. females in a math class.
Data values can be For example, a movie
ranked, or put in order. rating in which the
Ordinal The data values can be movie is rated from 1 to Bar Graphs
counted but not 5 stars by a group of
measured. moviegoers.
Data that applies to an
amount, a quantity, or a
For example, the Histograms
range of values. Usually
Quantitative heights of students in Stem-and-Leaf Plots
measurement units are
the class. Box Plots
associated with the
data.

Example 8: A family has a total monthly budget of $3200. They budget $650 for groceries. If a circle
graph were drawn to display the budget categories, what would be the measure of the central
angle of the sector that represents the groceries? Give the answer to the nearest degree.

Example 9: Which of the following graph types would best display the populations of Canada’s ten
provinces?
a) Box plot
b) Bar Graph
c) Histogram
d) Stem-and-Leaf plot

Page 5 of 21
MDM4U – STATISTICAL ANALYSIS

Example 10: The following data represents the vertical jump heights of physical education students at a
particular elementary school, rounded to the nearest centimetre. Use a TI-83 graphing
calculator to draw a histogram and a box plot for the data.
Using a TI-83 Graphing Calculator Height (cm) Frequency
14 2
1. Press [2 ] [+] [4:ClrAllLists] [ENTER]
nd

15 4
2. Press [STAT] [1:Edit] and enter the height in L1 and 16 6
the frequency in L2 17 9
18 13
3. Press [2nd] [Y=] [ENTER] 19 15
4. Use the arrows to scroll to On, press [ENTER], 20 14
scroll to the histogram icon, press [ENTER] 21 11
22 8
5. Scroll to Xlist:, press [2nd] [1] to choose L1 23 7
24 7
6. Scroll to Freq:, press [2nd] [2] to choose L2
25 4
7. Press [ZOOM] scroll down to 9:ZoomStat, press
[ENTER]

8. A histogram is displayed, however, the x-scale


needs to be adjusted to 1.

9. Press [WINDOW], scroll down to Xscl=, press [1]

10. Press [GRAPH]

The following graph should appear:

The box plot is the fifth type of graph listed in the STAT
PLOT menu.

Page 6 of 21
MDM4U – STATISTICAL ANALYSIS

Example 11: The following graph was obtained by entering one-variable data into a TI-83 graphing
calculator.

Which of the following data sets was most likely entered into the calculator to obtain the
given graph?
a) Restaurant ratings on a 10-point scale
b) Sales of different types of vacuum cleaners
c) The number of days with specific high temperatures in September
d) Marks on a math exam compared to amount of time spent studying

INTERPRETING THE MEANING OF STATISTICS

When a survey is conducted using a sample of the population, it is unlikely that the sample represents the
entire population exactly. If the same survey was conducted on a different sample of the population, the
chances are that the results would be at least somewhat different.

The confidence level is a measure of how confident the statistician is of the results. A 95% confidence
level means that the statistician is 95% confident of the results. The margin of error establishes the
interval for which the confidence level applies.

Assume a survey was done that reports that 70% of Canadians prefer watching hockey to football on
television. The survey states that it is accurate to within 3 percentage points 95% of the time. What
does this last statement mean? It means that if the survey were conducted 100 times, 95 times you would
get a result of 70 + 3%, or between 67% and 73%, of the respondents saying they prefer hockey.

If the sample size in the previous example were 1000 people, and a new survey were done with a sample
size of 2 000 people, then the margin of error for the same 95% confidence level would decrease,
possibly to 2%. Increasing the sample size has the effect of decreasing the margin of error, but not
proportionally, i.e., doubling the sample size does not halve the margin of error.

Dynamic statistical software for a computer, and programs built into some calculators, can efficiently
calculate confidence intervals and margins of error, and demonstrate the relationship between the two as
the sample sizes change. The following example illustrates the procedure with a TI-83 graphing
calculator.

Example 12: In a survey of 170 students, representing a normally distributed population, it was found
that 72 of the students owned a skateboard.

a) If the survey had a 95% confidence level, determine the proportion of students in the
represented population who own a skateboard.
b) Determine the margin of error as a percentage to the nearest tenth.
c) Determine the effect on the margin of error if the sample size were increased to 400.

Page 7 of 21
MDM4U – STATISTICAL ANALYSIS

Solution on next page

In order to obtain the answers, follow the given procedure with a TI-83 graphing calculator:

1. Press [STAT], scroll to TESTS


2. Scroll down to A: 1-PropZint, press [ENTER]
3. Type 72 after x:, 170 after n:, and .95 after C-level (if it is different)
You should have the following screen:

4. Scroll the cursor over Calculate and press [ENTER].


The result should be:

Solutions:

a)

b)

c)

Example 13: If the sample size for a normally distributed population were doubled from 200 to 400, then
the margin of error for a 95% confidence level will

a) be exactly half the original margin of error


b) be exactly double the original margin of error
c) increase, but it will still be less than double the original margin of error
d) decrease, but it will still be more than one-half the original margin of error

Page 8 of 21
MDM4U – STATISTICAL ANALYSIS

Example 14: The following screen shows information obtained by using the A: 1-PropZInt function on a
TI-83 graphing calculator.

For the data that has been entered in the A:l-PropZInt function, what is the margin of error
for the proportion, to the nearest tenth of a percent?
a) 1.8% b)3.6% c) 7.6% d) 9.5%

UNDERSTANDING STATISTICAL SUMMARIES

The following two graphs compare the sales in a particular year for companies A and B.

Although the two graphs show the same information, they will make different impressions on readers. In
the first graph, it appears that the sales for company B were not much greater than the sales for
company A. However, the second graph shows that the sales for company B were significantly greater
than the sales for company A.

Company A would likely use the first graph to show that they are competitive with company B. On the
other hand, company B would likely use the second graph to show that their sales are far greater than the
sales for company A.

The term "average" has a very loose meaning. Although it is usually interpreted as the measure of the
mean, it could be used to represent any of the measures of central tendency. The mean is not always the
best representation of central tendency, as illustrated in the next example.

Example 14: A president of a small company reports that the average annual income of his employees is
$55 000. However, the president does not report the amounts of their salaries, which are
$28 000, $30 000, $40 000, $42 000, $50 000, $52 000, $98 000, and $100 000. Although
the president correctly reported the mean salary, this report is misleading because only two
of the eight employees earn more then $55 000. A better representation of the "average" in
this situation would be to use the ____________ salary, which is $____________

Page 9 of 21
MDM4U – STATISTICAL ANALYSIS

Example 15: The following data set represents the number of goals scored by individual players on a
professional hockey team during the 2007-2008 season.

51, 50, 35, 22, 21, 16, 8, 7, 6, 6, 6, 3, 3, 2, 2, 2, 1, 1, 1, 0

The coach of the team reports that the average number of goals scored by his players is
more than 12. Which of the following statements is true?

a) The coach reported the mean number of goals, and it is not misleading.
b) The coach reported the median number of goals and it is misleading; the mode would have
been a better measure of central tendency.
c) The coach reported the mean number of goals and it is misleading; the median would have
been a better measure of central tendency.
d) The coach reported the median number of goals and it is misleading; the mean would have
been a better measure of central tendency.

Example 16: In preparation for a meeting with shareholders, the president of a company wants to prepare
a graph to show that while this year's revenues are less than last year's revenues, they are
not much less.
Which of the following described graphs would best convey his message?

a) A bar graph using a vertical scale that makes the bar representing this year's revenues
slightly shorter than the bar representing last year's revenues.
b) A bar graph using a vertical scale that makes the bar representing this year's revenues
much shorter than the bar representing last year's revenues.
c) A circle graph that shows the sector representing this year's revenues smaller than a
sector representing last year's revenues.
d) A line graph with a line connecting this year's revenues to last year's revenues.

Example 17: Consider the following headline and pictograph appearing in a newspaper.

Is the pictograph an accurate representation of the


relative teacher earnings? Explain.

Page 10 of 21
MDM4U – STATISTICAL ANALYSIS

ANALYSING TWO-VARIABLE DATA

In two-variable data, a relationship, or function, exists between the two data sets. For some types of
functions, such as a linear function, a correlation coefficient can be used to determine how closely the
two data sets fit this particular relationship.

The linear correlation coefficient, denoted by r, measures the fit of two sets of data to a linear model.
The value of r can be -1 or +1, or any value in between -1 and +1 (-1 < r < +1).

If r is close to +1, then there is a strong positive correlation between the variables and the function.
This means that one variable will increase as the other variable increases. If r = +1, then the relation is a
linear relation with a positive slope.

If r is close to —1, then there is a strong negative correlation between the variables and the
function. This means that one variable will increase while the other variable decreases. If r = -1, then the
relation is a linear relation with a negative slope.

If r = 0, there is no linear correlation between the variables.

Generally, if r > 0.8 then the relationship has a strong correlation, and if r < 0.5, the relationship has
a weak correlation.

Example 18: A survey was completed to determine the correlation between the amount of time students
spent studying mathematics in a week and their math marks. The following table shows a
sample of the data collected.
Study Time (h) Math Mark (%)
2 52
2.5 60
3 65
3.5 68
4 68
4.5 65
5 72
5.5 75
6 80
6.5 80
7 78
7.5 74
Use a calculator to determine the linear correlation coefficient, r, and whether there is a
strong positive correlation.

In order to obtain the answers, follow the given procedure with a TI-83 graphing calculator:

1. Press [2nd] [+] [4:ClrAllLists] [Enter]

2. Press [2nd] [0], scroll down to [DiagnosticOn], press [Enter]

3. Press [STAT] [1:Edit] and enter the study time in L1 and the math mark in L2.

4. Press [STAT]  to highlight “CALC”, then scroll down to [4:linReg(ax+b)], press [ENTER],

Page 11 of 21
MDM4U – STATISTICAL ANALYSIS

[ENTER]

You should see the following screen:

The correlation coefficient is approximately 0.875, which


indicates a strong positive correlation.

A contingency table can display the frequencies of data elements that are classified by two variables in
which the rows of the table represent one variable and the columns of the table represent the other
variable.

Example 19: A sample of 40 patients were randomly given either Drug A or Drug B as part of a
pharmaceutical drug trial. The test patients either reacted positively to the drug they were
given or they did not react. The results of this medical trial are summarized in the following
contingency table, in which the numbers represent the frequencies of results.

Drug A Drug B Drug C


Positive Reaction 11 16 27
No Reaction 8 5 13
Column Totals 19 21 40

From the contingency table, it appears that Drug B was more effective than Drug A because a
greater proportion of the users had a positive reaction. However, more studies would be
necessary because it may be that the 11 who reacted positively to Drug A would have no
reaction to Drug B, or that the 16 that reacted with Drug B may also react with Drug A.

Example 20: A survey collected data from adult men comparing the amount of time they spent exercising
to the number of hours of television they watched on a weekly basis. The following sample of
data is from the survey.
Exercising (h) Television (h)
0 12
0.5 10
For the data given in the table,
1 14 the linear
2 10 correlation coefficient is
2.5 10
3 8 approximately
3.5 8 a) -0.90
4 6 b) -0.81
4.5 6
5 5
c) 0.81
d) 0.90

Page 12 of 21
MDM4U – STATISTICAL ANALYSIS

Example 21: Two experimental groups of tree seedlings were given either Fertilizer X or Fertilizer Y.
They were all planted in the same type of soil, and they received the same amount of water
and sunlight. After two months, the heights of the seedlings were measured to see how many
were taller or shorter than 20 cm tall, which is the expected height of the seedlings without
fertilizer applied. The results of the experiment are shown if the following table.

Fertilizer X Fertilizer Y Row Totals


Above 20cm 11 13 24
Below 20cm 14 12 26
Column Totals 25 25 50

Which of the following statements best describes the effectiveness of the fertilizers?

a) Both fertilizers stunt the growth of the seedlings.


b) Fertilizer X is significantly more effective than fertilizer Y.
c) Fertilizer Y is significantly more effective than fertilizer X.
d) Neither fertilizer appears to be more effective than the other.

Example 22: The following table shows the average wage in each of the ten regions in Canada, as well as
the average number of working days lost per worker in each region due to any cause (e.g.,
illness, injury). The data is adapted from Statistics Canada for 2007.
Average Wage Days Lost
($) (days) Determine the linear correlation
16.91 10.2 A coefficient to the nearest hundredth for
15.07 9.8 B the relationship between the
average wage and the number of
17.29 12.0 C
days lost per worker.
16.58 10.5 D
19.20 12.0 E
Interpret what this value tells you
21.31 9.3 F
about the relationship.
18.39 10.8 G
18.87 10.5 H
Hint: A calculator or computer with
22.33 9.0 I
statistical functions is required for
20.37 10.1 J
this question.

Page 13 of 21
MDM4U – STATISTICAL ANALYSIS

RELATIONSHIPS BETWEEN TWO VARIABLES

Example 23: A study examines the speed limits on various roads in a city and the number of accidents
that occur on these roads. The data results in a relationship with a linear correlation
coefficient of 0.36. What does this correlation coefficient mean? Are there any other
pieces of data that might be significant to help describe the relationship between the speed
limits and accidents?

Example 24: Which of the following relationships most likely has a linear correlation coefficient of -0.10?
a) The speeds of cars and their stopping distances
b) The amount of cloud cover at night and the number of visible stars.
c) The amount of water applied to a lawn and the outdoor temperature.
d) The price of a cup of coffee and the price of a muffin at numerous coffee shops.

GRAPHICAL SUMMARIES OF TWO-VARIABLE DATA

Two-variable data that is categorical is often summarized in a contingency table.


 For example, consider the data summarized in the contingency table below:
Drug A Drug B Row Totals
Positive
11 16 27
Reaction
No
8 5 13
Reaction
Column 40
19 21
Totals (Grand Total)

Page 14 of 21
MDM4U – STATISTICAL ANALYSIS

Two-variable categorical data may be graphed by using double bar graphs or side-by-side circle
graphs.
Two-variable data that is ordinal may also be summarized in a double bar graph.

Two-variable data that is quantitative is most often summarized by a scatter plot and a corresponding
line (or curve) of best fit.

If one of the variables is categorical and the other is quantitative, side-by-side graphs of various
types are possible. Side-by-side box plots are illustrated in the following example.

Example 25: The following sets of math scores are from students who wrote a test before having a review
class and another test after having a review class.
Student Before Review After Review
A 65 84
B 48 65
C 76 98
D 43 67
E 87 86
F 67 76
G 55 44
H 59 68
I 76 87
J 56 78
K 76 66
L 88 89
M 39 58
N 75 54

For the data given in the table above, use a TI-83 (or equivalent) graphing calculator to
a) Draw a scatter plot of the data with the original scores as the independent variable and
the scores after the review as the dependent variable.
b) Graph side-by-side box plots of the data.
c) Determine the median and interquartile range for each set of data.

Solution
a) Using a TI-83 graphing calculator the steps are:
1. Press [2nd] [+] [4:ClrAllLists] [Enter].
2. Press [STAT] [1:Edit] and enter the “before review scores in L1 and the “after review”
scores in L2.
3. Press [2nd] [Y=] [ENTER]. Use the arrows to scroll to “On”, press [ENTER], scroll to
the first icon (scatter plot), press [ENTER]
4. Check that the variables are labelled as Xlist:L1 and Ylist:L2. Change if necessary.
5. Press [ZOOM], scroll down to 9:ZoomStat, press [ENTER].

You should see the following screen:

Page 15 of 21
MDM4U – STATISTICAL ANALYSIS

b) Using a TI-83 graphing calculator the steps are:


1. Use the arrow keys to scroll to “On”, press [ENTER], scroll to the 5 th icon (2nd box
plot), press [ENTER]
2. Scroll to Xlist:, press [2nd] [1] to choose L1.
3. Scroll to Freq:, press [2nd] [1] to choose L1.
4. Press [2nd] [Y=] [ENTER], press [2] to access Plot 2. Use the arrows to scroll to
“On”, press [ENTER], scroll to the 5th icon (2nd box plot), press [ENTER].
5. Scroll to Xlist:, press [2nd] [2] to choose L2.
6. Scroll to Freq:, press [2nd] [2] to choose L2.
7. Press [ZOOM], scroll down to 9:ZoomStat, press [ENTER].

You should see the following screen:

c) Using the TRACE button and the scrolling arrows, you should find that the median for the
first set of data is _____, and that the interquartile range is _____ - _____ = _____.

For the second set of data you should find that the median is _____, and that the
interquartile range is _____ - _____ = _____.

Example 26: A survey collected data from golfers comparing the average number of times they played per
month to their average scores.
Games per Average
Month Score Which of the following statements is most
4 98
5 92 accurate?
6 89 a) The data is one-variable data
7 90 and is best graphed
8 86 by using a box plot.
9 85 b) The data is one-variable and is best graphed by
10 82 using a bar graph.
11 83 c) The data is two-variable and is best graphed by
12 78 using a scatter plot.
13 78 d) The data is two-variable and is best graphed by
using side-by-side circle graphs.

Page 16 of 21
MDM4U – STATISTICAL ANALYSIS

Example 27: Patrons of a restaurant were asked to fill out a form rating the restaurant on its food and on
its service. Both ratings were on a scale of 1 to 5. Which of the following graphing choices
would best give the owner of the restaurant the best visual summary of the results of the
survey?
a) Double bar graph with one bar for each food and one bar for service.
b) Side-by-side circle graphs with one circle for food and one for service.
c) A scatter plot.
d) A histogram.

LINEAR REGRESSION

Recall: If r > 0.8 then the relationship has a strong correlation, and if r < 0.5, the relationship has a
weak correlation.

Linear regression involves the following”

 Finding the equation of the linear model (line of best fit).

 Determining the effect of outliers (points that are a significant distance from the line). The
vertical distance from an outlier to a line of best fit is called a vertical deviation or residual.
 An outlier can have a major effect on the value of the regression coefficient, r, especially
when the number of points is relatively small. When working with regression models you need
to be aware of outliers, their significance, and what may have caused them.

Example 28: The following data was collected about sales of a popular music CD from various Internet
sites.
Price per CD Number Sold
($) Use a TI-83 calculator to

a) 7.95 72 determine the linear equation of the line of best fit


8.49 70 with the values of a and b to the nearest tenth.
8.99 65
b) sketch the graphs of the scatter plot of the points
9.25 60
and the line of best fit.
9.49 86
c) 9.99 55 determine the residual, rounded to the nearest
10.49 52 tenth, for the most extreme outlier.
10.99 35
11.25 32
Solution
11.49 25
a) Using a TI-83 graphing calculator the steps are:

1. Press [2nd] [+] [4:ClrAllLists] [Enter].


2. Press [2nd] [0], scroll down to [DiagnosticOn], press [ENTER].

3. Press [STAT] [1:Edit] and enter the price per CD amounts in L1 and the # sold in L2
4. Press [STAT]  to highlight “CALC”, then scroll down to [4:linReg(ax+b)], press
[ENTER], [ENTER]
You should see the following screen:

Page 17 of 21
MDM4U – STATISTICAL ANALYSIS

You can see that the linear equation is approximately __________________________.

b) Using a TI-83graphing calculator, the steps are:


1. Press [2nd] [Y=] [ENTER]. Use the arrows to scroll to “On”, press [ENTER], scroll to
the first icon (scatter plot), press [ENTER]
2. Check that the variables are labelled as Xlist:L1 and Ylist:L2. Change if necessary.
3. Press [ZOOM], scroll down to 9:ZoomStat, press [ENTER].

You should see the following screen:

To insert the line of best fit, you have a couple of choices. You can type -13.9x + 191.9
into the Y1 = function, or you can paste the linear regression equation into Y 1 = by the
following procedure:

4. Press [Y=] [VARS] 5:Statistics  1:RegEQ [ENTER]

You should see the following screen:

5. Press [GRAPH] to graph the line of best fit.

You should see the following screen:

c) The TI-83 calculator has an automatic residual feature. It is in the LIST NAMES menu.
You can see the residuals on the main screen or paste them into a number of places. One
of the best places is into L3.
1. Press [STAT] [1 :Edit] and use the scrolling arrows to put the cursor on L3.
2. Press [2nd] [STAT], scroll down to 7:RESID, press [ENTER], [ENTER]. The residuals

Page 18 of 21
MDM4U – STATISTICAL ANALYSIS

should now appear in L3.

You can see that the residual for the most extreme outlier, which is the _____ point, is
approximately _____.
Example 29: For the data in the following table, the line of best fit in the form y = ax + b is
approximately

x 3 7 8 9 12 14 16 19 20 25
Y 28 26 25 20 18 17 15 12 11 6

a) y = -1.0x – 1 b) y = -1.0x – 31.5 c) y = -1.3x + 31.5 d) y = -1.0x + 31.5

Example 30: Which of the following statements regarding outliers is most accurate?
a) The effect of outliers on the equation of the line of best fit is always very slight.
b) The effect of outliers on the equation of the line of best fit is always slight when the
number of data points is small.
c) The effect of outliers on the equation of the line of best fit may be great when the
number of data points is small.
d) The effect of outliers on the equation of the line of best fit is always great when the
number of data points is great.

Example 31: The following stopping distances were measured for a particular brand of motorcycle.

Speed (km/h) Stopping Distance (m)


20 5.6
30 7.5
40 9.8
50 13.2
60 17.4
70 21.5
80 25.3
90 . 28.2
a) Use the data in the given table to determine the equation of a linear regression line of the
form D = ax + b, where x is the speed of the motorcycle and D is the stopping distance.
Give the values of a and b to the nearest hundredth.

b) Use the equation from part a) to estimate the stopping distance to the nearest tenth of
a metre, when a motorcycle is traveling at 120 km/h.

Page 19 of 21
MDM4U – STATISTICAL ANALYSIS

PRESENTING TWO-VARIABLE DATA

When analyzing two-variable statistical summaries you should be aware of the following factors and
techniques that may misrepresent the true relationships between the variables.

1. The sample size is too small to represent the population.


2. The sample is biased so it does not represent the population.
3. A line of best fit is given without a correlation coefficient so there is no way of knowing how well the
lie fits the data. The line of best fit may be skewed by one or more extreme outliers.
4. The scales on the axes are stretched or shrunk to show an increased or decreased slope.
5. There are other variables that affect the statistics that are not part of the data.
6. The data may represent an accidental relationship.
7. The data summary does not give a comparison to a standard or uninfluenced group (e.g., in the case of
drug testing).

Example 32: Sonia conducted a study to determine if there was a relationship between the average
amount of time a student spends on the computer each day and his or her math mark. She
surveyed students in her class and obtained data that led to scatter plots with the two lines
of best fit shown below.

From the results of her study, Sonia makes


the claim, "Boys should spend more time
on the computer to increase their math
marks, and girls should spend less time
on the computer."

List at least three things that may make


Sonia's claim false.

1. __________________________________________________________________

__________________________________________________________________

2. __________________________________________________________________

__________________________________________________________________

3. __________________________________________________________________

__________________________________________________________________

4. __________________________________________________________________

__________________________________________________________________

5. __________________________________________________________________

__________________________________________________________________

Page 20 of 21
MDM4U – STATISTICAL ANALYSIS

Example 33: Marco conducted a study with a sample size of 100 where he compared the number of years
people spent in post-secondary education to their starting salary. He drew a scatter plot of
his results and the corresponding line of best fit, which had a correlation coefficient of
0.98.
The line of best fit is shown below.
If Marco wants to make a change to show that years of
post-secondary education do not significantly increase the
starting salary, he could

A. increase the sample size.


B. eliminate some outliers.
C. stretch the horizontal scale on the graph.
D. stretch the vertical scale on the graph.
Example 34: Rowena wanted to study the relationship between exercise
and heart rate. Using the
students in her class, she measured their resting heart rates and then had them skip from
anywhere between 30 s to 5 min. Immediately after they finished skipping, she recorded
their heart rates. Rowena did not get the linear relationship she was expecting. Which of the
following reasons is most likely for the cause of the non-linear relationship?

a) Students' resting heart rates (before exercise) vary greatly.


b) Some students have trouble skipping.
c) There would be too many extreme outliers.
d) Heart rate as a function of skipping time is not a linear relationship.

Page 21 of 21

You might also like