Week 2 - Linear Regression
Week 2 - Linear Regression
1. Produce a scatterplot of quantitative data with appropriate explanatory and response axes;
2. Recognise a linear pattern and the general formula for a straight line;
3. Calculate a predicted value given the equation of the linear regression line;
4. Add a linear line of best fit to data using MS EXCEL, and describe the regression line (equation and correlation);
5. Assess the closeness of fit using the least-squares criterion as reflected in the correlation coefficient;
6. Obtain residual values and interpret their size and distribution about the line in the form of a residual plot;
PRELIMINARY QUESTIONS:
These problems are to help you engage with the lecture material, and also to make sure that everyone is up-
to-speed before the workshop starts. Please make sure you do them before class each week!
Q.1 State in your own words what is meant by each of the terms listed below. Be specific.
Term Definition
response variable
negative or none
Regression line Straight line that determines a response variable (Y) changes as
Q.2 What is the general equation of a straight line? Define all the terms in the equation.
Y= mx + c
m = gradient/slope
Y = response variable
X = explanatory variable
What is the regression line equation based on the description of the trend in this example?
Q5.2 You use same bar of soap to shower each morning. Bar weights 80g when new - weight goes down
by 5g per day on average. What is equation of regression line for predicting weight from days of use?
Y=-5x+80
R^2= 1
X= Days Y=Weight
WORKSHOP PROBLEMS:
Create a scatterplot of linear trend (similar to plot #1 below. Observe the size of the correlation
coefficient for different scatter patterns. Use “Draw your own line” to draw a line of best fit.
Change the intercept and slope, trying to minimise the sum of the squares of the residuals as shown
by the “relative SS” value. Compare yours with the “Show least-squares line” which is placed by
calculation. No written answers are required here just observe the values.
Quiz score vs chocolate consumption 5. Change in pulse rate with exercise 3. Measured radioactive decay
1. 2. 1400
140
120
Pulse rate after exercise (beats
120 1200
100
Counts per minute
100 1000
80
per minut)
80 800
60
60 600
40
40 400
20 20 200
0
0 50 100 150 200 250 300
0
0 20 40 60 80 100 120
0
0 5 10 15
Daily Chocolate consumption (g) Pulse rate before exercise (beats per minute) Time (mins)
PLOT 1 2 3
(If approp.)
Using Excel to produce a scatterplot of the data and add a LINEAR line of best fit:
• Plot Response variable (y-axis) against explanatory variable (x-axis). Excel: left hand column = x
• Select the chart layout that has the line and fx so that you obtain the equation of the line of best fit.
• Note the coefficients which are the intercept and slope for the equation;
• Obtain the full regression analysis including a residual plot by using Data Analysis/Regression. Note
that it asks for the y-data column first, and you need the data in columns, not rows;
• Check the appropriateness of linearity by interpreting the residual plot. Describe the scatter or pattern
in the residual plot: A random scatter of residuals, plus and minus, along the added line of best fit
indicates that linear IS appropriate. This is important. Data can often look linear, but a closer check
Week 2 Page | 2
Q.5 Do Q4.29 (7th ed: Q4.28) from Moore et al, p.121 and with the same data, Q5.39, p.155.
Download the data set “Sparrowhawk” from the Moodle page/Part 1: Exploring Data.
Describe the association between New Adults arriving and the percentage of returning birds:
Negative association. This means when x-axis value increases, Y-axis value will decrease
Weak
What does this R-squared value specifically tell us about this association?
ALSO: Apply linear regression analysis using Excel. Include the residual plot.
(Not done in class? You must attach your plots printed out)
ALSO: Describe the residual plot. Is there any trend: is there a curve of data about the line of best
fit OR are the data points randomly scattered either side along the linear trend line?
What does this tell you about fitting a linear model to these data?
Moore Q5.39 a) What is the equation of the linear model for this relationship?
Do not use x and y designations but replace them with a descriptive notation for the variables.
return) + 31.934
What does the slope actually indicate (use the actual value to explain size of influence of x on y)?
(y) from the previous return in the percentage of returning birds (x)
Week 2 Page | 3
Moore 5.39 c) Use the model to predict the new adult number if 45% of adults from the previous
year return:
Predicted value x=45
Y= -0.304 x + 31.934
Y = 18.254
EXTRA: Verify the value of the residual for the datum point where x = 45. Show the full
calculation of a residual.
Residual at each x = (data y value- line predicted ŷ value)
Residual = (Data Y) - (Predicted Y)
= 17 - 18.254
= -1.254
From this exercise, you should make sure you able to:
• draw a scatterplot:
• obtain a line of best fit for linear data and identify its equation
• obtain a residual plot
• obtain the correlation coefficient
• state what each of the items above tells you about the data.
MARK : /10
Week 2 Page | 4