0% found this document useful (0 votes)
38 views2 pages

Chapter 7

This document discusses multiple regression analysis to predict important factors in first-grade reading. It provides step-by-step instructions for conducting multiple regression using statistical software. It analyzes several data sets to identify significant predictors and ensure regression assumptions are met. Key findings include identifying the strongest predictors of reading scores and grammar outcomes. Assumptions like multicollinearity, normality and homoscedasticity are evaluated.

Uploaded by

dennypetrie
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views2 pages

Chapter 7

This document discusses multiple regression analysis to predict important factors in first-grade reading. It provides step-by-step instructions for conducting multiple regression using statistical software. It analyzes several data sets to identify significant predictors and ensure regression assumptions are met. Key findings include identifying the strongest predictors of reading scores and grammar outcomes. Assumptions like multicollinearity, normality and homoscedasticity are evaluated.

Uploaded by

dennypetrie
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Chapter 7: Looking For Groups of Explanatory Variables Through Multiple Regression: Predicting Important Factors in First-

Grade Reading

7.4.5 Application activity: Multiple Regression


ANALYZE > REGRESSION > LINEAR. Wellbeing “Dependent” box. variables into the “Independent” box:
“Enter”.
STATISTICS button and tick “confidence intervals”, “casewise diagnostics”, “descriptives”, “part and partial correlations” and “collinearity
diagnostics”,
PLOTS button and put SRESID in the “Y” axis box and ZPRED in the “X” axis box and tick “Normal probability plot”.
Save button and check Mahalanobis and Cook’s under the “Distances” box.
Press OK and run the regression.

Looking at the relations between DV and IV


The correlation between Iv and Dv is high (r = .807) and indicates multicollinearity.
The correlation between Iv and Iv is also high (r = .712).

The model in the “Model Summary” box that includes all 5 explanatory variables has an R2 = .672, which is quite high. Of the individual
terms of this equation Dv=IV IV IV IV, the Coefficients output box shows that only iv is statistical (t = 5.21, p < .0005).
The standardized coefficients are β = -.025 for non-verbal reasoning, β = .12 for working memory, β = -.106 for naming speed, β = -.083
for L2 phonemic awareness, and β = .769 for KINDERL2READING. You will find these coefficients in the “Coefficients” output box.
At the end of the Coefficients output you will find the VIF column. Here no values are over 5, so presumably this does not indicate a
problem with multicollinearity.
The Residuals Statistics output box does not indicate problems with outliers (standardized residuals, Cook’s distance or Mahalanobis), but
the residuals vs. predicted values plot could indicate some heteroscedasticity (values on the right side of the plot are more constrained than
values on the left). The P–P plot does show variance away from a straight line, indicating that data may not be normally distributed.

ANALYZE > REGRESSION > LINEAR. Dv and Iv allocation.


change the Method to “Stepwise”.
NEXT button to indicate that you will enter that variable in the first step.
Now put intelligence test scores (INTELLIG) into the “Independent” box and press NEXT. The third step should enter L2CONTA, the fourth
ANWR_1, and the last ENWR_1.
Open the STATISTICS button and tick “confidence intervals”, “casewise diagnostics”, “R squared change”, “descriptives” and “collinearity
diagnostics”. Open the PLOTS button and put SRESID in the “Y” axis box and ZPRED in the “X” axis box and tick “Normal probability
plot”. Click the SAVE button and check Mahalanobis and Cook’s under the “Distances” box. Press OK when back to the LINEAR
REGRESSION dialog box and run the regression.

In looking at your output, first look at the box labeled “Variables Entered/Removed” and make sure everything was done in steps the way
you wanted. Next look at the “Model Summary” box.
The overall R2 for the model with all 5 variables entered was R2 = .688, adjusted R2 = .672. This explains quite a lot of what is going on! I
will give a table with the results for the change in R2 (found in the “Model Summary” box), the unstandardized coefficients and the
statistical results for each of the variables in the last model (found in the “Coefficients” box).

R2 change Unstandardized t-statistic p-value


coefficient
Time 2 grammar
Time 1 grammar .303 .045 .577 .57
Intelligence .013 .186 .845 .40
L2 contact .006 -.132 -1.490 .14
ANWR1 .363 .546 3.458 .001
ENWR1 .004 .213 1.051 .30

We can compare the strength of the variables by looking at the R2 change. It is clear that at least entered in this order, Time 1 grammar is
highly predictive of Time 2 grammar scores, but even more highly predictive is scores on the Arabic non-word test (its R2 change is even
higher than that of the Time 1 grammar). The t-test shows that the ANWR is the only constituent which is statistical (by the way, French and
O’Brien tried reversing the order of the ENWR and ANWR and found that in that case the ENWR received most of the R 2 change (.328) and
the ANWR just a little (.038). So it is clear that a measure of phonological memory was the big predictor, and which one it was was not so
important).

In examining regression assumptions, the VIF column shows that in the model with all 5 variables, both of the phonological memory tests
received VIF values of a little over 10, indicating a problem with multicollinearity. Given what I said above about reversing the order of the
two tests, in order to find the optimal model it would be best to choose one or the other of the phonological memory tests (probably the
ANWR since it had a larger R2 change when it was first than the ENWR when it was first). In the Residuals Statistics box, no standardized
residuals are above 3 (or below -3) so that is good. For Cook’s distance no scores are above 1, and for Mahalanobis’ distance no scores are
above 15, so we do not seem to have any problems with outliers. For normality, looking at the P–P plot, there appears to be a very good fit
of the data to the line, indicating the residuals are normally distributed. For looking at the homoscedasticity requirement, the scatterplot of
residuals vs. predicted values does not show any evidence of data being more constricted on one side over another. This is quite a clean data
set that satisfies all of the assumptions of regression (a rarity!).

scatterplot matrix of the data (Graphs > Legacy Dialogs > Scatter/Dot, then choose Matrix Scatter and press DEFINE; put all 6 variables into
the “Matrix Variables” box) shows that all data may have a linear relationship with OVERALL except for ENROLL, which seems to be a
vertical line with a few outliers. Opening the regression dialog box, put OVERALL in the “Dependent” box and all of the other variables in
the “Independent” box. Leave the Method as “Enter”. Open the same buttons and tick the same boxes as described for #2. This model
explains R2 = .76 of the variance in overall scores, a large amount. The Coefficients output box indicates (from the t-test) that the statistical
factors were Teach and Knowledge only.
Running another regression with just Teach and Knowledge as the two explanatory factors, the R2 is now .74 (not much lower, but a much
simpler equation). Both factors are statistical components of the regression equation (according to the t-test).

In looking at regression assumptions, the VIF does not indicate a problem with multicollinearity, residuals statistics, Cook’s and
Mahalanobis do not indicate a problem with outliers or influence points, the P–P plot looks good indicating a normal distribution, and there
is no clear heteroscedasticity in the residuals vs. predicted fit plot. Overall, this model seems to satisfy regression assumptions quite well.

5.
First call for a scatterplot matrix using the commands described in #4 above. Look at the intersection of the explanatory variable with the
response variable (SWEAR2). A scatterplot matrix of the intersection of SWEAR2 with the explanatory variables (L2FREQ, WEIGHT2,
L2_COMP, L2SPEAK) showed a random scattering of the variables pretty much over the entire graph, which would violate the assumption of
linearity. However, since the points are discrete and not jittered so we can see their frequency, it could be that there are indeed linear trends
that are not apparent in the scatterplot. In other words, there may be many more points along a linear line in the plot, but because we can
only see 25 discrete points on the scatterplot, we cannot tell how often each point is chosen. If we add regression lines to the data (open the
Chart Editor, push the ADD FIT LINE AT TOTAL button (or use the menu) and then CLOSE), there do seem to be linear relationships indicated.
We will continue with the analysis.

In the regression, put SWEAR2 in the “Dependent” box and the other variables in the “Independent” box. Leave the Method as “Enter”. Open
the same buttons and tick the same boxes as described for #2.

Looking at output: The correlations between swearing frequency and the explanatory variables seem to be of acceptable effect size, but not
too high so as to pose a problem. The Coefficients box shows that only weight given to swearing in L2 (WEIGHT2), L2 speaking ability
(L2SPEAK) and L2 frequency usage (L2FREQ) are statistical predictors of swearing frequency. Go back to the ANALYZE > REGRESSION >
LINEAR menu and remove L2_COMP from the Independent box. Run the regression again, and the regression equation is:
Swearing frequency = .41 + .23(Weight given to swearing in L2) + .21(L2 speaking ability) + .29 (L2 frequency of use).
This model can be obtained by looking at the constant and the unstandardized coefficients in the “Coefficients” box of the output).

This model explains R2 = .29 of the variance in swearing frequency (according to the “Model Summary” box), which is a goodly amount
but there is room for more explanation. The Residuals Statistics does not indicate any problem with non-normality (maximum in
standardized residuals is not over 3), and Cook’s distance is less than 1. For very large samples like this (over 500) there is no problem with
Mahalanobis unless values are over 25 (Field, 2005), so none of these diagnostics indicates a problem with influence. The P–P plot looks to
be pretty normal, but the residuals vs. predicted values does not look random. It has a clear downward slope to it, indicating a problem with
heteroscedasticity in the data.

6. Larson-Hall (2008)
Use the LarsonHall2008.sav file. Open the regression dialog box and put GJTSCORE in the “Dependent” box. Enter the three explanatory
variables one at a time into the “Independent box” after you have changed the Method to “Stepwise” (see instructions in #3 if you can’t
remember how to do the hierarchical regression).

With this order (TOTALHRS, RLWSCORE, APTSCORE) the R2 = .12 (fairly low). The R2 change is .034 for hours, .088 for RLW test, and .001
for aptitude.

Now open up the regression dialog box. You could redo the regression by pressing the RESET button, but then you would have to open up all
the sub-dialog boxes as well and tick everything again. It’s probably easiest to just trace back your steps and move each variable out from
the 3 blocks you created.

With this order (RLWSCORE, APTSCORE, TOTALHRS) the R2 = .12. The R2 change is .090 for RLW test, .002 for aptitude, and .031 for hours
of input.

With this order (APTSCORE, RLWSCORE, TOTALHRS), the R2 = .12. The R2 change is .034 for total hours and .088 for RLW test. Aptitude
doesn’t even get included when it is first!

The R2 doesn’t really change depending on the order, but the R2 change does vary depending on the order it is entered. Aptitude gets very
little R2 change, but most when it is second after RLW. RLW is the strongest variable and it gets the most R2 change when it comes first.
RLW score gets the most when it is first.

You might also like