God Helps Those Who Help Themselves
God Helps Those Who Help Themselves
The problem for this class is taken from the journal article:
Jeffrey A Will and John K. Cochran, "God Helps Those Who Help Themselves?: The
Effects of Religious Affiliation, Religiosity, and Deservedness on Generosity Toward
the Poor." Sociology of Religion, 1995, 56:3, 327-338.
This analysis adds additional variables for respondents to the analysis reported in the
article:
This analysis presumes that the dimensions of poverty article has been reviewed.
The data for this problem is available in the data set, DeservingPoor.Sav, which can be
downloaded from the download web page. The data has been recoded from the raw GSS
data to the format presented in the article.
Relationship to be Analyzed
"Our concern in this study, therefore, is to examine the influence of religious variables
on generosity toward the poor...we examine how specific characteristics of poor
families influence generosity and how the effects of deservedness vary across faith
groups." (Page 328)
The question would suggest a three block hierarchical regression: with deservedness in
the first block, respondent control variables in the second block, and respondent
religious variables in the third block.
However, the author does standard multiple regression, and we will conform to his
analysis.
Since there are a large number of independent variables in the analysis, I ran frequency
distributions on the variables to identify those that had missing data. Variables that did
not have any missing data were excluded from the missing data analysis. Missing values
were present for the variables: age, income, the religious groups (Catholic, Jewish,
etc.), religious identity, and attendance.
The correlation matrix for the valid/missing variables had correlations of 1.0 between
all of the religious groups. This is because the religious groups had the same missing
cases, i.e. missing values for the original religion variables would produce identical
missing values for each of the dummy-coded groups.
Ignoring these 1.00 correlations, we note the next highest correlation is .176. The
correlations for missing values are very weak, so we do not have a missing data process
that will be problematic. Since our sample size is large, we will eliminate all of the
missing cases in our analyses.
With over 9,000 cases, we exceed the dimensions of the power table, so R² values of
less than 2% will be found to be statistically significant.
For the analysis of respondent categories, we add eight new independent variables:
Religious Preference, Age, Sex, Race, Education, Income, Strength of Religious Identity,
and Church Attendance. When we complete the dummy coding, we will add a total to
twelve independent variables to the analysis. The ratio of observations to independent
variables is 9,555 divided by 35 which equals 273 cases per independent variable.
God Helps Those Who Help Themselves Slide 4
Stage 2 Summary: Develop The Analysis Plan: Measurement Issues
Incorporating Nonmetric Data with Dummy Variables
All variables requiring dummy coding were recoded when the data set was constructed.
In particular, religion was coded into a set of dichotomous religious groups, e.g.
Catholic, Jewish, etc.
We do not have any evidence of curvilinear effects at this point in the analysis.
We do not have any evidence at this point in the analysis that we should add interaction
or moderator variables.
All of the variables in the analysis are metric or dummy-coded. Note that Family Savings
was dichotomously coded as either 0 or 1000. This scheme will make the slope more
interpretable.
None of these variables are normally distributed and none of the transformations induce
them to normality.
There is no evidence of a nonlinear relationship between these five added variables and
the dependent variable.
We do not pass the homogeneity test for the variables: R_MALE 'Gender Of
Respondent', R_WHITE 'Race Of Respondent', R_CONSPR 'Respondent Is Conservative
Protestant', and R_MODEPR 'Respondent Is Moderate Protestant'.
The only remedy for this problem would be a transformation of the dependent variable,
but given the normal appearance of the histogram of the dependent variable compared
to the histograms of the transformations of the dependent variable, I will forego any
transformations.
First, we look at the test of R Square which represents the relationship between the
dependent variable and the set of independent variables. This analysis tests the
hypothesis that there is no relationship between the dependent variable and the set of
independent variables, i.e. the null hypothesis is: R² = 0. If we cannot reject this null
hypothesis, then our analysis is concluded; there is no relationship between the
dependent variable and the independent variables that we can interpret.
If we reject the null hypothesis and conclude that there is a relationship between the
dependent variable and the set of independent variables, then we examine the table of
coefficients to identify which independent variables have a statistically significant
individual relationship with the dependent variable. For each independent variable in
the analysis, a t-test is computed that the slope of the regression line (B) between the
independent variable and the dependent variable is not zero. The null hypothesis is that
the slope is zero, i.e. B = 0, implying that the independent variable has no impact or
relationship on scores on the dependent variable.
For the highly significant variables in this analysis, our results concur for all variables,
except mother's education. Our finding for mother's education shows a statistically
significant relationship while the article does not.
Using output from the regression analysis to examine the conformity of the regression
analysis to the regression assumptions is often referred to as "Residual Analysis" because
if focuses on the component of the variance which our regression model cannot explain.
Using the regression equation, we can estimate the value of the dependent variable for
each case in our sample. This estimate will differ from the actual score for each case by
an amount referred to as the residual. Residuals are a measure of unexplained variance
or error that remains in the dependent variable that cannot be explained or predicted
by the regression equation.
In the plot of residuals, we see than the spread of the residuals is constant (same
height) across of the values for the dependent variable, so we do not have a pattern of
heteroscedasticity.
If we examine the normal p-p plot produced by the regression, the residuals appear to
be normally distributed.
Like the prior problem, this evidence of serial correlation is an artifact of the way the
data set was structured, i.e. each respondent reviewed seven vignettes, which were
added to the data set in sequential order. There is a tendency for a respondent to be
generous or punitive across cases he or she reviewed.
We have 100 outliers on the dependent variable listed for this problem. This amounts to
about 1% of the cases in the sample.
A total of 530 cases had a Cook's distance of 0.00045 or larger, or about 6% of the
sample. Since only 1 percent of the cases were outliers on the dependent variable, it is
likely that the majority of these cases are outliers on the combination of independent
variables. If that is the case, it could be argued that we should not consider omitting
these cases because they are a consequence of the factorial design which randomly
assigned the values or conditions to the independent variables.
If we rejected that argument and ran the regression without these cases, the results are
even more positive. The R² value increases from 18.8% to 30.5%. Moreover, some of the
individual relationships with independent variables change.
Though it is obvious that the influential cases have a negative impact on the analysis,
we will retain these cases to maintain consistency with the author.
With the exception of mother's education, our interpretation of the coefficients agrees
substantially with the author's.
Impact of multicollinearity
Multicollinearity does not appear to be a problem in this analysis. SPSS did not alert us
to any tolerance problems. The correlations among independent variables were weak or
very weak, a consequence of the factorial design of the study.
The R Square value (.188) drops to an Adjusted R Square (.185), a minor decline that
indicates the model is not over fitted to the data.
Split-sample validation
We can use the same selection variable that we used in the analysis above. The results of
the validation analysis are shown in the table on the next slide.
The validation analysis supports the generalizability of the model. Some of the variables
in the full model may have required the larger sample size of the full model to achieve
statistical significance. In particular, note that many of the religion variables may not be
as stable and generalizable as the full model would suggest.
Significant Number of children in family Number of children in family Number of children in family
Coefficients Father Disabled Father Disabled Father Disabled
(p < 0.01) Mother Unemployed, Only Mother Unemployed, Only Mother Unemployed, Only
Minimum Wage Jobs Minimum Wage Jobs Minimum Wage Jobs
Family Income Family Income Family Income
Mother's Education Mother's Education Mother's Education
Mother Unemployed, No Mother Unemployed, No Mother Unemployed, No
Transportation Transportation Transportation
Race of respondent Race of respondent Race of respondent
Education of respondent Education of respondent Education of respondent
Income of respondent Income of respondent Income of respondent
R is No Religion R is No Religion R is No Religion
Father Unemployed, Looking Father Unemployed, Looking R is Moderate Protestant
for Work for Work R is Catholic
R is Moderate Protestant
R is Catholic
R is Conservative Protestant
Slide 28