Unit 540 Differences Between Two Groups With Answers
Unit 540 Differences Between Two Groups With Answers
19 October, 2023
1
whether health related issues were more prevalent among female than among male senior
citizens.
A small data set with 60 cases is available on Canvas. The data are made up for
instructional purposes only. They do NOT refer to real people.
1. Download the data set ‘unit_540_practice_data.csv’. Import the data into R using a
script (not the menu!).
data <- read_csv("unit_540_practice_data.csv")
2. You are interested in the overall differences in health between the two groups (a
dichotomous variable, stored twice in the data set). Which measure would you
calculate first, to check whether there ARE differences between the two groups in
the sample?
The means :-). This is a description of that difference. The variables is a scale. Make sure to
check the distribution, because there may be outliers strongly affecting the mean!
3. Use the following commands to create a scatter plot and a jitter plot. Which one is
more helpful to understand the data?
2
Whether using a scatterplot or a jitter plot very much depends on the number of cases and on
the possible scores these cases have. If the number of cases is small and the differences
between observations are well visible (a ratio variable), a scatterplot is well enough. If,
however the number of cases is big and scores are only 1, 2, 3, 4 and 5, a jitter is more helpful.
The only reason why the dots are somewhat to the left and right of the two lines is because it
3
is a jitter. The spread is just there to make the differences more visible. They are not REAL
differences!
4. Include the group means in this jitter plot by using the following code:
Use a numeric version of the variable group in which one of the two groups has the value 0,
and the other the value 1. This is a special dichotomous variable called a ‘dummy’: you
either belong to it (1) (you are male) or you do not (0). That variable is already stored in
the data set under the name group_numeric. There is no inherent reason to pick a 1 or 0 for
one or the other group, although it is sometimes simpler to make a deliberate choice.
5. Create a linear equation summarizing the differences between the two groups in the
sample.
h ealt h=b0 +b 1 × group
It is very important to use a dummy (0, 1) variable in this context. When you give them value 1
and 2, the ‘intercept’ is meaningless because there is no ‘zero’ point (no group has value 0).
When using 1 and 2 (not 0 and 1), the intercept plus ONCE the ‘slope’ is the mean for group 1,
and the intercept plus twice the ‘slope’ is the value for group 2. That is not easy to understand.
6. The mean health for group 0 (female) is 6.689, and for the other group it is 6.959.
Using this information, amend / specify the linear equation summarizing the
difference.
The constant is the value when x (the variable associated with the slope) = 0. The second value
(slope) is the difference between the two groups (so NOT the mean of the second group).
4
h ealt h=b0 +b 1 × group
By adding male to the group variable in the graph, it becomes clearer what the variable is,
and we can add a sign too. This implies male is coded as 1 and female is coded as 0. The
choice here is completely ‘technical’ and has no substantive meaning.
10. Give the null and (two-sided) alternative hypotheses in words.
H_0: In the population, there is no difference in health between men and women. H_A: In the
population, there is a difference in health between men and women.
11. Give the null and (two-sided) alternative hypotheses in symbols. Since we can use
either means (t test framework) or slopes (b coefficients in a linear model
framework) as a description, give both sets of hypotheses.
5
H0 : β 1=0
HA : β1≠ 0
H 0 : μwomen −μmen =0
H A : μwome n−μmen ≠ 0
12. Run the regression analysis using lm() with the dummy variable group_numeric as a
predictor and health as a response variable. Store the object as ‘model’. Have a look
at the output using summary(model).
model <- data %>%
lm(health ~ group_numeric,
data=.)
##
## Call:
## lm(formula = health ~ group_numeric, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8591 -1.6145 0.2757 1.6583 3.3105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6895 0.3349 19.974 <2e-16 ***
## group_numeric 0.2696 0.5531 0.487 0.628
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.065 on 58 degrees of freedom
## Multiple R-squared: 0.00408, Adjusted R-squared: -0.01309
## F-statistic: 0.2376 on 1 and 58 DF, p-value: 0.6278
13. What is the linear equation as found in the data (use the output created by
summary())?
It is 6.689 + 0.27 x gender(male)
14. What is the expected level of health for women?
6.689
15. What is the expected level of health for men?
6.959
6
Suppose you draw a sample, look at the difference between the two group means, put the
sample back in the population, draw a second sample etc…. you get the sampling
distribution of the differences in the means. The standard error (the estimated standard
deviation of the sampling distribution) of the difference in health between men and women
is based on (1) the variances within the two groups, and on (2) the size (n) of both groups.
You will not be asked to remember the exact formula of this ‘pooled standard error’. In this
case the standard error is 0.5530834. Do you see it in the output?
16. Using this s.e., calculate (by hand) the 95% confidence interval for the difference in
health between men and women.
The difference in health between men and women most likely is between: -0.8140434 and
1.3540434 The mean difference is given in the slope. Take 1.96 times the s.e., which is the
margin of error. slope plus and slope minus the m.o.e. is the confidence intyerval for the slope.
17. Use the t.test procedure to check your answer (rounding errors may cause some
differences (in the second or third position after the dot :-)) between what you
calculate by hand and this output). ALSO: since this is about a ‘difference’ it may be
that the lower bound and the upper bound have switched sides (the difference lies
between -0.5 and +1.5, is the same as the difference lies between 0.5 and - 1.5,
because one can be ‘smaller’ that the other.).
t_test <- t.test(health ~ group_numeric,
data,
var.equal = TRUE)
t_test
##
## Two Sample t-test
##
## data: health by group_numeric
## t = -0.48748, df = 58, p-value = 0.6278
## alternative hypothesis: true difference in means between group 0
and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.3767339 0.8374994
## sample estimates:
## mean in group 0 mean in group 1
## 6.689474 6.959091
18. What do you conclude with regards to the null hypotheses formulated above (using
an alpha of 0.05)? Use relevant output in your answer.
Since the confidence interval includes zero, we cannot reject our null hypothesis that beta_1 =
0. The chance of finding this difference, while the data come from a population where there is
no difference at all, is bigger than 5 percent. The p value gives an indication of this chance
(and shows it is indeed bigger than 5%).
7
Unequal variances: using Welch t-test.
19. Under what circumstances would you use Welch t-test and NOT the above used two
sample t-test / linear model with dummies?
Only when (a) the distribution in both groups is close to normal (since we have a small
number of cases, this IS important to check, also for the two sample t-test), (b) the variances
of both groups may be different and (c) the number of cases in both groups are different.
20. Check the difference between the two groups using Welch t-test. Although this CAN
be done in a linear model framework, we suggest you use the t.test procedure. Do
you see a (substantial) difference in the output, as compared to the ‘ordinary’ two
sample t-test?
t_test_w <- t.test(health ~ group_numeric,
data,
var.equal = FALSE)
t_test_w
##
## Welch Two Sample t-test
##
## data: health by group_numeric
## t = -0.46909, df = 39.062, p-value = 0.6416
## alternative hypothesis: true difference in means between group 0
and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.432129 0.892895
## sample estimates:
## mean in group 0 mean in group 1
## 6.689474 6.959091
Because the the numbers of cases in both groups are very similar, the differences are minor.
But, for example, the degrees of freedom are different, so the interpretation of the output may
be different in some cases.
<< END OF THE ASSIGNMENT>>