0% found this document useful (0 votes)
9 views

Unit 540 Differences Between Two Groups With Answers

Uploaded by

z13612909240
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Unit 540 Differences Between Two Groups With Answers

Uploaded by

z13612909240
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Assignment for Unit 540

Are two groups really different?


Henk van der Kolk

19 October, 2023

Goals of the practice questions and this assignment


In this assignment, you will learn comparing two groups using a dummy. This is a variable
coded as 0 and 1, and used, for example, to compare two groups. In most introductory
textbooks this in know as the two sample t-test, but here we present the test in the context
of the linear model. You were already introduced to the one sample t-test in the beginning
of this course (“Is the IQ of the people in this area (our population) different from the IQ in
the rest of the country” or “Did the physical fitness of this group change over time?”1).
In the context of using a dummy independent variable, we often check whether the two
groups not only have the same mean, but also have the same variance. Keep in mind that in
linear regression models, we assume that there are no OTHER variables playing a
systematic role (their role is ‘random’). If, however, groups have not only different means
(our model) but also different variances, that may be problematic when estimating the
standard error. It may be an indication that (an) other variable(s) play(s) and important
role too, a role we should not ignore. That is why you will also learn about a (parametric)
variant of the two sample t-test: the Welch test, named after the British inventor of this
procedure Bernard Lewis Welch (1911-1989).
# wordt nu alleen gegenereerd bij de versie ZONDER antwoorden.
write_csv(data, "unit_540_practice_data.csv")

Differences in health among female (coded 0) and male (coded 1) senior


citizens: group means in a linear equation
In a study among senior citizens, researchers are interested in health related issues. They
focused on the difference between female and male.2 The researcher wants to know
1
In some academic articles you may also encounter a procedure to study one sample at two different time
points (the paired sample t-test). The is the same thing as a one sample t-test on the differences between t 1
and t 2. Just compute the differences between t 2 and t 1, take THAT as the observation for an individual and
test whether the mean is different from zero. We will only mention that third procedure and not practice it
separately in this assignment as the procedure was covered in unit 502 and in unit 503. But remember: when
studying a change in individuals, you can use the one sample t-test.
2
Other categories (groups) were available in the dataset (based on the question ‘with which gender do you
identify most?’), but were left out because of the low number of cases not picking one of the two categories.
This means that the outcomes of this (fictional) study do not generalize to this part of the population.

1
whether health related issues were more prevalent among female than among male senior
citizens.
A small data set with 60 cases is available on Canvas. The data are made up for
instructional purposes only. They do NOT refer to real people.
1. Download the data set ‘unit_540_practice_data.csv’. Import the data into R using a
script (not the menu!).
data <- read_csv("unit_540_practice_data.csv")

# important: use your script to do this, do NOT use the menu.

2. You are interested in the overall differences in health between the two groups (a
dichotomous variable, stored twice in the data set). Which measure would you
calculate first, to check whether there ARE differences between the two groups in
the sample?
The means :-). This is a description of that difference. The variables is a scale. Make sure to
check the distribution, because there may be outliers strongly affecting the mean!
3. Use the following commands to create a scatter plot and a jitter plot. Which one is
more helpful to understand the data?

2
Whether using a scatterplot or a jitter plot very much depends on the number of cases and on
the possible scores these cases have. If the number of cases is small and the differences
between observations are well visible (a ratio variable), a scatterplot is well enough. If,
however the number of cases is big and scores are only 1, 2, 3, 4 and 5, a jitter is more helpful.
The only reason why the dots are somewhat to the left and right of the two lines is because it

3
is a jitter. The spread is just there to make the differences more visible. They are not REAL
differences!
4. Include the group means in this jitter plot by using the following code:

Use a numeric version of the variable group in which one of the two groups has the value 0,
and the other the value 1. This is a special dichotomous variable called a ‘dummy’: you
either belong to it (1) (you are male) or you do not (0). That variable is already stored in
the data set under the name group_numeric. There is no inherent reason to pick a 1 or 0 for
one or the other group, although it is sometimes simpler to make a deliberate choice.
5. Create a linear equation summarizing the differences between the two groups in the
sample.
h ealt h=b0 +b 1 × group

It is very important to use a dummy (0, 1) variable in this context. When you give them value 1
and 2, the ‘intercept’ is meaningless because there is no ‘zero’ point (no group has value 0).
When using 1 and 2 (not 0 and 1), the intercept plus ONCE the ‘slope’ is the mean for group 1,
and the intercept plus twice the ‘slope’ is the value for group 2. That is not easy to understand.
6. The mean health for group 0 (female) is 6.689, and for the other group it is 6.959.
Using this information, amend / specify the linear equation summarizing the
difference.
The constant is the value when x (the variable associated with the slope) = 0. The second value
(slope) is the difference between the two groups (so NOT the mean of the second group).

4
h ealt h=b0 +b 1 × group

And b1 in this case is 0.27

Dummy coding and the selection of a reference category.


7. In this assignment we have coded male to be 1. This implies that the group female is
then called the ‘reference category’. It is the category implicitly present in the
intercept (when the other variable is zero). Group 1 is now compared to the
reference category. What would the estimated linear equation look like if we
switched reference category? (so, had we coded female 1 and male 0).
If we change our dummy variable, the absolute value of the difference between men and
women would not change, but the sign would change. The intercept would be the expected
value for male rather than for female . Check this by creating a reversed dummy and re-do the
analysis.
8. Would our substantial conclusions change if we switched reference category?
No, only the equation would change. The data do not change, so logically also our conclusions
will not change. Only the way we describe the difference in the equation changes.

Testing a group difference using the linear model: description.


Linear equations can be used if the dependent variable is a scale variable (so we can
compute a mean) and the independent variable is a dummy (a dichotomy coded 0 and 1).
We will now test whether there is indeed a difference between the two groups in the
population. We will do this in steps. The research question is “Do female and male
individuals in this group have an equal health?”. We are not setting an expectation about
the ‘direction’ of the difference (one group has a better health than the other), so we will
use a two-sided test in this assignment.
9. Present the hypothesis using boxes and arrows.

By adding male to the group variable in the graph, it becomes clearer what the variable is,
and we can add a sign too. This implies male is coded as 1 and female is coded as 0. The
choice here is completely ‘technical’ and has no substantive meaning.
10. Give the null and (two-sided) alternative hypotheses in words.
H_0: In the population, there is no difference in health between men and women. H_A: In the
population, there is a difference in health between men and women.
11. Give the null and (two-sided) alternative hypotheses in symbols. Since we can use
either means (t test framework) or slopes (b coefficients in a linear model
framework) as a description, give both sets of hypotheses.
5
H0 : β 1=0
HA : β1≠ 0
H 0 : μwomen −μmen =0
H A : μwome n−μmen ≠ 0

12. Run the regression analysis using lm() with the dummy variable group_numeric as a
predictor and health as a response variable. Store the object as ‘model’. Have a look
at the output using summary(model).
model <- data %>%
lm(health ~ group_numeric,
data=.)

##
## Call:
## lm(formula = health ~ group_numeric, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8591 -1.6145 0.2757 1.6583 3.3105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6895 0.3349 19.974 <2e-16 ***
## group_numeric 0.2696 0.5531 0.487 0.628
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.065 on 58 degrees of freedom
## Multiple R-squared: 0.00408, Adjusted R-squared: -0.01309
## F-statistic: 0.2376 on 1 and 58 DF, p-value: 0.6278

13. What is the linear equation as found in the data (use the output created by
summary())?
It is 6.689 + 0.27 x gender(male)
14. What is the expected level of health for women?
6.689
15. What is the expected level of health for men?
6.959

Testing a group difference using the linear model: inference


This was descriptive. We will now ‘assess’ the difference. This information is also found in
the summary() you created. We can do inference by constructing a confidence interval (of
the slope) or by testing the null-hypothesis (of no difference, the slope is zero).

6
Suppose you draw a sample, look at the difference between the two group means, put the
sample back in the population, draw a second sample etc…. you get the sampling
distribution of the differences in the means. The standard error (the estimated standard
deviation of the sampling distribution) of the difference in health between men and women
is based on (1) the variances within the two groups, and on (2) the size (n) of both groups.
You will not be asked to remember the exact formula of this ‘pooled standard error’. In this
case the standard error is 0.5530834. Do you see it in the output?
16. Using this s.e., calculate (by hand) the 95% confidence interval for the difference in
health between men and women.
The difference in health between men and women most likely is between: -0.8140434 and
1.3540434 The mean difference is given in the slope. Take 1.96 times the s.e., which is the
margin of error. slope plus and slope minus the m.o.e. is the confidence intyerval for the slope.
17. Use the t.test procedure to check your answer (rounding errors may cause some
differences (in the second or third position after the dot :-)) between what you
calculate by hand and this output). ALSO: since this is about a ‘difference’ it may be
that the lower bound and the upper bound have switched sides (the difference lies
between -0.5 and +1.5, is the same as the difference lies between 0.5 and - 1.5,
because one can be ‘smaller’ that the other.).
t_test <- t.test(health ~ group_numeric,
data,
var.equal = TRUE)

t_test

##
## Two Sample t-test
##
## data: health by group_numeric
## t = -0.48748, df = 58, p-value = 0.6278
## alternative hypothesis: true difference in means between group 0
and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.3767339 0.8374994
## sample estimates:
## mean in group 0 mean in group 1
## 6.689474 6.959091

18. What do you conclude with regards to the null hypotheses formulated above (using
an alpha of 0.05)? Use relevant output in your answer.
Since the confidence interval includes zero, we cannot reject our null hypothesis that beta_1 =
0. The chance of finding this difference, while the data come from a population where there is
no difference at all, is bigger than 5 percent. The p value gives an indication of this chance
(and shows it is indeed bigger than 5%).

7
Unequal variances: using Welch t-test.
19. Under what circumstances would you use Welch t-test and NOT the above used two
sample t-test / linear model with dummies?
Only when (a) the distribution in both groups is close to normal (since we have a small
number of cases, this IS important to check, also for the two sample t-test), (b) the variances
of both groups may be different and (c) the number of cases in both groups are different.
20. Check the difference between the two groups using Welch t-test. Although this CAN
be done in a linear model framework, we suggest you use the t.test procedure. Do
you see a (substantial) difference in the output, as compared to the ‘ordinary’ two
sample t-test?
t_test_w <- t.test(health ~ group_numeric,
data,
var.equal = FALSE)

t_test_w

##
## Welch Two Sample t-test
##
## data: health by group_numeric
## t = -0.46909, df = 39.062, p-value = 0.6416
## alternative hypothesis: true difference in means between group 0
and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.432129 0.892895
## sample estimates:
## mean in group 0 mean in group 1
## 6.689474 6.959091

Because the the numbers of cases in both groups are very similar, the differences are minor.
But, for example, the degrees of freedom are different, so the interpretation of the output may
be different in some cases.
<< END OF THE ASSIGNMENT>>

You might also like