0% found this document useful (0 votes)
19 views

Data analysis training workshop_Day 3 presentation

The document outlines the structure and content of a virtual Data Analysis Training Workshop held from June 24-26, 2024, presented by Dr. Reesha Kara. It covers hypothesis testing, Pearson correlation analysis, simple linear regression, and includes practical exercises using datasets. Participants are required to submit their home exercise answers by July 12, 2024, in both Excel and Word formats.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Data analysis training workshop_Day 3 presentation

The document outlines the structure and content of a virtual Data Analysis Training Workshop held from June 24-26, 2024, presented by Dr. Reesha Kara. It covers hypothesis testing, Pearson correlation analysis, simple linear regression, and includes practical exercises using datasets. Participants are required to submit their home exercise answers by July 12, 2024, in both Excel and Word formats.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Institute of Social and Economic

Research, Rhodes University

Data Analysis Training Workshop


24- 26 June 2024
Day 3
Virtual workshop

Presented by Dr Reesha Kara


Structure of the day
Section A: Introduction to Hypothesis
Testing
Section B: Home Exercise

1. Pearson Correlation Analysis


2. Simple Linear Regression Section C: Stata Illustration
Analysis
Section A: Introduction to
hypothesis testing
1. Pearson Correlation Analysis
2. Simple Linear Regression Analysis
1. Pearson Correlation Analysis
1. Measures the extent to which two variables are linearly related
2. They change together at a constant rate
3. For example: As the years increase, so to does the price of bread
4. Common tool used when describing simple relationships between
variables in the dataset
5. Correlations do not imply cause and effect
6. Correlations are useful because if you can find out what
relationships variables have, you can make predictions about future
behaviour
1. Understanding Correlations
1. Correlation coefficient, denoted by r and ranges from -1 to + 1
2. When r is closer to +1, it shows a strong positive relationship
between the variables
3. When r is closer to 0 it shows a weak relationship between the
variables
4. When r is closer to -1, it shows a strong negative relationship
between the variables
5. It is highly unlikely to have a r equal to +1 or -1
6. Positive r values indicate a positive correlation (relationship)
7. Negative r values indicate a negative correlation
1. Assumptions of a Pearson Correlation
1. Both variables should be continuous, on the interval or ratio scale –
other types of correlation tests
2. The variables must have related pairs. There must be a datapoint
for each observation across both of the variables
3. Absence of outliers

Other types of correlations:


Kendall rank correlation
Spearman correlation
Point-Biserial correlation
1. Visualising Correlation Coefficients

Positive correlation – as the years increase,


the price of bread increase
Negative correlation – the older a person
gets, the less hair they have
No correlation – height and exam scores
Coffee consumption and intelligence
Weight and income
1. Practical Examples
Using dataset 3 from the ‘NIDS 2008_practice data.xlxs’
1. Test whether there is a correlation between age and weight
2. Test whether there is a correlation between height and weight
3. Test whether there is a correlation between age and height
4. Test whether there is a correlation between age and income
5. Test whether there is a correlation between age, weight, height and
income
If r closer to +1 = strong positive correlation
If r closer to 0 = no correlation
If r closer to -1 = strong negative correlation
1. Interpretation of the Correlation
Coefficient

1. If r closer to +1 = strong positive correlation


2. If r closer to 0 = no correlation
3. If r closer to -1 = strong negative correlation
2. Simple Linear Regression Model
1. Various types of regression models:
a. Logistic Regression Model – dependant variable is binary
b. Non-Linear Regression Model – relates two variables in a non-linear relationship
c. Simple Linear Regression Model – linear relationship between two variables
d. Multiple Linear Regression Model – linear relationship between two or more variables
2. Describes the relationship between two variables by fitting a line to
the observed data (scatter plot)
3. Can have a positive, negative or no relationship
2. Simple Linear Regression Model
1. Used to estimate the relationship between two quantitative
variables
2. Can be used when:
a. The strength of the relationship between two variables
For example – the relationship between rainfall and soil erosion
b. The value of the dependent variable at a certain value of the independent variable
For example – the amount of soil erosion at a certain level of rain fall

3. Dependent variable (y) – response or outcome variable


4. Independent variable (x) – predictor or explanatory variable
Examples
1. Height and weight — as height increases, you'd expect weight to
increase
2. Alcohol consumed and blood alcohol content — as alcohol
consumption increases, you'd expect one's blood alcohol content to
increase
3. Age and crying – an increase in age results in a decreased amount
of crying among babies
2. Assumptions of a Regression Model
1. The relationship between the dependent and independent
variables are linear
2. The data values of the independent and dependent variables have
equal variances
3. There is no correlation between two or more of the independent
variables
4. The data for the independent and dependent variables are normally
distributed
2. Practical Examples
Using dataset 3 from the ‘NIDS 2008_practice data.xlxs’

1. Test whether an increase in age results in an increase in income


a. Null – There is no linear relationship between age and income
b. Alternate – There is a linear relationship between age and income
2. Test whether age has an effect on weight
a. Null – There is no linear relationship between age and weight OR age does not have an effect on
weight
b. Alternate – There is a linear relationship between age and weight OR age does have an effect on
weight
3. Is height influenced by age?
a. Null – Height does not/is not influence age OR there is no linear relationship between height and age
b. Alternate – Height does/is influence age OR there is a linear relationship between height and age
2. Interpretation of the Results
1. Multiple R is the correlation coefficient between the two variables
of interest. Shows you how strong the linear relationship is.
2. Standard Error is the average distance that the observed values fall
from the regression line. The smaller the standard error, the more
precise the linear regression is.
3. Significance F is the p-value for the regression model. Needs to be
lower than 0.05 for the model to be statistically significant. Need
the null and alternate hypothesis to interpret the statistic.
If the Significance F is < 0.05
Then
Reject the null hypothesis
Section B: Home Exercise
Use data from the ‘Home Exercise.xlxs’ data file to do the following exercise
Using data from the sheet titled One-Sample
T-Test
State the null and alternative hypothesis for all of the tests and indicate
whether it is a one-tailed or two-tailed test and include a conclusion

1. Test whether the average temperature reading is equal to 45


degrees
2. Test whether the average temperature reading is higher than 20
degrees
3. Is the average amount of hours spent asleep equal to 7 hours?
4. Test whether the average amount of hours spent asleep is greater
than 10 hours
Using data from the sheet titled Two-Sample
T-Test
State the null and alternative hypothesis for all of the tests and indicate
whether it is a one-tailed or two-tailed test and include a conclusion

1. Test whether the average income in City A is different from the average
income in City B
2. Test whether the average income of City B is higher than the average
income of City A
3. Test whether the average age of people living in City A is the same as the
average age of people living in City B
4. Test whether the average age of people living in City A is higher than the
average age of people living in City B
Using the data from the sheet titled
Chi-Square
1. Test whether there is an association between martial status and
highest education level
a. Null hypothesis -
b. Alternate hypothesis -
c. Conclusion -
2. Test whether there is an association between geographic location
and highest education level
a. Null hypothesis -
b. Alternate hypothesis -
c. Conclusion -
Using data from the sheet titled Correlation
1. Test whether there is a correlation between the amount of money
that is spent on take out during a month and the weight of a person
a. Null hypothesis –
b. Alternative hypothesis –
c. Conclusion –

2. Test whether there is a correlation between the amount of money


spent on winter clothing and temperature readings
a. Null hypothesis –
b. Alternate hypothesis –
c. Conclusion -
Using data from the sheet titled Regression
1. Test whether yearly income has an influence on the household size
by checking whether it results in an increased number of rooms in
the house
a. Null hypothesis –
b. Alternative hypothesis –
c. Conclusion –

2. Test whether years of education has an effect on monthly income


a. Null hypothesis –
b. Alternative hypothesis –
c. Conclusion -
Submission due: 12 July 2024
• Home exercise answers need to be submitted as follows:
• Excel document
• Ensure that the calculation is neat and understandable
• Ensure that the sheets are neat and structured in an orderly manner
• Contains all of the calculations and explanations required

Email your work to:


• Word document [email protected]
• Must contain the details included on the PowerPoint slide
• Must contain all of interpretations in full
• No calculations need to be included in this document
• The document should be neat, tidy and understandable
Section C: Stata Illustration
End of day 3, thank you!

You might also like