0% found this document useful (0 votes)
4 views171 pages

Inferential Statistics

The document provides an overview of inferential statistics, focusing on hypothesis testing, correlation, and various statistical tests. It explains the concepts of null and alternative hypotheses, types of errors in hypothesis testing, and the importance of p-values. Additionally, it discusses correlation analysis, including the types of correlation and methods for studying correlations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views171 pages

Inferential Statistics

The document provides an overview of inferential statistics, focusing on hypothesis testing, correlation, and various statistical tests. It explains the concepts of null and alternative hypotheses, types of errors in hypothesis testing, and the importance of p-values. Additionally, it discusses correlation analysis, including the types of correlation and methods for studying correlations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 171

Inferential Statistics

3/5/2025 1
Disclaimer
• This presentation contains materials adapted or referenced from
various online sources. These materials are included solely for
educational and informational purposes. I do not claim ownership or
original authorship of the content sourced from external references,
and full credit is due to the respective authors and creators of these
works.
3

Branches of Biostatistics
Inferential Statistics
This is a type of statistics
Focuses on drawing conclusions from;
- generalizing from samples to populations
- testing of hypothesis
- establishing relationships among variables in a sample or
determining difference between samples or groups
Inferential Statistics

Inferential Statistics: use sample data to evaluate the


credibility of a hypothesis about a population

Hypothesis is a statement about expected findings in a study


- establishing relationship between the outcome and
exposure or difference between samples or groups
Inferential Statistics

null hypothesis
"the two groups will not differ“

alternative hypothesis
"group A will do better than group B"
"group A and B will not perform the same
Possible Outcomes in
Hypothesis Testing

Null is True Null is False


Correct
Accept Error
Decision
Type II Error

Correct
Reject Error
Decision
Type I Error

Type I Error: Rejecting a True Hypothesis


Type II Error: Failing to reject (Accepting) a False hypothesis
Guide to deciding on the Hypotheses

p-value
A measure of confidence in the observed difference
Allows researchers to determine the probability that the difference is due to
chance or real
A p-value of LESS than 0.05 (<0.05) is the
common criterion for statistical significance
The probability that the results are due to
chance alone is less than 5 times out of 100
One can be 95% certain that the results are
real and not due to chance alone
Types of inferential statistical tests

• Parametric Tests
• Non Parametric tests
• Tests of association
Parametric tests

• Fulfil the assumption of normality


• Utilize the parameters of the normal distribution
• Executed on quantitative data
Normal distribution properties
• The mean, median and mode are equal (Mean=Median=Mode)
• The total area under the curve is equal to 1
• The normal distribution curve is symmetrical at the center
• Approximately half of the values are to the right of the center as well as the
left
• The normal distribution curve is defined by the mean and standard
deviation
• The normal distribution curve must have only one peak
• The curve approaches the x-axis, but it never touches, and it extends
farther away from the mean.
Correlation and
Chi-Square Tests
Research
Title and
Question
Report /
Literature
Presentati
Review
ons

Data Research
Analysis Hypo

RESEARCH
PROCESS
Approach
Data
and
collection
Designs

Variables
and Sample
Measure- size
ment
Sampling
Recap
Variables
• A variable is a characteristic that may assume more than one set of
values to which a numerical measure can be assigned
• Variable is a characteristic that changes.
• This differs from a constant, which does not change but remains the
same.

Continuous Ordinal
Quantitative Qualitative
Discrete Nominal
Measurement Scales
• Measurement scales are used to quantify numeric values of variables
or observations
There are 4 types of measurement scales:
• Nominal Scale
• Ordinal Scale
• Interval Scale
• Ratio Scale
The Hierarchy of Measurement Scales

Ratio Absolute zero

Interval Distance is meaningful

Ordinal Attributes can be ordered

Nominal Attributes are only named; weakest


Types of Data

Data

Quantitative Qualitative or
or Numeric Categorical

Continuous Data Discrete Data Nominal Data Ordinal Data


Today
Inferential Statistics
 Inferential statistics allows researchers to make decisions or
inferences by interpreting data patterns

 Inferential statistics can be classified as either Estimation or


Hypothesis testing.
Hypothesis Testing
• Hypothesis testing is a statistical method used to determine if there is
enough evidence in a sample data to draw conclusions about a
population
• Common tests include t-tests, chi-square tests, ANOVA, and
regression analysis. The selection depends on data type, distribution,
sample size
• Depending on data distributions, hypothesis testing can be
Parametric or nonparametric
Parametric and Nonparametric tests
• The key difference between parametric and nonparametric
tests is that the parametric test relies on statistical
distributions in data whereas non parametric do not depend
on any distribution.
• In the literal meaning of the terms, a parametric statistical test makes
assumptions about the parameters (defining properties) of the
population distribution(s) from which one's data are drawn.
• In contrast, a non-parametric test makes no such assumptions.
• Nonparametric statistics are most commonly used for variables at
the nominal or ordinal level of measurement, meaning they are used
for variables that do not have a normal distribution.
Sample vs. Population
Today
• Understanding and interpreting Correlation
• Understanding and interpreting Chi-Square Tests
Correlation
Correlation
• Correlation is a statistical tool that helps to measure, describe
and analyse the degree of relationship between two variables.
• Correlation analysis deals with the association between two or
more variables.
Thinking about lines, What can we measure?:
• Gradient – a measure of how the line slopes
• Intercept – where the line cuts the y axis
• Correlation – a measure of how well the line
fits the data
y
5
Equation for a line: y = 1.5 + 0.5x
y = a + bx 4
3
a is the point at which the line
crosses the y axis (when x=0). 2
1
b is a measure of the slope
(the amount of change in y that 00 5
x
1 2 3 4
occurs with a 1-unit change in x).
• A relationship exists when changes in one variable tend to be
accompanied by consistent and predictable changes in the other
variable.
• If two variables vary in such a way that movement in one is
accompanied by movement in other, these variables are called cause
and effect relationship.

22
• Causation always implies correlation but correlation does not
necessarily imply causation.
• The degree of relationship between the variables under consideration
is measured through the correlation analysis.
• The correlation analysis enable us to have an idea about the degree &
direction of the relationship between the two variables under study.
• A correlation typically evaluates three aspects of the relationship:
• the direction
• the form
• the degree
• A correlation typically evaluates three aspects of the relationship:
• the direction
• the form
• the degree
• The direction of the relationship is measured by the sign of the
correlation (+ or -).
• A positive correlation means that the two variables tend to change
in the same direction; as one increases, the other also tends to
increase.
• A negative correlation means that the two variables tend to change
in opposite directions; as one increases, the other tends to
decrease.

26
Direction of the Correlation
• Positive relationship – Variables change in the same
direction.
• As X is increasing, Y is increasing Indicated by
• As X is decreasing, Y is decreasing Sign (+).
• E.g., As height increases, so does weight.
• Negative relationship – Variables change in opposite
directions.
• As X is increasing, Y is decreasing
Indicated by
• As X is decreasing, Y is increasing Sign (-).
• E.g., As TV time (home video) increases, grades decrease
Positive and negative relationships
Positive or direct relationships
• If the points cluster around a line
that runs from the lower left to upper
right of the graph area, then the
relationship between the two
variables is positive or direct.

Negative or inverse relationships


• If the points tend to cluster around
a line that runs from the upper left
to lower right of the graph, then the
relationship between the two
variables is negative or inverse.
• A correlation typically evaluates three aspects of the relationship:
• the direction
• the form
• the degree
• The most common form of relationship is a straight line or linear
relationship which is measured by the Pearson correlation.
• There can also be a non-linear relationship

31
• Linear correlation: Correlation is said to be linear when the
amount of change in one variable tends to bear a constant ratio to
the amount of change in the other.
• The graph of the variables having a linear relationship will form a
straight line.
Ex X = 1, 2, 3, 4, 5, 6, 7, 8,
Y = 5, 7, 9, 11, 13, 15, 17, 19,
Y = 3 + 2x
• Non-Linear correlation: The correlation would be non-linear if
the amount of change in one variable does not bear a constant
ratio to the amount of change in the other variable.
A perfect positive correlation
Weight
Weight
of B
Weight
of A
A linear
relationship

Height
Height Height
of A of B
Linear Correlation
Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
• A correlation typically evaluates three aspects of the relationship:
• the direction
• the form
• the degree
• The measure of correlation is called the correlation coefficient (r).
• The degree of relationship (the strength or consistency of the
relationship) is measured by the numerical value of the correlation.
• A value of 1.00 indicates a perfect relationship and a value of zero
indicates no relationship.
• The degree of relationship is expressed by coefficient which range
from correlation ( -1 ≤ r ≥ +1)

37
High Degree of positive correlation
• Positive relationship
r = +.80

Weight

Height
Degree of correlation
• Moderate Positive Correlation

r = + 0.4

Shoe
Size

Weight
Degree of correlation
• Perfect Negative Correlation
r = -1.0

TV
watching
per
week

Exam score
Degree of correlation
• Moderate Negative Correlation

r = -.80
TV
watching
per
week

Exam score
Degree of correlation
• Weak negative Correlation
r = - 0.2
Shoe
Size

Weight
Degree of correlation
• No Correlation (horizontal line)
r = 0.0

IQ

Height
Degree of correlation (r)
r = +.80 r = +.60

r = +.40 r = +.20
Types of Correlation
• Simple correlation: Under simple correlation problem
there are only two variables are studied.
• Multiple Correlation: Under Multiple Correlation three
or more than three variables are studied.
• Partial correlation: analysis recognizes more than
two variables but considers only two variables
keeping the other constant.
• Total correlation: is based on all the relevant
variables, which is normally not feasible.
Hypothesis testing for correlation
• Step 1
• Ho : There is no correlation between variable A and variable B
• Ha : There is a correlation between variable A and variable B (this can be
positive or negative)
• Step 2: Calculate ‘r’ (correlation coefficient)
• Step 3: Check the corresponding p-value
• For example, if r = -0.4 and the P value is 0.007, it can be interpreted
as follows: There is statistically significant evidence of an inverse
correlation between variable A and variable B
Methods of Studying Correlation
• Scatter Diagram Method
• Karl Pearson’s Correlation
• Spearman’s Rank correlation coefficient
Methods of Studying Correlation
• Scatter Diagram Method
• Karl Pearson’s Correlation
• Spearman’s Rank correlation coefficient
Scatter Diagram Method
• Scatter Diagram is a graph of observed plotted points where
each points represents the values of X & Y as a coordinate.
It portrays the relationship between these two variables
graphically.
Table 1. BP and Age of Children

SBP Age Weight Age


(kg)
90 12.5
38 12.5
88 12.1
45 12.1
100 13.6
35 13.6
70 10.0
50 10.0
80 11.2
60 11.2
90 12.0
45 12.0
100 13.4
30 13.4
102 13.8
51 13.8
120 16.8
53 16.8
110 15.6
40 15.6
89 12.3
43 12.3
80 12.0
39 12.0
90 12.7
41 12.7
100 13.7
40 13.7
87 12.0
50 12.0
93 12.8
56 12.8
82 11.6
52 111.6
102 14.0
62 14.0
93 13.0
39 13.0
86 11.9
44 11.9
Scatter plot of the relationship between SBP and age of children
SBP Age

120
90 12.5
88 12.1
100 13.6

110
70 10.0
80 11.2
90 12.0

100
100 13.4
102 13.8
120 16.8
SPB
110 15.6 90
89 12.3
80 12.0
90 12.7
80

100 13.7
87 12.0
93 12.8
70

82 11.6
102 14.0 10 12 14 16 18
Age
93 13.0
86 11.9
Scatter plot of the relationship between weight and age of children
Weight Age
(kg)
38 12.5

60
45 12.1
35 13.6
50 10.0
60 11.2
45 12.0

50
30 13.4
51 13.8
53 16.8
40 15.6
43 12.3
40

39 12.0
41 12.7
40 13.7
50 12.0
56 12.8
30

52 111.6
0 50 100
62 14.0
Age
39 13.0

44 11.9
Scatter plot of the relationship between weight and age of children
Weight Age
(kg)
38 12.5
45 12.1
35 13.6
50 10.0
60 11.2
45 12.0
30 13.4
51 13.8
53 16.8
40 15.6
43 12.3
39 12.0
41 12.7
40 13.7
50 12.0
56 12.8
52 11.6
62 14.0
39 13.0

44 11.9
• Scatter plot - Positive relationship

Weight

Height
• Scatter plot - Moderate Positive Correlation

Shoe
Size

Weight
• Scatter plot - Perfect Negative Correlation

TV
watching
per
week

Exam score
• Scatter plot - Moderate Negative Correlation

TV
watching
per
week

Exam score
• Weak negative Correlation

Shoe
Size

Weight
• Scatter plot - No Correlation

IQ

Height
Advantages of Scatter Plot
• Simple & Non-Mathematical method
• First step in investing the relationship between two variables
Disadvantage of scatter diagram

• Can not adopt the exact degree of correlation


Methods of Studying Correlation
• Scatter Diagram Method
• Karl Pearson’s Correlation
• Spearman’s Rank correlation coefficient
• To compute a correlation you need two scores, X and Y, for each
individual in the sample.
• The Pearson correlation requires that the scores be numerical values
from an interval or ratio scale of measurement.
• Other correlational methods exist for other scales of measurement
(e.g. Spearman’s).

76
The Pearson Correlation
• The Pearson correlation measures the direction and degree of
linear (straight line) relationship between two variables.
• Pearson’s ‘r’ is the most common correlation coefficient
• Degree of Correlation is expressed by a value of the Coefficient
• r is usually between -1 ≤ r ≥ +1

77
• To compute the Pearson correlation r (Method 1), you first measure
the variability of X and Y scores separately by computing SS for the
scores of each variable (SSX and SSY).
• Then, the co-variability (tendency for X and Y to vary together) is
measured by the sum of products (SP).
• The Pearson correlation r is found by computing the ratio,
SP/(SSX)(SSY) .
Sample Table.
x y x–x y–y (x – x)2 (y – y)2 (x – x) (y – y)
Deviations of x Deviations of y Product of x and y

Σ (SSX) Σ (SSY) Σ(SP)


r = SP/(SSX)(SSY)

Where x = mean of x and y = mean of y


Sample Table.
x y x–x y–y (x – x)2 (y – y)2 (x – x) (y – y)
Age Glu Deviations of x Deviations of y Product of x and y
level (x = 41.2) (y = 81)

43 99
21 65
25 79
42 75
57 87
59 81

sum of X = 247, Sum of Y= 486


Σ (SSX) Σ (SSY) Σ(SP)
r = SP/(SSX)(SSY)

Where x = mean of x and y = mean of y


x y x–x y–y (x – x)2 (y – y)2 (x – x) (y – y)
Age Glu Deviations of x Deviations of y Product of x and y
level (x = 41.2) (y = 81)

43 99 -1.8 -18
21 65 20.2 16
25 79 16.2 2
42 75 -0.8 6
57 87 -15.8 -6
59 81 -17.8 0

Σ (SSX) Σ (SSY) Σ(SP)


r = SP/(SSX)(SSY)

Where x = mean of x and y = mean of y


x y x–x y–y (x – x)2 (y – y)2 (x – x) (y – y)
Age Glu Deviations of x Deviations of y Product of x and y
level (x = 41.2) (y = 81)

43 99 -1.8 -18 3.24 324


21 65 20.2 16 408.04 256
25 79 16.2 2 262.44 4
42 75 -0.8 6 0.64 36
57 87 -15.8 -6 269.64 36
59 81 -17.8 0 316.84 0

Σ (SSX) Σ (SSY) Σ(SP)


r = SP/(SSX)(SSY)

Where x = mean of x and y = mean of y


x y x–x y–y (x – x)2 (y – y)2 (x – x) (y – y)
Age Glu Deviations of x Deviations of y Product of x and y
level (x = 41.2) (y = 81)

43 99 -1.8 -18 3.24 324 32.4


21 65 20.2 16 408.04 256 323.2
25 79 16.2 2 262.44 4 32.4
42 75 -0.8 6 0.64 36 -4.8
57 87 -15.8 -6 269.64 36 94.8
59 81 -17.8 0 316.84 0 0

Σ (SSX) Σ (SSY) Σ(SP)


r = SP/(SSX)(SSY)

Where x = mean of x and y = mean of y


x y x–x y–y (x – x)2 (y – y)2 (x – x) (y – y)
Age Glu Deviations of x Deviations of y Product of x and y
level (x = 41.2) (y = 81)

43 99 -1.8 -18 3.24 324 32.4


21 65 20.2 16 408.04 256 323.2
25 79 16.2 2 262.44 4 32.4
42 75 -0.8 6 0.64 36 -4.8
57 87 -15.8 -6 269.64 36 94.8
59 81 -17.8 0 316.84 0 0
1260.84 656 478

Σ (SSX) Σ (SSY) Σ(SP)


r = SP/(SSX)(SSY)

Where x = mean of x and y = mean of y


x y x–x y–y (x – x)2 (y – y)2 (x – x) (y – y)
Age Glu Deviations of x Deviations of y Product of x and y
level (x = 41.2) (y = 81)

43 99 -1.8 -18 3.24 324 32.4


21 65 20.2 16 408.04 256 323.2
25 79 16.2 2 262.44 4 32.4
42 75 -0.8 6 0.64 36 -4.8
57 87 -15.8 -6 269.64 36 94.8
59 81 -17.8 0 316.84 0 0
1260.84 656 478
Σ (SSX) Σ (SSY) Σ(SP)
r = SP/(SSX)(SSY)

= 0.5255
Another method for calculating “r”
Procedure for computing the correlation coefficient
(Method 2)
• Calculate the sum (∑ )of the two series ‘x’ &’y’ (∑ x & ∑y )
• Square each value of ‘x’ &’y’ then obtain the sum of the squared
deviation i.e.∑x2 & .∑y2
• Multiply each value of x with corresponding value under y &
obtain the product of ‘xy’.
• Then obtain the sum of the product of x , y i.e. ∑xy
• Substitute the value in the formula.
Sample question: Find the value of the
correlation coefficient from the following table:
Subject Age x Glucose Level y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 1:Make a chart. Use the given data, and add three more
columns: xy, x2, and y2.

Glucose
Subject Age x xy x2 y2
Level y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 2: Multiply x and y together to fill the xy column. For
example, row 1 would be 43 × 99 = 4,257.

Glucose
Subject Age x xy x2 y2
Level y
1 43 99 4257
2 21 65 1365
3 25 79 1975
4 42 75 3150
5 57 87 4959
6 59 81 4779
Step 3: Take the square of the numbers in the x column, and put
the result in the x2 column.

Glucose
Subject Age x xy x2 y2
Level y
1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481
Step 4: Take the square of the numbers in the y column, and put
the result in the y2 column.

Glucose
Subject Age x xy x2 y2
Level y
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Step 5: Add up all of the numbers in the columns and put the
result at the bottom of the column. The Greek letter sigma (Σ) is
a short way of saying “sum of.”
Glucose
Subject Age x xy x2 y2
Level y
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
Step 6: Use the following correlation coefficient formula.
From our table:

Σx = 247
Σy = 486
Σxy = 20,485
Σx2 = 11,409
Σy2 = 40,022
n is the sample size, in our case = 6
The correlation coefficient =
Step 6: Use the following correlation coefficient formula.
From our table:

Σx = 247
Σy = 486
Σxy = 20,485
Σx2 = 11,409
Σy2 = 40,022
n is the sample size, in our case = 6
The correlation coefficient =

6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]

= 0.5298
Examples of SPSS output for Correlation

For example Var 1= Age, Var 2 = Income, Var 3 = Wt,


Var 4 = Ht, Var 5 = Exam score
Assumptions of Pearson’s Correlation Coefficient

• There is linear relationship between two variables, i.e. when


the two variables are plotted on a scatter diagram a straight
line will be formed by the points.
• Cause and effect relation exists between different forces
operating on the item of the two-variable series.
Advantages of Pearson’s Coefficient

• It summarizes in one value, the degree of correlation &


direction of correlation.
Limitation of Pearson’s Coefficient

• Always assume linear relationship


• Interpreting the value of r is difficult.
• Value of Correlation Coefficient is affected by extreme values.
• Time consuming methods (if doing it manually)
Coefficient of Determination
• The convenient way of interpreting the value of correlation
coefficient is to use the square of coefficient of correlation
which is called Coefficient of Determination.
• The Coefficient of Determination = r2.
• Suppose: r = 0.9, r2 = 0.81 this would mean that 81% of the
variation in the dependent variable has been explained by the
independent variable.
• For example, correlation between height and weight or
• Correlation between exercise time and blood pressure or
• Correlation between time spent reading and exam score
Coefficient of Determination
• The maximum value of r2 is 1 because it is possible to explain
all of the variation in y but it is not possible to explain more
than all of it.
• Coefficient of Determination =
Explained variation / Total variation
Coefficient of Determination: An example
• Suppose: r = 0.60
r = 0.30
It does not mean that the first correlation is twice as strong as
the second. The ‘r’ can be understood by computing the value
of r2 .
When r = 0.60 r2 = 0.36 -----(1)
r = 0.30 r2 = 0.09 -----(2)
This implies that in the first case 36% of the total variation is
explained whereas in second case 9% of the total variation is
explained .
How to Report Correlation
• If you do report your statistics in text: r(degrees of freedom) = the r statistic, p =
p value. The r statistic should be reported to 2 decimal places. The p values
should be reported to 3 decimal places

• There was a moderately positive correlation between height and weight, r (352)=
.513, p = .000
Somethings to note:
1. There are two ways to report p values. The first way is to cite the alpha value.
The second way, the preferred way is to report the exact p-value. The other thing to
note here is that if your p-value is less than .001, it's conventional to state p < .001,
rather than give the exact value.
2. The r statistic should be stated at 2 decimal places.
3. Remember to drop the leading 0 from both r and the p-value (i.e., not 0.34, but
rather .34).
4. You don't need to provide the formula for r.
5. Degrees of freedom for r is N - 2 (the number of data points minus 2).
Methods of Studying Correlation
• Scatter Diagram Method
• Karl Pearson’s Correlation
• Spearman’s Rank correlation coefficient
Spearman’s Rank Coefficient of Correlation

• When statistical series in which the variables under study are not
capable of quantitative measurement but can be arranged in
serial order, in such situation Pearson's correlation coefficient can
not be used but Spearman Rank correlation can be used.
• R = 1- (6 ∑D2 ) / N (N2 – 1)
• R = Rank correlation coefficient
• D = Difference of rank between paired item in two series.

• N = Total number of pairs of observation of X and Y.

• It is usually denoted by the symbol ( rho).


The Spearman Correlation
• The Spearman correlation is used in two general situations:
(1) It measures the relationship between two ordinal variables; that
is, X and Y both consist of ranks.
(2) It measures the consistency of direction of the relationship
between two variables. In this case, the two variables must be
converted to ranks before the Spearman correlation is computed.

100
The Spearman Correlation (cont.)
The calculation of the Spearman correlation requires:

1. Two variables are observed for each individual.


2. The observations for each variable are rank ordered. Note that the X values and
the Y values are ranked separately.
3. After the variables have been ranked, the Spearman correlation is computed by
either:
a. Using the Pearson formula with the ranked data.
b. Using the special Spearman formula (assuming there are few, if any, tied
ranks).

101
Class example
• MICROSOFT shortlisted 10 candidates for final selection interview.
They were examined in written and oral communication skills. They
were ranked as follows:

• Find out whether there is any correlation between the


written and oral communication skills of the short-listed
candidates.
• Recall the formula
Interpretation of Rank Correlation Coefficient (R)

• 0.82 from the class work


• The value of rank correlation coefficient, R ranges from -1 to
+1
• If R = +1, then there is complete agreement in the order of
the ranks and the ranks are in the same direction
• If R = -1, then there is complete agreement in the order of
the ranks and the ranks are in the opposite direction
• If R = 0, then there is no correlation
Another class work (Assignment).
• Problem (Conversion of scores into ranks)
• Calculate the rank correlation to determine the relationship
between performance in a Mathematics test and percentage time
spent in study given by the following data on 8 College students.

Performance in Test 90.0 92.4 98.5 98.3 95.4 91.3 98.0 92.0
% Time spent in study 76.0 74.2 75.0 77.4 78.3 78.8 73.2 76.5

• Remember to convert scores into ranks first!


Limitation Spearman’s Correlation

• Cannot be used for finding out correlation in a grouped frequency


distribution.
• This method should not be applied where N exceeds 30.
Chi-square
Objectives
• Learn to identify situations for which Chi square is the appropriate
test
• Calculate the Chi Square and determine the significance of the test
statistic
• Interpret the Chi Square statistic
• Relate results of your analysis to your null hypothesis: very important
Research Questions
Types of questions which can be answered using the Chi Square:
• Are the observed results different from the expected ones?
• Does a relationship exist between two variables?
• Does group A have a different outcome from group B?
• Is there an association between group A and group B?
Different Scales, Different Measures of Association
Scale of Both Variables Measures of Association

Nominal Scale Pearson Chi-Square: χ2

Ordinal Scale Spearman’s rho

Interval or Ratio Scale Pearson r

Continuous variable and Point-biserial


a Binary variable correlation
111
Why used?
• Chi-square analysis is primarily used to deal with categorical
(frequency) data
• The data that we analyze consists of frequencies; that is, the number
of individuals falling into categories. In other words, the variables are
measured on a nominal scale.
• The test statistic for frequency data is Pearson Chi-Square. The
magnitude of Pearson Chi-Square reflects the amount of discrepancy
between observed frequencies and expected frequencies.
The Chi Square Statistic
The (Chi Square) 2 statistic
2 =[(observed-expected)2/ expected]

2  
2
(O E)
E

• If observed and expected are very similar, 2 is small.


• If observed and expected are very different, 2 is large. Perhaps the
expectation is wrong or something else is happening
• If you also wanted to know the magnitude of difference you could
calculate an odds ratio
Steps in Test of Hypothesis
1. Determine the appropriate test
2. Establish the level of significance: α
3. Formulate the statistical hypothesis
4. Calculate the test statistic
5. Determine the degree of freedom
6. Compare computed test statistic against a tabled/critical value
STEPS IN HYPOTHESIS TESTING PROCEDURE
1. Enumerate data
2. Review assumptions
3. State hypotheses
4. Select test statistic
5. Determine distribution of test statistic
6. Calculate test statistic
7. State decision rule
8. Make statistical decision
9. Conclude Ho may be true or HA is true
10.Determine p value
124
1. Determine Appropriate Test
• Chi Square is used when both variables are measured on a nominal
scale.
• It can be applied to interval or ratio data that have been categorized
into a small number of groups.
• It assumes that the observations are randomly sampled from the
population.
• All observations are independent (an individual can appear only
once in a table and there are no overlapping categories).
• It does not make any assumptions about the shape of the
distribution nor about the homogeneity of variances.

116
2. Establish Level of Significance
• α is a predetermined value
• The convention
• α = .05
• α = .01
• α = .001

117
3. Determine The Hypothesis: Whether There is an
Association or Not

• Ho : The two variables are independent


• Ha : The two variables are associated

118
• The hypothesis is termed the null hypothesis which states
• Ho: That there is NO statistically significant difference between
observed values and the expected values.
• In other words, the results or differences that do exist between
observed and expected are totally random and occurred by
chance alone.
• The hypothesis is termed the alternative hypothesis which states
• Ha: That there is a statistically significant difference between
observed values and the expected values.
4. Calculating Test Statistics
• Contrasts observed frequencies in each cell of a contingency table
with expected frequencies.
• The expected frequencies represent the number of cases that would
be found in each cell if the null hypothesis were true ( i.e. the nominal
variables are unrelated).
• Expected frequency of two unrelated events is product of the row and
column frequency divided by number of cases.

Fe= Fr Fc / N

120
4. Calculating Test Statistics

(O  E)
 
2
2

E
 (Fo  Fe ) 
2
 
2
  
 Fe 
121
4. Calculating Test Statistics

 (Fo  Fe )  2
 2
  
 F e 

122
123

5. Determine Degrees of
Freedom
df = (R-1)(C-1)
6. Compare computed test statistic against a
tabled/critical value

• The computed value of the Pearson Chi- square statistic is


compared with the critical value to determine if the computed
value is improbable
• The critical tabled values are based on sampling distributions of the
Pearson Chi-square statistic
• If calculated 2 is greater than 2 table value, reject Ho

124
Example
• Suppose a researcher is interested in examining the relationship
between heart dx and COVID-19 infection
• A questionnaire was developed and administered to a random sample
of 90 elderly patients.
• The researcher also collects information about heart dx (present or
absent) and tests for COVID-19 infection status (negative,
indeterminate, positive) for the 90 respondents.

126
Bivariate Frequency Table or Contingency Table

Negative Not sure Positive f row

Heart Dx 10 10 30 50
Present
Heart Dx 15 15 10 40
Absent

f column 25 25 40 n = 90

127
Bivariate Frequency Table or Contingency Table

Negative Not sure Positive f row

Heart Dx 10 10 30 50
Present
Heart Dx 15 15 10 40
Absent

f column 25 25 40 n = 90

128
Row frequency
Bivariate Frequency Table or Contingency Table

Negative Not sure Positive f row

Heart Dx 10 10 30 50
Present
Heart Dx 15 15 10 40
Absent

f column 25 25 40 n = 90

129
Bivariate Frequency Table or Contingency Table

Negative Not sure Positive f row

Heart Dx 10 10 30 50
Present
Heart Dx 15 15 10 40
Absent

f ncny
Colum n frecqouluem
25 25 40 n = 90

130
1. Determine Appropriate Test

1. Heart Disease( 2 levels: present or absent) and Nominal


2. COVID-19 infection(3 levels: negative, indeterminate, positive) and
Nominal

131
132

2. Establish Level of
Significance
Alpha of .05
3. Determine The Hypothesis
• Ho : There is NO relationship between heart disease and COVID-19
infection among elderly patients

• Ha : There is a relationship between heart disease and COVID-19


infection among elderly patients

133
4. Calculating Test Statistics
(Recall study findings)

Negative Not sure Positive f row

Heart Dx 10 10 30 50
Present
Heart Dx 15 15 10 40
Absent

f column 25 25 40 n = 90

134
4. Calculating Test Statistics

Remember: Fe= Fr Fc / N

Negative Not sure Positive f row

Heart Dx fo =10 fo =10 fo =30 50


Present fe =13.9 fe =13.9 fe=22.2
Heart Dx fo =15 fo =15 fo =10 40
Absent fe =11.1 fe =11.1 fe =17.8
f column 25 25 40 n = 90

135
4. Calculating Test Statistics
Remember: Fe= Fr Fc / N

Negative Not sure Positive f row


= 50*25/90
Heart Dx fo =10 fo =10 fo =30 50
Present fe =13.9 fe =13.9 fe=22.2
Heart Dx fo =15 fo =15 fo =10 40
Absent fe =11.1 fe =11.1 fe =17.8
f column 25 25 40 n = 90

136
4. Calculating Test Statistics
Remember: Fe= Fr Fc / N

Negative Not sure Positive f row

Heart Dx fo =10 fo =10 fo =30 50


Present fe =13.9 f =13.9 f =22.2
e e
= 40* 25/90 0
Heart Dx fo =15 f =15
o
o f =1 40
Absent fe =11.1 fe =11.1 fe =17.8
f column 25 25 40 n = 90

137
Test Statistics

Negative Not sure Positive f row

Heart Dx fo =10 fo =10 fo =30 50


Present fe =13.9 fe =13.9 fe=22.2
Heart Dx fo =15 fo =15 fo =10 40
Absent fe =11.1 fe =11.1 fe =17.8
f column 25 25 40 n = 90

138
4. Calculating Test Statistics

2  
2
(O E)
E
(10 13.89)2 (10 13.89)2 (30  22.2)2
 
2
  
13.89 13.89 22.2

(1511.11)2  (1511.11)2  (10 17.8)2


11.11 11.11 17.8

= 11.03

139
5. Determine Degrees of 140

Freedom
Negative Not sure Positive f row df = (R-1)(C-1) =
(2-1)(3-1) = 2

Heart Dx 10 10 30 50
Present
Heart Dx 15 15 10 40
Absent
f column 25 25 40 n = 90
• Check the critical value in the Table with the following parameters
• α = 0.05
• df = 2
6. Compare computed test statistic against a
tabled/critical value
• α = 0.05
• df = 2
• Remember, If calculated 2 is greater than 2 table value, reject Ho
• Critical tabled value = 5.991
• Test statistic, 11.03, exceeds critical value
• Null hypothesis is rejected
• There is a relationship between heart disease and COVID-19
infection among elderly patients

143
The Hypothesis
If calculated 2 is greater than 2 table value, reject Ho
11.03 > 5.991

• Ho: There is NO relationship between heart disease and COVID-19


infection among elderly patients
• Ha: There is a relationship between heart disease and COVID-19
infection among elderly patients

144
SPSS Output for COVID-19 infection Example
Chi-Square Tests

Asymp. Sig.
Value df (2-sided)
Pearson Chi-Square 11.025a 2 .004
Likelihood Ratio 11.365 2 .003
Linear-by-Linear
Association 8.722 1 .003

N of Valid Cases 90
a. 0 cells (.0%) have expected count less than 5. The
minimum expected count is 11.11.

145
How to report Chi Square
• A Chi-squared analysis showed a significant relationship between
heart disease and COVID-19 infection among elderly patients
(χ2=11.03, df=2, P=0.004).
• If expected count is ≤ 5 and Table is 2 X 2 use Fishers Exact Test p-
values
• If expected count is > 5 and Table > 2 X 2 use Chi Squared Test p-
values
• If expected count is > 5 and Table is 2 X 2 use Continuity correction p-
values
• For example:

• Amongst those that currently smoked % had experienced


symptoms of asthma whereas % of non-smokers experience
such symptoms. This was statistically significant/non significant
at a 5% level using a two-tailed continuity corrected chi-squared
test with p= “
• Amongst those that currently smoked % had experienced
symptoms of asthma whereas % of non-smokers experience
such symptoms. This was statistically significant/non significant
at a 5% level using a two-tailed Fishers Exact chi-squared test
with p= “
One-dimensional
• Also known as Chi Square Goodness of Fit test
• Suppose we want to know how people in a particular area will vote in
general and go around asking them.

APC PDP Labor

20 30 10

• How will we see what’s really going on?


• Research Question: Can PDP win the district?
• Ho: There is no association between how people will vote and PDP
winning in the district,
• Ha: There is an association between how people will vote and PDP
winning in the district.
• Solution: Chi-square analysis to determine if our outcome is different
from what would be expected if there was no preference


2  
2
(O E)
E
APC PDP Labor

Observed 20 30 10
Expected 20 20 20
• Plug in to formula
• (Expected observation is gotten from the mean of the observed
proportions)
 (20  20)2 (30  20)2 (10  20)2
2  
2
(O E)
 
E 20 20 20
 2 (2)  10
.05
2
 5.99
• Remember, If calculated 2 is greater than 2 table value, reject Ho
• Reject H0
• The district will probably vote PDP
More Examples
• What do Doctors do with their free time?

Watch TV Sleep Plan Strike Play

Males
30 40 20 10
Females
20 30 40 10
• Question: Is there a relationship between gender and
what Doctors do with their free time?
• Can you state the null and the alternative hypothesis?

Watch TV Sleep Plan Strike Play

Males
30 40 20 10 100
Females
20 30 40 10 100
50 70 60 20 200

• Expected = (Ri*Cj)/N
• Example for males TV: (100*50)/200 = 25
Watch TV Sleep Plan Strike Play Total

Males (E) 30 (25) 40 (35) 20 (30) 10 (10) 100


Females (E) 20 (25) 30 (35) 40 (30) 10 (10) 100
50 70 60 20 200

• df = (R-1)(C-1)
• R = number of rows
• C = number of columns
Interpretation

 2 (3)  10.10
  7.82
2
.05

• Remember, If calculated 2 is greater than 2 table value, reject Ho


• Reject H0, there is a statistically significant relationship between
gender and what Resident Doctors do with their free time
Interpretation of the test will consider frequencies
• What do Doctors do with their free time?

Watch TV Sleep Plan Strike Play

Males
30 40 20 10
Females
20 30 40 10
Assumptions
• Normality
• Rule of thumb is that we need at least 5 for our expected frequencies value
• Inclusion of non-occurences
• Must include all responses, not just those positive ones
• Independence
• Note that the variables are independent or related (that’s what the test can be used for), but
rather as with t-tests, the observations (data points) don’t have any bearing on one another.
• To help with the last two, make sure that your N equals the total number of
people who responded
Exercise
• A study was conducted to assess the effectiveness of motorcycle safety
helmets in preventing head injury. The data consist of a random sample of
793 people involved in motorcycle accidents during a one-year period. Use
the data below if test if there is an association between having a head injury
and wearing helmets.
Head injury
Wearing helmets Yes No
Yes 17 130
No 218 428
Total 235 558

You might also like