0% found this document useful (0 votes)
7 views

Statistical method of categorical variable

The document discusses statistical methods for analyzing categorical variables, focusing on contingency tables and the chi-square test. It provides examples of how to calculate observed and expected frequencies, as well as the assumptions and applications of the chi-square test, including Fisher's exact test for small sample sizes. Additionally, it covers McNemar's test for matched pairs and the use of SPSS for conducting chi-square tests.

Uploaded by

Berhanu Yelea
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Statistical method of categorical variable

The document discusses statistical methods for analyzing categorical variables, focusing on contingency tables and the chi-square test. It provides examples of how to calculate observed and expected frequencies, as well as the assumptions and applications of the chi-square test, including Fisher's exact test for small sample sizes. Additionally, it covers McNemar's test for matched pairs and the use of SPSS for conducting chi-square tests.

Uploaded by

Berhanu Yelea
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Statistical methods for

Categorical Variables

03/22/2025 1
Contingency table
• A contingency table is a type of table
in a matrix format that displays the
frequency distribution of the
variables.

• They provide a basic picture of the


interrelation between two variables.
• Each “box” containing a frequency is
called a cell.

03/22/2025 2
Contingency table

• In general, if a 2 x 2 table of
observed frequencies for a sample of
size p can represented as follows

• the corresponding table of expected


counts is

03/22/2025 3
Example
The following table shows the relation
between the number of accidents in
1 year and the age of the driver in a
random sample of 500 drivers
between 18 and 50.
Number of Age of driver
Accidents
18-25 26-40 >40 total

0 75 115 110 300


1 50 65 35 150
> 2 25 20 5 50
Total 150 200 150 500

03/22/2025 4
This is a 3 x 3 contingency table.

Observed frequencies
Number of Age of driver
Accidents
18-25 26-40 >40 total

0 75 115 110 300


1 50 65 35 150
> 2 25 20 5 50
Total 150 200 150 500

row1 and
column 2(cell
12)

03/22/2025 5
• Calculation of expected frequencies:
A total of 150 drivers aged 18-25,
and 300/500 = 3/5 of all drivers have
had no accidents. If there is no
relation between driver age and
number of accidents, we expect that
3/5(150) = 90 drivers aged 18-25
would have no accidents. i.e.,
e11 = 300 × 150/500= 90
e12(row1 and column 2) = 200x300
/500 = 120
03/22/2025 6
Expected frequencies

Number o Age of driver


f
Accidents 18-25 26-40 >40 total

0 Observed 75 Observed 115 Observed 110 300


Expected 90 Expected 120 Expected 90
1 Observed 50 Observed 65 Observed 35 150
Expected 45 Expected 60 Expected 45
> 2 Observe 25 Observed 20 Observed 5 50
Expected 15 Expected 20 Expected 15
Total 150 200 150 500

03/22/2025 7
The Chi – square test(as a test of measure
of association)

• The chi-square is useful in making statistical


inferences about categorical data in which the
categories are two or more.

• Definition :A statistic which measures the


discrepancy between K observed frequencies
O1, O2, Ok and the corresponding expected
frequencies e1, e2. ek.
χ2= ∑{(Oi- ei)2}/ ei
03/22/2025 8
• For a 2 by 2 table

The following short cut formula can be used

03/22/2025 9
The Chi – square test(cont…

• A chi square (χ2) distribution is a


probability distribution.
• There is a different χ2 distribution for
each different value of degrees of
freedom, but all of them share the
following characteristics.
1. Every χ2 distribution extends indefinitely to the
right from 0.
2. Every χ2 distribution has only one (right ) tail.
3. As df increases, the χ2 curves get more bell
shaped and approach the normal curve in
appearance (but remember that a chi square
curve starts at 0, not at - ∞)
03/22/2025 10
The Chi – square test(cont…

• If the value of
– χ2 =zero, then there is a perfect
agreement between the observed and
the expected frequencies.
– larger χ2 -The greater the discrepancy
between the observed and expected
frequencies.

03/22/2025 11
Assumptions of chi square

• Every observation of the sample for this test


should be independent of all other observations.

• Chi square test should not be used when the


overall total n of the table is less than 40 and
any expected value is less than 5.

• The conventional criterion for the χ2 test to be


valid says that at least 80 percent of the
expected frequencies should exceed 5 and.

• No observed cell should be zero


• for 2 x2 and for RXC contingency table all the
expected frequencies should exceed 1
03/22/2025 12
Fisher’s exact test
• If the criterion is not satisfied we can usually
combine or delete rows and columns to give
bigger expected values.
• However, this procedure cannot be applied for 2
by 2 tables.
• If more than 20% of expected values <5 use
fishers exact test.
• When to use fishers exact test for a 2X2 table
– When the overall sample size is less than 20
– When the sample size is between 20 and 40 and
the smallest of the four expected value less than
5
Read about fishers exact test
03/22/2025 13
Example
The following table shows the relation
between the number of accidents in 1 year
and the age of the driver in a random
sample of 500 drivers between 18 and 50.
Test, at a 0.01 level of significance, the
hypothesis that the number of accidents is
independent
Number of
Accidents
of the driver's
Age of driverage.
18-25 26-40 >40 total

0 75 115 110 300


1 50 65 35 150
> 2 25 20 5 50
Total 150 200 150 500

03/22/2025 14
This is a 3 x 3 contingency table.

Observed frequencies
Number of Age of driver
Accidents
18-25 26-40 >40 total

0 75 115 110 300


1 50 65 35 150
> 2 25 20 5 50
Total 150 200 150 500

03/22/2025 15
• Hypothesis:
– HO : There is no relation between age of
driver and number of accidents
– HA : The variables are dependent
(related)
• The degrees of freedom (df) in a
contingency table with R rows and C
columns is:
df = ( R – 1) ( C – 1)
Hence,
03/22/2025
χ2tab with df = 4, at .01 level of 16
Expected frequencies

Number o Age of driver


f
Accidents 18-25 26-40 >40 total

0 Observed 75 Observed 115 Observed 110 300


Expected 90 Expected 120 Expected 90
1 Observed 50 Observed 65 Observed 35 150
Expected 45 Expected 60 Expected 45
> 2 Observe 25 Observed 20 Observed 5 50
Expected 15 Expected 20 Expected 15
Total 150 200 150 500

03/22/2025 17
• χ2calc = (75 –90)² /90 + (115 –
120 )² /120 + (110 – 90)² /90 + … +
(5 – 15 )² /15
= 1 + 0.208 + 4.444 + 0.556 +
0.417 + 2.222 + 6.667 + 0
+ 6.667
= 22. 2 (This corresponds to a P-
value of less than .001)
Therefore, there is a relationship
between number of accidents and
03/22/2025 18
Exercise
A study was conducted And the
following data was found .We wish to
Know if we may conclude that there
is arelation ship between Human
Papiloma Virus (HPV)status and
Stage of HIV infectionHIV(α= 0.05)
HPV

Sero Sero Sero negative total


Positive ,Symp positive ,Asym
tomatic ptomatic
Positive 23 4 10 37

Negative 10 14 35 59

Total 33 18 45 96

03/22/2025 19
Solution
• Hypothesis:
– HO : HPV status and stage of HIV
infection are independent
– HA : The two variables are not
independent(related)
• The degrees of freedom (df) is:
df = ( R – 1) ( C – 1)=2-1(3-1)=2
α= 0.05
Hence, χ2tab with df = 2, at 0.05 level of
significance = 5.991
03/22/2025 20
• Observed and expected frequencies.
HPV HIV

Sero Sero Sero negative total


Positive ,Symp positive ,Asym
tomatic ptomatic
Positive 23(12.72) 4(6.94) 10(17.34) 37

Negative 10(20.28) 14(11.06) 35(27.66) 59

Total 33 200 45 96

03/22/2025 21
χ2calc = (23 –12.72)² /12.72 + (4 –
6.94)² /6.94 + … + (35 – 27.66 )²
/27.66
= 8.30805 + 1.24548 ….+
1.94778=20.601
Conclusion : since
20.601>5.991 reject Ho
We conclude that Ho is false and that
there is a relationship between HPV
03/22/2025 22
Yates’ –corrected chi square test
• With 2x2 tables, chi square distribution can be better
approximated using yates’ continuity correction, particularly
when the sample sizes are small.

• For large samples the effect of correction is small .

• The use of continuous chi square distribution to approximate


frequencies tend to overestimate the test statistic, and we use a
continuity correction to remove this bias.

• A correction is applied to the original chi square value to


improve the fit
• Yates’ –correction is not applied for contingency tables larger than
2x2 tables, low cell frequencies are resolved by
pooling(collapsing)adjacent cells.

• For extremely small samples, chi square test even with yates’ correction
is not recommended. In this case use fishers excat test
CHI SQUARE test using SPSS

• Analyze descriptive statistics cross tabs


• For ROWS ,select the independent variable
• For COLUMNS, Select the dependent
variables
• Under STATISTICS, click on CHI SQUARE
• Under CELLS ,click on Observed, Expected and
Row percentages
X2 test for paired and matched
data
• Matched (Mc nemar’s
samples test)
arise from:
matched case control, before and
after study or cross over trial
• The data can be summarized
retaining the matching in a 2x2 table
control
Case Exposed Unexposed

exposed n++ n+-


unexposed n-+ n--
03/22/2025 27
Example :

we are interested in investigating the


association between retirement status and
heart disease 127 individuals who had
experienced cardiac arrest were matched
(on age, gender, and socioeconomic
status )with 127 healthy control subjects.is
the proportion of retirement identical in the
Retired
cardiac arrest
Cardiac patients
arrest Yes No and Total
the healthy
patients ?
Yes 47 80 127
No 39 88 127
Total 86 168 254

03/22/2025 28
Mc nemar’s test(cont…

• The assumption of the chi square is


not fulfilled because the individuals
in the two samples were matched
and are not independent.
• We need to take in to account the
pairing and analyze the data in terms
of matched pairs
• The following table gives the status
of the case control pairs.
03/22/2025 29
Example cont.…

Retired
Cardiac Yes No Total
arrest

Yes 47 80 127
No 39 88 127
Total 86 168 254
Cardiac tot
Healthy arrest al
Note: Each entry
Reti Not in the table
red retired corresponds to a
pair of subject
Retired 27 12 39 rather than
Not 20 68 88 individual subject.
retired
03/22/2025 30
total 47 80 12
Example cont.…

Cardiac tota
Healthy arrest l Note: Each entry
Retir Not in the table
ed retired corresponds to a
Retired 27 12 39 pair of subject
rather than
Not 20 68 88 individual subject.
retired
Cas
totalcontrol 47 Frequency
80 127 +,+ and -,-
e
pairs are
+ + 27
+ - 20
concordant
- + 12 pairs
- - 68

+,- and -,+


03/22/2025 31
Mc Nemar’s test(cont..

• A concordant pair : is a matched


pair in which the outcome is the
same for each member of the pair.
• A discordant pair: is a matched
pair in which in which the outcome
differ for the member of the pair.

03/22/2025 32
Mc nemar’s test(cont…

Cardiac arrest total


Healthy Retired Not retired

Retired 27 12 39
Not retired 20 68 88

• total 47 80between cardiac


Is there an association 127
arrest and retirement status?
• the concordant pairs provide no information
so we focus on the discordant pairs.
• Represent the total number of discordant
pair by nD
03/22/2025 33
Cardiac arrest total
Healthy Retired Not retired

Retired 27 12 nB =
39
b
Not retired 68 88
20 nA
= an
•The discordant pairs c be further subdivided in to
total
•Type 47 the person
A:those in which 80 who experienced
127 cardiac
arrest is retired and the healthy individual is not and
•Type B: those in which the healthy subject is retired and
the one with heart disease is not Ca con Frequen
se trol cy
•nA= the number of +,- or type A discordant+pairs
+ 27
•nB= the number of -,+ or type B discordant+ - 20
- + 12
=nA+nB= total number of discordant pairs
•nD03/22/2025 - - 68 34
• To perform McNemar’s test, calculate
the test statistic

• Alternative formula can also be:

• Where

03/22/2025 35
• Continuity correction

If H0 is true b and c should be


approximately equal
We do not reject Ho if the difference
between b and c is small
Reject Ho if the difference is large.
03/22/2025 36
• So for this case the test statistic is

X2 = (2(20)-32) 2
32

= 2.0

The critical value from the chi square table shows


at df 1 and alpha 0.05 is 3.84
So 2<3.84 un able to reject the H O

03/22/2025 37
McNemar’s test using SPSS

• Analyze descriptive statistics cross tabs


• For ROWS ,select the outcome for
condition/time 1
• For COLUMNS, select the outcome for
condition/time 2
• Under STATISTICS, click on MCNEMAR
• Under CELLS ,click on Observed and Row
percentages
Chi square for trend
• Suppose in a 2X C table that the column
variable is ordinal
• The question of interest will be whether there
is a trend in the proportions falling into first or
second row or across levels of the column
• To detect such trends, score variables (1 to the
first group and 2 for the second and so on..
• Ho: no trend among the proportions
• Ha: the proportions are increasing or
decreasing
• Calculate the test statistic
Sample Size
Determination

03/22/2025 47
Sample Size Estimation

 Deciding how many people needed to be studied in order to answer the


study objectives.
 The eventual sample size is usually a compromise between what is desirable
and what is feasible.

 If the study is too small we may fail to detect important effects or may
estimate effects too imprecisely.
 If the study is too large then we will waste resources.

 The feasible sample size is determined by the availability of resources. (To


collect the information and also to analyze it)
 In general, it is much better to increase the accuracy of data collection (by
improving the training of data collectors and data collection tools) than to
increase the sample size after a certain point.
03/22/2025 48
1. Single population proportion
 Let p denotes proportion of success, then

03/22/2025 49
In order to calculate the required sample size, you
need to know the following facts:
a) The reasonable estimate of the key proportion to be
studied (P). If you cannot guess the proportion, take it
as 50%.
b) The degree of accuracy required. That is, the
allowed deviation from the true proportion in the
population as a whole.
It can be within 1% or 5%, etc (d)
c) The confidence level required, usually specified as
95% (z)

03/22/2025 50
d) The size of the population that the sample is to represent
(N) . If it is more than 10,000 the precise magnitude is not
likely to be very important; but if the population is less
than 10,000 then a smaller sample size may be required.
n = Z2p(1-p)
d2

Example 1: (Prevalence of diarrhea)


a) p = 0.26 , w = 0.03 , Z = 1.96 ( i.e., for a 95% C.I.)
n = (1.96)2(.26 × .74) / (.03)2
= 821.25 ≈ 822

03/22/2025 51
b) If the above sample is to be taken from a
relatively small population (say N = 3000), the
required minimum sample will be obtained from
the above estimate by making some adjustment.
nf=n/(1+(n/N))………correction formula
821.25 / (1+ (821.25/3000)) ≈ 645 subjects

03/22/2025 52
Exercise
• A hospital administrator wishes to know what
proportion of discharged patients are unhappy
with the care received during hospitalization.
If 95% Confidence interval is desired to
estimate the proportion within 5%, how large a
sample should be drawn?

03/22/2025 53
Points to be considered

1. If sampling is from a finite population of size N < 10,000, then

Where n0 is the sample from an infinite population

2. The initial sample size approached in the study may need to be increased
in accordance with the expected response rate, loss to follow up, lack of
compliance and any other predicted reasons for loss of subjects.
3. Design effect for complex cluster sampling Common
values multiply n by 1.5, 2, 3, …5.
03/22/2025 54
Example 1
a) p = 0.26 , d = 0.03 , Z = 1.96 ( i.e., for a 95% C.I.)
n = (1.96)2 (.26 × .74) / (.03)2 = 821.25 ≈ 822
Thus, the study should include at least 822 subjects.

b) If the above sample is to be taken from a relatively small population


(say N = 3000);
The required minimum sample will be obtained from the above
estimate by making some adjustment.
= 821.25 / (1+ (821.25/3000)) = 644.7 ≈ 645 subjects

03/22/2025 55
2. Estimating a single population mean

 The same approach is used but with SE = σ / √n


 The required (minimum) sample size for a very large population is
given by:

03/22/2025 56
Estimating a single population mean(cont..

1. Specify d (or w = 2d)


where d = Margin of error = Absolute precision
= Half of the width (w) of CI
2. Use known σ2 or estimate using s2
For unknown σ2, it has to be estimated from:
• Pilot or preliminary sample:
– Select a pilot sample and estimate 2 with
the sample variance, s2
• Previous or similar studies

03/22/2025 57
Example:
 Health professionals wishes to estimate mean hemoglobin level in a
defined community. From preliminary contact they think that this mean is
about 150 mg/l with a standard deviation of 32 m/l.
 If they are willing to tolerate a sampling error of up to 5 mg/l in their
estimate, how many subjects should be included in their study? (α =5%,
two sided)
Solution:
- If the population size is assumed to be very large, the required sample size
would be: n = (1.96)2 (32)2 / (5)2 = 157.4 ≈ 158 persons

03/22/2025 58
- If the population size is, say 2000,
The required sample size would be 146 persons.

NB: σ2 can be estimated from previous similar studies or could


be obtained by conducting a small pilot study.

03/22/2025 59
To test a hypothesis about the
difference between two population
means (common standard deviation)

r is the ratio of the size of sample 2 to


sample1

03/22/2025 60
To test a hypothesis about the difference between
two population proportions
• Then sample size can be calculated

Where

This equation is quite general: it applies to comparative


cross- sectional, cohort, and case-control study
03/22/2025 61
• If the OR or RR and one of the
proportions are
known, we can compute the
unknown
proportion by : P1 = P2 * RR

03/22/2025 62
Example
A case‐control study to compare the
efficacy of a vaccine for the prevention
of child‐hood tuberculosis with a
placebo. Assume that 30% of the
controls are not vaccinated. If the
numbers of cases and controls are
equal what sample size is needed to
detect, with 80% power and 5% type I
error, an odds ratio of at least 2 in the
target
03/22/2025
population? 63
• The probability of exposure (that is
no vaccine given) for a control is P2 =
0.3.
• To calculate p1

P1 = 0.3/(0.3 + 0.7/2) +0.7/2) =


0.462
• To calculate for p
P = (0.3+0.462)/2 = 0.381
03/22/2025 64
Then to calculate for the sample size use:

n1 =139.9
Conclude that 140 cases and 140 controls
are required for the study.
03/22/2025 65
Sample size calculation using epi info

• Open the software


• Go to STATCALC and click it
• STATCLAC window appears
• Select the study design to calculate your
sample size

03/22/2025 66
Using stata for sample size calculation

• Open Stata
• Dropdown menu statistics
• Click on Power and Sample size
• Power and sample size analysis window
appears
• Select the appropriate methods which are
organized by: population parameter, outcome,
analysis type and Sample
03/22/2025 67
! ! !
o u
k y
a n
T h

03/22/2025 68

You might also like