Statistical method of categorical variable
Statistical method of categorical variable
Categorical Variables
03/22/2025 1
Contingency table
• A contingency table is a type of table
in a matrix format that displays the
frequency distribution of the
variables.
03/22/2025 2
Contingency table
• In general, if a 2 x 2 table of
observed frequencies for a sample of
size p can represented as follows
03/22/2025 3
Example
The following table shows the relation
between the number of accidents in
1 year and the age of the driver in a
random sample of 500 drivers
between 18 and 50.
Number of Age of driver
Accidents
18-25 26-40 >40 total
03/22/2025 4
This is a 3 x 3 contingency table.
Observed frequencies
Number of Age of driver
Accidents
18-25 26-40 >40 total
row1 and
column 2(cell
12)
03/22/2025 5
• Calculation of expected frequencies:
A total of 150 drivers aged 18-25,
and 300/500 = 3/5 of all drivers have
had no accidents. If there is no
relation between driver age and
number of accidents, we expect that
3/5(150) = 90 drivers aged 18-25
would have no accidents. i.e.,
e11 = 300 × 150/500= 90
e12(row1 and column 2) = 200x300
/500 = 120
03/22/2025 6
Expected frequencies
03/22/2025 7
The Chi – square test(as a test of measure
of association)
03/22/2025 9
The Chi – square test(cont…
• If the value of
– χ2 =zero, then there is a perfect
agreement between the observed and
the expected frequencies.
– larger χ2 -The greater the discrepancy
between the observed and expected
frequencies.
03/22/2025 11
Assumptions of chi square
03/22/2025 14
This is a 3 x 3 contingency table.
Observed frequencies
Number of Age of driver
Accidents
18-25 26-40 >40 total
03/22/2025 15
• Hypothesis:
– HO : There is no relation between age of
driver and number of accidents
– HA : The variables are dependent
(related)
• The degrees of freedom (df) in a
contingency table with R rows and C
columns is:
df = ( R – 1) ( C – 1)
Hence,
03/22/2025
χ2tab with df = 4, at .01 level of 16
Expected frequencies
03/22/2025 17
• χ2calc = (75 –90)² /90 + (115 –
120 )² /120 + (110 – 90)² /90 + … +
(5 – 15 )² /15
= 1 + 0.208 + 4.444 + 0.556 +
0.417 + 2.222 + 6.667 + 0
+ 6.667
= 22. 2 (This corresponds to a P-
value of less than .001)
Therefore, there is a relationship
between number of accidents and
03/22/2025 18
Exercise
A study was conducted And the
following data was found .We wish to
Know if we may conclude that there
is arelation ship between Human
Papiloma Virus (HPV)status and
Stage of HIV infectionHIV(α= 0.05)
HPV
Negative 10 14 35 59
Total 33 18 45 96
03/22/2025 19
Solution
• Hypothesis:
– HO : HPV status and stage of HIV
infection are independent
– HA : The two variables are not
independent(related)
• The degrees of freedom (df) is:
df = ( R – 1) ( C – 1)=2-1(3-1)=2
α= 0.05
Hence, χ2tab with df = 2, at 0.05 level of
significance = 5.991
03/22/2025 20
• Observed and expected frequencies.
HPV HIV
Total 33 200 45 96
03/22/2025 21
χ2calc = (23 –12.72)² /12.72 + (4 –
6.94)² /6.94 + … + (35 – 27.66 )²
/27.66
= 8.30805 + 1.24548 ….+
1.94778=20.601
Conclusion : since
20.601>5.991 reject Ho
We conclude that Ho is false and that
there is a relationship between HPV
03/22/2025 22
Yates’ –corrected chi square test
• With 2x2 tables, chi square distribution can be better
approximated using yates’ continuity correction, particularly
when the sample sizes are small.
• For extremely small samples, chi square test even with yates’ correction
is not recommended. In this case use fishers excat test
CHI SQUARE test using SPSS
03/22/2025 28
Mc nemar’s test(cont…
Retired
Cardiac Yes No Total
arrest
Yes 47 80 127
No 39 88 127
Total 86 168 254
Cardiac tot
Healthy arrest al
Note: Each entry
Reti Not in the table
red retired corresponds to a
pair of subject
Retired 27 12 39 rather than
Not 20 68 88 individual subject.
retired
03/22/2025 30
total 47 80 12
Example cont.…
Cardiac tota
Healthy arrest l Note: Each entry
Retir Not in the table
ed retired corresponds to a
Retired 27 12 39 pair of subject
rather than
Not 20 68 88 individual subject.
retired
Cas
totalcontrol 47 Frequency
80 127 +,+ and -,-
e
pairs are
+ + 27
+ - 20
concordant
- + 12 pairs
- - 68
03/22/2025 32
Mc nemar’s test(cont…
Retired 27 12 39
Not retired 20 68 88
Retired 27 12 nB =
39
b
Not retired 68 88
20 nA
= an
•The discordant pairs c be further subdivided in to
total
•Type 47 the person
A:those in which 80 who experienced
127 cardiac
arrest is retired and the healthy individual is not and
•Type B: those in which the healthy subject is retired and
the one with heart disease is not Ca con Frequen
se trol cy
•nA= the number of +,- or type A discordant+pairs
+ 27
•nB= the number of -,+ or type B discordant+ - 20
- + 12
=nA+nB= total number of discordant pairs
•nD03/22/2025 - - 68 34
• To perform McNemar’s test, calculate
the test statistic
• Where
03/22/2025 35
• Continuity correction
X2 = (2(20)-32) 2
32
= 2.0
03/22/2025 37
McNemar’s test using SPSS
03/22/2025 47
Sample Size Estimation
If the study is too small we may fail to detect important effects or may
estimate effects too imprecisely.
If the study is too large then we will waste resources.
03/22/2025 49
In order to calculate the required sample size, you
need to know the following facts:
a) The reasonable estimate of the key proportion to be
studied (P). If you cannot guess the proportion, take it
as 50%.
b) The degree of accuracy required. That is, the
allowed deviation from the true proportion in the
population as a whole.
It can be within 1% or 5%, etc (d)
c) The confidence level required, usually specified as
95% (z)
03/22/2025 50
d) The size of the population that the sample is to represent
(N) . If it is more than 10,000 the precise magnitude is not
likely to be very important; but if the population is less
than 10,000 then a smaller sample size may be required.
n = Z2p(1-p)
d2
03/22/2025 51
b) If the above sample is to be taken from a
relatively small population (say N = 3000), the
required minimum sample will be obtained from
the above estimate by making some adjustment.
nf=n/(1+(n/N))………correction formula
821.25 / (1+ (821.25/3000)) ≈ 645 subjects
03/22/2025 52
Exercise
• A hospital administrator wishes to know what
proportion of discharged patients are unhappy
with the care received during hospitalization.
If 95% Confidence interval is desired to
estimate the proportion within 5%, how large a
sample should be drawn?
03/22/2025 53
Points to be considered
2. The initial sample size approached in the study may need to be increased
in accordance with the expected response rate, loss to follow up, lack of
compliance and any other predicted reasons for loss of subjects.
3. Design effect for complex cluster sampling Common
values multiply n by 1.5, 2, 3, …5.
03/22/2025 54
Example 1
a) p = 0.26 , d = 0.03 , Z = 1.96 ( i.e., for a 95% C.I.)
n = (1.96)2 (.26 × .74) / (.03)2 = 821.25 ≈ 822
Thus, the study should include at least 822 subjects.
03/22/2025 55
2. Estimating a single population mean
03/22/2025 56
Estimating a single population mean(cont..
03/22/2025 57
Example:
Health professionals wishes to estimate mean hemoglobin level in a
defined community. From preliminary contact they think that this mean is
about 150 mg/l with a standard deviation of 32 m/l.
If they are willing to tolerate a sampling error of up to 5 mg/l in their
estimate, how many subjects should be included in their study? (α =5%,
two sided)
Solution:
- If the population size is assumed to be very large, the required sample size
would be: n = (1.96)2 (32)2 / (5)2 = 157.4 ≈ 158 persons
03/22/2025 58
- If the population size is, say 2000,
The required sample size would be 146 persons.
03/22/2025 59
To test a hypothesis about the
difference between two population
means (common standard deviation)
03/22/2025 60
To test a hypothesis about the difference between
two population proportions
• Then sample size can be calculated
Where
03/22/2025 62
Example
A case‐control study to compare the
efficacy of a vaccine for the prevention
of child‐hood tuberculosis with a
placebo. Assume that 30% of the
controls are not vaccinated. If the
numbers of cases and controls are
equal what sample size is needed to
detect, with 80% power and 5% type I
error, an odds ratio of at least 2 in the
target
03/22/2025
population? 63
• The probability of exposure (that is
no vaccine given) for a control is P2 =
0.3.
• To calculate p1
n1 =139.9
Conclude that 140 cases and 140 controls
are required for the study.
03/22/2025 65
Sample size calculation using epi info
03/22/2025 66
Using stata for sample size calculation
• Open Stata
• Dropdown menu statistics
• Click on Power and Sample size
• Power and sample size analysis window
appears
• Select the appropriate methods which are
organized by: population parameter, outcome,
analysis type and Sample
03/22/2025 67
! ! !
o u
k y
a n
T h
03/22/2025 68