Types of Data, Descriptive Statistics, and Statistical Tests For Nominal Data
Types of Data, Descriptive Statistics, and Statistical Tests For Nominal Data
~..
1
\
NONPARAMETRIC
I.
STATISTICS
DEFINITIONS
A. Parametric statistics
1. Variable of interest is a measured quantity.
2. Assumes that the data follow some distribution which can be described by specific parameters
a. Typically a normal distribution
3. Example: There are an infinite number of normal distributions, all which can be uniquely
defined by a mean and standard deviation (SD).
B. Nonparametric statistics
1. Variable of interest is not measured quantity. Mean and SD have little meaning.
2. Does not make any assumptions about the distribution of the data
3. "Distribution-free" statistics
C. Dependent variable
1. The variable of interest, the outcome of which is dependent on something else
D. Independent variable
1. The variable that is being tested for an effect on the dependent variable
E. Example
1. Does high-dose ciprofloxacin lead to seizures?
a. Seizures = dependent variable
b. Dose = independent variable
II.
PARAMETRIC STATISTICS
A. Developed primarily to deal with categorical data (non-continuous data)
1. Example: disease vs no disease; dead vs alive
B. Nonparametric statistical tests may be used on continuous data sets.
1. Removes the requirement to assume a normal distribution
2. However, it also throws out some information, as continuous data contains information in the
way that variables are related.
Corresponding
nonparametric tests
Mann-Whitney U test;
Wilcoxon rank sum test
Wilcoxon matched pairs signed.,
rank test
Spearman rank correlation
coefficient
Kruskal-Wallis analysis of
variance by ranks
Friedman two-way analysis
of variance
Purpose of test.
Compares two independent
samples
Examines a set of differences
Assesses the linear association
between two variables
Compares three or more
groups
Compares groups classified by
two different factors
---- \
III.
A. Nonparametric pros
1. Nonparametric tests make less stringent demands ofthe data.
a. For a parametric test to be valid, certain underlying assumptions must be met.
i. example: For a paired t test, assume that: data are drawn ITomnormal distribution;
every observation is independent of each other, and the SDs of the two populations are
equal. Data are continuous.
b. Nonparametric tests do not require these assumptions.
i. can be used to evaluate data that are not continuous
ii. no assumptions about distributions, independence, etc.
B. Nonparametric cons
1. If using for a continuous data set, nonparametric tests throw information inherent in
continuous data.
2. Reduces power to detect a statistical difference
a. A more conservative approach
3. Example: For data IToma normally distributed population, if the Wilcoxon signed-rank test
requires 1000 observations to demonstrate statistical significance, a t test will only
require 955.
IV.
CONTINGENCY
TABLES
A. Contingency tables are used to examine the relationship between subjects' scores on two qualitative
or categorical variables.
B. One variable determines the row categories; the other variable defines the column categories.
C. Example: In studying the association between smoking and disease, the row categories in the
figure below denote the categories of smoking status while the columns denote the presence or
absence of disease.
Smoke
v.
Yes
No
A
Disease
Yes
No
13
37
6
144
B
Disease
Yes
No
26% 74%
4% 96%
100%
100%
cm-SQUARED TEST
A. Commonly used procedure, uses contingency tables
B. Used to evaluate unpaired samples (unrelated groups)
C. Often used to evaluate proportions
D. Is there a difference in the proportion of viral infections in patients administered a
vaccine? (12/100 vs. 2/100)
E. Assumes nominal data (no ordering between variable groups)
j
F. Limited when the numbers of subjects in any "cell" is low (rule of thumb, <5)
G. Generallogic
1. Given two groups (vaccine vs control), the EXPECTED infection rate if the vaccine has no
effect would be equal among the two groups. This is the null hypothesis. The chi-squared test
compares the EXPECTED frequency of a particular event to the OBSERVED frequency in the
population of interest.
H. Formulas
x2
= L (0-E)2
E
. . Ti X T
E1J
=
N J
I.
Distribution
18
16
14
12
10
08
06
04
02
0
0
12
16
20
24
Chi-Square distribution
..
~
J.
VI.
b
d
(b+d)
a
c
(a + c)
(a+b)!
p(outcome)=
VII.
(a +b)
(c + d)
N
N! a! b! c! d!
IX.
=:;
J.
VI.
~~
a
c
(a + c)
b
d
(b + d)
(a+b)!
p(outcome)
VII.
(a +b)
(c + d)
N
N! a! b! c! d!
KRUSKAL-WALLIS TEST
:::;-
rs
=1- 6L:d2
n 3 -n
B.
Height
31
32
33
34
35
35
Rs = 6(-e+
Rank
1
2
3
4
5.5
5.5
Weight
7.7
8.3
7.6
9.1
9.6
9.9
Rank
2
3
1
4
5
6
d
-1
-1
2
0
0.5
-0.5
For statistical significance, can look up critical values from table or obtain from software
package.
-s:
.-=
rt
Example Problem 1: Association between tryptophan dietary supplements and eosinophiliamyalgia syndrome (EMS). A number of subjects from a particular area are evaluated; 80
patients with EMS were identified, along with 200 matched controls. Is there a statistically
significant association between tryptophan use and EMS?
Observed
Results:
EMS
42
38
80
Yes
I
Tryptophan use
No
Total
No EMS
34
166
200
Total
76
204
280
(42 of76 patients taking tryptophan had EMS, compared to 38 of 204 not taking tryptophan)
Expected values if no association exists (null hypothesis):
Yes
No
Tryptophan use
Total
EMS
21.7
58.3
80
No EMS
54.3
145.7
200
Total
76
204
280
The rate of EMS in the overall population, assuming no effect, would be 80/280 (28.6%).
(.286*76 = 21.7; .286x204 = 58.3). The No EMS cells can then be calculated from subtracting
the total (ex: 76 - 21.7 = 54.3).
E 11-- 76x80
280
E21
= 204x80
280
E 12 -- 76x200
280
E22
= 204x200
280
x2
X2
= I, (0-E)2
E
-~
..
- (
0.10
2.7055
4.6052
6.2514
7.7794
9.2363
10.6446
12.017
13.3616
14.6837
15.9872
17.275
18.5493
19.8119
21.0641
22.3071
23.5418
24.769
25.9894
27.2036
28.412
29.6151
30.8133
32.0069
33.1962
34.3816
35.5632
36.7412
37.9159
39.0875
40.256
41. 4217
42.5847
43.7452
44.9032
46.0588
47.2122
48.3634
49.5126
50.6598
51.805
52.9485
54.0902
55.2302
56.3685
57.5053
58.6405
59.7743
60.9066
62.0375
63.1671
Significance Level
0.05
0.025
5.0239
3.8415
7.3778
5.9915
7.8147
9.3484
9.4877
11.1433
11.0705
12.8325
12.5916
14.4494
16.0128
14.0671
15.5073
17.5345
16.919
19.0228
18.307
20.4832
19.6752
21.92
21.0261
23.3367
24.7356
22.362
26.1189
23.6848
24.9958
27.4884
28.8453
26.2962
27.5871
30.191
28.8693
31.5264
32.8523
30.1435
31.4104
34.1696
32.6706
35.4789
36.7807
33.9245
35.1725
38.0756
36.415
39.3641
37.6525
40.6465
41.9231
38.8851
43.1945
40.1133
41.3372
44.4608
45.7223
42.5569
43.773
46.9792
44.9853
48.2319
46.1942
49.4804
50.7251
47.3999
48.6024
51.966
49.8018
53.2033
54.4373
50.9985
52.1923
55.668
53.3835
56.8955
54.5722
58.1201
59.3417
55.7585
56.9424
60.5606
58.124
61.7767
59.3035
62.9903
60.4809
64.2014
61.6562
65.4101
66.6165
62.8296
64.0011
67.8206
69.0226
65.1708
66.3387
70.2224
67.5048
71. 4202
Distribution
0.01
6.6349
9.2104
11.3449
13 .2767
15.0863
16.8119
18.4753
20.0902
21.666
23.2093
24.725
26.217
27.6882
29.1412
30.578
31.9999
33.4087
34.8052
36.1908
37.5663
38.9322
40.2894
41.6383
42.9798
44.314
45.6416
46.9628
48.2782
49.5878
50.8922
52.1914
53.4857
54.7754
56.0609
57.342
58.6192
59.8926
61.162
62.4281
63.6908
64.95
66.2063
67.4593
68.7096
69.9569
71.2015
72.4432
73.6826
74.9194
76.1538
0.005
7.8794
10.5965
12.8381
14.8602
16.7496
18.5475
20.2777
21.9549
23.5893
25.1881
26.7569
28.2997
29.8193
31.3194
32.8015
34.2671
35.7184
37.1564
38.5821
39.9969
41.4009
42.7957
44.1814
45.5584
46.928
48.2898
49.645
50.9936
52.3355
53.6719
55.0025
56.328
57.6483
58.9637
60.2746
61.5811
62.8832
64.1812
65.4753
66.766
68.0526
69.336
70.6157
71.8923
73.166
74.4367
75.7039
76.9689
78.2306
79.4898
Eample Problem 2:
A sociological study evaluated the characteristics of marriage by religion; 256 people were
surveyed for religion and marital status. The results were as follows:
Protestant
Never
Married
Divorced
Separated
Total
Jewish
8
11
3
1
23
Catholic
29
75
21
8
133
16
21
6
3
46
None
20
19
13
0
52
Other
0
1
0
1
2
Total
73
127
43
13
256
WARNING:
chi-squared output
tests
of fitted cells
computed
Test statistic
Pearson
on
this
are sparse
table
Value
chi-squared
22.718
are
(frequency
<
5).
suspect.
df
12.000
Prob
0.030
What happened??
Catholic
29
75
21
125
Never
Married
Divorced
Total
Test statistic
Pearson
chi-sguared
16
21
6
43
Value
10.368
Jewish
8
11
3
22
df
6.000
None
20
19
13
52
Total
73
126
43
242
prob
0.110
BEFORE$
(rows)
by
AFTER$
for
51
46
52
149
for
unsure
against
Total
Percents
BEFORE$
of
total
(rows)
for
unsure
against
Total
N
(columns)
unsure
22
18
49
89
Total
101
91
158
350
against
28
27
57
112
count
by
AFTER$
(columns)
for
14.571
13.143
14.857
42.571
149
unsure
6.286
5.143
14.000
25.429
89
AFTER
against
8.000
7.714
16.286
32.000
112
Test statistic
McNemar
Pearson
Symmetry
chi-squared
chi-squared
Value
11.473
22.039
N
101
91
158
Total
28.857
26.000
45.143
100.000
350
df
Prob
4.000
3.000
0.022
0.000
The McNemar test of symmetry focuses on the counts in the off-diagonalcells (those along the
diagonal are not used in the computations). We are investigating the direction of change in
opinion. First, how many respondentsbecame more negative aboutNAFTA?
Among those who initially responded For, 22 (6.29%) are now Unsure and 28 (8%) are now
Against. Among those who were Unsure before the debate, 27 (7.71%) answered Against
afterwards. The three cells in the upper right contain counts for those who became more
unfavorable and comprise 22% (6.29 + 8.00 + 7.71) of the sample. The three cells in the lower
left contain counts for people who became more positive about NAFTA (46, 52, and 49) or 42%
of the sample.
The null hypothesis for the McNemar test is that the changes in opinion are equal. The chisquared statistic for this test is 22.039 with 3 df and p<0.0005. You reject the null hypothesis.
The pro-NAFTA shift in opinion is significantly greater than the anti-NAFTA shift.
-r
Subject
No drug
1
2
3
4
5
6
1600
1850
1300
1500
1400
1010
Daily UOP
+ Drug
1490
1300
1400
1410
1350
1000
Difference
-110
-550
+100
-90
-50
-10
Rank of
difference
5
6
4
3
2
1
Signedrank
of difference
-5
-6
+4
-3
-2
-1
Critical Value
15
21
19
28
24
32
28
39
33
45
39
52
44
58
50
65
57
73
63
80
70
P
.062
.032
.062
.016
.046
.024
.054
.020
.054
.02
.048
.018
.054
.02
.052
.022
.048
.02
.05
.022
.048
*Due to the nature of discrete possible values ofW, p values at traditional breakpoints are usually
not possible (ex.: p=0.05).