0% found this document useful (0 votes)
39 views

Types of Data, Descriptive Statistics, and Statistical Tests For Nominal Data

The document discusses nonparametric statistics and statistical tests for nominal data. It defines key terms like parametric vs nonparametric statistics and dependent and independent variables. It then covers various nonparametric tests like the Mann-Whitney U test, Wilcoxon signed-rank test, Kruskal-Wallis test, chi-squared test, Fisher's exact test and McNemar's test. It provides examples of when to use each test and how to interpret the results. In particular, it demonstrates how to use the chi-squared test to analyze a contingency table examining the association between tryptophan supplements and eosinophilia-myalgia syndrome.

Uploaded by

Wong Wei Hong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Types of Data, Descriptive Statistics, and Statistical Tests For Nominal Data

The document discusses nonparametric statistics and statistical tests for nominal data. It defines key terms like parametric vs nonparametric statistics and dependent and independent variables. It then covers various nonparametric tests like the Mann-Whitney U test, Wilcoxon signed-rank test, Kruskal-Wallis test, chi-squared test, Fisher's exact test and McNemar's test. It provides examples of when to use each test and how to interpret the results. In particular, it demonstrates how to use the chi-squared test to analyze a contingency table examining the association between tryptophan supplements and eosinophilia-myalgia syndrome.

Uploaded by

Wong Wei Hong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Types of Data, Descriptive Statistics, and

Statistical Tests for Nominal Data

Patrick F. Smith, Pharm.D.


University at Buffalo
Buffalo, New York

~..
1

\
NONPARAMETRIC

I.

STATISTICS

DEFINITIONS
A. Parametric statistics
1. Variable of interest is a measured quantity.
2. Assumes that the data follow some distribution which can be described by specific parameters
a. Typically a normal distribution
3. Example: There are an infinite number of normal distributions, all which can be uniquely
defined by a mean and standard deviation (SD).
B. Nonparametric statistics
1. Variable of interest is not measured quantity. Mean and SD have little meaning.
2. Does not make any assumptions about the distribution of the data
3. "Distribution-free" statistics
C. Dependent variable
1. The variable of interest, the outcome of which is dependent on something else
D. Independent variable
1. The variable that is being tested for an effect on the dependent variable
E. Example
1. Does high-dose ciprofloxacin lead to seizures?
a. Seizures = dependent variable
b. Dose = independent variable

II.

PARAMETRIC STATISTICS
A. Developed primarily to deal with categorical data (non-continuous data)
1. Example: disease vs no disease; dead vs alive
B. Nonparametric statistical tests may be used on continuous data sets.
1. Removes the requirement to assume a normal distribution
2. However, it also throws out some information, as continuous data contains information in the
way that variables are related.

Some Commonly Used Statistical Tests


Normal theory-based tests
t test for independent samples
Paired t test
Pearson correlation coefficient
One-way analysis of
variance (F test)
Two-way analysis of vanance

Corresponding
nonparametric tests
Mann-Whitney U test;
Wilcoxon rank sum test
Wilcoxon matched pairs signed.,
rank test
Spearman rank correlation
coefficient
Kruskal-Wallis analysis of
variance by ranks
Friedman two-way analysis
of variance

Purpose of test.
Compares two independent
samples
Examines a set of differences
Assesses the linear association
between two variables
Compares three or more
groups
Compares groups classified by
two different factors

---- \

III.

NONP ARAMETRIC PROS AND CONS

A. Nonparametric pros
1. Nonparametric tests make less stringent demands ofthe data.
a. For a parametric test to be valid, certain underlying assumptions must be met.
i. example: For a paired t test, assume that: data are drawn ITomnormal distribution;
every observation is independent of each other, and the SDs of the two populations are
equal. Data are continuous.
b. Nonparametric tests do not require these assumptions.
i. can be used to evaluate data that are not continuous
ii. no assumptions about distributions, independence, etc.
B. Nonparametric cons
1. If using for a continuous data set, nonparametric tests throw information inherent in
continuous data.
2. Reduces power to detect a statistical difference
a. A more conservative approach
3. Example: For data IToma normally distributed population, if the Wilcoxon signed-rank test
requires 1000 observations to demonstrate statistical significance, a t test will only
require 955.
IV.

CONTINGENCY

TABLES

A. Contingency tables are used to examine the relationship between subjects' scores on two qualitative
or categorical variables.
B. One variable determines the row categories; the other variable defines the column categories.
C. Example: In studying the association between smoking and disease, the row categories in the
figure below denote the categories of smoking status while the columns denote the presence or
absence of disease.

Smoke

v.

Yes
No

A
Disease
Yes
No
13
37
6
144

B
Disease
Yes
No
26% 74%
4% 96%

100%
100%

cm-SQUARED TEST
A. Commonly used procedure, uses contingency tables
B. Used to evaluate unpaired samples (unrelated groups)
C. Often used to evaluate proportions
D. Is there a difference in the proportion of viral infections in patients administered a
vaccine? (12/100 vs. 2/100)
E. Assumes nominal data (no ordering between variable groups)

j
F. Limited when the numbers of subjects in any "cell" is low (rule of thumb, <5)
G. Generallogic
1. Given two groups (vaccine vs control), the EXPECTED infection rate if the vaccine has no
effect would be equal among the two groups. This is the null hypothesis. The chi-squared test
compares the EXPECTED frequency of a particular event to the OBSERVED frequency in the
population of interest.
H. Formulas

x2

= L (0-E)2
E

with df= (r -l)(c -1)


ExpectedFrequencies(E) for eachcell:

. . Ti X T
E1J
=
N J

I.

Distribution

18
16
14
12
10
08
06
04
02
0
0

12

16

20

24

Chi-Square distribution

Chi-squared, by strict definition, is not a true nonparametric test. It assumes a


distribution that can be described by a single parameter, degrees of freedom.
J.

Chi-squared example problems (refer to Example Problem handout)

..

~
J.

Chi-squared example problems (refer to Example Problem handout)


FISHER'S EXACT TEST

VI.

A. Alternative to chi-squared for 2 x 2 contingency tables


1. Improves accuracy when expected frequencies are small 5) or sample size is small (n=20)
2. Calculates exact probabilities

b
d
(b+d)

a
c
(a + c)

(a+b)!
p(outcome)=

VII.

(a +b)
(c + d)
N

(c+d)! (a+c)! (b+d)!

N! a! b! c! d!

MCNEMAR'S TEST OF SYMMETRY

A. Chi-squared test requires samples to be independent of each other.


B. McNemar's test is used when samples are related (similar to paired t test).
C. There.are often times where measures may be repeated.

D. Example. Does drug X cause insomnia?


1. Patients may be questioned about insomnia before and after starting the drug.
2. The researcher asks the question, "Do more patients have insomnia since starting the drug?"
E. Refer to Example Problems handout
VIII.

KRUSKAL-W ALLIS TEST

A. Compares two independent samples


B. Values of a variable are transformed to ranks.
1. Tests that there is no shift in the center of the groups (that is, the centers do not differ)
C. If there are only two groups, the procedure reduces to the Mann-Whitney test-the analogue of the
unpaired t test.

IX.

WILCOXON SIGNED-RANK TEST


A. Nonparametric analogue of the paired t test
B. Compares the rank values of variables pair-by-pair
1. The sum of the ranks associated with positive and negative differences is computed.
2. The test statistic is the lesser of the two sums of ranks.
C. Refer to Example Problems handout

=:;

J.
VI.

Chi-squared example problems (refer to Example Problem handout)

~~

FISHER'S EXACT TEST'


A. Alternative to chi-squared for 2 x 2 contingency tables
1. Improves accuracy when expected frequencies are small 5) or sample size is small (n=20)
2. Calculates exact probabilities

a
c
(a + c)

b
d
(b + d)

(a+b)!

p(outcome)

VII.

(a +b)
(c + d)
N

(c+d)! (a+c)! (b+d)!

N! a! b! c! d!

MCNEMAR'S TEST OF SYMMETRY

A. Chi-squared test requires samples to be independent of each other.


B. McNemar's test is used when samples are related (similar to paired t test).
C. There' are often times where measures may be repeated.

D. Example. Does drug X cause insomnia?


1. Patients may be questioned about insomnia before and after starting the drug.
2. The researcher asks the question, "Do more patients have insomnia since starting the drug?"
E. Refer to Example Problems handout
VIII.

KRUSKAL-WALLIS TEST

A. Compares two independent samples


B. Values of a variable are transformed to ranks.
1. Tests that there is no shift in the center of the groups (that is, the centers do not differ)
C. If there are only two groups, the procedure reduces to the Mann-Whitney test-the analogue of the
unpaired t test.
IX.

WILCOXON SIGNED-RANK TEST


A. Nonparametric analogue of the paired t test
B. Compares the rank values of variables pair-by-pair
1. The sum of the ranks associated with positive and negative differences is computed.
2. The test statistic is the lesser of the two sums of ranks.
C. Refer to Example Problems handout

:::;-

X. SPEARMAN RANK CORRELATION COEFFICIENT


A. Nonparametric analogue oflinear regression and the correlation coefficient

Nonparametric analogue oflinear regression


and the correlation coefficient (r)

rs

=1- 6L:d2
n 3 -n

d = difference of ranks at each point

B.
Height
31
32
33
34
35
35
Rs = 6(-e+

Rank
1
2
3
4
5.5
5.5

Weight
7.7
8.3
7.6
9.1
9.6
9.9

Rank
2
3
1
4
5
6

d
-1
-1
2
0
0.5
-0.5

-12+ 22+ 0 + 0.52+- 0.52)/63 - 6) = 0.81

For statistical significance, can look up critical values from table or obtain from software
package.

-s:

.-=
rt
Example Problem 1: Association between tryptophan dietary supplements and eosinophiliamyalgia syndrome (EMS). A number of subjects from a particular area are evaluated; 80
patients with EMS were identified, along with 200 matched controls. Is there a statistically
significant association between tryptophan use and EMS?

Unrelated groups, categorical (yes/no) data - chi-squared is appropriate

Observed

Results:
EMS

42
38
80

Yes
I

Tryptophan use

No

Total

No EMS

34
166
200

Total

76
204
280

(42 of76 patients taking tryptophan had EMS, compared to 38 of 204 not taking tryptophan)
Expected values if no association exists (null hypothesis):

Yes
No

Tryptophan use

Total

EMS
21.7
58.3
80

No EMS
54.3
145.7
200

Total
76
204
280

The rate of EMS in the overall population, assuming no effect, would be 80/280 (28.6%).
(.286*76 = 21.7; .286x204 = 58.3). The No EMS cells can then be calculated from subtracting
the total (ex: 76 - 21.7 = 54.3).
E 11-- 76x80
280

E21

= 204x80
280

E 12 -- 76x200
280

E22

= 204x200
280

To evaluate significance,one needs a mean and measu:eof dispersion(ex. - standard deviation,


standard error, variance, etc.). The chi-squared test is based on a Poisson distribution, where
mean = variance); therefore,the chi-squaredtest assumes that the variance is equal to the expected
mean value.

x2
X2

= I, (0-E)2
E

Therefore, in this example:

= (42/21.7i/21.7 + (34-54.3i/54.3 + (38-58.3i/58.3 + (166-145.7)2/145;7= 36.4

-7 Look up the result in a chi-squared table (a 2 x 2 contingency table has 1 degree of


freedom). To be significant at the 0.05 level, X2must be > 3.84. Since 36.4 3.84, the
result is highly significant.

-~

..

- (

Critical Values for the Chi-Squared


df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

0.10
2.7055
4.6052
6.2514
7.7794
9.2363
10.6446
12.017
13.3616
14.6837
15.9872
17.275
18.5493
19.8119
21.0641
22.3071
23.5418
24.769
25.9894
27.2036
28.412
29.6151
30.8133
32.0069
33.1962
34.3816
35.5632
36.7412
37.9159
39.0875
40.256
41. 4217
42.5847
43.7452
44.9032
46.0588
47.2122
48.3634
49.5126
50.6598
51.805
52.9485
54.0902
55.2302
56.3685
57.5053
58.6405
59.7743
60.9066
62.0375
63.1671

Significance Level
0.05
0.025
5.0239
3.8415
7.3778
5.9915
7.8147
9.3484
9.4877
11.1433
11.0705
12.8325
12.5916
14.4494
16.0128
14.0671
15.5073
17.5345
16.919
19.0228
18.307
20.4832
19.6752
21.92
21.0261
23.3367
24.7356
22.362
26.1189
23.6848
24.9958
27.4884
28.8453
26.2962
27.5871
30.191
28.8693
31.5264
32.8523
30.1435
31.4104
34.1696
32.6706
35.4789
36.7807
33.9245
35.1725
38.0756
36.415
39.3641
37.6525
40.6465
41.9231
38.8851
43.1945
40.1133
41.3372
44.4608
45.7223
42.5569
43.773
46.9792
44.9853
48.2319
46.1942
49.4804
50.7251
47.3999
48.6024
51.966
49.8018
53.2033
54.4373
50.9985
52.1923
55.668
53.3835
56.8955
54.5722
58.1201
59.3417
55.7585
56.9424
60.5606
58.124
61.7767
59.3035
62.9903
60.4809
64.2014
61.6562
65.4101
66.6165
62.8296
64.0011
67.8206
69.0226
65.1708
66.3387
70.2224
67.5048
71. 4202

Distribution
0.01
6.6349
9.2104
11.3449
13 .2767
15.0863
16.8119
18.4753
20.0902
21.666
23.2093
24.725
26.217
27.6882
29.1412
30.578
31.9999
33.4087
34.8052
36.1908
37.5663
38.9322
40.2894
41.6383
42.9798
44.314
45.6416
46.9628
48.2782
49.5878
50.8922
52.1914
53.4857
54.7754
56.0609
57.342
58.6192
59.8926
61.162
62.4281
63.6908
64.95
66.2063
67.4593
68.7096
69.9569
71.2015
72.4432
73.6826
74.9194
76.1538

0.005
7.8794
10.5965
12.8381
14.8602
16.7496
18.5475
20.2777
21.9549
23.5893
25.1881
26.7569
28.2997
29.8193
31.3194
32.8015
34.2671
35.7184
37.1564
38.5821
39.9969
41.4009
42.7957
44.1814
45.5584
46.928
48.2898
49.645
50.9936
52.3355
53.6719
55.0025
56.328
57.6483
58.9637
60.2746
61.5811
62.8832
64.1812
65.4753
66.766
68.0526
69.336
70.6157
71.8923
73.166
74.4367
75.7039
76.9689
78.2306
79.4898

Eample Problem 2:
A sociological study evaluated the characteristics of marriage by religion; 256 people were
surveyed for religion and marital status. The results were as follows:

Protestant

Never
Married
Divorced

Separated
Total

Jewish
8
11
3
1
23

Catholic

29
75
21
8
133

16
21
6
3
46

None
20
19
13
0
52

Other
0
1
0
1
2

Total
73
127
43
13
256

Is there a relationship between marital status and religion?


SYSTAT

WARNING:

chi-squared output

More than one-fifth


Significance

tests

of fitted cells
computed

Test statistic
Pearson

on

this

are sparse
table

Value

chi-squared

22.718

are

(frequency

<

5).

suspect.

df
12.000

Prob
0.030

What happened??

Omitting sparse cells: Leave out 'other' and 'separated':


Protestant

Catholic

29
75
21
125

Never
Married
Divorced

Total

Test statistic
Pearson

chi-sguared

16
21
6
43

Value
10.368

Jewish
8
11
3
22

df
6.000

None
20
19
13
52

Total
73
126
43
242

prob
0.110

There is no statistically significant difference between the groups (p=O.11)

Example Problem 3: McNemar Test of Symmetry


In November of 1993, the U.S. Congress approved the North American Free Trade Agreement
(NAFTA). Let's say that two months before the approval and before the televised debate
between Vice President Al Gore and businessman Ross Perot, political pollsters queried a sample
of 350 people, asking "Are you for, unsure, or against NAFTA?" Immediately after the debate,
the pollsters contacted the same people and asked the question a second time. Here are the
results:

BEFORE$

(rows)

by

AFTER$

for
51
46
52
149

for
unsure
against
Total

Percents
BEFORE$

of

total

(rows)

for
unsure
against
Total
N

(columns)

unsure
22
18
49
89

Total
101
91
158
350

against
28
27
57
112

count
by

AFTER$

(columns)

for
14.571
13.143
14.857
42.571
149

unsure
6.286
5.143
14.000
25.429
89

AFTER
against
8.000
7.714
16.286
32.000
112

Test statistic
McNemar

Pearson
Symmetry

chi-squared
chi-squared

Value
11.473
22.039

N
101
91
158

Total
28.857
26.000
45.143
100.000

350

df

Prob
4.000
3.000

0.022
0.000

The McNemar test of symmetry focuses on the counts in the off-diagonalcells (those along the
diagonal are not used in the computations). We are investigating the direction of change in
opinion. First, how many respondentsbecame more negative aboutNAFTA?
Among those who initially responded For, 22 (6.29%) are now Unsure and 28 (8%) are now
Against. Among those who were Unsure before the debate, 27 (7.71%) answered Against
afterwards. The three cells in the upper right contain counts for those who became more
unfavorable and comprise 22% (6.29 + 8.00 + 7.71) of the sample. The three cells in the lower
left contain counts for people who became more positive about NAFTA (46, 52, and 49) or 42%
of the sample.
The null hypothesis for the McNemar test is that the changes in opinion are equal. The chisquared statistic for this test is 22.039 with 3 df and p<0.0005. You reject the null hypothesis.
The pro-NAFTA shift in opinion is significantly greater than the anti-NAFTA shift.

-r

Example Problem 4: Wilcoxon Signed-Rank Test


Evaluate the effect of a diuretic in healthy volunteers:

Subject

No drug

1
2
3
4
5
6

1600
1850
1300
1500
1400
1010

Daily UOP
+ Drug
1490
1300
1400
1410
1350
1000

Difference
-110
-550
+100
-90
-50
-10

Rank of
difference
5
6
4
3
2
1

Signedrank
of difference
-5
-6
+4
-3
-2
-1

W = sum of signed ranks = -13


If the drug has no effect, the ranks associated with a positive change should be similar to the
ranks associated with a negative change; hence, the sum (W) should = O.
How large must W be to call this a statistically significant difference? Refer to Critical Values
table:
N
5
6
7
8
9
10
11
12
13
14
15

Critical Value
15
21
19
28
24
32
28
39
33
45
39
52
44
58
50
65
57
73
63
80
70

P
.062
.032
.062
.016
.046
.024
.054
.020
.054
.02
.048
.018
.054
.02
.052
.022
.048
.02
.05
.022
.048

*Due to the nature of discrete possible values ofW, p values at traditional breakpoints are usually
not possible (ex.: p=0.05).

You might also like