Module3 Cse3190 FDA-1
Module3 Cse3190 FDA-1
Module3
• Proportion tests
• Chi squared test
• Fisher exact test
• Correlation-T test
• Wilcoxon Rank sum tests
• Wilcoxon signed rank test
• one-way ANOVA test
• Kruskal Wallis test
Proportion tests
• Definition: A proportion test is used to
determine whether a sample proportion differs
from a hypothesized population proportion.
• Applications: Quality control, hypothesis
testing, surveys, and reliability testing in
engineering.
• Binary outcomes: When the variable has two
outcomes (e.g., success/failure).
2
• Hypothesis: Testing a claim about population
proportions.
• Examples:
• Quality assurance: Proportion of defective items in a
batch.
• Survey data: Proportion of users preferring a new
technology.
3
• Key Terms Used
• Sample Proportion (p): The proportion of a specific
outcome in your sample.
• Population Proportion (P): The proportion you are
testing against (hypothesized value).
• Null Hypothesis (H₀): Assumes no difference
between sample proportion and population
proportion.
• Alternative Hypothesis (H₁): Assumes there is a
difference.
4
• Types of Proportion Tests
• One-proportion Z-test: Testing a single sample
against a known proportion.
• Two-proportion Z-test: Comparing the proportions
of two independent samples.
5
• Steps in Conducting a Proportion Test
1. Formulate Hypotheses:
• H ₀: p = P
• H₁: p ≠ P (two-tailed) or p > P, p < P (one-tailed)
2. Calculate the Test Statistic:
• Formula: Z=
• Where:
• p = sample proportion
• P = hypothesized proportion
• n = sample size
6
3. Determine Critical Value or P-value
4. Make Decision:
• Reject H₀ if test statistic is greater than critical value or P-
value is less than significance level (α).
5. Conclusion: State the result in the context of the
problem.
7
Example: One-Proportion Z-Test
• Problem: A manufacturer claims that only 5% of
its products are defective. A quality control
engineer inspects 200 products and finds that 12
are defective. Is the manufacturer's claim true?
• Hypotheses:
• H₀: P = 0.05
• H₁: P ≠ 0.05
8
• Solution:
• Use Z-test formula to calculate the Z value and
compare with the critical value.
• P = 0.05 # population proportion (hypothesized proportion)
• n = 200 # sample size
• x = 12 # number of defective items found in the sample
• Sample proportion p==0.06
• Z=
9
=
=
=
10
• Assuming a significance level of 0.05 for a two-
tailed test
• alpha = 0.05
• critical_value = 1.96 # Critical value for a 95%
confidence interval in a two-tailed test
• The calculated Z value is 0.65. For a two-tailed
test at a 95% confidence level, the critical value is
1.96.
11
• Since the Z value (0.65) is less than the critical
value (1.96), we fail to reject the null hypothesis.
Therefore, there is not enough evidence to
conclude that the proportion of defective
products differs from the claimed 5%
12
Suppose a company claims that 10% of its
products are defective. A quality control engineer
checks a random sample of 150 products and finds
that 20 of them are defective. The engineer wants
to test whether the actual proportion of defective
products differs from the company's claim at a 5%
significance level.
13
• Hypotheses:
• H₀: P = 0.10
• H₁: P ≠ 0.10
• Solution:
• Use Z-test formula to calculate the Z value and
compare with the critical value.
• P = 0.10 # population proportion (hypothesized proportion)
• n = 150 # sample size
• x = 20 # number of defective items found in the sample
• Sample proportion p==0.13333
14
• Z=
= =
=
=
• 1.360827634879543
15
• For a two-tailed test at a 5% significance level
(α = 0.05):
The critical value for a 95% confidence interval is
±1.96.
• Since the calculated Z-value (1.36) is less than
the critical value (1.96), we fail to reject the
null hypothesis. This means there is not
enough evidence to conclude that the
proportion of defective products differs from the
company's claim of 10%.
16
• The test does not provide sufficient evidence to
suggest that the actual proportion of defective
products is different from 10%
17
• Two-Proportion Z-Test step by step.
• Suppose two different assembly lines, Line A and
Line B, are producing the same product. A quality
control manager wants to determine whether
there is a significant difference between the
proportion of defective products produced by
each line.
• From Line A, a sample of 200 products is inspected,
and 30 defective products are found.
• From Line B, a sample of 250 products is inspected,
and 20 defective products are found.
18
• Step 1: State the Hypotheses
• Null Hypothesis (H₀): The proportion of defective
products is the same for both lines.
H0:PA=PB
• Alternative Hypothesis (H₁): The proportion of
defective products differs between the two lines.
H1:PA≠PB
19
• Step 2: Collect and Define the Data
• nA=200 #Sample size from Line A.
• xA=30 #Number of defective products from Line
A.
• nB=250 #Sample size from Line B.
• xB=20 #Number of defective products from Line
B.
20
• Sample proportion from Line A (pA):
pA=== 0.15
• Sample proportion from Line B (pB ):
pB= == 0.08
21
• Step 3: Calculate the Test Statistic (Z-Value)
• The formula for the Z-test for two proportions is:
• Z=
•W here:
• pis the pooled sample proportion:
p=
Z= 2.347871376374779
22
• Step 4: Determine the Critical Value or P-
Value
• For a two-tailed test at a 5% significance level (α =
0.05):
• The critical value for a 95% confidence interval is
±1.96.
23
Chi-Square Test
• A Chi-Squared Test is a statistical test used to
determine whether there is a significant
association between categorical variables. There
are two main types of Chi-Squared Tests:
• Chi-Squared Test of Independence: Used to
determine if two categorical variables are
independent.
• Chi-Squared Goodness-of-Fit Test: Used to
determine if a sample fits a population distribution.
24
Example: Chi-Squared Test of Independence
Problem:
A researcher wants to find out if there is an association
between gender (Male/Female) and preference for a
product (Like/Dislike). They conduct a survey with the
following results:
Like Unlike Total
Male 30 20 50
Female 20 30 50
Total 50 50 100
25
• Step 1: State the Hypotheses
• Null Hypothesis (H₀): Gender and product
preference are independent.
• Alternative Hypothesis (H₁): Gender and product
preference are not independent (there is an
association).
26
• Step 2: Observed Values (O)
• The observed values are the counts from the survey:
27
• Step 3: Calculate Expected Values (E)
• The expected values are calculated using the formula:
• E=
28
• Step 4: Calculate the Test Statistic (Chi-
Squared Value)
• The Chi-Squared test statistic is calculated using the
formula: ❑
( 𝑂 −𝐸 ) 2
•χ 2
= ∑❑ 𝐸
• The calculated Chi-Squared test statistic is 4.0.
29
• Step 5: Determine the Critical Value
• Degrees of freedom (df) = (number of rows - 1) *
(number of columns - 1)
df=(2−1)(2−1)=1df = (2-1)(2-1) = 1df=(2−1)(2−1)=1
• At a 5% significance level (α = 0.05), the critical
value for 1 degree of freedom is 3.841.
30
• Step 6: Decision
• Since the Chi-Squared test statistic (4.0) is greater
than the critical value (3.841), we reject the null
hypothesis.
• Conclusion:
• There is enough evidence to conclude that there is a
significant association between gender and product
preference.
31
Fisher exact test
• The Fisher Exact Test is used to determine if
there are nonrandom associations between two
categorical variables, especially in small sample
sizes.
• It’s often used as an alternative to the Chi-
Squared test when sample sizes are small or
when expected frequencies in any of the cells are
below 5.
32
Example: Fisher's Exact Test
Problem:
Suppose a study is conducted to determine if a new
drug is effective in curing a certain disease. The results
are summarized as follows:
Cured Not Cured Total
Drug 8 2 10
Placebo 3 7 10
Total 11 9 20
33
• Step 1: State the Hypotheses
• Null Hypothesis (H₀): There is no association
between the treatment type and the outcome (they
are independent).
• Alternative Hypothesis (H₁): There is an association
between the treatment type and the outcome (they
are not independent).
34
• Step 2: Perform Fisher's Exact Test
• Formula for Fisher's Exact Test:
Cured Not Cured Total
Drug a b (a+b)
Placebo c d (c+d)
Total (a+c) (b+d) n
P=≈ 0.032
35
Step 3: Decision
• At a 5% significance level (α = 0.05), the p-value
(0.0698) is greater than 0.05, so we fail to reject
the null hypothesis.
Conclusion:
• There is not enough evidence to conclude that there is
a significant association between the treatment type
(Drug vs. Placebo) and the outcome (Cured vs. Not
Cured). The data does not provide strong evidence
that the drug is more effective than the placebo,
though the odds ratio suggests a higher chance of
being cured with the drug.
36
Correlation-T test
• The Correlation T-Test is used to determine if
the correlation coefficient (r) between two
variables is significantly different from zero.
• In other words, it tests whether there is a
significant linear relationship between two
continuous variables.
37
• Formula for Correlation T-Test Calculation
• The test statistic ttt for the correlation t-test is
calculated using the following formula:
Where:
• r is the sample correlation coefficient.
• n is the number of paired data points
(sample size)
38
• Example:
• Let's say we have a sample of 10 paired data points
and the calculated correlation coefficient r between
two variables is 0.65. We want to test whether this
correlation is significantly different from zero at a 5%
significance level.
• Step 1: Hypotheses
• H0: ρ=0 (There is no linear relationship between the
variables).
• H1:ρ≠0 (There is a significant linear relationship
between the variables).
39
• Step 2: Given Data
• Correlation coefficient (r) = 0.65
• Sample size (n) = 10
• Step 3: Apply the correlation t-test formula
• The calculated t-value is approximately 2.42.
• Step 4: Determine the Critical Value or P-
value
• Degrees of freedom df=n−2=10−2=8df = n - 2 = 10 - 2
= 8df=n−2=10−2=8.
• Using a t-distribution table or calculator, the critical t-
value for a two-tailed test at a 5% significance level
(α = 0.05) and 8 degrees of freedom is
approximately ±2.306.
40
• Step 5: Decision
• Since the calculated t-value (2.42) is greater than
the critical value (2.306), we reject the null
hypothesis.
• Conclusion:
• There is enough evidence to suggest that the
correlation between the two variables is significantly
different from zero, indicating a significant linear
relationship between them.
41
Wilcoxon Rank sum tests
• The Wilcoxon Rank-Sum Test (also called the
Mann-Whitney U Test) is a non-parametric test
used to compare whether two independent
samples come from the same distribution.
• It's used when the data do not necessarily follow
a normal distribution, making it a non-
parametric alternative to the two-sample t-test.
42
• Key Points:
• It tests whether the distributions of two independent
samples are the same.
• It is used for ordinal data or continuous data that is
not normally distributed.
• Unlike the t-test, it does not assume normality.
43
• Test Procedure:
1. Rank the data: Combine the two samples and rank
the values from lowest to highest.
2. Assign ranks: Assign ranks to each observation in
the combined dataset. If there are ties, assign the
average rank to the tied values.
3. Sum the ranks for each sample: Compute the sum
of ranks for each sample.
4. Calculate the test statistic: The test statistic is
based on the rank sums, and it compares how the
ranks are distributed between the two samples.
44
• Example:
• Suppose we have two independent samples:
• Sample 1: 3, 1, 7, 5
• Sample 2: 4, 6, 8, 2
45
• Step 1: Combine and Rank the Data
• First, combine the two samples and assign ranks.
Value Rank
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
46
• Step 2: Assign Ranks to Each Sample
• Sample 1 (3, 1, 7, 5): Ranks = 3, 1, 7, 5 → Sum of
Ranks for Sample 1 = 16 = R1
• Sample 2 (4, 6, 8, 2): Ranks = 4, 6, 8, 2 → Sum of
Ranks for Sample 2 = 20 = R2
• Step 3: Test Statistic
• For the Wilcoxon Rank-Sum Test, we choose the
smaller of the two rank sums:
• W=min(R1,R2)=min(16,20)=16
47
• Step 4: Compare to Critical Value or Use
Normal Approximation
• For small sample sizes, you would typically refer to a
table of critical values for the Wilcoxon Rank-Sum
Test. However, since both samples have 4
observations, we can use a normal approximation.
• To calculate the Z-score using the Mann-
Whitney U statistic:
• Step 4.1: Calculate the U-statistic for Sample 1:
U1 = n1 R1
Where
• n1and n2are the sizes of the two samples.
• R1is the rank sum for sample 1.
48
• The Mann-Whitney U-statistic for Sample 1 is 10.
• Using the normal approximation, the calculated Z-score is
approximately 0.577.
• Step 5: Decision
• At a typical significance level (α=0.05\alpha = 0.05α=0.05),
the corresponding critical Z-value for a two-tailed test is
±1.96.
• Since the calculated Z-score (0.577) is less than 1.96, we fail
to reject the null hypothesis.
• Conclusion:
• There is no significant evidence to suggest that the two
samples come from different distributions. Thus, we conclude
that there is no significant difference between the two groups
based on the Wilcoxon Rank-Sum Test.
49
Wilcoxon signed rank test
• The Wilcoxon Signed-Rank Test is a non-
parametric test used to compare two related
samples or paired observations to assess
whether their population mean ranks differ.
• It is the non-parametric equivalent of the paired
t-test and is used when the data does not follow a
normal distribution.
50
• When to Use:
• Paired samples: When you have two sets of related
data, such as pre-test and post-test measurements
from the same subjects.
• The differences between the paired observations are
used for ranking.
51
• Test Procedure:
1.Calculate the differences between each pair of
observations.
2.Rank the absolute values of the differences from
smallest to largest. Assign the average rank to ties.
3.Assign the signs (+ or −) of the differences to the
ranks.
4.Sum the ranks for the positive differences (R+) and
the negative differences (R−).
5.The test statistic W is the smaller of R+ and R−
52
• Example
• Suppose we have the following paired data
representing the weights of individuals before and
after a fitness program:
Person Before(kg) After(kg)
1 70 68
2 82 80
3 76 75
4 85 85
5 74 72
6 90 88
53
• Step 1: Calculate the difference
Person Before(kg) After(kg) Difference
(After-Before)
1 70 68 -2
2 82 80 -2
3 76 75 -1
4 85 85 0
5 74 72 -2
6 90 88 -2
54
• Step 2: Rank the Absolute Differences:
• Ignore differences of zero (like for person 4) and rank
the absolute differences.
• If there are ties (i.e., two or more differences with the
same value), assign the average rank to those tied
values. Person Difference Absolute Rank
(After-Before) Difference
1 -2 2 3.5
Absolute Rank
Difference 2 -2 2 3.5
1 1 3 -1 1 1
2 2,2,2,2 4 0 0 Excluded
Average Rank
= (2+3+4+5)/4 5 -2 2 3.5
=14/4=3.5 6 -2 2 3.5
55
• Step 4: Assign Signs to the Ranks
• Assign the sign of the original difference to the
corresponding rank.
• If the original difference was negative, assign a negative sign
to the rank.
• If the original difference was positive, assign a positive sign
to the rank.
Person Difference Rank Signed
(After-Before) Rank
1 -2 3.5 -3.5
2 -2 3.5 -3.5
3 -1 1 -1
4 0 Excluded Excluded
5 -2 3.5 -3.5
6 -2 3.5 -3.5
56
• Step 5: Calculate the Sum of the Positive and
Negative Ranks
• Sum of Positive Ranks (R+):
• Since all differences are negative, there are no positive
ranks, so R+= 0.
• Sum of Negative Ranks (R−):
• Add up the negative ranks:
• R−= 3.5 + 3.5 + 1 + 3.5 + 3.5=−15
• The test statistic W is the smaller of R+and R−, which
is:
• W=min(0,15)=0
57
• Step 6: Decision
• At a typical significance level (α=0.05):
• Since the p-value (0.034) is less than 0.05, we reject the
null hypothesis.
• Conclusion:
• There is significant evidence to suggest that the
median difference between the "Before" and "After"
measurements is not zero. Therefore, the fitness
program appears to have a significant effect on
reducing weight
58
one-way ANOVA test
• The One-Way ANOVA test is used to determine
whether there are statistically significant
differences between the means of three or more
independent (unrelated) groups.
• It is a parametric test used when the
assumptions of normality and homogeneity of
variance are met.
59
• When to Use:
• You have one independent variable with three or
more levels (groups).
• You want to test if the means of these groups are
significantly different.
60
• How One-Way ANOVA Works:
1.Between-group variability: The variability between
the means of the groups.
2.Within-group variability: The variability of
observations within each group.
• One-Way ANOVA compares the variance between
the groups (due to differences in group means)
and the variance within the groups (due to
individual differences).
61
• Formula for F-statistic:
Where:
• MSB (Mean Square Between Groups) is the variance
between the group means.
• MSW (Mean Square Within Groups) is the variance
within the groups.
62
• Steps in a One-Way ANOVA:
• Step 1: Calculate Group Means and Grand Mean
• Calculate the mean for each group.
• Calculate the overall mean (grand mean) across all
observations.
• Step 2: Calculate the Sum of Squares
• Sum of Squares Between (SSB): This measures the
variability between group means.
- grand)2
Where
• is the number of observations in group i,
• i is the mean of group i, and
• grand is the grand mean.
63
• Sum of Squares Within (SSW): This measures the
variability within each group.
) i
Where
• X is the individual observation in group I
• i is the mean of group i, and
• Total Sum of Squares (SST): The total variability in the data.
SST=SSB+SSW
64
• Step 3: Calculate the Mean Squares
• Mean Square Between (MSB): Divide the sum of
squares between by the degrees of freedom between
groups.
65
• Step 4: Calculate the F-statistic
66
• Example:
Suppose we have three different teaching
methods, and we want to test if the average scores
of students taught using these methods are
significantly different. The data are:
• Group 1 (Method A): [85, 90, 88]
• Group 2 (Method B): [78, 82, 84]
• Group 3 (Method C): [92, 94, 89]
67
• Step 1: Calculate Group Means and the Grand Mean
• Mean of Group 1:
• Mean of Group 2:
• Mean of Group 3:
• Grand Mean
= 86.89
68
• Step 2: Calculate the Sum of Squares
• 2.1: Sum of Squares Between (SSB)
• For Group 1:
n1=3,
(− )2=(87.67−86.89)2=0.608
Contribution to SSB: 3×0.608=1.8243
• For Group 2:
n2=3,
(− )2=(81.33−86.89)2=30.878
Contribution to SSB: 3×30.878=92.634
• For Group 3:
n3=3,
(− )2=(91.67−86.89)2=22.878
Contribution to SSB: 3×22.878=68.634
69
• Now, sum them up to get the total SSB:
• SSB=1.824+92.634+68.634=163.092
70
• 2.2: Sum of Squares Within (SSW)
• For Group 1 (Method A):
(85−87.67)2+(90−87.67)2+(88−87.67)2= 7.11
• For Group 2 (Method B):
(78−81.33)2+(82−81.33)2+(84−81.33)2=14.67
• For Group 3 (Method C):
(92−91.67)2+(94−91.67)2+(89−91.67)2=9.33
• Now, sum them up to get the total SSW:
SSW=7.11+14.67+9.33=31.11
71
• Step 3: Calculate the Mean Squares
• 3.1: Mean Square Between (MSB)
72
• Step 4: Calculate the F-statistic
73
Kruskal Wallis test
• The Kruskal-Wallis Test is a non-parametric
alternative to the One-Way ANOVA.
• It is used to determine if there are statistically
significant differences between the distributions
of three or more independent groups.
• Unlike ANOVA, it does not assume a normal
distribution and is based on ranked data.
74
• When to Use:
• You have three or more independent groups.
• You want to test whether the distributions of the
groups differ.
• You do not assume the data to be normally
distributed.
75
• Test Statistic:
• The Kruskal-Wallis test uses ranks rather than raw
data to compute a test statistic. The formula for the
test statistic H is:
Where:
• N is the total number of observations across all groups.
• k is the number of groups.
• Riis the sum of ranks for group i.
• ni is the number of observations in group i.
• If the null hypothesis is true, H approximately follows
a chi-squared distribution with k−1 degrees of
freedom.
76
• Steps for Performing the Kruskal-Wallis Test:
• Step 1: Rank the Data
• Combine all the data from all groups and assign ranks, with
the smallest value getting a rank of 1. In the case of ties,
assign average ranks.
• Step 2: Calculate the Test Statistic H
• Use the formula in the previous slide to calculate the
Kruskal-Wallis test statistic.
• Step 3: Determine the p-value
• Compare the calculated H statistic to the critical value from
the chi-squared distribution with k−1 degrees of freedom,
or calculate the p-value.
77
• Example:
Let’s use the same data as the previous One-Way
ANOVA example, but this time apply the Kruskal-Wallis
test:
• Group 1 (Method A): [85, 90, 88]
• Group 2 (Method B): [78, 82, 84]
• Group 3 (Method C): [92, 94, 89]
Calculate the Kruskal-Wallis test statistic and p-value
for this example.
78
• Step 1: Combine and Rank the Data
• First, we combine all the data from the three groups
and rank the values, assigning the smallest value the
rank of 1. In the case of ties, we assign the average
rank to the tied values.
• Combined Data (Unranked):
• Group 1 (Method A): 85, 90, 88
• Group 2 (Method B): 78, 82, 84
• Group 3 (Method C): 92, 94, 89
79
• Combine and Rank:
Score Rank
78 1
82 2
84 3
85 4
88 5
89 6
90 7
92 8
94 9
80
• Step 2: Calculate the Rank Sums for Each
Group
• Now, assign the ranks back to the original groups and
sum them for each group:
• Group 1 (Method A): Scores = 85, 90, 88 → Ranks = 4, 7, 5
R1=4+7+5=16
• Group 2 (Method B): Scores = 78, 82, 84 → Ranks = 1, 2, 3
R2=1+2+3=6
• Group 3 (Method C): Scores = 92, 94, 89 → Ranks = 8, 9, 6
R3=8+9+6=23
81
• Step 3: Calculate the Kruskal-Wallis Test
Statistic H
• Step 3.1: Calculate Total Sample Size N
• N=9(since there are 3 scores in each group and 3 groups)
• Step 3.2: Calculate for Each Group
• For Group 1 (Method A):
==
• For Group 2 (Method B):
==
==
• For Group 3 (Method C):
82
• Step 3.3: Sum the Terms
• = 85.33 + 12 + 176.33 = 273.67
= 6.49
83
• Step 4: Determine the p-value
• The Kruskal-Wallis test statistic H=6.49H =
6.49H=6.49 approximately follows a chi-squared
distribution with k−1=3−1=2k - 1 = 3 - 1 =
2k−1=3−1=2 degrees of freedom.
• Using statistical software or a chi-squared table, we
find the p-value for H=6.49H = 6.49H=6.49 and 222
degrees of freedom is approximately 0.039.
• Conclusion:
• Since the p-value (0.039) is less than 0.05, we reject
the null hypothesis and conclude that there is a
significant difference in the distributions of the three
groups.
84