CSCI946 w3_DataPrep
CSCI946 w3_DataPrep
delivery,
pilot project
CSCI446/946
Content
• Brief Recap
– Big Data Analytics Lifecycle
– An Example
• Tools for Data Preparation
– Exploratory Data Analysis
• Visualization before analysis (in lab)
• Visualizing single and multiple variables (in lab)
– Statistical Methods for Evaluation
• Hypothesis, Hypothesis Testing, t-test, ANOVA
Exploratory Data Analysis
• Consider two situations:
1. A company collects data about customer satisfaction
levels, and wishes to investigate whether a change in
the product design would improve the costumer
satisfaction. How can the company ascertain whether
the change had the desired effect?
2. A data scientist produces a set of results by deploying
a machine learning model. The data scientist wishes to
investigate whether a variation of the model
architecture would improve results. How can the data
scientist be certain that the modified model yields an
improvement in results?
Exploratory Data Analysis
• For each of the two situations we could compute
the average (mean) value of the data prior to the
change, and compute the mean value of the data
after the change.
– Analyse the difference of Means
• Perform T-test:
– Is |T| greater or equal to T*?
– Let check: |-1.7828|>=2.048407
– Answer is “No”
• Insufficient evidence to reject
• H0 is accepted.
Statistical Methods for Evaluation
• Student’s t-test (an example)
– What does the “p-value” mean? – two-sided test
p-value: the probability of the rank-sums of this magnitude being observed assuming that
the population distributions are identical
Statistical Methods for Evaluation
• Wilcoxon Rank-Sum Test – Suppose we have the following data:
– Group A: [85, 80, 78, 90, 95]; Group B: [88, 82, 85, 87, 92]
• Step 1: Combine and Rank the Data
– Combine: [85, 80, 78, 90, 95, 88, 82, 85, 87, 92]
– Rank: [4.5, 2, 1, 8, 10, 6, 3, 4.5, 5, 9]
• Step 2: Sum the Ranks for Each Group
– Group A: [4.5, 2, 1, 8, 10]; Sum of ranks 𝑊1 =4.5+2+1+8+10=25.5
– Group B: [6, 3, 4.5, 5, 9]; Sum of ranks 𝑊2 =6+3+4.5+5+9=27.5
• Step 3: Choose the Test Statistic
– W can be either 25.5 or 27.5 depending on the test design, but usually,
the smaller sum is used if conducting a one-sided test.
• Step 4: Determine Significance
– Compare the test statistic 𝑊 to a critical value from the Wilcoxon rank-
sum distribution or use a p-value from statistical software.
Statistical Methods for Evaluation
• Type I and Type II Errors
– Type I error: the rejection of the null hypothesis when
the null hypothesis is TRUE
• The probability of type I error is denoted by α
• Fix by select an appropriate significance level
– Type II error: the acceptance of the null hypothesis
when the null hypothesis is FALSE
• The probability of type II error is denoted by β
• Fix by increase sample size
• Power (statistical power):
– determine necessary sample size
– The probability of correcting rejecting the null
hypothesis (1- β)
Statistical Methods for Evaluation
• ANOVA (Analysis of Variance)
– What if there are more than two populations?
– Multiple t-test may not perform well now
• A generalization of the hypothesis testing
– ANOVA tests if any of the population means differ
from the other population means
– Each population is assumed to be normal and
have the same variance
Statistical Methods for Evaluation
• ANOVA (Analysis of Variance)