0% found this document useful (0 votes)

11 views33 pages

08-Data Science-S25-Comparing Two Samples

Uploaded by

mohussein529

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views33 pages

08-Data Science-S25-Comparing Two Samples

Uploaded by

mohussein529

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

CMPS 360

Data Science Session 8

Fundamentals
Spring 2025 Comparing Two Samples

Ch. 12

Many slides are created by John DeNero ([email protected]) and Ani Adhikari ([email protected])
Comparing Two Samples
● Compare values of sampled individuals in Group A with
values of sampled individuals in Group B.

● Question: Do the two sets of values come from the

same underlying distribution?

● Answering this question by performing a statistical test

is called A/B testing.
A/B Testing
The Groups and the Question
● Random sample of mothers of newborns. Compare:
○ (A) Birth weights of babies of mothers who smoked
during pregnancy
○ (B) Birth weights of babies of mothers who didn’t
smoke

● Question: Could the difference be due to chance alone?

(Demo)
Hypotheses
● Null:
○ In the population, the distributions of the birth
weights of the babies in the two groups are the
same. (They are different in the sample just due to
chance.)
● Alternative:
○ In the population, the babies of the mothers who
smoked weigh less, on average, than the babies of
the non-smokers.
Test Statistic
● Group A: non-smokers
● Group B: smokers

● Statistic: Difference between average weights

Group B average - Group A average

● Negative values of this statistic favor the alternative

(Demo)
Simulating Under the Null
● If the null is true, all rearrangements of labels are
equally likely
● Plan:
○ Shuffle all group labels
○ Assign each shuffled label to a birth weight
○ Find the difference between the averages of the two
shuffled groups
○ Repeat
The Data

...

Non-smoker Non-smoker Smoker Smoker

... Non-smoker
120 oz 113 oz 128 oz 108 oz 117 oz
Shuffling Labels Under the Null

...

Smoker Non-smoker Non-smoker Smoker Smoker

...
120 oz 113 oz 128 oz 108 oz 117 oz
Shuffling Rows: Random Permutation
● tbl.sample(n)
○ Table of n rows picked randomly with replacement
● tbl.sample()
○ Table with same number of rows as original tbl,
picked randomly with replacement
● tbl.sample(n, with_replacement = False)
○ Table of n rows picked randomly without replacement
● tbl.sample(with_replacement = False)
○ All rows of tbl, in random order
(Demo)
How We’ve Tested Thus Far
Hypothesis Testing Review
● 1 Sample: One Category (e.g. percent of flowers that are purple)
○ Test Statistic: observed_proportion, abs(observed_proportion - null_proportion)
○ How to Simulate: sample_proportions(n, null_dist)

● 1 Sample: More Than 2 Categories (e.g. ethnicity distribution of jury panel)

○ Test Statistic: tvd(observed_dist, null_dist)
○ How to Simulate: sample_proportions(n, null_dist)

● 1 Sample: Numerical Data (e.g. scores in a lab section)

○ Test Statistic: observed_mean, abs(observed_mean - null_mean)
○ How to Simulate: population_data.sample(n, with_replacement=False)

● 2 Samples: Underlying Values (e.g. birth weights of smokers vs. non-smokers)

○ Test Statistic: group_a_mean - group_b_mean, group_b_mean - group_a_mean,
abs(group_a_mean - group_b_mean)
○ How to Simulate: observed_data.sample(with_replacement=False)
Smoking caused
Lower Birth Weight?
Importance of Random Assignment
We’ve concluded that in the population, birth weights of
babies whose mothers smoke weigh less than those whose
mothers do not
● Is lower birth weight caused by maternal smoking?
● Can’t Tell:
○ Moms aren’t randomly assigned whether to smoke
○ Other factors contribute to their decision to smoke (e.g.
income, geography, diet)
Causality
Randomized Controlled Experiment
● Sample A: control group
● Sample B: treatment group
● If the treatment and control groups are selected at
random, then you can make causal conclusions.
● Any difference in outcomes between the two groups
could be due to
○ chance
○ the treatment
(Demo)
Before the Randomization
● In the population there is one imaginary ticket for each
of the 31 participants in the experiment.
● Each participant’s ticket looks like this:

Potential Potential
Outcome Outcome

Outcome if assigned to Outcome if assigned to

treatment group control group
The Data
16 randomly picked tickets show:

Outcome if assigned to
control group

The remaining 15 tickets show:

Outcome if assigned to
treatment group
The Hypotheses
● Null:
○ In the population, the distribution of all potential
control scores is the same as the distribution of all
potential treatment scores.
○ tl;dr the treatment has no effect
● Alternative:
○ In the population, more of the potential treatment
scores are 1 (pain improves) than the potential
control scores. (Demo)
Random Assignment & Shuffling
Data Sample Hypothesis Testing Conclusions
Generation Data Difference of Means
Permutation Test

Observational
Sample Association

Our Two-
Shuffle Labels
Sample
to Simulate
Numerical Data
from Null

Randomized
Control
Experiment Causation
P-Values and Error Probabilities
Discussion Question
There are 2000 students in a course. Each has a coin to test:
Null: The coin is fair
Alternative: The coin is unfair
● based on 1,000 tosses of a coin,
● the statistic = | number of heads - 500 |,
● and the 5% cutoff for the P-value.

Suppose all 1000 coins are fair. About how many students
will conclude that their coins are unfair?
Statistic Simulated Under the Null

About 5% of the
area is to the right
of the gold line
Can the Conclusion be Wrong?
Yes.
Null is true Alternative is
true
Test favors the
null ✅ ❌
Test favors the
alternative ❌ ✅
An Error Probability
● The cutoff for the P-value is an error probability.

● If:
○ your cutoff is 5%
○ and the null hypothesis happens to be true

● then there is about a 5% chance that your test will

reject the null hypothesis.
P-value cutoff vs P-value
● P-value cutoff
○ Does not depend on observed data or simulation
○ Decide on it before seeing the results
○ Conventional values at 5% and 1%
○ Probability of hypothesis testing making an error
● P-value
○ Depends on the observed data and simulation
○ Probability under the null hypothesis that the test statistic
is the observed value or further towards the alternative
More on Hypothesis Tests
Discussion Question
● Manufacturers of Super Soda run a taste test.
● 91 out of 200 tasters prefer Super Soda over its rival.
Question: Do fewer people prefer Super Soda than its rival, or is this
just chance?
Null hypothesis:
Alternative hypothesis:
Test statistic (a way to summarize the whole sample as a single
number):
p-value: Start at the observed statistic and look which way?
Discussion Question
● Manufacturers of Super Soda run a taste test.
● 91 out of 200 tasters prefer Super Soda over its rival.
Question: Do fewer people prefer Super Soda than its rival, or is this
just chance?
Null hypothesis: Half the people in the population prefer Super Soda.
Alternative hypothesis: fewer people in the population prefer Super
Soda than its rival
Test statistic (a way to summarize the whole sample as a single
number): The number of people in the sample who prefer Super Soda
p-value: Start at the observed statistic and look which way? left
(Demo)
Hypothesis Test Concerns
The outcome of a hypothesis test can be affected by:
● The hypotheses you investigate:
How do you define your null distribution?
● The test statistic you choose:
How do you measure a difference between samples?
● The empirical distribution of the statistic under the null:
How many times do you simulate under the null distribution?
● The data you collected:
Did you happen to collect a sample that is similar to the population?
● The truth:
If the alternative hypothesis is true, how extreme is the difference?

(Demo)
Changing number of simulations
Difference from the null
Hypothesis Test Effects
Number of simulations: Make it as large as possible so
that the empirical distribution of the test statistic under the
null distribution is good. No new data needs to be collected.
Number of observations: A larger sample will lead you to
reject the null more reliably if the alternative is in fact true.
Difference from the null: If the null hypothesis is false, but
the truth is similar to the null hypothesis, then even a large
sample may not provide enough evidence to reject the null.

Final - Module 4 B
No ratings yet
Final - Module 4 B
61 pages
Lesson 1 Course Outline and Intro PDF
No ratings yet
Lesson 1 Course Outline and Intro PDF
29 pages
Isds361b Notes
No ratings yet
Isds361b Notes
103 pages
cs447_tool-using-simulation-to-test-a-hypothesis
No ratings yet
cs447_tool-using-simulation-to-test-a-hypothesis
4 pages
Stat 139 - Unit 03 - Hypothesis Testing - 1 Per Page
No ratings yet
Stat 139 - Unit 03 - Hypothesis Testing - 1 Per Page
32 pages
stats_final_review
No ratings yet
stats_final_review
11 pages
Sign Test
No ratings yet
Sign Test
7 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
59 pages
Statistical Analysis (T-Test)
No ratings yet
Statistical Analysis (T-Test)
61 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
60 pages
04 - Introduction to Statistical Inference
No ratings yet
04 - Introduction to Statistical Inference
11 pages
Group 7 - Hypothesis Testing - 1
No ratings yet
Group 7 - Hypothesis Testing - 1
25 pages
Hypothesis Testing Lecture
No ratings yet
Hypothesis Testing Lecture
28 pages
Discussion-2: Example: 11-3
No ratings yet
Discussion-2: Example: 11-3
4 pages
Final Exam
No ratings yet
Final Exam
13 pages
Complete Business Statistics: The Comparison of Two Populations
No ratings yet
Complete Business Statistics: The Comparison of Two Populations
66 pages
Basic Biostats, 2
No ratings yet
Basic Biostats, 2
58 pages
Infer Ential
No ratings yet
Infer Ential
25 pages
OLANTIGUE Written Report
No ratings yet
OLANTIGUE Written Report
15 pages
Hypothesis Lecture
No ratings yet
Hypothesis Lecture
7 pages
Matched Pair+Hypothesis+Testing
No ratings yet
Matched Pair+Hypothesis+Testing
8 pages
Lecture 6 Part One
No ratings yet
Lecture 6 Part One
11 pages
STOR 120 - Lecture Slides - Review for Midterm 2 - Solutions
No ratings yet
STOR 120 - Lecture Slides - Review for Midterm 2 - Solutions
64 pages
Independent Samples T-Test Dr. Tom Pierce Radford University
No ratings yet
Independent Samples T-Test Dr. Tom Pierce Radford University
10 pages
Data Science Course 1
No ratings yet
Data Science Course 1
32 pages
Chapter 3
No ratings yet
Chapter 3
19 pages
COM 201 - Inferential Statistics - 18032022-1
No ratings yet
COM 201 - Inferential Statistics - 18032022-1
58 pages
Assignment No. 02 Introduction To Educational Statistics (8614)
No ratings yet
Assignment No. 02 Introduction To Educational Statistics (8614)
19 pages
Lec15
No ratings yet
Lec15
43 pages
STAT22209 - Nonparametric Statistics
No ratings yet
STAT22209 - Nonparametric Statistics
74 pages
Lec2 PDF
No ratings yet
Lec2 PDF
8 pages
Unit 3 2020
No ratings yet
Unit 3 2020
66 pages
Making Predictions
No ratings yet
Making Predictions
30 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
5 pages
03 Fact Sheet HME712 Bos - 3 General Principles of Hypothesis Testing
No ratings yet
03 Fact Sheet HME712 Bos - 3 General Principles of Hypothesis Testing
2 pages
T test
No ratings yet
T test
29 pages
Commed Revision Course Biosta2
No ratings yet
Commed Revision Course Biosta2
44 pages
Nciph ERIC2
No ratings yet
Nciph ERIC2
7 pages
Lecture 7 - Applied Statistics - English Section-Third Year 2018
No ratings yet
Lecture 7 - Applied Statistics - English Section-Third Year 2018
8 pages
hypothesis testing
No ratings yet
hypothesis testing
18 pages
L07 Test
No ratings yet
L07 Test
52 pages
Session 2 On Hypothesis Testing
No ratings yet
Session 2 On Hypothesis Testing
13 pages
Notes10 Two-Sample Tests
No ratings yet
Notes10 Two-Sample Tests
73 pages
QT II (Hy I) & (Hy II)
No ratings yet
QT II (Hy I) & (Hy II)
116 pages
Testing of Hypothesis
67% (3)
Testing of Hypothesis
37 pages
IE5005 Lecture 04
No ratings yet
IE5005 Lecture 04
57 pages
ST 511 Self Notes
No ratings yet
ST 511 Self Notes
6 pages
8 Hypothesis Testing 1
No ratings yet
8 Hypothesis Testing 1
26 pages
Hypothesis Testing-2 PDF
No ratings yet
Hypothesis Testing-2 PDF
16 pages
Test3+SP23-1 2
No ratings yet
Test3+SP23-1 2
4 pages
ACFrOgC2 Bbwg3ORnN7gAa-GHcIa7ul2bVBZ6G9XDfVjmI6G4kHrdOq-Oh0 Oi L1dd-Bfv76Jw4bJtEeq1ul V IZPR9z7wOg5SzGzE6yPsbYlA2Y TXkNT8RzK0dTnIskqO5kKRPtlcZNVW0X
No ratings yet
ACFrOgC2 Bbwg3ORnN7gAa-GHcIa7ul2bVBZ6G9XDfVjmI6G4kHrdOq-Oh0 Oi L1dd-Bfv76Jw4bJtEeq1ul V IZPR9z7wOg5SzGzE6yPsbYlA2Y TXkNT8RzK0dTnIskqO5kKRPtlcZNVW0X
9 pages
Defining Hypothesis Testing
No ratings yet
Defining Hypothesis Testing
17 pages
U02Lecture05 - Statistical Experiments and Significance Testing
No ratings yet
U02Lecture05 - Statistical Experiments and Significance Testing
51 pages
Review and Non Parametric Using SPSS 2023
No ratings yet
Review and Non Parametric Using SPSS 2023
69 pages
Computational Data Science - Unit 4
No ratings yet
Computational Data Science - Unit 4
18 pages
Lecture 1: Course Introduction, Review and Paired-Samples T-Test
No ratings yet
Lecture 1: Course Introduction, Review and Paired-Samples T-Test
13 pages
Anova Word Ex (1-29)
No ratings yet
Anova Word Ex (1-29)
13 pages
Lesson 12 Hypothesis Testing and Interpretation
No ratings yet
Lesson 12 Hypothesis Testing and Interpretation
10 pages
Inferential Statistics: Draw Inferences About The Larger Group
No ratings yet
Inferential Statistics: Draw Inferences About The Larger Group
60 pages
DMDA Unit-5 notes (2) (1)
No ratings yet
DMDA Unit-5 notes (2) (1)
35 pages
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet
Exploring Life: Powerpoint Lectures For
No ratings yet
Exploring Life: Powerpoint Lectures For
94 pages
SASTA Proceedings of The 37th Annual Congress 1963
No ratings yet
SASTA Proceedings of The 37th Annual Congress 1963
177 pages
Dunn Test PDF
No ratings yet
Dunn Test PDF
6 pages
Identification Essential Oil by Gc-Ms - Robert Adam1
100% (3)
Identification Essential Oil by Gc-Ms - Robert Adam1
407 pages
Cri 311-Chapter 10-Scene Examination
No ratings yet
Cri 311-Chapter 10-Scene Examination
19 pages
Unit I Applied Social Psychology
No ratings yet
Unit I Applied Social Psychology
14 pages
Thematic Analysis
100% (1)
Thematic Analysis
10 pages
Lesson 3 - Sociological Research Methods
No ratings yet
Lesson 3 - Sociological Research Methods
31 pages
Lesson 5
No ratings yet
Lesson 5
4 pages
Chapter 3 Technical Skills
No ratings yet
Chapter 3 Technical Skills
5 pages
Statistics Mcqs - Estimation Part 6: Examrace
No ratings yet
Statistics Mcqs - Estimation Part 6: Examrace
45 pages
Goodwood Publishing IJFAM Template
No ratings yet
Goodwood Publishing IJFAM Template
4 pages
Nyame Akuma Issue 039
No ratings yet
Nyame Akuma Issue 039
80 pages
2.1-Action-Research-Framework-050119
No ratings yet
2.1-Action-Research-Framework-050119
11 pages
QTT PDF
No ratings yet
QTT PDF
43 pages
Python-Linear Regression
No ratings yet
Python-Linear Regression
72 pages
III - Practices in Collecting Quantitative Data
No ratings yet
III - Practices in Collecting Quantitative Data
62 pages
A V (Anova) : Nalysis of Ariance
No ratings yet
A V (Anova) : Nalysis of Ariance
8 pages
Barcelo - Computational Intelligence in Archaeology
100% (1)
Barcelo - Computational Intelligence in Archaeology
437 pages
SB11 - Group 1
100% (1)
SB11 - Group 1
33 pages
Novelraj, Abhirami Statistical Bio Assignment 2
No ratings yet
Novelraj, Abhirami Statistical Bio Assignment 2
4 pages
1.ზურაბიშვილი. თვისებრივი კვლევა
No ratings yet
1.ზურაბიშვილი. თვისებრივი კვლევა
8 pages
Assignment Yujan Tamrakar - Research
No ratings yet
Assignment Yujan Tamrakar - Research
11 pages
Business Statistics:a Decision Making Approach Chapter 1 PowerPoint
No ratings yet
Business Statistics:a Decision Making Approach Chapter 1 PowerPoint
25 pages
Basic Terms in Statistics
No ratings yet
Basic Terms in Statistics
47 pages
Example Questions For CRIM3714 Assessment 1 MCQ
No ratings yet
Example Questions For CRIM3714 Assessment 1 MCQ
4 pages
Ph.D. Course Qualitative Research
No ratings yet
Ph.D. Course Qualitative Research
6 pages
HR Analytics-: Data & Analysis Strategies
No ratings yet
HR Analytics-: Data & Analysis Strategies
27 pages
Mco 3
No ratings yet
Mco 3
126 pages

08-Data Science-S25-Comparing Two Samples

Uploaded by

08-Data Science-S25-Comparing Two Samples

Uploaded by

CMPS 360

Data Science Session 8

● Question: Do the two sets of values come from the

● Answering this question by performing a statistical test

● Question: Could the difference be due to chance alone?

● Statistic: Difference between average weights

● Negative values of this statistic favor the alternative

Non-smoker Non-smoker Smoker Smoker

Smoker Non-smoker Non-smoker Smoker Smoker

● 1 Sample: More Than 2 Categories (e.g. ethnicity distribution of jury panel)

● 1 Sample: Numerical Data (e.g. scores in a lab section)

● 2 Samples: Underlying Values (e.g. birth weights of smokers vs. non-smokers)

Outcome if assigned to Outcome if assigned to

The remaining 15 tickets show:

● then there is about a 5% chance that your test will

You might also like