Week4 Modified

This document covers weekly course objectives for a big data algorithms and statistics course, including topics like hypothesis testing, assessing linear models, and feature selection in linear models. It also provides historical context on the student's t-distribution and how it addressed issues with small sample sizes compared to the normal distribution.

Uploaded by

turbonstre

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Week4 Modified

Uploaded by

turbonstre

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

BDM 2053

Big Data Algorithms and Statistics

Weekly Course Objectives
● What is the t-distribution?
● Hypothesis testing.
○ What are p-values?
● How do we assess linear models?
○ MSE, MAE, F-tests
● Feature selection in linear models.
○ Backward selection
○ Forward selection
● Do some examples in Python!
History on Student’s t-distribution
● z-tables were always used until the early 1900s.
● However, assuming the normal distribution has one issue… on
small samples, we tend to get a lot of variation (less data, more
variability).
○ This was the case with Guinness. They needed to measure
the content of their barley in their beer, but taking too
much of a sample would mean they sell less (which was
tough enough during prohibition). If they took too little,
they didn’t have enough information on their estimates.
● William Gosset realized this issue. The standard error (σ/(n0.5))
is not a good measure when the sample is small because we use
the sample standard deviation.
○ We actually use s, the sample standard deviation, not σ, the
population standard deviation.
History on Student’s t-distribution cont.
● Say we had a standard deviation of 10. Depending on n, our
sample size, our standard error measure would change.

● When n ≥ 30, we see that this curve tends to ﬂatten out. In

other words, to minimize our standard errors we must must
increase our sample size!
History on Student’s t-distribution cont.
● Size matters!... sample size that is. Bigger is better. It reduces
variability and therefore, when making inferences, tends to
reduce uncertainty.
● This lead Gosset to release a paper in the early 1900s called
“The Application of the Law of Error to the work of the
Brewery” where he discusses a new distribution that is flatter
but still bell-shaped when compared to the normal
distribution.
● This allowed Gosset to account for smaller sample sizes, which
solve this concern of bias in our sample mean test statistic.
To the left, we can see that this distribution that
Gosset made would be able to account for the
variability introduced by the sample size. The the
less the variables, the more spread out the
distribution.
Recall the area under the curve is equal to 1, so you
must stretch and flatten the curve out.
History on Student’s t-distribution cont.
● Gosset didn’t change the calculation of the test statistic, just
the distribution to model the statistic when the sample size
was small.
● Gosset didn’t want to publish his paper, but his paper still
appeared under the name Student.
○ Some say Guinness didn’t want him to because they’d then
be admitting there’s variability in their beer.
○ Others say Guinness didn’t want their competitors to know
they are using science.
○ Some say Gosset published secretly to get around the
prohibition rules.
● Originally they were published under “Student’s z
Distribution” because it referenced the normal distribution.
● Later another statistician named Fisher came and made a
suggestion to drop the sample size of each distribution by 1 to
account for uncertainty. This is called degrees of freedom.
● For this reason, the revised tables were called “Student’s t
Distribution”.
Degrees of Freedom
● “The number of degrees of freedom is the number of values in
the final calculation of a statistic that are free to vary.”
● What does this even mean?
○ Say we have 5 observations and we calculate a mean of 3.
○ If I assign x1=2, x2=1, x3=7 and x4=4, what does the last
value, x5 have to equal?
○ Well since the sum of the 5 have to be 15 such that 15/5 = 3,
aka the mean in this case, that means that x5 = 15 - x1- x2-
x3- x4 = 1. The fifth value has to equal 1 or else the mean
won’t be 5.
○ The first 4 values have the freedom to take on any values
observed, but due to the fact that the test statistic exists,
the fifth value is therefore predetermined!
○ This is why the t-distribution took on n-1 degrees of
freedom when estimating the sample mean!
When do we use the z-test vs t-test?
● Put it simply, this decision tree summarizes it the best
t-test examples!
● A random sample of statistics students were asked to estimate
the total number of hours they spend watching Netflix in an
average week. The responses are recorded below. Use this
sample data to construct a 98% confidence interval for the
mean number of hours statistics students will spend watching
television in one week.
0 3 1 20 9 5 10 1 10 4 1 4 2 4 4 5
○ x̄ = 92/15 = 6.133
○ s =5.514
○ d.o.f. = 15-1 = 14
○ α = 1-0.98 = 0.02
■ α/2 = 0.01
○ Therefore, t0.99, 14 = 2.6245
○ x̄ - (t0.99, 14)(s/150.5) , x̄ + (t0.99, 14)(s/150.5) = 2.3964, 9.8701
○ Fun fact! The mean is in the middle of the
Hypothesis testing
● A hypothesis is a claim that can be tested, specifically through
data.
○ The mean salary of graduate students is $80,000 is a
hypothesis, because we have data to test this claim.
○ The United States would do better with a Bernie Sanders
administration vs. Trump is not a hypothesis, but more of
an idea. We do not have data to support this!
● Hypothesis testing typically has a null hypothesis and an
alternative hypothesis.
○ The null hypothesis is what we claim to be true or what we
believe is the reality, until there is evidence against our
claim. Denoted as H0
○ The alternative hypothesis is everything opposite of what
our null hypothesis is. Denote as HA
Types of hypothesis tests
● There are 2 main families of hypothesis tests
○ 1-tail
○ 2-tail
● A 1-tailed test is a statistical hypothesis test set up to show that
the sample mean would be higher or lower than the population
mean, but not both.
● A 2-tailed test is a statistical hypothesis test set up to show that
the sample mean could be higher or lower than the population
mean, but not one exclusively.
● To reject or fail to reject the null hypothesis, we must look
at the alternative hypothesis.
○ The alternative hypothesis is what formulates the rejection
region.
● To reject or fail to reject, we must make reference to some
region denoted by α, which gives rise to a rejection region.
This must be specified before your experiment.
Types of hypothesis tests cont.

One sided t-test Two sided t-test

● If our test statistics are at least as extreme as our null hypothesis and within the region where they are
more “in favor” of the alternative, then we must reject the null
Hypothesis testing example
● It is believed that a bag of Lays is actually only 60% full, and
the rest of the bag is just air. A factory worker states that this is
no longer true and that Lays does not make bags that are 60%
full of air. What is the null and alternative hypothesis?
○ H0: µ = 0.6 The mean air in a bag of Lays is equal to 60%.
○ HA: µ ≠ 0.6 The mean air in a bag of Lays is not equal to 60%.

● Atinder thought he made a fair lab that would’ve taken less

than 40 minutes to complete. What is the null and alternative
hypothesis?
○ H0: µ < 40 The mean completion of the lab is less than 40 minutes.
○ HA: µ ≥ 40 The mean completion of the lab is greater than or equal to 40 minutes.
● We only reject or fail to reject (not accept) the null.
● We might intuitively understand hypothesis testing, but how
do we actually test it?
The notorious p-values
● You may have heard of p-values… they aren’t simply just
probabilities, but rather the probabilities of observing specific
events.
● More generally, p-values are the probabilities of observing
events that are at least as extreme as the hypothesis, assuming
that the null hypothesis is true.
○ … what does that even mean :S ?
● All this means is that, if we had some data and gathered some
statistics (mean, standard deviation), we can then use this
information to determine how “extreme” our hypothesis is,
assuming it’s the truth.
○ If our null hypothesis is off, the probability will be very
small! But how small is small… this is where we reflect
back on our rejection level α
p-value example
● It has been asked ”What do you think is the ideal number of
children for a family to have?”. A popular blog reports that
women want 3 kids. Later, 50 females responded had mean of
3.22, and standard deviation of 1.99. Test this hypothesis at α =
0.05
○ H 0: µ = 3
○ H A: µ ≠ 3
● Since the sample here is large, we can assume it comes from
either a normal distribution or t-distribution. Let’s assume a
t-distribution for now…
○ x̄ = 3.22, s = 1.99,
○ n = 50, d.o.f. = 50-1 = 49
○ α = 0.05
■ α/2 is needed here because it goes in both directions of
the distribution!
p-value example cont.
● If we want to test the null hypothesis, we let µ = 3 and compute
the probability of this event. Therefore we get the following:
● t49 = (x̄-µ0)/(s/ (n0.5) ) = (3.22-3)/(1.99/(500.5)) = 0.788

Our rejection regions.

Anything outside these regions
is considered rare.

● What is the probability of getting a value that is at least as

extreme as the test statistic? aka…
● P(T>t49) = 1 - P(T < t49) = 1 - 0.78275 = 0.21725
○ Since we reject if our probability is in the rejection region of
2.5% probability, we fail to reject this null hypothesis!
Let’s assume a more wild case…
● Say we wanted to test the null hypothesis that the population
mean is 5. Very extreme in hindsight! But this will help us get
the idea.
● Recalculating our t statistic yields:
○ t49 = (x̄-µ0)/(s/ (n0.5) ) = (3.22-5)/(1.99/(500.5)) = -6.324

Our rejection regions.

Our value is WAY out here. Very Anything outside these regions
extreme! is considered rare.

● Again, ﬁnding the probability of this event:

● P(T < t49) = P(T < t49) = 0
● The probability of observing this test statistic or something
more extreme than µ0= 5 would be 0, aka very rare! REJECT!
One more example.
● Six coins of the same type are discovered at an archaeological
site. Archeologists believe that their weights are more than
5.25g indicating that they are from another provenance. The
coins are weighed and have mean 4.73g with sample standard
deviation 0.18g. Perform the relevant test at the 2% level of
signiﬁcance, assuming a normal distribution of weights of all
such coins.
● H0: µ > 5.25 vs. HA: µ ≤ 5.25
● x̄ = 4.73, s = 0.18
● n = 6, d.o.f. = 6-1 = 5
● α = 0.02
● Since the sample size is small, and we do not know the
population standard deviation, we must use a t-distribution!
One more example.
● t5 = (x̄-µ0)/(s/ (n0.5) ) = (4.73-5.25)/(0.18/(60.5)) = -7.0763

Our value is WAY out here. Very

extreme!

● You can likely imagine that the probability of this event being
observed is EXTREMELY small
● P(T ≤ t5) = 0. So we REJECT the null hypothesis in favor of the
alternative hypothesis.
● Alternatively, once we have the our test statistic, we can
determine what the rejection statistic needs to be such that we
reject.
○ t0.02,5 = -2.7565. Since our test statistic is lower, REJECT.
BREAK!
Other measures to assess linear regression
● We left off fitting a linear regression model, finding the Least
Squares estimation and assessing our model using R2.
● What other measures can we use to assess linear regression
models?
● The first is mean squared error (MSE). Literally the average
sum of squares of the residuals (or error).
MSE = (1/n)Σ(yi-ŷi)2
● The second is mean absolute error (MAE). Literally the
average absolute value of the residuals (or error).
MAE = (1/n)Σ|yi-ŷi|
● We’ll look at this top down, starting with the F-test…
○ Top-down meaning, start at the broadest test, and then
narrow down our methods.
F-tests
● Without involving the statistical theory, all we need to know is
that the F-test is much like a z-test or a t-test - it’s another
statistic from a distribution.
● However, this distribution tests if all the coefficients of the
linear regression model are 0. It’s a hypothesis test!
○ H 0: β 1 = β 2 = β 3 = β 4 = … = β p = 0
○ H A: β i ≠ 0
● In english, the null hypothesis states that your model is better
off having no variables (Just use the average value of your
dependent variable to predict). The alternative is that at least 1
is not equal to 0.
● The p-value for the F-test is assuming the null is true. If it’s less
than 5%, we REJECT the null hypothesis and have evidence
that having at least one of the coefficients is beneficial to our
model!
Individual t-tests for coefficients!
● Some of you may have spotted that our coefficients had t-value
statistics associated to them. What does this mean?...
● Since our model is based on our data, we ultimately come up
with a test statistic for our coefficients in the process!
● Our hypothesis testing for a simple linear regression model for
β1 would be:
○ H 0: β 1 = 0
○ H A: β 1 ≠ 0
○ This means that the population coefficient for the first
independent variable equals 0 is our null hypothesis. A
p-value less than 5% indicates we reject our null
hypothesis.
Variable / Feature selection for linear models
● We now have established that the p-values for the coefficients
is a good method of understanding if that coefficient should be
removed from our model. But what do we do now?
● Backward Selection is a process where we start off with every
variable in the model and remove one variable at a time until a
stopping rule (like p-value threshold, adjusted R2, etc) is hit,
until you cannot remove any more or all are removed.
● Forward Selection is a process where we start off with no
variable in the model and add one variable at a time based on a
stopping rule (greatest improvement to R2, greatest reduction
in MSE, etc) until you can no longer add more or all are added.
● Stepwise Selection is a process where you add a variable,
evaluate, then try backward selection, and then repeat by
adding another variable, trying backward selection, etc.
Challenges with step selections
● Doing any of the previously described methods come with
potential challenges…
● If you have a lot of variables and a lot of data - this could be
a very computationally expensive task refitting all these
models, so you must be less strict with your stopping rule …
which yields to poorer model performance.
● A variable dropped or not considered to be added might be
better to keep at a later step.
○ Say you did forward selection, you have 10 variables and
currently your selection stops are using Var1, Var4 and
Var6. Maybe Var8 is worth adding and does a good job to
helping your model, but only after you add a less
significant variable like Var 5.
Recommended Strategy

a
1 2 3 4
Ev

1) Evaluate by dropping or adding one variable at a time.

2) Assess the p-value
⚫ Drop the variable with the highest p-value and add the
variable with the lowest.
3) Remove / Add recommended variable.
4) Compare your model to the previous. Keep an eye on MSE!
Let’s do examples in Python!
Thank You

Chem 20L UCLA Study Questions Answer Key
No ratings yet
Chem 20L UCLA Study Questions Answer Key
5 pages
Maths Test 1 PDF
No ratings yet
Maths Test 1 PDF
2 pages
PROBLEMS - University Physics 2 - Homework 1 - Electric Field
No ratings yet
PROBLEMS - University Physics 2 - Homework 1 - Electric Field
3 pages
Stats 2 Module Updated
No ratings yet
Stats 2 Module Updated
30 pages
Stats 2 Module Updated
No ratings yet
Stats 2 Module Updated
33 pages
Chapter2 handout
No ratings yet
Chapter2 handout
34 pages
COMMONSTATISTICALTABLES
No ratings yet
COMMONSTATISTICALTABLES
35 pages
00 - Inrroduction To Statistics
No ratings yet
00 - Inrroduction To Statistics
30 pages
One Sample Statistical Tests, Continued
No ratings yet
One Sample Statistical Tests, Continued
57 pages
2statistics Prac New
No ratings yet
2statistics Prac New
13 pages
Reviewer Stats and Prob
No ratings yet
Reviewer Stats and Prob
8 pages
Hypothesis Handouts
No ratings yet
Hypothesis Handouts
17 pages
Application of The Paired T-Test
No ratings yet
Application of The Paired T-Test
6 pages
Learning Module - Statistics and Probability
No ratings yet
Learning Module - Statistics and Probability
71 pages
Chapter No. 08 Fundamental Sampling Distributions and Data Descriptions - 02 (Presentation)
No ratings yet
Chapter No. 08 Fundamental Sampling Distributions and Data Descriptions - 02 (Presentation)
91 pages
Test On Variables: in Surveys, The Foolish Ask Questions, Wise Cannot Answers
No ratings yet
Test On Variables: in Surveys, The Foolish Ask Questions, Wise Cannot Answers
24 pages
CH 21
No ratings yet
CH 21
58 pages
11paired T
No ratings yet
11paired T
49 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
12 pages
Ttest
No ratings yet
Ttest
8 pages
Week 5
No ratings yet
Week 5
26 pages
BIOSTATS MIDTERMS
No ratings yet
BIOSTATS MIDTERMS
4 pages
Clase5 2s 2018
No ratings yet
Clase5 2s 2018
34 pages
Unit 4 Hypothesis Testing (One Sample Mean) (SY22)
No ratings yet
Unit 4 Hypothesis Testing (One Sample Mean) (SY22)
34 pages
Day 13 - Intro To T Statistic
No ratings yet
Day 13 - Intro To T Statistic
7 pages
S2 - Q4l9statppt10
No ratings yet
S2 - Q4l9statppt10
110 pages
25-Tests For Population Means (One Sample and Two Samples) - 30!09!2023
No ratings yet
25-Tests For Population Means (One Sample and Two Samples) - 30!09!2023
12 pages
R Programming Notes
No ratings yet
R Programming Notes
113 pages
MIT18_05S14_Reading19
No ratings yet
MIT18_05S14_Reading19
15 pages
FYM - DOE - Lecture #4 PDF
No ratings yet
FYM - DOE - Lecture #4 PDF
35 pages
Hypothesis Test
No ratings yet
Hypothesis Test
6 pages
Statistical Inferences
No ratings yet
Statistical Inferences
46 pages
Lecture 2 Hypothesis Test I - Updated2
No ratings yet
Lecture 2 Hypothesis Test I - Updated2
33 pages
Is Important Because:: TECH 6300 Introduction To Statistical Inference The Normal Distribution
100% (1)
Is Important Because:: TECH 6300 Introduction To Statistical Inference The Normal Distribution
19 pages
StockWatson Econ CH 2
No ratings yet
StockWatson Econ CH 2
39 pages
Application of T-Test To Analyze The Small Sample of Statistical Research
No ratings yet
Application of T-Test To Analyze The Small Sample of Statistical Research
4 pages
Stat
No ratings yet
Stat
2 pages
AE 2023 Lecture4 PDF
No ratings yet
AE 2023 Lecture4 PDF
38 pages
Hypothesis Test
No ratings yet
Hypothesis Test
23 pages
T-Tests & Chi2
No ratings yet
T-Tests & Chi2
35 pages
ESTADISTICA APLICADA - Elementos Básicos
No ratings yet
ESTADISTICA APLICADA - Elementos Básicos
30 pages
Normal Distribution
No ratings yet
Normal Distribution
35 pages
Week 13 Hypothesis Testing
No ratings yet
Week 13 Hypothesis Testing
32 pages
T test
No ratings yet
T test
29 pages
Business Analytics & Machine Learning: Regression Analysis
No ratings yet
Business Analytics & Machine Learning: Regression Analysis
58 pages
Z Test
No ratings yet
Z Test
14 pages
Unit 6
No ratings yet
Unit 6
81 pages
6.4 The Normal Distribution
No ratings yet
6.4 The Normal Distribution
23 pages
Statistics U4
No ratings yet
Statistics U4
38 pages
The T Distribution
No ratings yet
The T Distribution
25 pages
Chapter - 9 - Introduction To The T Statistic
No ratings yet
Chapter - 9 - Introduction To The T Statistic
41 pages
Sampling Theory
No ratings yet
Sampling Theory
7 pages
Chapter 3
No ratings yet
Chapter 3
45 pages
Lecture Notes 1
No ratings yet
Lecture Notes 1
147 pages
Hypothesis Testing 1
No ratings yet
Hypothesis Testing 1
29 pages
Statppt2 - Test Statistic, Z-Critical & T-Critical
No ratings yet
Statppt2 - Test Statistic, Z-Critical & T-Critical
44 pages
Introduction To Hypothesis Tests: Assistant Prof. Dr. Özgür Tosun
No ratings yet
Introduction To Hypothesis Tests: Assistant Prof. Dr. Özgür Tosun
71 pages
Analysing and Presenting Data: Practical Hints: Daniele CEI, Giorgio MATTEI
No ratings yet
Analysing and Presenting Data: Practical Hints: Daniele CEI, Giorgio MATTEI
53 pages
Hypothesis Testing Lecture
No ratings yet
Hypothesis Testing Lecture
28 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
10 pages
qm2 Notes
No ratings yet
qm2 Notes
9 pages
Sampling in Statistics
From Everand
Sampling in Statistics
Stephanie Glen
No ratings yet
Statistics II Essentials
From Everand
Statistics II Essentials
Emil Milewski
2.5/5 (1)
28nm FDSOI CMOS technology FEOL and BEOL thermal stability for 3D Sequential Integration yield and reliability analysis
No ratings yet
28nm FDSOI CMOS technology FEOL and BEOL thermal stability for 3D Sequential Integration yield and reliability analysis
2 pages
May 2021 - Plane Trigonometry 1
No ratings yet
May 2021 - Plane Trigonometry 1
2 pages
Unit 5 B Test
No ratings yet
Unit 5 B Test
6 pages
Classification of Acids
100% (1)
Classification of Acids
8 pages
Mag 2 DC
100% (1)
Mag 2 DC
9 pages
Gibbons 2.3
100% (1)
Gibbons 2.3
2 pages
(Ebook) Got Your Back:: Dealing with Friends and Enemies by Ryan Basen ISBN 9781616135409, 1616135409download
100% (7)
(Ebook) Got Your Back:: Dealing with Friends and Enemies by Ryan Basen ISBN 9781616135409, 1616135409download
47 pages
Full Download Codes and ciphers Julius Caesar the Enigma and the internet 1st Edition R. F. Churchhouse PDF DOCX
100% (11)
Full Download Codes and ciphers Julius Caesar the Enigma and the internet 1st Edition R. F. Churchhouse PDF DOCX
50 pages
Tutorial 17 Rapid Drawdown
No ratings yet
Tutorial 17 Rapid Drawdown
17 pages
2 Analogies
100% (1)
2 Analogies
3 pages
Potato Disease Classification Using Deep Learning
No ratings yet
Potato Disease Classification Using Deep Learning
6 pages
Multiplication Word Problem For Mathelicious
No ratings yet
Multiplication Word Problem For Mathelicious
2 pages
Straus, Intro To Post-Tonal Theory, 294-307 PDF
No ratings yet
Straus, Intro To Post-Tonal Theory, 294-307 PDF
14 pages
Bb8 Animation HTML y Css
No ratings yet
Bb8 Animation HTML y Css
7 pages
RL Examples
No ratings yet
RL Examples
6 pages
03 Synchronous Skew Etc
No ratings yet
03 Synchronous Skew Etc
8 pages
reliability
No ratings yet
reliability
2 pages
A Simple Approach To Short Circuit Calculations - EDP1
No ratings yet
A Simple Approach To Short Circuit Calculations - EDP1
48 pages
Rab Diskominfo
No ratings yet
Rab Diskominfo
6 pages
2012 Maths Practice Paper (M1 & M2) Marking Scheme
No ratings yet
2012 Maths Practice Paper (M1 & M2) Marking Scheme
23 pages
Novotest T-D3
No ratings yet
Novotest T-D3
2 pages
Q&A 7.5 Effective Length of A Base Plate T-Stub: Normal Force Only
No ratings yet
Q&A 7.5 Effective Length of A Base Plate T-Stub: Normal Force Only
1 page
A2 Expt 14.4 (8) Analysis of Iron Tablets
100% (2)
A2 Expt 14.4 (8) Analysis of Iron Tablets
3 pages
MAN Service Letter 2017 Overhaul Hours PDF
No ratings yet
MAN Service Letter 2017 Overhaul Hours PDF
13 pages
GLA University, Mathura (UP) : Data Models and Database Architecture
No ratings yet
GLA University, Mathura (UP) : Data Models and Database Architecture
56 pages
Chapter 6 - COLUMN
No ratings yet
Chapter 6 - COLUMN
54 pages
Service Call Attendent Sheet
No ratings yet
Service Call Attendent Sheet
8 pages

Week4 Modified

Uploaded by

Week4 Modified

Uploaded by

BDM 2053

Big Data Algorithms and Statistics

● When n ≥ 30, we see that this curve tends to ﬂatten out. In

One sided t-test Two sided t-test

● Atinder thought he made a fair lab that would’ve taken less

Our rejection regions.

● What is the probability of getting a value that is at least as

Our rejection regions.

● Again, ﬁnding the probability of this event:

Our value is WAY out here. Very

1) Evaluate by dropping or adding one variable at a time.

You might also like