ML Unit2 SimpleLinearRegression pdf-60-97
ML Unit2 SimpleLinearRegression pdf-60-97
Some Concepts
Sampling distribution of a statistic
Sampling distribution of a statistic is the probability distribution for the possible values of the statistic that results when random
sample of size 𝑛 are repeatedly drawn from population
Suppose you randomly sampled 10 people from the population of women in Houston, Texas, between the ages of 21 and 35
years and computed the mean height of your sample.
Now sample mean would not be equal to mean of all the women in Houston
It might be somewhat lower or it might be somewhat higher, but it would not equal the population mean exactly.
Similarly, second sample is taken of 10 people from the same population
Not necessarily mean of the second sample to equal the mean of the first sample
A critical part of inferential statistics involves determining how far sample statistics are likely to vary from each other and from
the population parameter
Sample statisticscould be
Mean, Mean absolute value of the deviation from the mean, Standard Deviation of the sample, Variance of the sample
In the above example, the statistics is sample mean and population mean
The numerical descriptive measures calculated from the samples are known as statistics
Ref-
David Scott , Mikki Hebl, Rudy Guerra , Dan Osherson, and Heidi Zimmer, Introduction to Statistics, Online edition
1 1
William Mendenhall, Robert Beaver, Barbara Beaver, Introduction to probability and statistics, Cengage, 14 th edition
Sampling Distributions and Inferential
Statistics
We collect sample data
From this data we estimate parameters of the sampling distribution
This knowledge of the sampling distribution is useful for knowing the degree to which means from different samples would
differ from each other and from the population mean
It would give you a sense of how close your particular sample mean is likely to be to the population mean
This information is directly available from a sampling distribution
The most common measure of how much sample means differ from each other is the standard deviation of the sampling
distribution of the mean
This standard deviation is called the standard error of the mean
If all the sample means were very close to the population mean, then the standard error of the mean would be small
On the other hand, if the sample means varied considerably, then the standard error of the mean would be large
Example
Assume that sample mean were 125 and we estimated that the standard error of the mean were 5
For a normal distribution, then it would be likely that your sample mean would be within 10 units of the population mean since most of a normal
distribution is within two standard deviations of the mean
Sampling distribution of the mean
Mean
The mean of the sampling distribution of the mean is the
mean of the population from which the scores were sampled
If a population has a mean μ, then the mean of the
sampling distribution of the mean is also μ.
Symbol 𝜇M is used to refer to the mean of the sampling
distribution of the mean
Formula for the mean of the sampling distribution of the
mean can be written as: 𝜇𝑀 = 𝜇
Sampling distribution of the mean
Variance
The variance of the sampling distribution of the mean is
2
computed as follows 𝜎𝑀 = 𝜎 2 /𝑛
That is, the variance of the sampling distribution of the mean
is the population variance divided by 𝑛, the sample size (the
number of samples used to compute a mean).
Thus, the larger the sample size, the smaller the variance of
the sampling distribution of the mean
Sampling distribution of the mean
Standard Error
The standard error of the mean is the standard deviation of
the sampling distribution of the mean. It is therefore the
square root of the variance of the sampling distribution of
the mean and can be written as: 𝜎𝑀 = 𝜎/ 𝑁
The standard error is represented by a σ because it is a
standard deviation
The subscript (M) indicates that the standard error in
question is the standard error of the mean
Conditions for inference
A good sample must have the following characteristics
Representative of entire population
Big enough to draw conclusion from (n>=30)
Randomly picked
Sampling distribution of the sample mean needs to be approximately normal
This is true if our parent population is normal
or if sample size is reasonably large (n ≥30)
Independent
Individual observations need to be independent
If sampling is done without replacement, the sample size should not be more than 10% of
the population
Need to know
If the sampled population is normal, then the sampling
distribution will also be normal
When the sampled population is approximately
symmetric, the sampling distribution becomes
approximately normal
When the sampled population is skewed, the sample
size 𝑛 ≥ 30 , should be taken to so that the sampling
distribution becomes approximately normal
Central limit theorem
The central limit theorem states that: Given a
population with a finite mean μ and a finite non-zero
variance σ2, the sampling distribution of the mean
approaches a normal distribution with a mean of μ
and a variance of 𝜎 2 /𝑁 as N, the sample size,
increases.
Analysis of variance ANOVA
Analysis of Variance (ANOVA) is a statistical
method used to test differences between two or
more means
ANOVA is used to test general rather than specific
differences among means
Analysis of variance ANOVA for linear
regression
Divide total variation in y ("total sum of squares") into two components:
• due to the change in x ("regression sum of squares")
• due to random error ("error sum of squares“)
• Data= Fit + Error
• SST= SSR+SSE
2
• σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2 = σ𝑛𝑖=1 𝑦ෝ𝑖 − 𝑦ത + σ𝑛𝑖=1(𝑦𝑖 −𝑦ෝ𝑖 )2
If the regression sum of squares is a "large" component of the total sum of
squares
it suggests that there is a linear association between the predictor x and the
response y
ANOVA for linear regression
SST= SSR+SSE
The degrees of freedom associated with each of
these sums of squares follow a similar decomposition
That is
𝑑𝑓 𝑜𝑓 𝑆𝑆𝑇 = 𝑑𝑓 𝑜𝑓 𝑆𝑆𝑅 + 𝑑𝑓 𝑜𝑓 𝑆𝑆𝐸
Parameters of ANOVA
p p
Verify whether the regression model provides a better fit to the data than a model that
contains no independent variables
Using the F-distribution table for alpha = 0.05, with numerator of degrees of
freedom 2 (df for Regression) and denominator degrees of freedom 9
Conclusion from the above example
We find that the F critical value is 4.2565
Since our f statistic (5.09) is greater than the F
critical value (4.2565),
We can conclude that the regression model as a
whole is statistically significant.
Example
Find F stat and T Math score xi
Final Calculus
grade yi
stat for simple 39
43
65
78
linear regression 21
64
52
82
Do this example in 75
34
98
56
your notebooks 52 75
Hypothesis Using ANOVA
(-13.33)2 = 177.7
Equivalence of ANOVA F-test and t-test
actual
relationship
between
variables overfit model
Over-fitting
Overfitting is a modeling error that occurs when a function or model is too
closely fit the training set and getting a drastic difference of fitting in test set
A statistical model begins to describe
the random error in the data rather than the
relationships between variables
R-squared is a popular measure of quality of fit in regression
However it does not offer significant information about how well a given
regression model can predict future values
Overfitting lead to erroneous R-squared, regression coefficients and p-values
in the population
Overfitting a regression model reduces its generalizability outside the original
dataset.
Detecting over-fit models: Cross validation
• We can detect overfitting by determining whether your model fits new data as
well as it fits the data used to estimate the model
• Used to estimate the behaviour of the large data set based on the a small part
of data set
• Evaluate machine learning models on a limited data sample
• Use k number of groups to split the dataset
• Called k-fold cross validation
• Randomly split the dataset into k fold/groups of equal size
• First fold is considered as validation set and is verified against the remaining k-1
folds
K-fold cross validation
Choosing the right value of k is quite complex
Behaviour of the model is dependent on the dataset
Some ways of choosing the value of k is
Each train test group of data should be large enough to statistically
representative
K=10, experimentally proven to be optimum
K=n, where n is size of data set such that each sample is given
equal opportunity
This is called Leave One Out Cross Validation (LOOCV)
Ex: k-fold cross validation
Data samples: [1, 2, 3, 4, 5, 6]
K=3
Fold1 = [5,2], Fold2 = [1,3], Fold3 = [4,6]
Model1: trained on fold1 + fold2, tested on fold3
Model2: trained on fold2 + fold3, tested on fold1
Model3: trained on fold1 + fold3, tested on fold2
Minimum number of samples can be 1
Cross validation: The ideal procedure
• Divide data into three sets, training, validation and test sets
• Parameters of regression model are calculated based on training data
• And accuracy is measured for new data
• The validation error gives an unbiased estimate of the predictive power of a
model.
K- fold Cross validation
Split data into 5 samples
Fit a model to the training sample
Use test sample to determine Cross Validation Metric
Repeat the process for next sample
References
Machine Learning, IBM
David Scott1, Mikki Hebl, Rudy Guerra1, Dan Osherson, and
Heidi Zimmer, Introduction to Statistics, Online edition
William Mendenhall, Robert Beaver, Barbara Beaver,
Introduction to probability and statistics, Cengage, 14th edition
Ref: https://online.stat.psu.edu/stat462/node/91/
https://openstax.org/details/books/introductory-business-
statistics