0% found this document useful (0 votes)
10 views

Statistical Sampling & Parameter Estimation: Prof M.Shashi

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Statistical Sampling & Parameter Estimation: Prof M.Shashi

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Statistical Sampling &

Parameter Estimation
Prof M.Shashi
Population and Sampling
• Population is includes all entities of interest for a decision making
scenario.
• In most of the cases, it involves vast population which may not be reachable
within the reasonable constraints on time and effort.
• Sample is a subset of the population. Whether the subset includes
appropriate data to represent the population or not depends on the
purpose of taking the sample.
• Generally business decisions are made based on samples. Hence
unless mentioned otherwise, data considered for business analytics is
sample only.
Statistical Sampling
• Sampling is used for:
for estimating the central tendency and spread of population
using mean, mode, median, variance, 5-point summary, etc.
providing inputs for building decision models to understand the
trends and make predictions
• Sampling plan is a description of the approach used to obtain samples
from population.
Components of a Sampling Plan
• Objective of sampling activity,
• target population
• population frame
• method of sampling
• operational procedures to collect data and
• statistical tools to be used for data analysis.
Sampling Methods
• Sampling methods can be either Subjective or Probabilistic.
• Subjective sampling methods includes
• Judgemental sampling wherein an expert decides whom to sample (best
customers) and
• Convenience sampling wherein samples are taken based on ease and
feasibility (recent customers)
• Probabilistic sampling involves selecting items randomly from the
whole population.
• Probabilistic sampling is necessary for drawing valid statistical
conclusions.
Probabilistic Sampling Methods in Excel

• Simple Random Sampling


• Systematic or Periodic Sampling
• Stratified Sampling
• Clustered Sampling
• Sampling from a continuous Process
In Excel, click on Data Analysis in the Analysis group of the Data tab and select
Sampling and we get a dialogue box shown above.
This tool requires input range specified as numeric.
With Simple random sampling every subset of a given size (n) has equal chance of
being selected.
In periodic sampling from a population size of p, the tool selects the first item at
random from the first block of size p/n items and then a series of the remaining (n-
1)items of the sample are selected from the population p/n items apart.
Probabilistic Sampling Methods contd…
• Stratified sampling applies to populations containing natural partitions
(strata) and selects proportionate number of items from each stratum.
Disadv: leads to negligible representation from minority groups.
• Cluster Sampling involves sampling a set of clusters rather than individual
items so that all the items within a selected cluster are included in the
sample. It is easier and costs less compared to selecting individual items for
sampling large datasets.
• Sampling from Continuous process like a manufacturing process in done by
selecting a random time and include a chunk of n-items in the sample
arriving after that timing. OR select n-time stamps at random and then
include the next item after each of these time stamps.
Estimating Population Parameters
Single point Estimates using Excel Functions
1. Mean is obtained using =AVERAGE(B2:B95)
Sum of mean-deviations for all observations is equal to 0.
2. Median is the middle value of a ordered list of observations. It is
obtained using =MEDIAN(B2:B95) or by applying sort option on
the range and find the middle observation or the average of two
middle values if the list has even number of observations as
shown in the figure.
3. Mode is the value with highest frequency of occurrence in the
given range. It is obtained using MODE.SNGL(range) or
MODE.MULT(range) for single mode and multiple modes
respectively.
For frequency distributions MODE is the group / interval having the
highest frequency.
4. Mid Range is the average of the MIN and MAX.
Errors in Point Estimation
• Drawback of point estimates is that they do not provide any indication on
the magnitude of potential error.
• Sampling error (refers to the variation of estimates among samples) is
inherent in any sampling process. It can only be minimized but can not be
avoided altogether.
• Non-sampling error occurs when the sample doesn’t represent the target
population adequately.
• It is due to poor sample design such as convenience sampling where random
sampling is appropriate OR
• Wrong population frame is selected OR
• Less reliable data
• Data analyst should eliminate non-sampling error.
Effect of size of the sample on sampling error
• Sampling error depends on the size of sample relative (p/n) to the population size.
• Larger samples provide more accurate estimates of population parameters.
• The figures illustrate the variation of sampling error when sample mean is used for estimating
population mean with different sample sizes.
• Population is uniformly distributed between 0 to 10 and hence population mean is 5.
• It is estimated by sample mean with varying sample sizes and comparative results are shown in the next slide:
Experiment to observe Variation of Sampling Error
with Relative Size of the Sample (on 25 samples of each size)
Application of Standard Error
Central Limit Theorem
• This theorem is one of the most important foundation for making systematic
inference in real world scenarios.
• Central Limit Theorem(CLT) states that if the sample size is large enough, the
sampling distribution of mean is approximately normally distributed regardless of
the distribution of the population and that the mean of the sampling distribution
is same as that of the population.
• In the experiment discussed in previous slides:
• Distribution of population is uniformly distributed, yet sampling distribution of mean
converges to Normal distribution as the size of the sample increases.
• CLT also states that if the population is normally distributed, then the sampling
distribution of mean is also normally distributed for any sample size.
• Hence CLT allows us to apply the concepts and formulae derived for calculating
probabilities for normal distributions to draw conclusions about sample means.
Estimating Sampling Error using Empirical Rules
• According to empirical rule, the true value of a parameter falls within
a range of three standard deviations around the estimated value of
the parameter when it follows normal distribution.
• As per the central limit theorem the sampling distribution of mean is
normally distributed and the mean of the sample distribution is same
as the population mean (μ) for large samples.
• Hence the distribution of sample means starts
from m-3*s and ends at m+3*s.
• From the given table, for sample sizes 25 and 500, we can empirically
estimate the intervals as [3.65,6.35] and [4.76,5.24] respectively.
Interval estimates
• Interval estimates provide range of (plausible values of) a population
parameter / characteristic based on a sample.
• Probability intervals are centered on the mean or median.
• A 100(1-α)% probability interval is any interval [A,B] such that the
probability of falling between A and B is (1-α).
• Eg: In normal distribution with mean (μ) and standard deviation (σ) , μ ± σ is
approximately 68% probability interval around mean. Here, Margin of error is σ.
• 5th and 95th percentiles bounds or defines the 95% probability interval.
Confidence Intervals

d
Estimation of Confidence Interval for Mean

• Finding zα/2 value in the statistical tables:


eg:For 95% confidence level, α=0.05 and α/2=0.025 and 1-α/2=0.975; Search for 0.975 in the cells of Z tables
and note the row and col values. zα/2 value for 95%= row+col =1.9+0.06=1.96
• As the level of confidence, 1-α, decreases, zα/2 decreases and the confidence interval becomes narrower. A
99% confidence interval is wider than a 95% confidence interval for a given sample.
• We must trade-off a higher level of accuracy (low error margins) with the risk that the confidence interval
does not contain the true mean reflected by lower level confidence.
• For a fixed level of risk or level of confidence, as the sample size increases, standard error decreases and
makes the confidence interval narrower that corresponds to more accurate interval.
T-Distribution
• It is a probability distribution with
shape similar to normal distribution
but with larger variance to represent
wider confidence intervals.
• It is used to model uncertainty about 0 t(α/2,df)
the true standard deviation when unknown.
• t-distribution has a parameter, degrees of freedom (df) and as the degrees of freedom increases,
the t-distribution converges to standard normal distribution.
• As the sample size increases, df increases and we use z-values as in the previous formulae to
estimate confidence interval even if the population standard deviation is not known; value of
standard deviation estimated from the large sample is accepted as the value of σ.
• When there is doubt or when only normal sized samples are available, it is better to use t-
distribution.
• Degrees of freedom is defined as the number of sample values that are free to vary. In general, df
is the number of sample values minus the number of estimated parameters.
• Eg: Since sample variance (s2) is estimated using only one estimated parameter (sample mean), t-distribution
of s2 has df=(n-1)
Confidence interval for Mean with unknown
population standard deviation
• Formula for 100(1-α)% confidence interval for the population mean(μ)
when the population variance is unknown is: m ± tα/2,n-1(s/√n) where
tα/2,n-1 value is found from t-distribution tables with (n-1) df, given the
upper tail probability of α/2.
Confidence interval for Mean with unknown
population standard deviation contd…
Confidence Interval for Proportion
of a Categorical variables
• For categorical variables like gender, the proportion of records of a specific
value among all possibilities in the sample is of interest.
• An unbiased estimator of a population proportion π is the statistic / metric
called sample proportion, p^=x/n where x records have the specific value in
the sample of size n.
• A 100(1-α)% confidence interval for the proportion is
Using Confidence intervals for Decision Making
• Drawing conclusion based confidence interval • Predicting election results based on confidence
for population mean interval of proportions
In the example 6.8,while we require a volume Suppose the exit poll of 1300 voters found that 692
voted for candidate A in a contest between A and
of 800ml to be filled in a bottle, we found the B. Just because 692/1300=0.53 of the sample
sample average of 796 ml and accordingly 95% voters are in favour of A we can't predict that A
confidence interval is computed as [790.12, would win. Instead 95% confidence interval for the
801.88]. proportion is estimated as [0.505,0.559] and since
Although the sample mean is less than 800ml, the lower bound is also greater then 0.5, it is safe
we rely on the confidence interval as it to predict that A would win.
contains the desired value, since it is just as However, if it was found that A has only 670 voters,
plausible that the population mean could be the sample proportion(p^) is 0.515 and 95%
800ml with 95% confidence. However, if the confidence interval reduces to [0.488, 0.543].
sample mean were found to be 792, we get Then, it is not wise to predict that A would win,
since the population proportion could be less than
the 95% confidence interval as [786.12, 0.5, even though sample proportion is greater than
797.88] . Then the manufacturer should check 0.5 due to possible sampling error.
and adjust the equipment to meet the
standard.
Prediction Intervals
• A prediction interval is one that provides a range for predicting the value of a new
observation from the same population.
• Prediction interval is associated with the distribution of a random variable while a
confidence interval is associated with the sampling distribution of a summary statistic.
• Prediction intervals are wider than confidence intervals.
• When the population SD is unknown, a 100(1-α)% prediction interval for a new observation
is

You might also like