Sample size
Sample size
Sample size
The sample size formula helps us find the accurate sample size through the difference between
the population and the sample. Since it not possible to survey the whole population, we take a
sample from the population and then conduct a survey or research. The sample size is denoted by
“n” or “N”. Here, it is written as “SS”. The sample size formula is determined in two steps.
First, we calculate the sample size for the infinite population and second we adjust the sample
size to the required population. The sample size formula can be given as:
S= Z2 × P × (1−P)/M^2
where,
In order to calculate the required sample size, we need to find several other sets of values and
then substitute them into an appropriate formula. Let's look at the steps to be followed to
calculate the sample size.
When you are working with a smaller group, precision plays a major role in having a greater
statistical impact. For example, if you are performing a survey among the employees of a
very small business, then you need to make sure that the population size is accurate within a
number of a dozen or so people.
When you are working on larger surveys, there could be deviance with respect to the actual
population. For instance, if the demographic chosen includes everyone who is living in
Canada, then the size can be estimated roughly to 30 million people, although the actual size
could vary by some hundreds of thousands.
The margin of error is considered to be the amount of error that can be allowed in the study. The
margin of error is actually a percentage that shows how close the sample results will be with
respect to the true value of the overall population that is considered in the study.
Usually, you can obtain more accurate answers with a smaller margin of error, but if a small
margin of error is chosen, then you may require a larger sample.
The margin of error usually is represented with a minus or a plus percentage when the results
of a survey are presented.
For instance, 35% of people choose option B, with a margin of error of +/- 5%". In this particular
example, the margin of error actually indicates that, if the question was asked to the entire
population, then you are confident that between 30% (35 - 5) and 40% (35 + 5) of the people will
agree with option B.
When the confidence level is chosen as 95%, then this means that you can be 95% certain
that the results will accurately fall within the margin of error chosen by you.
When a larger confidence level is chosen, it shows a greater degree of accuracy provided that
the sample size is larger. Some of the most common confidence levels used in studies are
99% confident, 90% confident, and 95% confident.
When the confidence level is set to 95%, then it shows that you are 95% confident that 30%
to 40% of the total chosen population would definitely agree with option B of the survey.
The standard of deviation shows how much variation can be expected from the responses of the
study.
Compared to the moderate results, you can expect extreme answers to be more accurate.
Consider an example where 1% of the survey responses says "No", and then 99% answer
"Yes", then it means that the sample actually represents the overall population in an accurate
manner.
In another case, if 55% answer "No" and 45% answer "Yes," then this means that there could
be a greater chance of error.
Since this value is difficult to be calculated in an actual survey, most people choose to use 0.5
(50%) as the value which is actually the worst-case scenario percentage. Thus, using this value
will actually guarantee that the calculated sample size is huge enough to show the overall
population within the confidence level and the confidence interval in an accurate manner.
The Z-score can be considered as a constant value that is set automatically depending on the
confidence level. Z-score shows the number of standard deviations or the standard normal score
between the average/mean of the population and any selected value.
Z-score is very easy to calculate that one can do it with their hand, or find an online calculator.
Due to the fact that the confidence levels are all standardized, most researchers actually
memorize the required z-score for most of the commonly used confidence levels:
80% 1.28
85% 1.44
90% 1.65
95% 1.96
99% 2.58
power analysis
A power analysis is a calculation that helps you determine a minimum sample size for your
study. It’s made up of four main components. If you know or have estimates for any three of
these, you can calculate the fourth component.
Statistical power: the likelihood that a test will detect an effect of a certain size if there
is one, usually set at 80% or higher.
Significance level (alpha): the maximum risk of rejecting a true null hypothesis that you
are willing to take, usually set at 5%.
Expected effect size: a standardized way of expressing the magnitude of the expected
result of your study, usually based on
similar studies or a pilot study.
Confidence interval
So if you use an alpha value of p < 0.05 for statistical significance, then your confidence level
would be 1 − 0.05 = 0.95, or 95%.
Proportions
Population means
Differences between population means or proportions
Estimates of variation among groups
Point estimate
The point estimate of your confidence interval will be whatever statistical estimate you are
making (e.g., population mean, the difference between population means, proportions, variation
among groups)
The alpha value is the probability threshold for statistical significance. The most common alpha
value is p = 0.05, but 0.1, 0.01, and even 0.001 are sometimes used. It’s best to look at
the research papers published in your field to decide which alpha value to use.
You will most likely use a two-tailed interval unless you are doing a one-tailed t test. For a two-
tailed interval, divide your alpha by two to get the alpha value for the upper and lower tails.
3. Look up the critical value that corresponds with the alpha value.
If your data follows a normal distribution, or if you have a large sample size (n > 30) that is
approximately normally distributed, you can use the z distribution to find your critical values.
For a z statistic, some of the most common values are shown in this table:
If you are using a small dataset (n ≤ 30) that is approximately normally distributed, use
the t distribution instead. The t distribution follows the same shape as the z distribution, but
corrects for small sample sizes. For the t distribution, you need to know your degrees of
freedom (sample size minus 1). For normal distributions, like the t distribution
and z distribution, the critical value is the same on either side of the mean.
The confidence interval for data which follows a standard normal distribution is:
Where:
The confidence interval for the t distribution follows the same formula, but replaces the Z* with
the t*.
In real life, you never know the true values for the population (unless you can do a complete
census). Instead, we replace the population values with the values from our sample data, so the
formula becomes:
Where:
Where:
ˆp = the proportion in your sample (e.g. the proportion of respondents who said they
watched any television at all)
Z*= the critical value of the z distribution
n = the sample size
1. You can find a distribution that matches the shape of your data and use that distribution
to calculate the confidence interval.
2. You can perform a transformation on your data to make it fit a normal distribution, and
then find the confidence interval for the transformed data.
Hypothesis Testing
is a type of statistical analysis in which you put your assumptions about a population parameter
to the test. It is used to estimate the relationship between 2 statistical variables.
A teacher assumes that 60% of his college's students come from lower-middle-class families.
A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic patients.
Hypothesis Testing Formula
Z = ( x̅ – μ0 ) / (σ /√n)
The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no
bearing on the study's outcome unless it is rejected. H0 is the symbol for it, and it is pronounced
H-naught. The Alternate Hypothesis is the logical opposite of the null hypothesis. The
acceptance of the alternative hypothesis follows the rejection of the null hypothesis. H1 is the
symbol for it.
A sanitizer manufacturer claims that its product kills 95 percent of germs on average. To put this
company's claim to the test, create a null and alternate hypothesis.
Another straightforward example to understand this concept is determining whether or not a coin
is fair and balanced. The null hypothesis states that the probability of a show of heads is equal to
the likelihood of a show of tails. In contrast, the alternate theory states that the probability of a
show of heads and tails would be very different.
Let's consider a hypothesis test for the average height of women in the United States. Suppose
our null hypothesis is that the average height is 5'4". We gather a sample of 100 women and
determine their average height is 5'5". The standard deviation of population is 2.
z = ( x̅ – μ0 ) / (σ /√n)
z = 0.5 / (0.045)
z = 11.11
We will reject the null hypothesis as the z-score of 11.11 is very large and conclude that there is
evidence to suggest that the average height of women in the US is greater than 5'4".
Null Hypothesis (H0): This hypothesis states that there is no effect or difference, and it is the
hypothesis you attempt to reject with your test.
Alternative Hypothesis (H1 or Ha): This hypothesis is what you might believe to be true or
hope to prove true. It is usually considered the opposite of the null hypothesis.
The significance level, often denoted by alpha (α), is the probability of rejecting the null
hypothesis when it is true. Common choices for α are 0.05 (5%), 0.01 (1%), and 0.10 (10%).
Choose a statistical test based on the type of data and the hypothesis. Common tests include t-
tests, chi-square tests, ANOVA, and regression analysis. The selection depends on data type,
distribution, sample size, and whether the hypothesis is one-tailed or two-tailed.
Collect Data
Gather the data that will be analyzed in the test. To infer conclusions accurately, this data should
be representative of the population.
Calculate the Test Statistic Based on the collected data and the chosen test, calculate a test
statistic that reflects how much the observed data deviates from the null hypothesis.
Determine the p-value The p-value is the probability of observing test results at least as
extreme as the results observed, assuming the null hypothesis is correct. It helps determine
the strength of the evidence against the null hypothesis.
If the p-value > α: Do not reject the null hypothesis, suggesting insufficient evidence to
support the alternative hypothesis.
Report the Results Present the findings from the hypothesis test, including the test statistic, p-
value, and the conclusion about the hypotheses.
Perform Post-hoc Analysis (if necessary) Depending on the results and the study design,
further analysis may be needed to explore the data more deeply or to address multiple
comparisons if several hypotheses were tested simaltaneously.
1. Z Test
2. T Test
A statistical test called a t-test is employed to compare the means of two groups. To determine
whether two groups differ or if a procedure or treatment affects the population of interest, it is
frequently used in hypothesis testing.
3. Chi-Square
You utilize a Chi-square test for hypothesis testing concerning whether your data is as predicted.
To determine if the expected and observed results are well-fitted, the Chi-square test analyzes the
differences between categorical variables from a random sample. The test's fundamental premise
is that the observed values in your data should be compared to the predicted values that would be
present if the null hypothesis were true.
4. ANOVA
ANOVA, or Analysis of Variance, is a statistical method used to compare the means of three or
more groups. It’s particularly useful when you want to see if there are significant differences
between multiple groups. For instance, in business, a company might use ANOVA to analyze
whether three different stores are performing differently in terms of sales. It’s also widely used in
fields like medical research and social sciences, where comparing group differences can provide
valuable insights.
The One-Tailed test, also called a directional test, considers a critical region of data that would
result in the null hypothesis being rejected if the test sample falls into it, inevitably meaning the
acceptance of the alternate hypothesis. In a one-tailed test, the critical distribution area is one-
sided, meaning the test sample is either greater or lesser than a specific value.
In two tails, the test sample is checked to be greater or less than a range of values in a Two-
Tailed test, implying that the critical distribution area is two-sided. If the sample falls within this
range, the alternate hypothesis will be accepted, and the null hypothesis will be rejected.
The crucial point in this situation is that the alternate hypothesis (H1), not the null hypothesis,
decides whether you get a right-tailed test.
Example:
Suppose H0: mean = 50 and H1: mean not equal to 50. According to the H1, the mean can be
greater than or less than 50. This is an example of a Two-tailed test. In a similar manner, if H0:
mean >=50, then H1: mean <50. Here the mean is less than 50. It is called a One-tailed test.
Type 1 Error: A Type-I error occurs when sample results reject the null hypothesis despite being
true.
Type 2 Error: A Type-II error occurs when the null hypothesis is not rejected when it is false,
unlike a Type-I error.
Example:
Suppose a teacher evaluates the examination paper to decide whether a student passes or fails.
H0: Student has passed. H1: Student has failed. Type I error will be the teacher failing the
student [rejects H0] although the student scored the passing marks [H0 was true]. Type II error
will be the case where the teacher passes the student [do not reject H0] although the student did
not score the passing marks [H1 is true].