0% found this document useful (0 votes)
6 views

Sample size

The document discusses the importance of sample size in statistical studies, emphasizing that an appropriate sample size enhances the precision of estimates and increases the power of statistical tests. It outlines the sample size formula, steps to calculate it, and the significance of confidence intervals and hypothesis testing. Additionally, it explains the components of power analysis and the process of hypothesis testing, including formulating hypotheses, choosing significance levels, and selecting appropriate tests.

Uploaded by

saman
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Sample size

The document discusses the importance of sample size in statistical studies, emphasizing that an appropriate sample size enhances the precision of estimates and increases the power of statistical tests. It outlines the sample size formula, steps to calculate it, and the significance of confidence intervals and hypothesis testing. Additionally, it explains the components of power analysis and the process of hypothesis testing, including formulating hypotheses, choosing significance levels, and selecting appropriate tests.

Uploaded by

saman
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Sample Size and Power

Sample size

It is the number of observations or


individuals included in a study or
experiment. It is the number of
individuals, items, or data points
selected from a larger population to
represent it statistically. Choosing a
sample size that is too small may
lead to inconclusive results,
whereas a too-large sample size can
waste resources and complicate
data management.

A larger sample size can potentially


enhance the precision of estimates,
leading to a narrower margin of
error. A larger sample size can also
increase the power of a statistical
test. This means that with a larger sample, you are less likely to find results that are not actually
true. A good study will be able to find the most accurate results with the least amount of subjects.

Sample Size Formula

The sample size formula helps us find the accurate sample size through the difference between
the population and the sample. Since it not possible to survey the whole population, we take a
sample from the population and then conduct a survey or research. The sample size is denoted by
“n” or “N”. Here, it is written as “SS”. The sample size formula is determined in two steps.
First, we calculate the sample size for the infinite population and second we adjust the sample
size to the required population. The sample size formula can be given as:

Formula 1: Sample size for infinite population

S= Z2 × P × (1−P)/M^2

Formula 2: Adjusted sample size

Adjusted Sample Size = (S)/1+(S−1)/Population

where,

 S = sample size for infinite population


 Z = Z score
 P = population proportion ( Assumed as 50% or 0.5)
 M = Margin of error

Note: Z score is determined based on the confidence level.

How to Apply Sample Size Formula?

In order to calculate the required sample size, we need to find several other sets of values and
then substitute them into an appropriate formula. Let's look at the steps to be followed to
calculate the sample size.

Step 1: Determining Key Values


One of the key values to be determined is the population size which refers to the total number of
people within the required demographic. For much larger studies, we can consider using an
approximated value instead of using a precise number.

 When you are working with a smaller group, precision plays a major role in having a greater
statistical impact. For example, if you are performing a survey among the employees of a
very small business, then you need to make sure that the population size is accurate within a
number of a dozen or so people.
 When you are working on larger surveys, there could be deviance with respect to the actual
population. For instance, if the demographic chosen includes everyone who is living in
Canada, then the size can be estimated roughly to 30 million people, although the actual size
could vary by some hundreds of thousands.

Step 2: Determining the margin of error or confidence interval

The margin of error is considered to be the amount of error that can be allowed in the study. The
margin of error is actually a percentage that shows how close the sample results will be with
respect to the true value of the overall population that is considered in the study.

 Usually, you can obtain more accurate answers with a smaller margin of error, but if a small
margin of error is chosen, then you may require a larger sample.
 The margin of error usually is represented with a minus or a plus percentage when the results
of a survey are presented.

For instance, 35% of people choose option B, with a margin of error of +/- 5%". In this particular
example, the margin of error actually indicates that, if the question was asked to the entire
population, then you are confident that between 30% (35 - 5) and 40% (35 + 5) of the people will
agree with option B.

Step 3: Setting the confidence level.


The confidence level is pretty closely related to the margin of error or confidence interval. This
value is used to measure the degree of certainty about how well a sample actually represents the
entire population within the margin of error chosen for the study.

 When the confidence level is chosen as 95%, then this means that you can be 95% certain
that the results will accurately fall within the margin of error chosen by you.
 When a larger confidence level is chosen, it shows a greater degree of accuracy provided that
the sample size is larger. Some of the most common confidence levels used in studies are
99% confident, 90% confident, and 95% confident.
 When the confidence level is set to 95%, then it shows that you are 95% confident that 30%
to 40% of the total chosen population would definitely agree with option B of the survey.

Step 4: Specifying the standard of deviation.

The standard of deviation shows how much variation can be expected from the responses of the
study.

 Compared to the moderate results, you can expect extreme answers to be more accurate.
 Consider an example where 1% of the survey responses says "No", and then 99% answer
"Yes", then it means that the sample actually represents the overall population in an accurate
manner.
 In another case, if 55% answer "No" and 45% answer "Yes," then this means that there could
be a greater chance of error.

Since this value is difficult to be calculated in an actual survey, most people choose to use 0.5
(50%) as the value which is actually the worst-case scenario percentage. Thus, using this value
will actually guarantee that the calculated sample size is huge enough to show the overall
population within the confidence level and the confidence interval in an accurate manner.

Step 5: Finding the Z-score.

The Z-score can be considered as a constant value that is set automatically depending on the
confidence level. Z-score shows the number of standard deviations or the standard normal score
between the average/mean of the population and any selected value.

Z-score is very easy to calculate that one can do it with their hand, or find an online calculator.

Due to the fact that the confidence levels are all standardized, most researchers actually
memorize the required z-score for most of the commonly used confidence levels:

Confidence Level Z-score

80% 1.28
85% 1.44

90% 1.65

95% 1.96

99% 2.58

power analysis
A power analysis is a calculation that helps you determine a minimum sample size for your
study. It’s made up of four main components. If you know or have estimates for any three of
these, you can calculate the fourth component.

 Statistical power: the likelihood that a test will detect an effect of a certain size if there
is one, usually set at 80% or higher.

 Sample size: the minimum number of observations needed to observe an effect of a


certain size with a given power level.

 Significance level (alpha): the maximum risk of rejecting a true null hypothesis that you
are willing to take, usually set at 5%.

 Expected effect size: a standardized way of expressing the magnitude of the expected
result of your study, usually based on
similar studies or a pilot study.

Confidence interval

The confidence interval is the range of values


that you expect your estimate to fall between a
certain percentage of the time if you run your
experiment again or re-sample the population in
the same way. The confidence level is the
percentage of times you expect to reproduce an
estimate between the upper and lower bounds of
the confidence interval, and is set by the alpha
value.

Confidence, in statistics, is another way to


describe probability. For example, if you construct a confidence interval with a 95% confidence
level, you are confident that 95 out of 100 times the estimate will fall between the upper and
lower values specified by the confidence interval. Your desired confidence level is usually one
minus the alpha (α) value you used in your statistical test:
Confidence level = 1 − a

So if you use an alpha value of p < 0.05 for statistical significance, then your confidence level
would be 1 − 0.05 = 0.95, or 95%.

When do you use confidence intervals?


You can calculate confidence intervals for many kinds of statistical estimates, including:

 Proportions
 Population means
 Differences between population means or proportions
 Estimates of variation among groups

Calculating a confidence interval:


Most statistical programs will include the confidence interval of the estimate when you run a
statistical test. If you want to calculate a confidence interval on your own, you need to know:

 The point estimate you are constructing


 The confidence interval for the critical values for the test statistic
 The standard deviation of the sample
 The sample size

Point estimate
The point estimate of your confidence interval will be whatever statistical estimate you are
making (e.g., population mean, the difference between population means, proportions, variation
among groups)

Finding the critical value


Critical values tell you how many standard deviations away from the mean you need to go in
order to reach the desired confidence level for your confidence interval.

There are three steps to find the critical value.

1. Choose your alpha (α) value.

The alpha value is the probability threshold for statistical significance. The most common alpha
value is p = 0.05, but 0.1, 0.01, and even 0.001 are sometimes used. It’s best to look at
the research papers published in your field to decide which alpha value to use.

2. Decide if you need a one-tailed interval or a two-tailed interval.

You will most likely use a two-tailed interval unless you are doing a one-tailed t test. For a two-
tailed interval, divide your alpha by two to get the alpha value for the upper and lower tails.

3. Look up the critical value that corresponds with the alpha value.
If your data follows a normal distribution, or if you have a large sample size (n > 30) that is
approximately normally distributed, you can use the z distribution to find your critical values.

For a z statistic, some of the most common values are shown in this table:

Confidence level 90% 95% 99%

alpha for one-tailed CI 0.1 0.05 0.01

alpha for two-tailed CI 0.05 0.025 0.005

z statistic 1.64 1.96 2.57

If you are using a small dataset (n ≤ 30) that is approximately normally distributed, use
the t distribution instead. The t distribution follows the same shape as the z distribution, but
corrects for small sample sizes. For the t distribution, you need to know your degrees of
freedom (sample size minus 1). For normal distributions, like the t distribution
and z distribution, the critical value is the same on either side of the mean.

Confidence interval for the mean of normally-distributed data


Normally-distributed data forms a bell shape when plotted on a graph, with the sample mean in
the middle and the rest of the data distributed fairly evenly on either side of the mean.

The confidence interval for data which follows a standard normal distribution is:

Where:

 CI = the confidence interval


 X̄ = the population mean
 Z* = the critical value of the z distribution
 σ = the population standard deviation
 √n = the square root of the population size

The confidence interval for the t distribution follows the same formula, but replaces the Z* with
the t*.

In real life, you never know the true values for the population (unless you can do a complete
census). Instead, we replace the population values with the values from our sample data, so the
formula becomes:
Where:

 ˆx = the sample mean


 s = the sample standard deviation

Confidence interval for proportions


The confidence interval for a proportion follows the same pattern as the confidence interval for
means, but place of the standard deviation you use the sample proportion times one minus the
proportion:

Where:

 ˆp = the proportion in your sample (e.g. the proportion of respondents who said they
watched any television at all)
 Z*= the critical value of the z distribution
 n = the sample size

Confidence interval for non-normally distributed data


To calculate a confidence interval around the mean of data that is not normally distributed, you
have two choices:

1. You can find a distribution that matches the shape of your data and use that distribution
to calculate the confidence interval.
2. You can perform a transformation on your data to make it fit a normal distribution, and
then find the confidence interval for the transformed data.

Hypothesis Testing

is a type of statistical analysis in which you put your assumptions about a population parameter
to the test. It is used to estimate the relationship between 2 statistical variables.

 A teacher assumes that 60% of his college's students come from lower-middle-class families.
 A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic patients.
Hypothesis Testing Formula

Z = ( x̅ – μ0 ) / (σ /√n)

 Here, x̅ is the sample mean,


 μ0 is the population mean,
 σ is the standard deviation,
 n is the sample size.

Null and alternate hypothesis

The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no
bearing on the study's outcome unless it is rejected. H0 is the symbol for it, and it is pronounced
H-naught. The Alternate Hypothesis is the logical opposite of the null hypothesis. The
acceptance of the alternative hypothesis follows the rejection of the null hypothesis. H1 is the
symbol for it.

A sanitizer manufacturer claims that its product kills 95 percent of germs on average. To put this
company's claim to the test, create a null and alternate hypothesis.

H0 (Null Hypothesis): Average = 95%.

Alternative Hypothesis (H1): The average is less than 95%.

Another straightforward example to understand this concept is determining whether or not a coin
is fair and balanced. The null hypothesis states that the probability of a show of heads is equal to
the likelihood of a show of tails. In contrast, the alternate theory states that the probability of a
show of heads and tails would be very different.

Hypothesis Testing Calculation With Examples

Let's consider a hypothesis test for the average height of women in the United States. Suppose
our null hypothesis is that the average height is 5'4". We gather a sample of 100 women and
determine their average height is 5'5". The standard deviation of population is 2.

To calculate the z-score, we would use the following formula:

z = ( x̅ – μ0 ) / (σ /√n)

z = (5'5" - 5'4") / (2" / √100)

z = 0.5 / (0.045)

z = 11.11
We will reject the null hypothesis as the z-score of 11.11 is very large and conclude that there is
evidence to suggest that the average height of women in the US is greater than 5'4".

Steps in Hypothesis Testing

Hypothesis testing is a statistical method to determine if there is enough evidence in a sample of


data to infer that a certain condition is true for the entire population. Here’s a breakdown of the
typical steps involved in hypothesis testing:
Formulate Hypotheses

 Null Hypothesis (H0): This hypothesis states that there is no effect or difference, and it is the
hypothesis you attempt to reject with your test.

 Alternative Hypothesis (H1 or Ha): This hypothesis is what you might believe to be true or
hope to prove true. It is usually considered the opposite of the null hypothesis.

Choose the Significance Level (α)

The significance level, often denoted by alpha (α), is the probability of rejecting the null
hypothesis when it is true. Common choices for α are 0.05 (5%), 0.01 (1%), and 0.10 (10%).

Select the Appropriate Test

Choose a statistical test based on the type of data and the hypothesis. Common tests include t-
tests, chi-square tests, ANOVA, and regression analysis. The selection depends on data type,
distribution, sample size, and whether the hypothesis is one-tailed or two-tailed.

Collect Data

Gather the data that will be analyzed in the test. To infer conclusions accurately, this data should
be representative of the population.

Calculate the Test Statistic Based on the collected data and the chosen test, calculate a test
statistic that reflects how much the observed data deviates from the null hypothesis.

Determine the p-value The p-value is the probability of observing test results at least as
extreme as the results observed, assuming the null hypothesis is correct. It helps determine
the strength of the evidence against the null hypothesis.

Make a Decision Compare the p-value to the chosen significance level:


 If the p-value ≤ α: Reject the null hypothesis, suggesting sufficient evidence in the data
supports the alternative hypothesis.

 If the p-value > α: Do not reject the null hypothesis, suggesting insufficient evidence to
support the alternative hypothesis.

Report the Results Present the findings from the hypothesis test, including the test statistic, p-
value, and the conclusion about the hypotheses.

Perform Post-hoc Analysis (if necessary) Depending on the results and the study design,
further analysis may be needed to explore the data more deeply or to address multiple
comparisons if several hypotheses were tested simaltaneously.

Types of Hypothesis Testing

1. Z Test

To determine whether a discovery or relationship is statistically significant, hypothesis testing


uses a z-test. It usually checks to see if two means are the same (the null hypothesis). Only when
the population standard deviation is known and the sample size is 30 data points or more, can a
z-test be applied.

2. T Test

A statistical test called a t-test is employed to compare the means of two groups. To determine
whether two groups differ or if a procedure or treatment affects the population of interest, it is
frequently used in hypothesis testing.

3. Chi-Square

You utilize a Chi-square test for hypothesis testing concerning whether your data is as predicted.
To determine if the expected and observed results are well-fitted, the Chi-square test analyzes the
differences between categorical variables from a random sample. The test's fundamental premise
is that the observed values in your data should be compared to the predicted values that would be
present if the null hypothesis were true.

4. ANOVA

ANOVA, or Analysis of Variance, is a statistical method used to compare the means of three or
more groups. It’s particularly useful when you want to see if there are significant differences
between multiple groups. For instance, in business, a company might use ANOVA to analyze
whether three different stores are performing differently in terms of sales. It’s also widely used in
fields like medical research and social sciences, where comparing group differences can provide
valuable insights.

Simple and Composite Hypothesis Testing


Depending on the population distribution, you can classify the statistical hypothesis into two
types.
Simple Hypothesis: A simple hypothesis specifies an exact value for the parameter.
Composite Hypothesis: A composite hypothesis specifies a range of values.
Example:
A company is claiming that their average sales for this quarter are 1000 units. This is an
example of a simple hypothesis. Suppose the company claims that the sales are in the range
of 900 to 1000 units. Then this i s a case of a composite hypothesis.

One-Tailed and Two-Tailed Hypothesis Testing

The One-Tailed test, also called a directional test, considers a critical region of data that would
result in the null hypothesis being rejected if the test sample falls into it, inevitably meaning the
acceptance of the alternate hypothesis. In a one-tailed test, the critical distribution area is one-
sided, meaning the test sample is either greater or lesser than a specific value.
In two tails, the test sample is checked to be greater or less than a range of values in a Two-
Tailed test, implying that the critical distribution area is two-sided. If the sample falls within this
range, the alternate hypothesis will be accepted, and the null hypothesis will be rejected.

Right Tailed Hypothesis Testing


If the larger than (>) sign appears in your hypothesis statement, you are using a right-tailed test,
also known as an upper test. Or, to put it another way, the disparity is to the right. For instance,
you can contrast the battery life before and after a change in production. Your hypothesis
statements can be the following if you want to know if the battery life is longer than the original
(let's say 90 hours):

 The null hypothesis is (H0 <= 90) or less change.

 A possibility is that battery life has risen (H1) > 90.

The crucial point in this situation is that the alternate hypothesis (H1), not the null hypothesis,
decides whether you get a right-tailed test.

Left Tailed Hypothesis Testing


Alternative hypotheses that assert the true value of a parameter is lower than the null hypothesis
are tested with a left-tailed test; they are indicated by the asterisk "<".

Example:
Suppose H0: mean = 50 and H1: mean not equal to 50. According to the H1, the mean can be
greater than or less than 50. This is an example of a Two-tailed test. In a similar manner, if H0:
mean >=50, then H1: mean <50. Here the mean is less than 50. It is called a One-tailed test.

Type 1 and Type 2 Error

A hypothesis test can result in two types of errors.

Type 1 Error: A Type-I error occurs when sample results reject the null hypothesis despite being
true.

Type 2 Error: A Type-II error occurs when the null hypothesis is not rejected when it is false,
unlike a Type-I error.

Example:

Suppose a teacher evaluates the examination paper to decide whether a student passes or fails.
H0: Student has passed. H1: Student has failed. Type I error will be the teacher failing the
student [rejects H0] although the student scored the passing marks [H0 was true]. Type II error
will be the case where the teacher passes the student [do not reject H0] although the student did
not score the passing marks [H1 is true].

You might also like