Bio Stats
Bio Stats
Biostatistics
interpretation of clinical trials, which are essential for evaluating the safety and
occurrence and transmission. They also help estimate disease prevalence and
incidence.
iii. Public Health Research: Biostatistics is crucial for public health research,
iv. Genetics and Genomics: Biostatistics is used to analyze genetic data, including
such as patient outcomes, medical imaging data, and laboratory test results. It
analyzing surveys that collect data on health-related behaviors, risk factors, and
modeling and analyzing drug concentration data to understand how drugs are
absorbed, distributed, metabolized, and excreted in the body and how they affect
health outcomes.
Sample size
Sample size refers to the number of individuals, items, or units selected from a larger
design and statistical analysis because the size of the sample can significantly impact
the validity and reliability of the study's findings. The goal of determining an
appropriate sample size is to strike a balance between collecting enough data to draw
meaningful conclusions and avoiding the unnecessary collection of data, which can be
resource-intensive.
In statistical terms, the sample size is denoted by the letter "n" and represents the number
of observations or data points collected from the population. The size of the sample is
A larger sample size generally provides more precise and reliable estimates of
likelihood of detecting true effects or differences. However, larger sample sizes may also
Conversely, a smaller sample size may be more manageable in terms of resources and
logistics, but it may yield less precise estimates and have a higher risk of producing
results that are not statistically significant or generalizable to the larger population.
i. Research Objectives: The specific research questions and hypotheses guide the
determination of the sample size. The sample size should be adequate to address
ii. Population Variability: The level of variability within the population of interest
affects the required sample size. Greater variability often necessitates a larger
iii. Level of Confidence: The desired level of confidence (e.g., 95%, 99%) in the study
results influences the sample size. Higher confidence levels typically require
larger samples.
iv. Margin of Error: The acceptable margin of error or precision for estimates
impacts the sample size. Smaller margins of error demand larger sample sizes.
v. Statistical Power: The statistical power of the study (the ability to detect true
effects) is determined by the sample size. Higher power requires a larger sample.
vi. Effect Size: The size of the effect or difference that you expect to detect in your
study affects the sample size calculation. Smaller effect sizes may require larger
samples to detect.
vii. Study Design: The chosen study design (e.g., cross-sectional, longitudinal,
experimental) and the statistical methods to be used influence the sample size
calculation.
budget, and the feasibility of data collection, play a role in determining the sample
size.
Biostats vocabulary
The research hypothesis states there is a relationship between the independent and
dependent variables. The research hypothesis is also called the alternative hypothesis.
It is the opposite of the null hypothesis. When the null hypothesis is rejected, based on
Null hypothesis- H0
study. Typically, this is not the anticipated outcome of an experiment. Usually the investi-
gator conducts an experiment because he/she has reason to believe manipulation of the
independent variable will influence the dependent variable. So, rejection of the null
Variable- The variable is the fundamental entity studied in scientific research. A variable
is an attribute or thing which is free to vary (can take on more than one value).
Independent variable
manipulated by the investigator. More generally, Independent variables are the causes or
Dependent variable
In an experimental setting, dependent variable refers to the variables which are observed
values depend upon the values of independent variables. Ex.- Blood sugar level in anti-
diabetic test.
Variance- provides a way to understand how much the data values "vary" or "spread out"
from the central tendency, which is typically represented by the mean. It quantifies how
much individual data points in a dataset deviate from the mean (average) of the dataset.
whether there are statistically significant differences among multiple group means.
ANOVA is an extension of the t-test, which is used to compare the means of two groups.
Degree of freedom- Degrees of freedom represent the number of values that are free to
by the data or the model being used. It varies as per statistical test used, i.e.- Student’s T-
Example- degree of freedom in ANOVA comparing five groups each group having six
samples-
the group means and is calculated as (k - 1), where "k" is the number of groups. In
2. Degrees of Freedom (Within Groups): This represents the variation within each
(total sample size) and "k" is the number of groups. In this case, there are 5 groups
For example, if there are 20 individuals and they are being compared for their
performance before and after a treatment, there are 20 pairs of observations. In this
case, the degrees of freedom for the paired t-test would be (20 - 1) = 19.
Where-
For example, if an unpaired t-test is conducted to compare the means of two groups,
and the first group has 30 observations (n1 = 30) and the second group has 25
observations (n2 = 25), the degrees of freedom for the t-test would be (30 + 25 - 2) =
53.
Mean: The mean, also known as the average, is calculated by adding up all the values in
a dataset and then dividing the sum by the number of data points.
Median: The median is the middle value in a dataset when the data is ordered from
smallest to largest. If there is an even number of data points, the median is the average of
Mode: The mode is the value that appears most frequently in a dataset. A dataset can
have one mode (unimodal), multiple modes (multimodal), or no mode at all (no distinct
value appears more often than others). The mode is particularly useful for categorical or
nominal data, where you are interested in finding the most common category or group.
Example-
Mean = (2 + 3 + 4 + 5 + 3 + 8 + 9 + 10 + 11 + 3) / 10
Mean = 58 / 10
Mean = 5.8
Median:
Since there are 10 data points (an even number), the median is the average of the two
middle values, which are the 5th and 6th values in the sorted list.
Median = (5 + 8) / 2
Median = 13 / 2
Median = 6.5
Mode:
In this dataset, the number 3 appears three times, which is more frequent than any other
Standard deviation-
set of values. It quantifies how much individual data points differ from the mean
(average) of the data set. In other words, it provides a measure of the spread or the extent
to which data points are dispersed around the mean. The standard deviation is often
denoted by the Greek letter σ (sigma) for a population and 's' for a sample.
Standard deviation provides a way to quantify the degree of variation or dispersion in a
data set. A high standard deviation indicates that data points are widely spread out from
the mean, while a low standard deviation suggests that data points are close to the mean.
The formula for calculating the standard deviation for a sample and population is as
follows:
Where:
Example- Calculate standard deviation for data set 20, 25, 31,23,27,29.
To calculate the standard deviation for the given data set, follow these steps:
➢ Calculate the squared difference between each data point and the mean.
➢ Take the square root of the result from step 3 to get the standard deviation.
Step 2: Calculate the squared difference between each data point and the mean:
Step 4: Take the square root of the result from step 3 to get the standard deviation:
So, the standard deviation of the given data set is approximately 3.71.
To assess the spread of data relative to the mean, it's common to use a coefficient of
The standard error of the mean (SEM) is a statistical measure that quantifies the
variability or uncertainty in the sample mean when estimating the population mean. It is
a measure of how much the sample mean is likely to vary from one sample to another. In
essence, the SEM provides information about the precision of the sample mean as an
SEM = σ / √n
Where:
SEM is related to the variability in sample means when samples are repeatedly drawn
from the same population. It gives an idea of how much the sample mean is likely to differ
The SEM decreases as the sample size (n) increases. In other words, larger sample sizes
In practice, the SEM is useful for interpreting and comparing sample means from different
studies or experiments. A smaller SEM indicates that the sample means are more
consistent and, therefore, provide more reliable estimates of the population mean.
has specific characteristics that allow for the use of parametric statistical tests and
models. Examples of parametric data: height of adult humans, weight of laboratory mice,
probability distribution and do not have the same characteristics as parametric data.
They are often associated with nominal or ordinal scales of measurement and do not
Rankings: Ranking data, such as movie ratings from 1 to 5 stars or preferences for
different food items from most preferred to least preferred, are nonparametric ordinal
data, Blood Type, Survey Responses: Many survey questions that ask respondents to
choose from predefined categories or options generate nonparametric data. For example,
Nonparametric data often require different statistical methods and tests than parametric
data. Nonparametric tests, such as the chi-squared test, Wilcoxon signed-rank test, and
Mann-Whitney U test, are used to analyze nonparametric data because they do not make
assumptions about normality or equal variance. These tests are valuable when dealing
with data that cannot be treated as normally distributed or when the scale of
(Ordinal data- Ordinal data represents categories with a meaningful order or ranking.
However, the intervals between categories are not necessarily equal. Examples include
education levels e.g., "high school," "Intermediate," "bachelor's degree" and customer
Nominal data- It represents categories or groups that are distinct and separate from
each other but don't have any inherent order or ranking. Nominal data is typically used
to classify items into discrete categories or to name and identify different attributes or
(also known as the Gaussian distribution). In a normal distribution, data points are
variable and the independent variables. This means that the change in the dependent
Common examples of parametric tests and the types of data they are used for include:
independent groups when the data follows a normal distribution. For example,
comparing the test scores of two groups of students who received different
teaching methods.
ii. Paired t-Test: Used to compare the means of two related groups (paired data)
when the data is normally distributed. For example, comparing the pre- and post-
iii. Analysis of Variance (ANOVA): Used to compare means among three or more
ANOVA is used for one independent variable, while two-way ANOVA is used for
dependent variable and one or more independent variables when the assumptions
free procedures. In general, these procedures can be used with nominal or ordinal
certain shapes (in contrast to parametric procedures which invariably require normal
include the Chi-square Tests, and the Spearman Rank Correlation Coefficient.
p Value- In statistics, the p-value- short for probability value, is a measure that helps
hypothesis testing that helps researchers determine whether the results of their study
If the p-value is less than α, null hypothesis is rejected and alternative hypothesis is
accepted. This means results are statistically significant, suggesting that there is evidence
indicating that data does not provide strong enough evidence to support the alternative
hypothesis.
Correlation
is a measure of the degree of relationship among variables. There are many correlation
coefficients. Two of the most important measures are the Pearson Product Moment
Correlation Coefficient (by far the most frequently used) and the Spearman Rank
relationship.
one or more independent variables. A regression model is able to show whether changes
observed in the dependent variable are associated with changes in one or more of the
independent variables.
Statistical tests of significance
Statistical tests of significance, also known as hypothesis tests, are a fundamental part
of statistical analysis. These tests help researchers determine whether the observed
differences or associations in their data are statistically significant or if they could have
occurred by random chance. Here are some common statistical tests of significance and
Student's t-test: It is used for comparing the means of two groups. It is of two types-
independent/unpaired samples t-test (for comparing two independent groups) and the
Analysis of Variance (ANOVA): It is used for comparing means among more than two
groups. It may be one-way ANOVA (for one categorical independent variable), two-way
ANOVA (for two independent variables), and repeated measures ANOVA (for repeated
whether there are statistically significant differences among multiple group means.
ANOVA is an extension of the t-test, which is used to compare the means of two groups.
There are different types of ANOVA depending on the number of factors and levels
One-Way ANOVA: This is used when there is one independent variable (factor) and more
than two levels or groups. It assesses whether there are statistically significant
Two-Way ANOVA: This is used when there are two independent variables, and on has to
assess the main effects of each variable as well as any interaction between them. It's
useful for studying how two factors affect the dependent variable.
clinical trial to test the efficacy of a new drug in treating a specific medical condition. They
want to investigate how two factors, dosage (Low, Medium, High) and gender (Male,
Female), impact the response variable, which is the reduction in symptoms after taking
the drug.
Factor 1: Dosage
Low: 1 mg
Medium: 5 mg
High: 10 mg
Factor 2: Gender
Male
Female
For this study, the company collects data from a large number of participants, randomly
assigning them to different dosage levels and recording their gender. After the trial, they
Steps of ANOVA:
a) Null Hypothesis (H0): The null hypothesis in ANOVA states that there are no
significant differences among the group means. In other words, all group means
are equal.
b) Alternative Hypothesis (Ha): The alternative hypothesis suggests that at least one
c) Test Statistic: ANOVA calculates a test statistic called the F-statistic. The F-statistic
selected.
e) Decision: If the calculated F-statistic is greater than the critical F-value from the
rejected. This indicates that at least one group mean is significantly different from
the others.
f) Post hoc Tests: If ANOVA indicates significant differences, post hoc tests (e.g.,
business, and many others, to compare means across different categories or levels and
whether the variations observed in the data are likely due to real differences between
Chi-squared test (χ²): It is used for analyzing categorical data and testing for
(Categorical data- data that does not follow arithmetic operations- +, -, x, /, average etc.
This data is represented by pie chart, bar diagram etc. Example- percentage of person
It is two type- chi-squared test of independence and the chi-squared goodness-of-fit test.
Used for assessing the strength and direction of a linear relationship between two
Spearman Rank Correlation: Used when the relationship between two variables is not
necessarily linear or when the data is ordinal. Calculates a correlation coefficient based
However, the intervals between categories are not necessarily equal. Examples include
education levels (e.g., "high school," "Intermediate," "bachelor's degree") and customer
Kruskal-Wallis Test: Used when comparing three or more independent groups, but the
Mann-Whitney U Test: It is used for comparing two independent groups when the data
t-test.
Wilcoxon Signed-Rank Test: It is used for comparing two related groups (paired data)
when the data is not normally distributed. It is a non-parametric alternative to the paired
samples t-test.
determines the relationship between the dependent variable and one or more
independent variables.
dependent variable and one or more independent variables. It assesses how well the
Post-hoc analysis
Post hoc analysis, also known as post hoc tests or multiple comparison tests, are
There are several different types of post hoc tests, and the choice of which one to use
depends on the nature of data and research hypothesis. Here are some common post hoc
tests:
Newman-Keuls Method: This is a stepwise procedure that compares all possible pairs
Bonferroni-Dunn Test: Used in cases where there is a control group and multiple
treatment groups, this test helps identify which treatment groups are significantly
Tukey's Honestly Significant Difference (HSD): Tukey's HSD test is a popular post hoc
Scheffé Test: The Scheffé test is a conservative and versatile post hoc test that can be
used with any kind of design. It is useful when there are unequal sample sizes and
Fisher's Least Significant Difference (LSD): The LSD test is a straightforward post hoc
test that can be used when the variances across groups are roughly equal, and sample