0% found this document useful (0 votes)
27 views

Bio Stats

Uploaded by

Pankaj Gurra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Bio Stats

Uploaded by

Pankaj Gurra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Statistics

Statistics is the science of collecting, analyzing, interpreting, and presenting data.

Biostatistics

Biostatistics is the application of statistical techniques to scientific research in biology or

life science which may include- medicine and public health.

Application of biostatistics in health science

i. Clinical Trials: Biostatistics is extensively used in the design, analysis, and

interpretation of clinical trials, which are essential for evaluating the safety and

efficacy of new drugs, medical treatments, and interventions. Biostatisticians help

determine sample sizes, randomization procedures, and the statistical methods to

analyze trial data.

ii. Epidemiology: Biostatistics is fundamental in epidemiology, the study of

patterns, causes, and effects of diseases in populations. Biostatisticians analyze

epidemiological data to identify risk factors, trends, and patterns in disease

occurrence and transmission. They also help estimate disease prevalence and

incidence.

iii. Public Health Research: Biostatistics is crucial for public health research,

including the surveillance and monitoring of diseases, assessment of health

interventions, and the evaluation of health policies and programs.

iv. Genetics and Genomics: Biostatistics is used to analyze genetic data, including

genome-wide association studies (GWAS) to identify genetic factors associated


with diseases. It also plays a role in genetic linkage analysis, gene expression

studies, and population genetics.

v. Medical Research: Biostatistics aids in analyzing various types of medical data,

such as patient outcomes, medical imaging data, and laboratory test results. It

helps researchers draw meaningful conclusions from complex datasets.

vi. Epidemiological Surveys: Biostatistical methods are applied in designing and

analyzing surveys that collect data on health-related behaviors, risk factors, and

health outcomes in populations.

vii. Healthcare Quality Improvement: Biostatistics is used to assess and improve

the quality of healthcare services through the analysis of patient outcomes,

hospital performance, and healthcare utilization data.

viii. Pharmacokinetics and Pharmacodynamics: Biostatistics plays a role in

modeling and analyzing drug concentration data to understand how drugs are

absorbed, distributed, metabolized, and excreted in the body and how they affect

health outcomes.

Sample size

Sample size refers to the number of individuals, items, or units selected from a larger

population for inclusion in a research study or survey. It is a critical aspect of research

design and statistical analysis because the size of the sample can significantly impact

the validity and reliability of the study's findings. The goal of determining an

appropriate sample size is to strike a balance between collecting enough data to draw

meaningful conclusions and avoiding the unnecessary collection of data, which can be

resource-intensive.
In statistical terms, the sample size is denoted by the letter "n" and represents the number

of observations or data points collected from the population. The size of the sample is

typically determined by statistical considerations, research objectives, available

resources, and practical constraints.

A larger sample size generally provides more precise and reliable estimates of

population parameters (such as means, proportions, or correlations) and increases the

likelihood of detecting true effects or differences. However, larger sample sizes may also

be more costly and time-consuming to collect and analyze.

Conversely, a smaller sample size may be more manageable in terms of resources and

logistics, but it may yield less precise estimates and have a higher risk of producing

results that are not statistically significant or generalizable to the larger population.

Factors influencing sample size

i. Research Objectives: The specific research questions and hypotheses guide the

determination of the sample size. The sample size should be adequate to address

these research objectives.

ii. Population Variability: The level of variability within the population of interest

affects the required sample size. Greater variability often necessitates a larger

sample to obtain meaningful results.

iii. Level of Confidence: The desired level of confidence (e.g., 95%, 99%) in the study

results influences the sample size. Higher confidence levels typically require

larger samples.
iv. Margin of Error: The acceptable margin of error or precision for estimates

impacts the sample size. Smaller margins of error demand larger sample sizes.

v. Statistical Power: The statistical power of the study (the ability to detect true

effects) is determined by the sample size. Higher power requires a larger sample.

vi. Effect Size: The size of the effect or difference that you expect to detect in your

study affects the sample size calculation. Smaller effect sizes may require larger

samples to detect.

vii. Study Design: The chosen study design (e.g., cross-sectional, longitudinal,

experimental) and the statistical methods to be used influence the sample size

calculation.

viii. Resources and Constraints: Practical considerations, such as available time,

budget, and the feasibility of data collection, play a role in determining the sample

size.

Biostats vocabulary

Research hypothesis- Ha/H1

The research hypothesis states there is a relationship between the independent and

dependent variables. The research hypothesis is also called the alternative hypothesis.

It is the opposite of the null hypothesis. When the null hypothesis is rejected, based on

research data, it implies acceptance of the research hypothesis.

Null hypothesis- H0

A null hypothesis is a type of statistical hypothesis that proposes that no statistical

significance exists in a set of given observations. The null hypothesis is a statement

inferring there is no difference between population parameters. That is, there is no


relationship between independent and dependent variables in the population under

study. Typically, this is not the anticipated outcome of an experiment. Usually the investi-

gator conducts an experiment because he/she has reason to believe manipulation of the

independent variable will influence the dependent variable. So, rejection of the null

hypothesis is interpreted as a significant finding.

Variable- The variable is the fundamental entity studied in scientific research. A variable

is an attribute or thing which is free to vary (can take on more than one value).

Independent variable

In an experimental setting, independent variable refers to the variables that are

manipulated by the investigator. More generally, Independent variables are the causes or

causal factors in medical research studies. Ex- dose of drug.

Dependent variable

In an experimental setting, dependent variable refers to the variables which are observed

by the experimenter. It is outcome of the experiment. More generally dependent variables

values depend upon the values of independent variables. Ex.- Blood sugar level in anti-

diabetic test.

Variance- provides a way to understand how much the data values "vary" or "spread out"

from the central tendency, which is typically represented by the mean. It quantifies how

much individual data points in a dataset deviate from the mean (average) of the dataset.

ANOVA- Analysis of Variance, is a statistical technique used to analyze the variation in a

dataset by comparing the means of different groups or categories. It is used to determine

whether there are statistically significant differences among multiple group means.

ANOVA is an extension of the t-test, which is used to compare the means of two groups.
Degree of freedom- Degrees of freedom represent the number of values that are free to

vary in a statistical calculation without violating any constraints or relationships imposed

by the data or the model being used. It varies as per statistical test used, i.e.- Student’s T-

test, ANOVA, Chi-square test.

Example- degree of freedom in ANOVA comparing five groups each group having six

samples-

1. Degrees of Freedom (Between Groups): This represents the variation between

the group means and is calculated as (k - 1), where "k" is the number of groups. In

this case, k = 5, so degrees of freedom between groups is (5 - 1) = 4.

2. Degrees of Freedom (Within Groups): This represents the variation within each

group. It is calculated as (N - k), where "N" is the total number of observations

(total sample size) and "k" is the number of groups. In this case, there are 5 groups

with 6 samples each, so N = 5 * 6 = 30, and k = 5. Therefore, degrees of freedom

within groups is (30 - 5) = 25.

Degree of freedom in paired t- test- (Number of Pairs of data) - 1

For example, if there are 20 individuals and they are being compared for their

performance before and after a treatment, there are 20 pairs of observations. In this

case, the degrees of freedom for the paired t-test would be (20 - 1) = 19.

Degrees of Freedom in paired t- test = (n1 + n2 - 2)

Where-

"n1" is the sample size of the first group.

"n2" is the sample size of the second group.


The "- 2" accounts for the fact that two samples (two groups) are being compared.

For example, if an unpaired t-test is conducted to compare the means of two groups,

and the first group has 30 observations (n1 = 30) and the second group has 25

observations (n2 = 25), the degrees of freedom for the t-test would be (30 + 25 - 2) =

53.

Measure of central tendency in data

Mean: The mean, also known as the average, is calculated by adding up all the values in

a dataset and then dividing the sum by the number of data points.

Formula: Mean = (Sum of all values) / (Number of values)

Median: The median is the middle value in a dataset when the data is ordered from

smallest to largest. If there is an even number of data points, the median is the average of

the two middle values.

Mode: The mode is the value that appears most frequently in a dataset. A dataset can

have one mode (unimodal), multiple modes (multimodal), or no mode at all (no distinct

value appears more often than others). The mode is particularly useful for categorical or

nominal data, where you are interested in finding the most common category or group.

Example-

Calculate mean median and mode for data set 2,3,4,5,3,8,9,10,11,3

Mean = (2 + 3 + 4 + 5 + 3 + 8 + 9 + 10 + 11 + 3) / 10
Mean = 58 / 10

Mean = 5.8

Median:

First, sort the dataset in ascending order: [2, 3, 3, 4, 5, 8, 9, 10, 11].

Since there are 10 data points (an even number), the median is the average of the two

middle values, which are the 5th and 6th values in the sorted list.

Median = (5 + 8) / 2

Median = 13 / 2

Median = 6.5

Mode:

In this dataset, the number 3 appears three times, which is more frequent than any other

number, hence Mode = 3

Measure of dispersion of data

Standard deviation-

Standard deviation is a statistical measure of the amount of variation or dispersion in a

set of values. It quantifies how much individual data points differ from the mean

(average) of the data set. In other words, it provides a measure of the spread or the extent

to which data points are dispersed around the mean. The standard deviation is often

denoted by the Greek letter σ (sigma) for a population and 's' for a sample.
Standard deviation provides a way to quantify the degree of variation or dispersion in a

data set. A high standard deviation indicates that data points are widely spread out from

the mean, while a low standard deviation suggests that data points are close to the mean.

The formula for calculating the standard deviation for a sample and population is as

follows:

Sample Standard Deviation (s) = √[Σ(xi - x̄ )² / (n - 1)]

Population Standard Deviation (σ) = √[Σ(xi - μ)² / n]

Where:

xi represents each individual data point.

x̄ is the sample mean (average).

μ =the population mean

n is the number of data points in the sample.

Σ denotes the sum of the values.

Example- Calculate standard deviation for data set 20, 25, 31,23,27,29.

To calculate the standard deviation for the given data set, follow these steps:

➢ Find the mean (average) of the data set.

➢ Calculate the squared difference between each data point and the mean.

➢ Find the average of those squared differences.

➢ Take the square root of the result from step 3 to get the standard deviation.

Step 1: Find the mean (average):


(20 + 25 + 31 + 23 + 27 + 29) / 6 = 155 / 6 ≈ 25.83

Step 2: Calculate the squared difference between each data point and the mean:

For 20: (20 - 25.83)² ≈ 32.91

For 25: (25 - 25.83)² ≈ 0.68

For 31: (31 - 25.83)² ≈ 26.45

For 23: (23 - 25.83)² ≈ 7.95

For 27: (27 - 25.83)² ≈ 1.32

For 29: (29 - 25.83)² ≈ 10.20

Step 3: Find the average of those squared differences:

(32.91 + 0.68 + 26.45 + 7.95 + 1.32 + 10.20) / 6 ≈ 13.76

Step 4: Take the square root of the result from step 3 to get the standard deviation:

√13.7683 ≈ 3.71 (rounded to two decimal places)

So, the standard deviation of the given data set is approximately 3.71.

To assess the spread of data relative to the mean, it's common to use a coefficient of

variation (CV), which is defined as: = (Standard Deviation/Mean )×100

Standard error of the mean (SEM)

The standard error of the mean (SEM) is a statistical measure that quantifies the

variability or uncertainty in the sample mean when estimating the population mean. It is

a measure of how much the sample mean is likely to vary from one sample to another. In

essence, the SEM provides information about the precision of the sample mean as an

estimate of the population mean.


The formula for calculating the standard error of the mean is:

SEM = σ / √n

Where:

SEM is the standard error of the mean.

σ is the population standard deviation.

n is the sample size.

SEM is related to the variability in sample means when samples are repeatedly drawn

from the same population. It gives an idea of how much the sample mean is likely to differ

from the true population mean.

The SEM decreases as the sample size (n) increases. In other words, larger sample sizes

result in more precise estimates of the population mean.

In practice, the SEM is useful for interpreting and comparing sample means from different

studies or experiments. A smaller SEM indicates that the sample means are more

consistent and, therefore, provide more reliable estimates of the population mean.

Parametric data – It refers to data that follows a particular probability distribution or

has specific characteristics that allow for the use of parametric statistical tests and

models. Examples of parametric data: height of adult humans, weight of laboratory mice,

temperature readings, blood pressure, income levels.


Nonparametric data- also known as categorical or ordinal data, do not follow a specific

probability distribution and do not have the same characteristics as parametric data.

They are often associated with nominal or ordinal scales of measurement and do not

assume a normal distribution or equal variance. Here are some examples of

nonparametric data: Gender, Marital Status, Education Level, Preference Data,

Rankings: Ranking data, such as movie ratings from 1 to 5 stars or preferences for

different food items from most preferred to least preferred, are nonparametric ordinal

data, Blood Type, Survey Responses: Many survey questions that ask respondents to

choose from predefined categories or options generate nonparametric data. For example,

"yes" or "no" responses to a binary question.

Nonparametric data often require different statistical methods and tests than parametric

data. Nonparametric tests, such as the chi-squared test, Wilcoxon signed-rank test, and

Mann-Whitney U test, are used to analyze nonparametric data because they do not make

assumptions about normality or equal variance. These tests are valuable when dealing

with data that cannot be treated as normally distributed or when the scale of

measurement is nominal or ordinal (Categorical data).

(Ordinal data- Ordinal data represents categories with a meaningful order or ranking.

However, the intervals between categories are not necessarily equal. Examples include

education levels e.g., "high school," "Intermediate," "bachelor's degree" and customer

satisfaction ratings e.g., "very dissatisfied," "neutral," "very satisfied".

Nominal data- It represents categories or groups that are distinct and separate from

each other but don't have any inherent order or ranking. Nominal data is typically used

to classify items into discrete categories or to name and identify different attributes or

characteristics. Ex- colour, gender, marital status, geographical regions.)


Parametric tests- Parametric tests assume that the data follows a normal distribution

(also known as the Gaussian distribution). In a normal distribution, data points are

symmetrically distributed around a central mean, forming a bell-shaped curve.

Parametric statistical models assume a linear relationship between the dependent

variable and the independent variables. This means that the change in the dependent

variable is proportional to the change in the independent variables.

Common examples of parametric tests and the types of data they are used for include:

i. Independent t-Test or unpaired t-test: Used to compare the means of two

independent groups when the data follows a normal distribution. For example,

comparing the test scores of two groups of students who received different

teaching methods.
ii. Paired t-Test: Used to compare the means of two related groups (paired data)

when the data is normally distributed. For example, comparing the pre- and post-

treatment scores of the same group of patients.

iii. Analysis of Variance (ANOVA): Used to compare means among three or more

independent groups when the data meets parametric assumptions. One-way

ANOVA is used for one independent variable, while two-way ANOVA is used for

two independent variables.

iv. Linear Regression: Used to model the relationship between a continuous

dependent variable and one or more independent variables when the assumptions

of linearity and normality are met.

v. Analysis of Covariance (ANCOVA): Combines aspects of ANOVA and regression

to analyze data with both categorical and continuous independent variables,

assuming parametric assumptions are met.

Nonparametric Tests- Nonparametic tests are sometimes referred to as distribution-

free procedures. In general, these procedures can be used with nominal or ordinal

measures and do not have assumptions requiring that distributions of variables be of

certain shapes (in contrast to parametric procedures which invariably require normal

distributions and interval or ratio measures). Examples of nonparametric procedures

include the Chi-square Tests, and the Spearman Rank Correlation Coefficient.

p Value- In statistics, the p-value- short for probability value, is a measure that helps

assess the strength of evidence against a null hypothesis. It is an important tool in

hypothesis testing that helps researchers determine whether the results of their study

are statistically significant or can be attributed to random chance.


The p-value is compared to a significance level (often denoted as α), typically set at 0.05.

If the p-value is less than α, null hypothesis is rejected and alternative hypothesis is

accepted. This means results are statistically significant, suggesting that there is evidence

of an effect or difference. If the p-value is greater than α null hypothesis is accepted,

indicating that data does not provide strong enough evidence to support the alternative

hypothesis.

Correlation

Correlation refers to the degree of relationship among variables. A correlation coefficient

is a measure of the degree of relationship among variables. There are many correlation

coefficients. Two of the most important measures are the Pearson Product Moment

Correlation Coefficient (by far the most frequently used) and the Spearman Rank

Correlation Coefficient (often used with ordinal measures and or non-Gaussian

variables). The Pearson is parametric and the Spearman is a nonparametric measure of

relationship.

Regression- A regression is a statistical technique that relates a dependent variable to

one or more independent variables. A regression model is able to show whether changes

observed in the dependent variable are associated with changes in one or more of the

independent variables.
Statistical tests of significance

Statistical tests of significance, also known as hypothesis tests, are a fundamental part

of statistical analysis. These tests help researchers determine whether the observed

differences or associations in their data are statistically significant or if they could have

occurred by random chance. Here are some common statistical tests of significance and

when they are typically used:

Student's t-test: It is used for comparing the means of two groups. It is of two types-

independent/unpaired samples t-test (for comparing two independent groups) and the

paired samples t-test (for comparing two related groups).

Analysis of Variance (ANOVA): It is used for comparing means among more than two

groups. It may be one-way ANOVA (for one categorical independent variable), two-way

ANOVA (for two independent variables), and repeated measures ANOVA (for repeated

measurements on the same subjects).

ANOVA- Analysis of Variance, is a statistical technique used to analyze the variation in a

dataset by comparing the means of different groups or categories. It is used to determine

whether there are statistically significant differences among multiple group means.

ANOVA is an extension of the t-test, which is used to compare the means of two groups.
There are different types of ANOVA depending on the number of factors and levels

(groups) in data. The two most common types are:

One-Way ANOVA: This is used when there is one independent variable (factor) and more

than two levels or groups. It assesses whether there are statistically significant

differences among the means of the groups.

Example- Analyzing effect of antidiabetic drug in comparison to control (untreated group.

Two-Way ANOVA: This is used when there are two independent variables, and on has to

assess the main effects of each variable as well as any interaction between them. It's

useful for studying how two factors affect the dependent variable.

Example- Example: A Clinical Trial- suppose a pharmaceutical company is conducting a

clinical trial to test the efficacy of a new drug in treating a specific medical condition. They

want to investigate how two factors, dosage (Low, Medium, High) and gender (Male,

Female), impact the response variable, which is the reduction in symptoms after taking

the drug.

Factor 1: Dosage

Low: 1 mg

Medium: 5 mg

High: 10 mg

Factor 2: Gender

Male
Female

For this study, the company collects data from a large number of participants, randomly

assigning them to different dosage levels and recording their gender. After the trial, they

measure the reduction in symptoms for each participant.

Steps of ANOVA:

a) Null Hypothesis (H0): The null hypothesis in ANOVA states that there are no

significant differences among the group means. In other words, all group means

are equal.

b) Alternative Hypothesis (Ha): The alternative hypothesis suggests that at least one

group mean is different from the others.

c) Test Statistic: ANOVA calculates a test statistic called the F-statistic. The F-statistic

is a ratio of the variance between groups to the variance within groups.

d) Significance Level (α): A significance level, often denoted by α (e.g., α = 0.05), is

selected.

e) Decision: If the calculated F-statistic is greater than the critical F-value from the

F-distribution table (based on α and degrees of freedom), null hypothesis is

rejected. This indicates that at least one group mean is significantly different from

the others.

f) Post hoc Tests: If ANOVA indicates significant differences, post hoc tests (e.g.,

Tukey's HSD, Bonferroni, Scheffé) can be performed to identify which specific

groups differ from each other.


ANOVA is widely used in various fields, including experimental research, social sciences,

business, and many others, to compare means across different categories or levels and

make statistical inferences about population differences. It helps researchers determine

whether the variations observed in the data are likely due to real differences between

groups or simply the result of random chance.

Chi-squared test (χ²): It is used for analyzing categorical data and testing for

associations or independence between two or more categorical variables.

(Categorical data- data that does not follow arithmetic operations- +, -, x, /, average etc.

This data is represented by pie chart, bar diagram etc. Example- percentage of person

using different transport mode in a city, date of birth of population, consumption of

different food items etc.)

It is two type- chi-squared test of independence and the chi-squared goodness-of-fit test.

Pearson Correlation Coefficient (Pearson's r):

Used for assessing the strength and direction of a linear relationship between two

continuous variables. It provides a measure of the linear association, ranging from -1

(perfect negative correlation) to 1 (perfect positive correlation).

Spearman Rank Correlation: Used when the relationship between two variables is not

necessarily linear or when the data is ordinal. Calculates a correlation coefficient based

on the ranks of the data points.


(Ordinal data- Ordinal data represents categories with a meaningful order or ranking.

However, the intervals between categories are not necessarily equal. Examples include

education levels (e.g., "high school," "Intermediate," "bachelor's degree") and customer

satisfaction ratings (e.g., "very dissatisfied," "neutral," "very satisfied").

Kruskal-Wallis Test: Used when comparing three or more independent groups, but the

assumption of normality is violated. It is a non-parametric alternative to one-way ANOVA.

Mann-Whitney U Test: It is used for comparing two independent groups when the data

is not normally distributed. It is a non-parametric alternative to the independent samples

t-test.

Wilcoxon Signed-Rank Test: It is used for comparing two related groups (paired data)

when the data is not normally distributed. It is a non-parametric alternative to the paired

samples t-test.

Logistic Regression: Used when the dependent variable is binary or categorical. It

determines the relationship between the dependent variable and one or more

independent variables.

Linear Regression: It is used for modeling the relationship between a continuous

dependent variable and one or more independent variables. It assesses how well the

independent variables predict the dependent variable.

Post-hoc analysis

Post hoc analysis, also known as post hoc tests or multiple comparison tests, are

statistical procedures used to make detailed comparisons between groups after an

overall analysis of variance or other statistical test has been conducted.


These tests help identify specific group differences when you have more than two groups,

There are several different types of post hoc tests, and the choice of which one to use

depends on the nature of data and research hypothesis. Here are some common post hoc

tests:

Newman-Keuls Method: This is a stepwise procedure that compares all possible pairs

of group means and is often used when variances are unequal.

Bonferroni-Dunn Test: Used in cases where there is a control group and multiple

treatment groups, this test helps identify which treatment groups are significantly

different from the control.

Tukey's Honestly Significant Difference (HSD): Tukey's HSD test is a popular post hoc

test for comparing all possible pairs of group means.

Scheffé Test: The Scheffé test is a conservative and versatile post hoc test that can be

used with any kind of design. It is useful when there are unequal sample sizes and

variances across groups.

Fisher's Least Significant Difference (LSD): The LSD test is a straightforward post hoc

test that can be used when the variances across groups are roughly equal, and sample

sizes are equal. It can be less conservative than other tests.

You might also like