0% found this document useful (0 votes)
2 views

Basic Biostatistics

The document discusses basic biostatistics and its role in epidemiology, emphasizing the importance of summarizing and analyzing data through various methods such as tables, graphs, and statistical tests. It covers different types of data, measures of central tendency and variability, and introduces key statistical methods like t-tests, chi-squared tests, correlation, and regression. Additionally, it highlights the significance of confidence intervals, hypothesis testing, and meta-analysis in drawing conclusions from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Basic Biostatistics

The document discusses basic biostatistics and its role in epidemiology, emphasizing the importance of summarizing and analyzing data through various methods such as tables, graphs, and statistical tests. It covers different types of data, measures of central tendency and variability, and introduces key statistical methods like t-tests, chi-squared tests, correlation, and regression. Additionally, it highlights the significance of confidence intervals, hypothesis testing, and meta-analysis in drawing conclusions from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

BASIC BIOSTATISTICS:

CONCEPTS AND TOOLS


Basic Epidemiology, chapter 4
Summarizing data
• In epidemiology, biostatistics helps to:
• Summarize and analyze data
• Test and verify hypothesis
• Data can either be:
• Numeric
• e.g. age, measurements of height, weight, etc.
• Categorical
• Categorical data is gotten through classification
• e.g. classification of individuals based on blood groups: A, B, AB, or O.
• Ordinal data (data expressed as ranks like socioeconomic status: low, middle, high)
• In case of large amounts of data, summarization is needed to
draw appropriate conclusions
• Data summary can be done with tables and graphs
Summarizing data-tables and graphs
• Tables and graphs are made to display data in a way that be easily under-
stood
• Every table and graph must have enough information to be interpreted
without reference to the text
• Table
Titles should clearly describe the contents of the tables or graphs being pre-
1: Advantages of Tables and Graphs
sented Advantages of Tables Advantages of Graphs
1. Helps to display more complex data with Simple and easy to understand
precision and flexibility
2. Requires less technical skill to prepare Makes use of vivid images that encourages memory
3. Takes less space for a given information Gives the ability to show complex relationships be-
tween variables

• There are different types of graphs:


• Pie charts and component bar charts
• Spot maps and rate maps
• Bar charts
Pie charts and component band charts
• Pie charts and component band charts show how an entity is divided into its
constituent parts.
• A pie chart represents information in a circle, while a component band chart
uses bands
• When there are two or more whole entities to be divided into components, it
is better to use a component band chart

Fig. 1 Pie chart number of deaths by cause among 25–34 and 35- Fig. 2 Component band chart showing number of deaths by
44 year olds — united states, 2003 cause among 25–44 year olds — united states, 1997 and
Spot and rate maps
• They are maps that show the geographical distribution of dis-
ease or other epidemiological events
• Rate maps show:
• Differences in value of cases according to geographical locations
• Prevalence
• Incidence
• mortality
• Spot and rate maps can show data in both a static and interac-
tive form e.g. The World Health Chart
Table 2: Functions of spot and rate maps
Spot maps Rate maps
Show the locations of individual cases Show the distribution of rate (i.e. disease rate)
across different areas
Each point represents a single case Uses colors or shading to show differences in
rates across regions
Spot and rate maps

Fig. 3. Spot map of bacterial meningitis Fig. 4. Rate maps of the world showing incidence
cases, upper west region, 2018–2020. rates and mortality rates of thyroid cancer among
women
Bar charts and line graphs

• Bar charts compare two


• Line graphs display differences in
or more categories of the value of continuous variables
data
• Bar charts convey data
with the varying bar
Frequency distributions and histograms

• A frequency distribution shows how often each value in a


dataset occurs
• Frequency distributions are often in tabular forms
• A histogram is a visual representation of frequency distri-
butions
• A frequency polygons is a line that connects the middle of
each of the bars of the histogram (e.g The bell-shaped
curve of a normal distribution
Normal distribution

• The normal distribution is a bell-shaped


probability distribution
• The mean, median, and mode are all
equal
• Most of the data cluster around the mean
• ~68% of data falls within 1SD of the
mean
• ~95% of the data falls within 2SD of the
mean
• ~99.7% of the data falls within 3SD of
the mean
• Normal distributions are important be-
cause:
• It models real-world data
• It allows the use of key statistical methods
• Statistical methods like linear regressions, T-
tests, ANOVA assume data follows a normal
Summary numbers-measures of central
tendency
• Measures of central tendency
• Mean
• The sample average
• For a sample with n values, for a variable x, the sample mean will be:

• Median
• The value of the middle after all the measurements have been put in
order
• Mode
• The value of the measurement in a sample that occurs most frequently
Measures of variability

• Measures how different individual data points in a sam-


ple are
• Useful for :
• generalizing about a population
• Identifying outliers
• Comparing different data sets
• The most useful measures of variability are:
• Variance
• Standard deviation
• Standard error of mean
Measures of variability


𝑛

∑ ( 𝑥𝑖 − 𝑥 )2
Standard devia- 𝜎 = 𝑖 =1

tion 𝑛− 1
• Standard error Measure of potential error 2
• Estimates efficiency, accuracy and consistency of var ⅈ 𝑎𝑛𝑐 ⅇ =𝜎
a sample
• The higher the SE, the lower the reliability 𝜎
𝑆𝐸 ( 𝜎 𝑥 ) =
• Standard deviation √𝑛
• Measures how far apart values are from the
mean. Low SD indicates that the values are
closer to the mean
• Square root of the variance
• Variance
• Measures the average degree to which each
value is different from the mean
• Square of the standard deviation
Basic concepts of statistical inference

• Random sample data are used to make conclusions about


a population
• These conclusions are made in terms of summarizing num-
bers.
• Summary numbers for population are represented by
Greek letters

• Estimates of these parameters obtained from a sample are


represented by x , s and b,
Using samples to understand populations
• To make statistical inferences, selecting a
random sample is important
• Every member of a population as an equal
chance in a random sample
• If sample mean is the same as population
mean, then it is an unbiased representa-
tion of the population
• Sample sizes must be large enough
for a study to have statistical power
• Sample sizes are calculated based
on the following:
• Prevalence
• Acceptable error
• Detectable difference
Confidence intervals

• Probability that a statistical value will fall between two set


values-the upper and lower bounds
• Represents how much uncertainty is in a statistic
• Shows variation in a statistic if a study is repeated several times
• Can be used to test a hypothesis
• The most ideal C.Is are >=95%
• There is a 95% chance the value is within the upper and lower
bounds
Calculating confidence intervals
• To calculate the confidence interval, the following mea-
surements are needed:
• Upper bound and lower bounds
• Sample size
• Mean
• Standard deviation
• A constant
• When n=10, = 67.9, SD=10.2
lower bound: ( - (1.96)s/)

Upper bound + (1.96)s/ )


The resulting confidence interval is: C(LB < < UB)
Hypothesis tests, p-values, statistical power

• In biostatistics, hypothesis testing involves


putting assumptions about a population pa-
rameter to the test
• In testing a hypothesis, it is important to:
• Make a careful statement concerning the
hypothesis to be tested
• Know the p-value associated with the
test Fig. showing possible results of a hypothesis
• Know the statistical power of the test test
P-value
• A measure that helps determine how strong the evi-
dence against a null hypothesis is
• P-value helps to determine if data supports or contra-
dicts the null hypothesis
• If p-value is smaller than a pre-defined threshold, it
suggests that:
• The null hypothesis should be rejected
• The data from the tests conducted is unlikely to have been
caused by chance alone
• There is evidence to support an alternative hypothesis
Statistical power
• The ability of a statistical test to detect an effect when it
truly exists
• It measures the likelihood of rejecting the null hypothe-
sis correctly when the alternative is true
Basic biostatistics methods in epidemiology

• t-tests
• chi square tests
• correlation
• regression
T-tests
• Tests if two means differ significantly under the null hypothe-
sis
• Helps to understand if the difference observed is due to
chance
• There are different types of T-tests:
• Independent samples t-tests: compares means of two separate
and unrelated groups (e.g. comparing mean ages between two dif-
ferent populations)
• Paired samples t-tests: used for two sets of measurements that
are paired. Takes dependency and relationship between measure-
ments into consideration (e.g. comparing the mean blood pressure
in the same population before and after medication)
• One-sample T-tests: compares the mean of a group to a known or
hypothesized value (e.g. comparing level of pesticides in a popula-
tion compared to the government-approved limits)
Chi-squared tests for cross tabula-
tions
• Chi-squared test is a statistical analysis used to determine if
there is a significant association between two categorical
data
• To perform this test, a cross-tabulation or contingency table
is created
• χ²= Σ [(Observed frequency - Expected frequency)²/ Ex-
pected frequency]
• After calculating chi-squared statistic, it is compared to the
critical value from the chi-squared distribution
• If calculated chi-square is greater than critical value, null
hypothesis is rejected
Correlation
• Quantifies the degree to which two
variables vary together
• results relating to correlation can be:
• Positive
• Negative
• No correlation
• Corelation is typically measured with
correlation coefficient
• The most common is the Pearson cor-
relation coefficient, r.
• r ranges from -1 to +1
• If r is close to +1, positive correlation
• If r is close to -1, negative correlation
• If r is close to 0, zero or weak corre-
lation
• To visualize correlation, spot maps
are the best
Regression
• Statistical method used to examine relationship be-
tween a dependent variable and one or more indepen-
dent variables
• To understand how a change in the independent vari-
able is related to a change in the dependent variable
• Regression models help to estimate values of a depen-
dent variable based on the independent variable
• Linear regression
• Logistic regression
• Cox proportional hazards regression
Linear regression
• Linear regression assumes a straight line relationship between
the dependent and independent variable.
• Mathematically, this model is expressed as:
Y = b0 + b1*X + ε
where:
Y = the dependent variable
X =independent variable
b0 is the y-intercept, which represents the predicted value of Y when X is
zero.
b1 is the slope of the line, indicating how much Y is expected to change for a
one-unit increase in X.
Logistic regression
• Analyses the relationship between a categorical depen-
dent variable and one or more independent variables
• Used for situations where the dependent variable can
take on binary values
• In logistic regression, the relationship between the in-
dependent variables and the dependent variable is
modeled using the logistic or sigmoid function
• Logistic regression estimates the odds ratio, It mea-
sures the ratio of the probability of success to the prob-
ability of failure
Survival analyses and Cox propor-
tional hazards models
• To investigate survival time of patients and predictor
variables (covariates)
• It is a multivariate statistical model
• h(t) = h0(t)*exp(b1x1 + b2x2 + ... + bpxp)
• In this model,
• t represents survival time
• h(t) is the hazard function which is determined by covariates (x1, x2, ...,
xp )
• x1, x2, ..., xp measures the impact of the covariates
• h0 is the baseline hazard
• Censoring affects computation of cox-proportional mod-
els
Kaplan-Meier survival curves
• Used to display time-to-event data especially survival
data
• Proportion range from 1.0 (or 100%) to 0.0 (or 0%)
• Solves the problem of censoring in statistics
• Used in medical field to analyze:
• effectiveness of treatments
• Survival rate of participants
• How to create a Kaplan-Meier survival curve:
• Identify the starting point
• Observe the event
• Calculate the probability of survival
• Plot the curve
Kaplan-meier survival curve
Meta-analysis
• Statistical analysis combining the result of separate
but comparable results
• Used in order to identify an overall trend
• Different from other studies-no new data is collected
• Steps for a successful meta-analysis:
• formulating the problem and study design;
• identifying relevant studies;
• excluding poorly conducted studies or those with
major methodological flaws;
• measuring, combining and interpreting the re-
sults.
Reason for the surge in
Meta-analysis
• ethical reasons,
• cost issues
• the need to have an overall idea of effects in dif-
ferent population
• To make conclusive judgements from aggregate
studies when sample size for a single study is too
small

You might also like