BIO STATISTICS of First Semester
BIO STATISTICS of First Semester
Biostatistics
Definition of
Biostatistics:
Biostatistics is the
branch of statistics
responsible for the
proper interpretation
of scientific data
generated in the
Biology, Public
Health, Medical and
other allied health
sciences.
Biostatisticians are
specialists in the
evaluation of data as
scientific evidence.
VARIABLES AND
THEIR TYPES:
A variable is a
measureable
characteristic of a
person, object or
phenomenon that can
take on different
values.
A simple example of a
variable is a Person’s
weight. The variable
weight can take on
different values
because person can be
55Kg, 35Kg, and so
on..
Type of variables
Variable can be
classified as either
dependent or
independent.
DEPENDENT OR
INDEPENDENT
VARIABLES:
As in health system
research you often
look for casual
explanations, hence it
is important to make
distinction between
dependent and
independent variables.
The variables that is
used to describe or
measure the problem
under study is called
the dependent
variable. It represents
the output or effect, or
is tested to see if there
is an effect.
A dependent variable
is also known as a
‘response variable”
‘outcome variable’.
The variables that are
used to describe or
explain the difference
in the dependent
variable or to cause
changes in the
dependent variables
are called the
independent variables.
It represents the
inputs or causes, or is
tested to see if they are
the cause. An
independent variable
is also known as a
‘Predictor variable’
‘Explanatory
variables’ ,
‘Risk factor’ and
‘Exposure
variables’.
For example , in a
study of the
relationship between
smoking and lungs
cancer (with the
values yes or not)
would be the
dependent variable
and ‘Smoking’
(varying from not
smoking to smoking
more than three
packets a day) would
be the independent
variable.
Whether a variable is
dependent or
independent, is
determined by the
statement of the
problem and the
objectives of the study.
It is therefore,
important when
designation an
analytical study to
clearly indicate which
variable is the
dependent and which
are the independent
ones.
If a researcher
investigate why people
smoke; ‘Smoking is
the dependent
variable and pressure
from peers to smoke’
could be an
independent variable.
In the lungs cancer
study ‘Smoking was
the independent
variable, and lungs
cancer was the
outcome.
DATA, TYPES AND
ITS
CLASSIFICATION
Description of Data:
1. qualitative Or
categorical Data:
Qualitative data
comprises of a
characteristic which
cannot be expressed
numerically like
gender ( Male,
Female), ethnicity
( White , Black, Asian,
Americans), healing
(Yes or no) etc. It is
divided in three types:
Binary
Nominal
Ordinal
BINARY DATA:
In binary data, the
variables are divided
into two mutually
exclusive categories.
Nominal Data:
In Nominal Data,
the variable are
divided in to more
than two mutually
exclusive categories.
These categories
however cannot be
ordered one above
another (as they are
not greater or lesser
than each other).
Example :
Nominal data
CATEGORIES:
Marital Status:
Single, Married,
Widowed, Separated
and Divorced.
Employment Status:
Unemployed, Self-
employed,
Government
employed.
Ordinal Data:
In
ordinal data, the
variables are also
divided into more
than two mutually
exclusive categories,
but they can be
ordered one above
another , from
lowest to highest or
vice versa.
Quantitative or
Numerical Data:
DISCRETE
DATA:
Dis
crete Data is one in
which values can
only be whole
numbers.
Examples:
Number of children
(i.e. 2,3. Or any
whole Numbers)
Number of Meals
taken in the day
(1,2,3 or 4.)
Length of stay (5
days, 6 days etc.)
Continuous Data_:
Continuous
data is one which
can take either a
decimal or fraction
form.
Examples:
Weight of the
study participants
(i.e. 55.0 Kg, 59.5
Kg, 76.8 Kg)
Haemoglobin of the
participants enrolled
in the study
(i.e.13.1mg/dl,
16.0mg/dl, 8.9mg/dl)
Tabulation and
graphical
presentation of data
Once the data is
collected, the
researcher needs to
present the piece of
information to the
audience. Graphs
and tables are
excellent means of
presenting these
information. The
style of presentation
depends, of course,
on the type of data.
Data can be
presented as
frequency tables,
charts, graphs, etc.
Researchers should
always make it a
rule of thumb that
categorical data
should be presented
as bar chart or pie
chart and
continuous data
must be presented as
histogram.
Frequency Tables
The most common
way of presentation
of data is to arrange
them in the form of
tables. It gives the
frequency with
which (or the
number of times) a
particular value
appears in the data.
The basic principles
of tabulation of data
are:
1. The information
should be in simple
and in orderly
manner.
2. The table should
have a title which
must be brief and
comprehensive.
3. Rows and
columns must have
their own captions.
4. The title of the
rows must be
entered on the left
side of the table
while the titles of the
columns are on the
top of the row. The
rest of the table
constituting the
body, contains the
numerical values in
actual numbers, in
percentage or in
both forms.
5. Standard codes or
symbols, if used,
should be explained
in the footnote.
In frequency table
(see table 1 and 2),
data is represented
in a tabular form. It
gives the frequency
with which (or the
number of times) a
particular value
appears in the data
(Table 1 and 2).
Examples
The marital status of
different
respondents (200 in
total) participated in
a knowledge,
attitude and practice
survey regarding
malaria are as
follows; single 80
(40%), married 100
(50%) and divorced
20 (10%).
The bar graphs
Y-axis = percentage of respondent
X-axis = Martial status of respondent
Pie Chart:
Pie chart can also be
used to display
binary, nominal and
ordinal data
(categorical). A pie
chart consists of
circular region
partitioned into
sections, where each
percentage
represents a part or
a percentage.
Example
The data regarding
knowledge of
research ethics was
collected from 300
postgraduate
trainees.
Fig. Sex of
respondents
OTHER EXAMPLE
OF PIE CHART
REPRESENTATION
PRECENTAGE OF
NON-
COMMUNICABLE
DISEASES
CVD
Mental disorder
Injuries
Respiratery Disease
Cancers
Diabetes
Main cause of deaths
worldwide presented
in a Pie chart
Histogram:
A histogram
depicts a frequency
distribution for
quantitative data, it
comprises a series of
adjacent bars
Histogram is
constructed to
represent the
continuous or
quantitative data.
Ideally, every
quantitative variable
should be normally
distributed (bell
shaped curve).
Frequency Polygon
This is a frequency
distribution
obtained by joining
the mid-points of the
histogram blocks.
It s advantage over
histogram is that
two or more
variables may be
drawn / compared at
a time.
Cumulative
Frequency Polygon (
Ogive)
The horizontal scale
is same as that used
for histogram; the
vertical scale
indicates cumulative
frequency or
cumulative relative
frequency.
Line Graph
A line graph (also
called time series
plot) is appropriate
for representing
data that vary
continuously. It
shows a trend of
variables over time.
To construct a time
series plot, time is
placed on a
horizontal axis and
the variable being
measured on a
vertical axis, with
points being
connected using line
segments.
Example
An open label
single arm trial was
designed to evaluate
the impact of
parenteral iron. The
participants mean
baseline
haemoglobin (mg/dl)
were measured, and
the mean
haemoglobin values
at the 3, 12, 24 weeks
were also recorded.
The information is
presented below in a
tabular and
graphical form.
ii- Decreasing or
negative pattern
e.g. vaccination status
and prevalence of
infection
iii- No relationship
e.g. income level
and total fertility
rate
Measure of
Central
Tendency
Measure of central
tendency refers to the
summary measure
used to describe the
most ‘typical’ value in
a set of values.
The three most
common measures of
central tendency are
mean, median and
mode.
Mean:
The most popular
measure of central
tendency for a
quantitative dataset is
the arithmetic mean
or simply the mean of
the dataset. It is also
known as the
AVERAGE.
It is calculated by
adding all the
observations and
dividing by the total
number of
observations. The
sample mean is
denoted by x
(pronounced x bar)
and the population
mean is denoted by u
(the Greek letter mu).
Note that the mean
can only be calculated
for quantitative data.
Median:
The median is an
important measure
of central tendency.
It is the value that
divides a
distribution into two
equal halves. We
arrange the
observations in
order from smallest
to largest value or
vice versa. If there is
an odd number of
total observations,
the median is middle
value. If there is an
even number of total
observations, the
median is the
average of the two
middle values.
Median =(n+1)/2
If total values are 9,
then ,
=9+1/2=5 value at 5 th
position is median
MODE
It is the most frequent
item or most
commonly occurring
value in the series of
observations.
For example
The diastolic blood
pressure of 20
individuals was
85,75,81,79,71,95,75,77
,75,90,71,75,79,95,75,7
7,84,75,81,75, the
mode or the most
frequently occurring
value is 75.
Sources of DATA
on Community
Health
There are numerous
sources of data on
morbidity and
mortality in the
community. Each
source has advantages
and limitations.
Complete and
unbiased
ascertainment is
carried out by special
surveillance systems.
The sources may be:
. Census
.Reports of
Notifiable Diseases
. Hospital Records
. National Health
Survey, etc.
. CENSUS:
The census is an
important source of
health information.
It is taken in most of
the countries of the
World at regular
intervals, usually of
ten years.
A census is defined
by the United
Nations as ‘ the total
process of collecting,
compiling and
publishing
demographic,
economic, and social
data pertaining at a
specific time or
times, to all persons
in a country or
delimited territory.
Census is a massive
undertaking to
contact every
member of the
population in a given
time and collect a
variety of
information. It needs
considerable
organization, a vast
preparation and
several years to
analyze the results.
This is the main
drawback of census
as a data source i.e.
the full results are
usually not available
quickly.
Although the
primary function of
census is to provide
demographic
information such as
total count of
population and its
breakdown into
groups and sub-
groups such as age
and sex distribution,
it represents only a
small part of the
total information
collected. It contains
economic and social
characteristics of the
people ,the condition
under which they
live, how they live,
how they work their
income and other
basic information.
This data provides a
basic frame of
reference and base
line for planning
action and guideline
for further research.
This is ‘periodic
counts or
enumerations of a
population’; carried
out after every 10
years. The data is
collected on many
characteristics of the
population, such as
name, address, age,
sex,race,
marital status, and
relationship to the
head of the
household, as well as
some characteristics
of housing,
information on
nativity, migration,
education, parity,
employment status,
income, etc., are
obtained. There are
two principal
methods for
enumeration of
population:
De facto,
which allocates
persons according to
their location at the
time of enumeration;
and
De jure,
this assigns them
according to their
usual place of
residence.
VITAL
STATISTICS:
The data collection
or registration of all
vital events, i.e.
births, marriages,
divorces,
separations, and
deaths (and fetal
deaths).
PRESENTATION
OF
STATISTICAL
DATA
Statistical data,
once collected,
must be arranged
purposively, in
order to bring out
the important
points clearly and
strikingly.
Therefore the
manner in which
statistical data is
presented is of
utmost importance.
There are several
ways of presenting
data (depending
upon type of data),
viz: tables, charts,
graphs, diagrams,
pictures and
special curves. A
brief description of
these methods is
given below:
. TABULATION
Tabulation is the
first step before the
data is used for
analysis or
interpretation. A
table can be simple
or complex,
depending upon
the number or
measurement of a
single set or
multiple sets of
items. Whether
simple or complex,
there are certain
general principles
which should be
borne in mind in
designing tables:
(a) The tables
should be
numbered e.g.,
Table 1, Table 2,
etc.
(b) A title must
be given to each
table. The title
must be brief and
self-explanatory
(c) The heading
of columns or
rows should be
clear and concise
(d) The data must
be presented
according to size
or importance;
chronologically,
alphabetically or
geographically
(e) If percentages
or averages are to
be compared,
they should be
placed as close as
possible
(f) No table
should be too
large
(g) Most people
find a vertical
arrangement
better than a
horizontal one
because, it is
easier to scan the
data form top to
bottom than from
left to right
(h) Foot notes
may be given,
where necessary,
providing
explanatory notes
or additional
information.
Some examples of
tabulation are
given below:
1. SIMPLE
TABLE:
This is a tabulated
record of the fixed
characteristic under
study.
Table1. Total
Fertility Rate (TFR)
in Selected
Countries, 1950 &
1955.
Country TFR (per TFR (per
women) women)
1950 1955
Germany 2.1 1.3
Japan 2.7 1.5
Sweden 2.2 1.7
U.K 2.2 1.7
South Korea 5.2 1.7
China 6.1 1.9
Sri Lanka 5.7 2.3
Indonesia 5.5 2.9
India 6.0 3.4
Bangladesh 6.7 3.4
Pakistan 6.5 5.2
Kenya 7.5 5.4
Yemen 7.7 7.7
2. FREQUENCY
DISTRIBUTION
TABLE:
In a frequency
distribution table,
the data is first split
up into convenient
groups (class
intervals) and the
number of items
(frequency, f )
which occur in each
group is shown in
the adjacent column.
Class Limits are the
smallest and highest
number in a class
interval. e.g. First CI
(in table ) is 20-29,
where 20 and 29 are
class limits.
Class boundaries or
true limits are points
that demarcate
upper limit of one
class and lower limit
of the next. e.g. class
boundary b/w
classes 80-89 and 90-
99 (in table ) is 89.5;
it is upper boundary
for former and
lower boundary for
latter.
Class Marks: Mid
points of class
intervals; e.g. table
has class marks of
24.5, 34.5 ………
84.5 and 94.5.
Class Width:
Difference b/w two
consecutive lower
class limits; e.g.
table has class
width of 10.
To develop a
frequency table
. First take upper
and lower limits of
data.
. Then decide the
number of
intervals.
Formula is range /
number of class
interval.
Range is upper
limits minus lower
limit.
. For lower limit of
the interval formula
is minimum value-
(width of class
interval / 2)
. For example: We
have data of 50 cases
in which lower limit
is 20 and upper limit
is 45 and we want to
develop 9 class
intervals.
. Calculation is as
under:
First get the range
(45 – 20=25)
. Divide it by no. of
class intervals, and
get the gap (25/9=3)
. For getting starting
point, formula is
minimum –
(Calculated gap
divided by 2) i.e. (20
– (3/2) = 20 – 1.5
=18.5 or 19 so start
point will be 19 and
not 20.
. 19-21,
22-24………..
. Simple Frequency
Distribution (f ):
This lists each
possible or actual
score and its
frequency; can be
used for qualitative
data.
Table 2. Simple
Frequency
Distribution of
Student’s Score in
High School
- Relative
Frequency
Distribution (RF/P):
The percentages
(or proportions)
relative to total
cases for each
class intervals are
given. It is
obtained by
dividing the
number of cases in
CI by total number
of cases and
multiplying by 100.
RF=Class
Frequency
Sum of all
frequencies
Table. Relative
Frequency
Distribution for High
School Scores
Example
CI f RF P
20-29 2 4.8 .048
30-39 6 14.3 .014
40-49 8 19.0 .019
50-59 6 14.3 .014
60-69 8 19.0 .019
70-79 3 7.1 .071
80-89 6 14.3 .014
90-99 3 7.1 .071
Total 42 100% 1
. Cumulative
Relative Frequency
Distribution:
This lists
percentages (or
proportions) below
the upper limit of
each class interval.
CRF = Cumulative
Relative Frequency
Total Observation
An Example of
Organization of
Quantitative Data:
The Students of a
medical university
conducted a baseline
sample survey in a
community and also
asked about the
number of living
children per women
(15-49 years). The
following data were
collected based on a
random sample of
n=30 woman.
2,2,5,3,0,1,3,2,3,4,1,3,
4,5,7,3,2,4,1,0,5,8,6,5,
4,2,4,4,7,6
(No. of F RF C CRF
children) F
0 2 2/30=0.067 2 2/30=0.067
1 3 3/30=0.10 5 5/30=0.167
2 5 5/30=0.167 10 10/30=0.333
3 5 5/30=0.167 15 15/30=0.5
4 6 6/30=0.20 21 21/30=0.70
5 4 4/30=0.133 25 25/30=0.833
6 2 2/30=0.067 27 27/30=0.90
7 2 2/30=0.067 29 29/30=0.967
8 1 1/30=0.033 30 30/30=1.00
. Stem-and-Leaf
Plots:
This shows stems (the
left most digits i.e. all
but last digit of scores)
in one column and
leaves (the right most
or last digits) in
another column.
Stem-and Leaf
Display of School
Scores
Stem Leaves
2 23
3 244778
4 45677899
5 344667
6 34555579
7 456
8 345577
9 888
To save the space or to
be more precise, a
condensed stem-and-
leaf plot can also be
developed. The digits
in the leaves
associated with the
numbers in each stem
are separated by
asterisk; every row in
the condensed plot
contains on asterisk.
A Condensed Stem-and-Leaf
Plot for High School Scores
Example
Stem Leaves
2-3 23*244778
4-5 45677899*344667
6-7 34555579*456
8-9 345577*888
. CHARTS AND
DIAGRAMS
This is very useful
method of
presenting
statistical data, and
is better retained
in the memory
than statistical
tables. Graphs
should be carefully
titled, and both
axes should be
clearly labeled.
Every graph,
called a figure in
most journal
articles, should
ideally be readable
and
understandable on
its own without
restoring to the
accompanying text.
The data that is to
be presented by
diagrams must be
simple. However,
simplicity may be
obtained only at
the expense of
details and
accuracy. That is,
lot in the charts
and diagrams. If
we want the real
study, we have to
go back to the
original data.
For Graphical
Presentation of
Quantitative Data:
Bar Charts
(Simple, Multiple,
Component), pareto
Chart, Sliding Bar
Chart (e.g.
Population
Pyramid), and Pie
chart
. BAR CHARTS
The various
categories are
presented along
horizontal or x-
axis; their
frequencies (number
or percent) are on
vertical or y-axis.
Length of bar is
proportional to
magnitude to be
represented. Scale
on y-axis begins at
zero.
These are used for
displaying
qualitative (nominal
or ordinal) data.
Bar charts may be:
(A) SIMPLE BAR
CHART: Vertical
or horizontal bars
are separated by
appropriate space. A
suitable scale is
chosen to present
the length of bars.