Chapter 1 Biostat Discript Stastics
Chapter 1 Biostat Discript Stastics
Descriptive Statistics
04/17/25 1
Learning Objectives
At the end of this chapter, the students will be able to
Ω Define and Identify the different types of data and
understand why we need to classify variables
Ω Identify the different methods of data collection and criterion
that we use to select a method of data collection.
Ω Define a questionnaire, identify the different parts of a
questionnaire and indicate the procedures to prepare a
questionnaire.
Ω Identify the different methods of data organization and
presentation .
Ω Understand the criterion for the selection of a method to
organize and present data.
Ω Identify the different methods of data summarization
04/17/25 2
Descriptive Statistics
• Techniques used to organize and
summarize a set of data in a concise way.
– Organization of data
– Summarization of data
– Presentation of data
• Numbers that have not been summarized
and organized are called raw data.
04/17/25 3
Organization of data
• arranging data in a way that makes it
easy to find and use.
• This can be done by creating a filing
system, using spreadsheets or
databases, or by using a data
visualization tool
• Summarization of data- is the
process of reducing the amount of data
to its most important points.
• This can be done by creating a table of
contents, an executive summary,
or by using a data reduction tool.
04/17/25 4
Presentation of data
• the process of communicating data to
others in a way that is easy to
understand.
• This can be done by creating a report,
a presentation, or by using a data
visualization tool.
04/17/25 5
Definition of Terms
• Biostatistics is the application of
statistical techniques to scientific
research in health-related fields,
including medicine, biology, and public
health.
• It involves the collection, analysis, and
interpretation of biological data,
especially data relating to human
biology, health, and medicine
04/17/25 6
Descriptive statistics
• refers to the analysis, summary, and
communication of findings that
describe a data set.
• It involves measures of central
tendency (mean, median, mode) and
measures of variability (range,
variance, standard deviation)
• are used to summarize and present
data concisely and meaningfully,
aiding in the understanding of the
central
04/17/25
tendency, dispersion, and7
shape of the distribution of a dataset
Statistics
• in the context of Biostatistics refers to the
application of statistical methods to biological,
medical, and health-related data.
• It involves the collection, analysis, presentation,
and interpretation of data specifically within
these fields.
04/17/25 8
Descriptive statistics include:
• Tables
• Graphs
- Measures of variability/Dispersion
04/17/25 9
• Before summarization and organization, we
need to know the types of variables and
measurement scales of our data.
04/17/25 10
Variable
• Variable: A characteristic which takes
different values in different persons, places,
or things.
• Any aspect of an individual or object that is
measured (e.g., BP) or recorded (e.g., age,
sex) and takes any value.
• There may be one variable in a study or
many.
• E.g., A study of treatment outcome of TB
04/17/25 11
• Variables can be broadly classified
into:
– Categorical (or Qualitative) or
– Quantitative (or numerical variables).
04/17/25 12
• Categorical variable: A variable or
characteristic which can not be measured in
quantitative form but can only be sorted by
name or categories
04/17/25 13
• Quantitative variable: A variable that can
be measured (or counted) and expressed
numerically.
04/17/25 14
Quantitative variable is divided into two:
1. Discrete: It can only have a limited number of
discrete values (usually whole numbers).
– E.g., the number of episodes of diarrhoea a child has
had in a year. You can’t have 12.5 episodes of diarrhoea
• Characterized by gaps or interruptions in the
values (integers).
• Both the order and magnitude of the values matter.
• The values aren’t just labels, but are actual
measurable quantities.
04/17/25 15
2. Continuous variable: It can have an
infinite number of possible values in any
given interval.
• Both the magnitude and the order of the
values matter
• Does not possess the gaps or interruptions
• Weight is continuous since it can take on
any number of values (e.g., 34.575 Kg).
04/17/25 16
SUMMARY
Variable
Types
of Qualitative Quantitative
variables or categorical measurement
04/17/25 18
1. Nominal scale:
• Data that represent categories or names. There is
no implied order to the categories of nominal
data.
• The simplest type of data, in which the values fall
into unordered categories or classes
• Consists of “naming” observations or classifying
them into various mutually exclusive and
collectively exhaustive categories
• Uses names, labels, or symbols to assign each
measurement.
• Each item must fit into exactly one category.
– Examples: Blood type, sex, race, marital status, etc.
04/17/25 19
Example of nominal Scale:
Race/Ethnicity:
1. Black • The numbers have NO
2. White meaning
3. Latino • They are labels only
4. Other
04/17/25 20
• If nominal data can take on only two
possible values, they are called
dichotomous or binary.
• So sex is not just nominal, it is
dichotomous (male or female).
• Yes/no questions
– E.g., cured from TB at 6 months of Rx
04/17/25 21
2. Ordinal scale:
• Assigns each measurement to one of a limited
number of categories that are ranked in terms of
order.
• The spaces or intervals between the categories are
not necessarily equal.
• Although non-numerical, can be considered to
have a natural ordering
– Examples: Patient status, cancer stages,
social class, etc.
04/17/25 22
Example of ordinal scale:
04/17/25 23
3. Interval scale:
- In interval data the intervals between values are
the same.
- Measured on a continuum and differences
between any two numbers on a scale are of known
size.
Example: Temp. in oF on 4 consecutive days
Days: A B C D
Temp. oF: 50 55 60 65
For these data, not only is day A with 50 o cooler
than day D with 65o, but is 15o cooler.
- It has no true zero point. “0” is arbitrarily chosen
and doesn’t reflect the absence of temp.
04/17/25 24
4. Ratio scale:
- Measurement begins at a true zero point and
the scale has equal space.
- Examples: Height, age, weight, BP, etc.
– The absence of negative numbers and
the presence of a true zero point are
key characteristics of a ratio scale are
key characteristics
– are the highest level of measurement
Note on meaningfulness of “ratio”-
– Someone who weighs 80 kg is two times as
heavy as someone else who weighs 40 kg. This
is true even if weight had been measured in
04/17/25
other measurements. 25
Degree of precision in measuring
Nominal
Ordinal
Interval
Ratio
04/17/25 26
Method of Data collection
04/17/25 27
Introduction
Before any statistical work can be done data must
be collected. Depending on the type of variable
and the objective of the study, different data
collection methods can be employed.
Data Collection Methods
Data collection techniques allow us to
systematically collect data about our objects of
study (people, objects, and phenomena) and
about the setting in which they occur.
In the collection of data we have to be
systematic. If data are collected haphazardly, it
will be difficult to answer our research questions
in a conclusive way.
04/17/25 28
Various data collection techniques can be used
such as:
Observation
Face-to-face interview
self-administered interviews
Postal or mail method and telephone interviews
Using available information
Focus group discussions (FGD)
In-depth interview
Other data collection techniques – Rapid
appraisal techniques, Nominal group techniques,
Delphi techniques, life histories, case studies, etc.
04/17/25 29
Problems in gathering data
→Language barriers
→Lack of adequate time
→Expense
→Inadequately trained and experienced staff
→ Invasion of privacy
→Suspicion/doubt
→Bias (spatial(r/ship), project, person, season,
diplomatic, professional)
→Cultural norms (e.g. which may preclude
men interviewing women)
04/17/25 30
Choosing a Method of Data Collection
◊ Decision-makers need information that is
relevant, timely, accurate and usable.
◊ The cost of obtaining, processing and
analyzing these data is high.
◊ The challenge is to find ways, which lead to
information that is cost-effective, relevant,
timely and important for immediate use.
◊ The statistical data may be classified under
two categories, depending upon the sources.
◊ 1) Primary data 2) Secondary data
04/17/25 31
Primary Data
• Are those data, which are collected by the
investigator himself for the purpose of a specific
inquiry or study.
•Such data are original in character and are mostly
generated by surveys conducted by individuals or
research institutions.
•High response rates might be obtained since the
answers to various questions are obtained on the
spot.
• It permits explanation of questions concerning
difficult subject matter.
04/17/25 32
Secondary Data: When an investigator
uses data, which have already been
collected by others, such data are called
"Secondary Data".
Such data are primary data for the agency
that collected them, and become secondary
for someone else who uses these data for
his own purposes.
The secondary data can be obtained from
journals, reports, government publications,
publications of professionals and research
organizations.
04/17/25 33
Secondary data are less expensive to collect
both in money and time.
These data can also be better utilized and
sometimes the quality of such data may be
better because these might have been
collected by persons who were specially
trained for that purpose.
On the other hand, such data must be used
with great care, because such data may also
be full of errors due to the fact that the
purpose of the collection of the data by the
primary agency may have been different from
the purpose of the user of these secondary
data.
04/17/25 34
Secondly, there may have been bias
introduced, the size of the sample may have
been inadequate, or there may have been
arithmetic or definition errors, hence, it is
necessary to critically investigate the validity
of the secondary data.
In general, the choice of methods of data
collection is largely based on the accuracy
of the information they yield.
In this context, ‘accuracy’
accuracy refers not only to
correspondence between the information and
objective reality - although this certainly
enters into the concept - but also to the
information’s relevance.
04/17/25 35
The selection of the method of data collection
is also based on practical considerations,
such as:
The need for personnel, skills, equipment,
etc. .
The acceptability of the procedures to the
subjects
The probability that the method will provide
a good coverage
The investigator’s familiarity with a study
procedure may be a valid consideration.
04/17/25 36
Types of Questions
Depending on how questions are asked and
recorded we can distinguish two major
possibilities - Open –ended questions, and
closed questions.
Open-ended questions
Open-ended questions permit free responses
that should be recorded in the respondent’s own
words.
The respondent is not given any possible
answers to choose from.
Such questions are useful to obtain information
on: Facts with which the researcher is not very
familiar, Opinions, attitudes, and suggestions of
informants, or Sensitive issues.
04/17/25 37
For example
•“Can you describe exactly what the traditional birth
attendant did when your labor started?”
•“What do you think are the reasons for a high drop-
out rate of village health committee members?”
•“What would you do if you noticed that your daughter
(school girl) had a relationship with a teacher?”
04/17/25 38
Closed ended Questions
Closed questions offer a list of possible
options or answers from which the
respondents must choose.
When designing closed questions one should
try to: Offer a list of options that are exhaustive
and mutually exclusive Keep the number of
options as few as possible.
Closed questions are useful if the range of
possible responses is known.
04/17/25 39
For example
1.“What is your marital status?
a) Single
b) Married/living together
c) Separated/divorced/widowed
2.“Have your every gone to the local village
health worker for treatment?
a) Yes
b) No
04/17/25 40
• Closed questions may also be used if one is
only interested in certain aspects of an issue
and does not want to waste the time of the
respondent and interviewer by obtaining
more information than one needs.
04/17/25 41
Requirements of questions
• Must have face validity – that is the question
that we design should be one that give an
obviously valid and relevant measurement for
the variable.
• Must be clear and unambiguous – the way in
which questions are worded can ‘make or
break’ a questionnaire.
– Questions must be clear and unambiguous.
– They must be phrased in language that is believed
the respondent will understand, and that all
respondents will understand in the same way.
– To ensure clarity, each question should contain
only one idea; ‘double- barreled’ questions like ‘Do
you take your child to a doctor when he has a cold or
has diarrhea?’ are difficult to answer, and the
answers are difficult to interpret.
04/17/25 42
• Must not be offensive – whenever possible
it is wise to avoid questions that may offend
the respondent, for example those that deal
with intimate matters, those which may seem
to expose the respondent’s ignorance, and
those requiring him to give a socially
unacceptable answer.
• The questions should be fair - They should
not be phrased in a way that suggests a
specific answer, and should not be loaded.
– Short questions are generally regarded as
preferable to long ones.
04/17/25 43
• Sensitive questions - It may not be
possible to avoid asking ‘sensitive’
questions that may offend respondents,
e.g. those that seem to expose the
respondent’s ignorance. In such situations
the interviewer (questioner) should do it
very carefully and wisely
04/17/25 44
Methods of Data Organization and
Presentation
04/17/25 45
Methods of data organization and
presentation
04/17/25 47
Generally Summarizing and organizing data
can be achieved through:
1. Frequency Distributions
2. Graphical Representations
3. Measures of Central Tendency
4. Measures of variability
04/17/25 48
Frequency Distributions
o For data to be more easily appreciated and to draw
quick comparisons, it is often useful to arrange the
data in the form of a table, or in one of a number of
different graphical forms.
o When analyzing voluminous data collected from
say, a health center's records, it is quite useful to
put them into compact tables.
o Quite often, the presentation of data in a
meaningful way is done by preparing a frequency
distribution.
o If this is not done the raw data will not present any
meaning and any pattern in them (if any) may not
be detected.
04/17/25 49
Array
Array (ordered array) is a serial arrangement
of numerical data in an ascending or
descending order.
This will enable us to know the range over
which the items are spread and will also get
an idea of their general distribution.
Very difficult with large sample size
Hence it is an appropriate way of
presentation when the data are small in size
(usually less than 20).
20
04/17/25 50
Ordered Array
12 19 27 36 42 59
15 22 31 39 43 61
17 23 31 41 44 65
18 26 34 41 54 67
04/17/25 51
• The actual summarization and organization
of data starts from frequency distribution.
04/17/25 52
• For nominal and ordinal data, frequency
distributions are often used as a summary.
• Example:
04/17/25 54
a) Qualitative variable: Count the number of
cases in each category.
04/17/25 55
Frequency Relative Frequency
ICU Type (How often) (Proportionately often)
Medical 12 0.48
Surgical 6 0.24
Cardiac 5 0.20
Other 2 0.08
Total 25 1.00
04/17/25 56
Example 2:
A study was conducted to assess the
characteristics of a group of 234 smokers by
collecting data on gender and other variables.
Gender, 1 = male, 2 = female
04/17/25 57
b) Quantitative variable:
- Select a set of continuous, non-overlapping
intervals such that each value can be placed
in one, and only one, of the intervals.
- The first consideration is how many intervals
to include
04/17/25 58
For a continuous variable
(e.g. – age), the frequency
distribution of the individual
ages is not so interesting.
04/17/25 59
• We “see more” in
frequencies of age
values in
“groupings”. Here,
10 year groupings
make sense.
• Grouped data
frequency
distribution
04/17/25 60
To determine the number of class intervals and the
corresponding width, we may use:
Sturge’s rule:
K 1 3.322(log n)
L S
W
K
where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value
04/17/25 61
Example:
– Leisure time (hours) per week for 40 college
students:
23 24 18 14 20 36 24 26 23 21 16 15 19 20
22 14 13 10 19 27 29 22 38 28 34 32 23 19
21 31 16 28 19 18 12 27 15 21 25 16
K = 1 + 3.22 (log40) = 6.32 ≈ 6
Maximum value = 38, Minimum value = 10
Width = (38-10)/6 = 4.66 ≈ 5
04/17/25 62
Time Relative Cumulative
(Hours) Frequency Frequency Relative
Frequency
10-14 5 0.125 0.125
15-19 11 0.275 0.400
20-24 12 0.300 0.700
25-29 7 0.175 0.875
30-34 3 0.075 0.950
35-39 2 0.050 1.00
Total 40 1.00
04/17/25 63
• Cumulative frequencies: When frequencies of
two or more classes are added.
04/17/25 64
• True limits: Are those limits that make an
interval of a continuous variable continuous
in both directions
04/17/25 65
Time
(Hours) True limit Mid-point Frequency
Total 40
04/17/25 66
Simple Frequency Distribution
• Primary and secondary cases of syphilis
morbidity by age, 1989
Age group Cases
(years) Number Percent
04/17/25 71
Importance of diagrammatic representation:
data.
3. They have great memorizing value than
mere figures.
4. They facilitate comparison
5. Used
04/17/25
to understand patterns and trends 72
• Well designed graphs can be powerful
means of communicating a great deal of
information
04/17/25 73
Limitations of Diagrammatic Representation
1. The technique of diagrammatic representation
is made use only for purposes of comparison.
It is not to be used when comparison is either
not possible or is not necessary.
2. Diagrammatic representation is not an
alternative to tabulation. It only strengthens the
textual exposition of a subject, and cannot
serve as a complete substitute for statistical
data.
3. It can give only an approximate idea and as
such where greater accuracy is needed
diagrams will not be suitable.
4. They fail to bring to light small differences
04/17/25 74
Construction of graphs
The choice of the particular form among
the different possibilities will depend on
personal choices and/or the type of the
data.
Bar charts and pie chart are commonly
used for qualitative or quantitative discrete
data.
Histograms, frequency polygons are used
for quantitative continuous data.
04/17/25 75
There are, however, general rules that are
commonly accepted about construction of graphs:
1.Every graph should be self-explanatory and as simple as
possible.
2.Titles are usually placed below the graph and it should
again question what? Where? When? How classified?
3.Legends or keys should be used to differentiate variables
if more than one is shown.
4.The axes label should be placed to read from the left side
and from the bottom.
5.The units in to which the scale is divided should be
clearly indicated.
6.The numerical scale representing frequency must start at
zero or a break in the line should be shown.
04/17/25 76
Method of constructing bar chart
• All the bars must have equal width
• The bars are not joined together (leave
space between bars)
• The different bars should be separated by
equal distances
• All the bars should rest on the same line
called the base
• Label both axes clearly
04/17/25 77
Specific types of graphs include:
• Bar graph Nominal, ordinal
• Pie chart data
• Histogram
• Stem-and-leaf plot
• Box plot Quantitative
• Scatter plot data
• Line graph
• Others
04/17/25 78
1. Bar Chart
Bar diagrams are used to represent and compare
the frequency distribution of discrete variables
and attributes or categorical series
04/17/25 79
A. Simple bar chart:
• It is a one-dimensional diagram in which the
bar represents the whole of the magnitude.
04/17/25 80
90
80
Number of Children 70
60
50
40
30
20
10
0
Not Immunized Partialy immunized Fully immunized
Immunization Status
04/17/25 81
B. Multiple bar chart
In this type of chart the component figures
are shown as separate bars adjoining each
other.
The height of each bar represents the
actual value of the component figure.
It depicts distributional pattern of more
than one variable
– Example of multiple bar diagrams: consider
that data on immunization status of women by
marital status.
04/17/25 82
Fig. 2 TT Immunization status by marital status of women
15-49 years, Asendabo town, 1996
04/17/25 83
There’s no reason why the bar chart can’t be
plotted horizontally instead of vertically.
CHA
Type of source
HC
Reading
Training femal
male
e
Campaign
Anti FGMC
CAT
0 10 20 30 40 50
Percent
04/17/25 85
Distribution of patients in hopital X by source of referal, 1999
769
800
700 623
600
No. of patients
500
400
300 256
200 161
97
100
0
Other GP OPD Casualty Other
hospital
Source of referal
04/17/25 86
C. Component ( sub-divided) Bar
Diagram
Bars are sub-divided into component parts of the
figure.
These sorts of diagrams are constructed when each
total is built up from two or more component figures.
They can be of two kinds:
I) Actual Component Bar Diagrams: When the
overall height of the bars and the individual
component lengths represent actual figures.
Example of actual component bar diagram: The
above data can also be presented as below.
04/17/25 87
04/17/25 88
C. Percentage Component Bar Diagram
Where the individual component lengths
represent the percentage each component
forms the overall total.
Note that a series of such bars will all be
the same total height, i.e., 100 percent.
oExample of percentage component bar
diagram
04/17/25 89
04/17/25 90
2. Pie chart
• Shows the relative frequency for each category by
dividing a circle into sectors, the angles of which
are proportional to the relative frequency.
• Used for a single categorical variable
• Use percentage distributions
04/17/25 91
Steps to construct a pie-chart
• Construct a frequency table
04/17/25 93
Distribution fo cause of death for females, in England and Wales, 1989
Others
8%
Digestive System
4%
Injury and Poisoning
3%
Circulatory system
Respiratory system
42%
13%
Neoplasmas
30%
04/17/25 94
3. Histogram
• Histograms are frequency distributions with
continuous class intervals that have been turned
into graphs.
04/17/25 95
• Bars are drawn over the intervals in such a
way that the areas of the bars are all
proportional in the same way to their
interval frequencies.
04/17/25 96
Example: Distribution of the age of women at the time of marriage
Age 15-19 20-24 25-29 30-34 35-39 40-44 45-49
group
Number 11 36 28 13 7 3 2
Age of women at the time of marriage
40
35
30
No of women
25
20
15
10
0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
04/17/25 97
Histogram for the ages of 2087 mothers with <5
children, Adami Tulu, 2003
700
600
500
400
300
200
N1AGEMOTH
04/17/25 98
Two problems with histograms
1. They are somewhat difficult to construct
2. The actual values within the respective
groups are lost and difficult to reconstruct
04/17/25 99
4. Stem-and-Leaf Plot
• A quick way to organize data to give visual
impression similar to a histogram while retaining
much more detail on the data.
• Similar to histogram and serves the same purpose
and reveals the presence or absence of symmetry
• Are most effective with relatively small data sets
• Are not suitable for reports and other
communications, but
• Help researchers to understand the nature of their
data
04/17/25 100
5. Frequency polygon
• A frequency distribution can be portrayed
graphically in yet another way by means of a
frequency polygon.
• To draw a frequency polygon we connect the mid-
point of the tops of the cells of the histogram by a
straight line.
• The total area under the frequency polygon is
equal to the area under the histogram
• Useful when comparing two or more frequency
distributions by drawing them on the same
diagram
04/17/25 101
Frequency polygon for the ages of 2087 mothers with <5
children, Adami Tulu, 2003
700
600
500
400
300
200
N1AGEMOTH
04/17/25 102
It can be also drawn without erecting rectangles by joining
the top midpoints of the intervals representing the frequency
of the classes as follows:
40
35
30
No of women
25
20
15
10
0
12 17 22 27 32 37 42 47
Age
04/17/25 103
6. Ogive Curve (The Cummulative
Frequency Polygon)
• Some times it may be necessary to know the
number of items whose values are more or less
than a certain amount.
• We may, for example, be interested to know the
no. of patients whose weight is <50 Kg or >60 Kg.
• To get this information it is necessary to change
the form of the frequency distribution from a
‘simple’ to a ‘cumulative’ distribution.
• Ogive curve turns a cumulative frequency
distribution in to graphs.
• Are much more common than frequency polygons
04/17/25 104
Cumulative Frequency and Cum. Rel. Freq. of Age
of 25 ICU Patients
04/17/25 106
Example: Heart rate of patients admitted to hospital Y, 1998
60
50
40
Cum. freqency
30
20
10
0
54.5
59.5
64.5
69.5
74.5
79.5
84.5
89.5
94.5
99.5
104.5
Heart rate
LM MM
04/17/25 108
Percentiles (Quartiles)
• Suppose that 50% of a cohort survived at least 4
years.
• This also means that 50% survived at most 4
years.
• We say 4 years is the median.
• The median is also called the 50th percentile
• We write: P50 = 4 years.
04/17/25 109
• Similarly we could speak of other percentiles:
– P0: The minimum
– P25: 25% of the sample values are less than or
equal to this value. 1st Quartile
. P25 means 25th percentile
04/17/25 110
It is possible to estimate the values of percentiles from
a cumulative frequency polygon.
04/17/25 111
7. Scatter plot
• Most studies in medicine involve measuring
more than one characteristic, and graphs
displaying the relationship between two
characteristics are common in literature.
• When both the variables are qualitative then
we can use a multiple bar graph.
• When one of the characteristics is qualitative
and the other is quantitative, the data can be
displayed in box and whisker plots.
04/17/25 112
• For two quantitative variables we use
bivariate plots (also called scatter plots
or scatter diagrams).
04/17/25 113
• A scatter diagram is constructed by drawing X-and Y-axes.
• Each point represented by a point or dot() represents a pair of
values measured for a single study subject
140
120
Saturation of bile
100
80
60
40
20
0
0 10 20 30 40 50 60 70 80
Age
04/17/25 114
• The graph suggests the possibility of a
positive relationship between age and
percentage saturation of bile in women.
04/17/25 115
8. Line graph
• Useful for assessing the trend of particular situation
overtime.
• Helps for monitoring the trend of epidemics.
• The time, in weeks, months or years, is marked along the
horizontal axis, and
• Values of the quantity being studied is marked on the
vertical axis.
• Values for each category are connected by continuous
line.
• Sometimes two or more graphs are drawn on the same
graph taking the same scale so that the plotted graphs
are comparable.
04/17/25 116
No. of microscopically confirmed malaria cases by species
and month at Zeway malaria control unit, 2003
No. of confirmed malaria cases
2100
1800 Positive
1500 P. falciparum
P. vivax
1200
900
600
300
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Months
04/17/25 117
04/17/25 118