Chapter One
Chapter One
2/20/2025 1
Course Content
• Introduction
• Elementary probabality and probablity
Distribution
• Statistical Inference
• Sampling Distribution
• Introduction to STATA
• Categorical data analysis
• Continious data analysis
• Longitudinal data Analysis
• Non Parametric Test
2/20/2025 2
INTRODUCTION
Learning objectives
At the end of this chapter the student will be able
to:
• Define what is meant by statistics.
• Explain what is meant by descriptive statistics and inferential statistics.
• Distinguish between a qualitative variable and a quantitative variable.
• Distinguish among the nominal, ordinal, interval, and ratio levels
of measurement.
• Source data
• Method data collection
• Method data presentation
• 2/20/2025
Numerical measures 3
Statistical data: refers to numerical
descriptions of things. These descriptions may
take the form of counts or measurements.
E.g. statistics of malaria cases include fever
cases, number of positives obtained, sex and
age distribution of positive cases, etc.
4
Statistical methods: refers methods that are used for
collecting, organizing, analyzing and interpreting
numerical data for understanding a phenomenon or
making wise decisions. In this sense it is a branch of
scientific method and helps us to know in a better
way the objective under study.
5
Introduction…
• What is statistics?
- we use statistics every day, often without
realising it.
• Statistics: A field of study concerned with:
– collection, organization, analysis,
summarization and interpretation of
numerical data, &
– the drawing of inferences about a body of data
when only a small part of the data is
observed.
6
Biostatistics ?
- The application of statistical methods to the fields of
biological and medical sciences are able to methodically
distinguish between true differences among observations
and random variations caused by chance alone. .
Concerned with interpretation of biological data & the
communication of information derived from these data.
Has central role in medical investigations.
Because almost every experiment involves uncertainty,
statistics is the scientific method for quantitative data
analysis.
7
Uses of biostatistics
• Provide methods of organizing information
• Assessment of health status: Vital statistics ?
• Health program evaluation
• Resource allocation: Census
• knowledge of biostatistics permits one to make valid
conclusions from data sets.
• Magnitude of association
– Strong vs weak association between exposure and
outcome
8
Uses of biostatistics
• Assessing risk factors
– Cause & effect relationship
• Evaluation of a new vaccine or drug
– What can be concluded if the proportion of people
free from the disease is greater among the
vaccinated than the unvaccinated?
– How effective is the vaccine (drug)?
– Is the effect due to chance or some bias?
• Drawing of inferences
– Information from sample to population
2/20/2025 9
10
Limitation of statistics
It more deals with quantitative data rather than
qualitative data
11
Famous quote in statistics
12
Types of biostatistics
13
Types of Statistics
1. Descriptive statistics:
• Descriptive statistics are methods for organizing and
summarizing data.
• Helps to identify the general features and trends in a
set of data and extracting useful information.
• For example, tables or graphs are used to organize
data, and descriptive values such as the average score
are used to summarize data.
14
Descriptive biostatistics
• Some statistical summaries which are especially
common in descriptive analyses are:
• Measures of central tendency
• Measures of dispersion
• Cross-tabulation /contingency table
• Histogram
• Quantile, Q-Q plot
• Scatter plot
• Box plot
15
Types of Statistics
2. Inferential statistics:
• Inferential statistics are methods for using sample data to
make general conclusions (inferences) about populations.
16
Inferential statistics:
2/20/2025 17
18
Types of variable
Depending on the characteristic of the measurement,
variable can be:
• Qualitative(Categorical) variable A variable or
characteristic which cannot be measured in quantitative
form. But, can only be identified by name or
categories, or variable that can be placed into distinct
categories, according to some characteristic or
attribute.
For instance place of birth, ethnic group, type of
drug, stages of breast cancer (I, II, III, or IV),
degree of pain (low, moderate, sever ).
19
2/20/2025 20
21
• Examples: Patient status, cancer stages, social
class, Likert scales etc.
22
Example of ordinal scale:
23
3. Interval scale:
- Measured on a continuum and differences between any
two numbers on a scale are of known size.
Example: Temp. in oF on 4 consecutive days
Days: A B C D
Temp. oF: 50 55 60 65
For these data, not only is day A with 50o cooler than
day D with 65o, but is 15o cooler.
- It has no true zero point. “0” is arbitrarily chosen and
doesn’t reflect the absence of temp.
24
4. Ratio scale:
- Measurement begins at a true zero point and the
scale has equal space.
- Examples: Height, age, weight, BP, etc.
• Note on meaningfulness of “ratio”-
– Someone who weighs 80 kg is two times as heavy as
someone else who weighs 40 kg. This is true even if
weight had been measured in other measurements.
25
Scales of Measurement
• Nominal = Naming
• Ordinal = Naming + Order
• Interval = Naming + Order + Equal Intervals
• Ratio = Naming + Order + Equal Intervals +
True Zero
26
27
Interval
Ordinal
Nominal
Ratio
Degree of precision in measuring
Basic Terms in statistics
28
Why Sample?
Census of a population may be:
Impossible
Impractical
Too costly
29
Basic terms cont . . .
o Census
A census is the collection of data from every member
of the population.
o Parameter
A parameter is a numerical measurement describing
some characteristics of a population.
o Statistic
A statistic is a numerical measurement describing
some characteristics of a sample.
30
Basic terms cont . . .
• Data are observations (such as measurements,
genders, survey responses) that have been
collected.
• It is the raw material for statistics.
• Can be obtained from:
– Routinely kept records, literature
– Surveys
– Counting
– Experiments
– Reports
– Observation
– Etc
31
Population and Sample
• Population:
– Refers to any collection of objects
• Target population:
– A collection of items that have something in
common for which we wish to draw conclusions at
a particular time.
• E.g., All hospitals in Ethiopia
– The whole group of interest
32
Population and Sample cont…
Study (Sampled) Population:
• The subset of the target population that has at
least some chance of being sampled
The specific population group from which samples
are drawn and data are collected
Sample:
. A subset of a study population, about which
information is actually obtained.
. The individuals who are actually measured and
comprise the actual data.
33
Population & Sample
Population Sample
• Includes ALL POSSIBLE • A set of observations from a
OBSERVATIONS population
• Greek Letters • Roman Letters
34
Population
• Role of statistics
in using information
from a sample to make
inferences about the
population
Information
Sample
35
E.g.: In a study of the prevalence
of HIV among adolescents in
Ethiopia, a random sample of
adolescents in Lideta Kifle
Ketema of AA were included.
36
Parameter and Statistic
• Parameter: A descriptive measure computed
from the data of a population.
37
Exercise:- Consider the following Scales of
measurement (types of data) and answer questions
1. Blood group
2. Temperature (Celsius)
3. Sex
4. Job satisfaction index (1-5)
5. Number of heart attacks
6. Calendar year
7. Serum uric acid (mg/100ml)
8. Number of cases of each reportable disease reported by a
health worker
9. The average weight gain of 6 1-year old dogs with a special
diet supplement was 950 grams last month.
10. Injury severity (a score between 1and 3 is allocated
depending on the severity) – scores 1 and 3 show mild and
very severe respectively.
38
Source of data and methods of data collection
Discuss 5 minutes
39
Source of Data
Source of data
Internal External
source source
Primary Secondary
source source
40
Internal and External Source of Data
Internal Sources of Data External Sources of Data
Many institutions and o When information is
departments have information collected form outside
about their regular functions, agencies, it is called
for their own internal purpose. external source of data.
When those information is
o Such type of data are either
used in any survey, it’s called
Primary or Secondary.
Internal Source Of Collection
of Data. o This type of information
E.g.., Public health Institutes can be collected by Census
& Nursing association or Sampling method by
members etc. conducting surveys
41
Primary Data
• Primary data are those which are collected for
the first time.
• It is real time data which are collected by the
researcher himself.
• This is the process of Collecting and making
use of the data.
• This Data originated by the researcher
specifically to address the research problem
42
Method of Collecting Primary
Data
1. Direct personal Investigation ( i.e. Interview
Method)
2. Indirect oral investigation ( i.e. through
enumerators)
3. Investigation through Local reporters
Questionnaire
4. Investigation through mailed Questionnaire
5. Investigation through Observation
43
Secondary Data
• Secondary data are those that have already been
collected by others.
• These are usually in journals, periodicals, research
publications, official records etc.
• Secondary data may be available in the published
or unpublished form.
• When it is not possible to collect the data by
primary method, the investigator go for
Secondary method.
• This Data collected for some purpose other than
the problem at hand.
44
Method of Collecting Secondary Data
1. Published Sources
a) International Publication
b) Government Publications
45
Difference between Primary and Secondary Data
Primary Data Secondary Data
• Real time data. • Past data.
• Sure about sources of data. • Not sure about sources of data.
• Help to give results/finding • Refining the problem.
• Costly and Time consuming • Cheap and No time consuming
process. process.
• Avoid biasness of response • Can not know in data biasness
data or not
• More flexible. • Less Flexible.
46
Data collection techniques VS Data collection tools
47
• Data collection methods?
48
Data collection methods…
49
Data collection methods…
50
Data collection methods…
51
Data collection methods…
52
Data collection methods…
53
Types of Questions
Depending on how questions are asked and recorded we
can distinguish two major possibilities - open –ended
questions, and closed questions.
Open-ended questions
Open-ended questions permit free responses that should
be recorded in the respondent’s own words. The
respondent is not given any possible answers to choose
from.
Such questions are useful to obtain information on:
Facts with which the researcher is not very familiar,
Opinions, attitudes, and suggestions of informants, or
Sensitive issues. 54
Open-ended questions…
For example
Can you describe exactly what the traditional birth
attendant did when your labor started?
What do you think are the reasons for a high drop-out
rate of village health committee members?
What would you do if you noticed that your daughter
(school girl) had a relationship with a teacher?
55
Closed Questions
Closed questions offer a list of possible options or
answers from which the respondents must choose.
When designing closed questions one should try to:
Offer a list of options that are exhaustive and
mutually exclusive
56
Closed Questions…
For example
What is your marital status?
1. Single
2. Married/living together
3. Separated
4. divorced
5. widowed
Have you ever gone to the local village health worker for
treatment?
1. Yes
2. No
Closed questions may also be used if one does not want
to waste the time of the respondent and interviewer by
obtaining more information than one needs. 57
Methods of data collection summary
Types of data Data type by source Methods of data
collection
primary Observation
59
method of data presentation
Definitions
• Frequency distribution: is the organization of
raw data in the form of table, using classes and
frequency.
• Frequency: for a particular class is the number
of original values that fall into that class (the
number of values in a specific class of the
distribution
• Raw data: recorded information in its original
collected form, whether it is to be counts or
measurements, is referred to as raw data.
60
Cont…
• Tabulation: is the process of summarizing,
classifying the data in the form of table.
• There are three basic types of frequency
distributions.
– Categorical frequency distribution.
– Ungrouped frequency distribution.
– Grouped frequency distribution.
61
Categorical frequency distribution
• Used to present data that can be place in specific
categories such as nominal or ordinal scale.
• Used for qualitative data, such as marital status,
religion, blood type…etc
• Example: A social worker collected the following data
on marital status for 25 people. (M=married, S=single,
W=widowed, D=divorced)
• M W D S S M MM W D S M D
S W D D S SS W W D D M
• The following will demonstrate how to construct the
frequency distribution for this data.
62
Cont…
63
2. Ungrouped frequency distribution
65
Grouped frequency distribution
• When the range of the data is large, the data must be grouped
in to classes that are more than one unit in width.
Definitions:
• Grouped Frequency Distribution: a frequency distribution
when several numbers are grouped in one class.
• Class limits (CL): Separates one class in a grouped frequency
distribution from another.
• The limits could actually appear in the data and have gaps
between the upper limits of one class and lower limit of the
next.
• Units of measurement (U): the distance between two possible
consecutive measures.
• It is usually taken as 1, 0.1, 0.01, 0.001, -----etc.
66
• 1 5 8 10 u=1
• 1 5.2 7 8 u=0.1
• 2 9.12 4 8 3 u=0.01
• 2 3.33 1.2 8 7 U=0.01
67
• Class boundaries (CB) are the numbers used to
separate classes, but without the gaps created by
class limits.
• They are obtained as follows:
– Find the size of the gap between the upper class
limit of one class and the lower class limit of the
next class.
– Add half of that amount to each upper class limits
to find the upper class boundaries; and subtract
half of that amount from each lower class limit to
find the lower class boundaries.
68
Guidelines for classes
k = 1+3.32logn
70
Example:
– Leisure time (hours) per week for 40 college
students:
23 24 18 14 20 36 24 26 23 21 16 15 19 20 22
14 13 10 19 27 29 22 38 28 34 32 23 19 21 31
16 28 19 18 12 27 15 21 25 16
range=38-10=28
K = 1 + 3.22 (log40) = 6.32 ≈ 6
Maximum value = 38, Minimum value = 10
Width = (38-10)/6 = 4.66 ≈ 5
71
Time Relative Cumulative
(Hours) Frequency Frequency Relative
Frequency
10-14 5 0.125 0.125
15-19 11 0.275 0.400
20-24 12 0.300 0.700
25-29 7 0.175 0.875
30-34 3 0.075 0.950
35-39 2 0.050 1.00
Total 40 1.00
72
Diagrammatic and Graphic presentation of
data
• After the data has been organized in to frequency
distribution, they can be presented in chart and graphs.
• These are techniques for presenting data in visual
displays using geometrics and pictures.
• The purpose of presenting data in to graphs and charts
are:
• They have greater attraction.
• They facilitate comparison.
• They are easily understandable.
73
Diagrammatic presentation of data
74
Pie chart
75
The pie chart
Frequency
15% Men
25%
Women
Boys
76
1. Bar charts (or graphs)
78
Histogram
• Histograms are frequency distributions with continuous
class intervals that have been turned into graphs.
• Bars are drawn over the intervals in such a way that the
areas of the bars are all proportional in the same way to
their interval frequencies.
2/20/2025 82
2/20/2025 83
2/20/2025 84
5. Frequency polygon
• A frequency distribution can be portrayed
graphically in yet another way by means of a
frequency polygon.
86
Measure of Central Tendency(MCT)
• Is a single value that attempts to describe a set of data
by identifying the central position within that set of
data.
87
MCT…
88
Characteristics of a good MCT
89
MCT…
Arithmetic Mean
Median
Mode
90
MCT…
Notations: =summation
X= mean of samples
μ = the mean of the population
σ = standard deviation of the population
91
MCT…
Example :
Suppose n values of a variable are denoted as x1, x2, x3….,
xn,, then
92
MEAN
• The mean (or average) is the most popular and
well known measure of central tendency.
93
Types of Means
• Arithmetic mean
• Weighted mean
• Geometric mean
• Harmonic mean
94
Ungrouped Data: arithmetic mean...
95
Mean….
Example:
=> 167, 120, 150, 125, 150, 140, 40, 136, 120, 150
96
Solution
97
Mean…
Grouped data:
• It is calculated as follow.
98
Mean for grouped data…
99
Example. Compute the mean age of 169 subjects from the
grouped data.
100
Solution
Class interval Mid-point (mi) Frequency (fi) mifi
10-19 14.5 4 58.0
20-29 24.5 66 1617.0
30-39 34.5 47 1621.5
40-49 44.5 36 1602.0
50-59 54.5 12 654.0
60-69 64.5 4 258.0
102
Solution
Class interval Mid-point (mi) Frequency (fi) mifi
5-14 9.5 5 47.5
15-24 19.5 10 195.0
25-34 29.5 120 3540.0
35-44 39.5 22 869.0
45-54 49.5 13 643.5
55-64 59.5 5 297.5
Total __ 175 5592.5
105
WM…
Grade Score Weight(Cr Hrs)
A 4 3
B 3 2
C 2 3
D 1 1
F 0 2
106
Home take reading assignment
• Geometric mean (GM)
• Harmonic Mean (HM)
107
2. Median
Ungrouped data:
• The median is the value which divides the data set into
two equal parts.
• If the number of values is odd, the median will be the
middle value when all values are arranged in order of
magnitude.
• When the number of observations is even, there is no
single middle value but two middle observations.
108
Suppose there are n observations in a sample.
If these observations are ordered from smallest to
largest, then the median is defined as follows:
109
Example 1
110
Example 2
111
Median…
Grouped data:
• The first step is to locate the class interval in which
the median is located, using the following
procedure.
112
Example
Class interval Mid-point (mi) Frequency (fi) Cum. freq
10-19 14.5 4 4
20-29 24.5 66 70
30-39 34.5 47 117
40-49 44.5 36 153
50-59 54.5 12 165
60-69 64.5 4 169
Total 169
n/2=84.5
Found on the 3rd class
interval
113
Properties of the median
Advantage:
114
3. Mode
• The mode is the most frequently occurring value among
all the observations in a set of data.
Ungrouped data:
118
Mode…
Grouped data:
• To find the mode of grouped data, we usually refer to the modal class,
where the modal class is the class interval with the highest frequency.
• If a single value for the mode of grouped data must be specified,
Mode = Lm + (Δ1)W
(Δ1+ Δ2)
Where by:
Lm= Lower boundary of the modal class(highest frequency)
W= Class width
Δ1= difference of frequency between modal class and the class before it
Δ2= difference of frequency between modal class and the class after it
119
Mode…
120
Properties of mode
Advantage:
Disadvantage:
121
Calculate Mean, Median and Mode for the following data:
QUIZE-I (5%):
122
Measures of Dispersion
123
2. Measures of Dispersion/ Variation
More over, two or more sets may have the same mean
and/or median but they may be quite different
• Simple to compute
• Uniquely defined
125
Dispersion…
• The most common and well known measures of
dispersion include the following:
Range
Variance
Standard deviation
Coefficient of variation
126
Range
• The range is defined as the difference between the highest
and smallest observation in the data.
• It is the crudest measure of dispersion.
• Range = 42-5 = 37
• Data set with higher range exhibit more variability
127
Properties of range
• It is the simplest crude measure
128
2. Inter quartile range (IQR)
• Quartiles divide the data into four equal parts.
129
.... (IQR)
Q2 = [2(n+1)/4]th
Q3 = [3(n+1)/4]th
• The inter-quartile range is the difference between the third and the
first quartiles.
IQR = Q3 - Q1
130
IQR…
• To compute it, we first sort the data, in ascending
order.
131
Solution
Sorting : => 18 21 23 24 24 32 42 59
=> 1st quartile = The (n+1)/4th observation = (2.25)th
observation
= 21 + (23-21)x 0.25 = 21.5
132
Variance (2, s2)
• Variance is used to measure the dispersion of values
relative to the mean.
o Sample variance = S2
133
Variance…
Parameter:
Statistic:
134
• A sample variance is calculated for a sample of
individual values (X1, X2, … Xn) and uses the
sample mean rather than the population mean µ.
135
Characteristics of Variance
136
Example
• Following are the survival times of n=11 patients after
heart transplant surgery.
137
Solution
138
Exercise
Class interval (fi)
10-19 4
20-29 66
30-39 47
40-49 36
50-59 12
60-69 4
Total 169
139
Properties of SD
140
SD Vs Standard Error (SE)
141
Coefficient of variation (CV)
142
Example
• The following data is the mark obtained by two
students taking the same course. Find out who was the
more consistent student.
A: 58 59 60 65 66 52 75 31 46 48
B: 56 87 89 46 93 65 44 54 78 67
143
Solution
• Step one: calculate the mean for each using the
formula for ungrouped data.
• Mean(A)= 56
• Mean(B)= 68
144
Solution…
• Step two: Calculating the variance and standard
deviation for ungrouped data as follow:
• SD(A)= 11.73
• SD(B)= 17.13
145
Thank you very much
146