Biostat 1st Part
Biostat 1st Part
1
Contents
Definition of terms
Introduction to statistics
Scale of measurement
Methods of data collection
Presenting and summarizing data
2
Data
In general, the term data refers to factual material
used as a basis for discussion and decision making
Is a raw material for statistics (in biostatistics it
refers to the material available for analysis and
interpretation)
Numerical description of things
Results of observation, counting, or measurement
Can be primary or secondary data
3
Variable
Any aspect of an individual that is measured and take
any value for different individuals or cases.
Any characteristic (property) of the observational unit
with outcomes (data) that vary from one observation
to the other.
Any attribute, phenomenon, or event that can have
different values.
Example: blood pressure, age, sex…
Value (in statistics): is the magnitudes of
measurements, statistics, or parameters.
4
Difference between variable and data?
5
Introduction to statistics
Definition
Statistics: A field of study concerned with the
collecting, organizing or presenting, analyzing and
interpreting numerical data for understanding a
phenomenon or making wise decisions.
6
Statistics…
Statistical data: it refers to numerical descriptions of
certain phenomenon or condition.
7
Characteristics of statistical data
8
statistical data...
9
Statistics
Statistical methods refer to a body of methods that are
used for collecting, organizing, analyzing and interpreting
numerical data for understanding a phenomenon or making
wise decisions
can be divided in to two main branches:
1. Descriptive statistics:-
- is concerned with the organization, presentation, and
summarization of data.
2. Inferential statistics:-
- is a set of procedures used to draw conclusions about a large
body of data, called a population, based on a smaller set of
data, a sample, taken from a population.
10
Difference Between Descriptive and
Inferential Statistics
11
What is biostatistics?
When the different statistical methods are applied in
biological, medical and public health data they
constitute the discipline of Biostatistics.
12
13
Scales of Measurement
14
Variable
Any aspect of an individual that is measured and take
any value for different individuals or cases is called a
variable.
Variables are mainly divided into:
Qualitative (or categorical),
Quantitative (or numerical variables).
15
Variable …
Qualitative variable: a variable or characteristic which
cannot be measured in quantitative form but can only be
identified by name or categories,
Ex. place of birth, ethnic group, type of drug, sex
17
Scale of measurement
Measurement: is the assignment of numbers to objects
or events according to a set of rules
There are four basic types of data (scales of
measurement);
1. Nominal scale
2. Ordinal scale
3. Interval scale
4. Ratio scale
18
1. Nominal data/scale
Data that represent names or categories
20
Ordinal scale …
Examples;
Patient satisfaction:-
Very satisfied
Satisfied
Unsatisfied
Very unsatisfied
Injuries (according to their level of severity): -
Fatal
Severe
Moderate
Minor
21
3. Interval scale of measurement
In interval data the intervals between values are the
same
Scales begin with an arbitrary ‘0’ point, only fixed by
convention
23
24
Interval
Ordinal
Nominal
Ratio
Degree of precision in measuring
Summary of variable and measurement scale
25
Class exercise
Identify the type of data (nominal, ordinal, interval and ratio) represented by
each of the following. Confirm your answers by giving your own examples.
1. Color preference
2. RH factor
3. PH (power of Hydrogen)
4. IQ (Intelligent Quotient )
5. Job satisfaction index (1-5)
6. Serum glucose level (mg/dl)
7. Number of cases of each reportable disease reported by a health worker
8. The average weight gain of 4 pregnant women
9. Temperature (Celsius)
26
Methods of data collection
27
Data collection methods
Data collection methods: are techniques allowing us to
systematically collect information about our objects of
study (people, objects, phenomena) and about the settings
in which they occur.
Data collection methods can generally be classified into :
A) OBSERVATION
Is a technique that involves systematically selecting,
watching and recording behavior and characteristics of
living beings, objects or phenomena
It ranges from simple visual observation to using
machines
29
Observation
Advantage:
It gives relatively more accurate data on behaviors and
activities.
Disadvantage:
Investigators own bias, desires etc will affect data
quality.
It needs more resources and skilled man power
31
Primary data collection
33
Interview…
B2) FACE TO FACE INTERVIEW(PERSONAL
INTERVIEW)
What differs it from in-depth interview is that this
uses structured questionnaires to be asked in order.
Advantages:
Serious approach by respondent resulting in accurate
information.
Good response rate.
Completed and immediate.
Interviewer in control & can give help for a problem.
Can investigate motives and feeling
34
Face to face interview
Disadvantages:
Need to set up interviews.
Time consuming.
Geographic limitations.
Can be expensive.
Respondent bias – tendency to please or impress, create
false personal image, or end interview quickly.
35
Primary data…
B.3) FOCUS GROUP DISCUSSION (FGD):
A focus group is typically composed of eight to twelve
participants who are unfamiliar with each other and
conducted by a trained interviewer
The group should be homogeneous
36
FGD…
Advantage:
When details of thoughts/ attitudes in a certain target
population is wanted to be known.
Used in application of social and behavoiral sciences
Details of practices, beliefs etc on the topic is achieved.
Disadvantage:
Requires skilled facilitatator to raise topics for ideas &
probe.
37
Primary data
B. 4) TELEPHONE INTERVIEW
This is an alternative form of to the personal interview,
face-to-face interview.
Advantages:
Quick.
Can cover reasonably large numbers of people or
organizations.
Wide geographic coverage.
High response rate – keep going till the required number.
Spontaneous response.
38
Telephone interview
Disadvantages:
Often connected with selling.
Questionnaire required.
Not everyone has a telephone.
Repeat calls are inevitable – average 2.5 calls to get someone.
Time is wasted.
Straightforward questions are required.
Respondent has little time to think.
39
C. Administering written questionnaires (self-administered
questionnaire)
41
Mail interview
Advantages:
Can cover a large number of people or
organizations.
Wide geographic coverage.
Relatively cheap
Disadvantages:
Design problems and Questions have to be relatively
simple and carefully designed.
Historically low response rate
Time delay whilst waiting for responses to be returned.
42
Other types of primary data collection
A) Self reported checklist
B) Expert judgments
C) Citizen report/score/cards
D) Delphi technique
E) Maping and scalnig
F) Case-studies
G) Diaries
H) Critical incidents
I) Portfolios
J) Multi method (combination)
43
2. Secondary data collection
It is also called as data mining.
44
Secondary data collection
Advantage:
Data collection is inexpensive.
Less time consuming
Disadvantages:
It is sometimes difficult to gain access to the records or reports
required, and the data may not always be complete and precise
enough, or too disorganized.
There could be differences in objectives between the primary
author of the data and the user.
45
Common problems might include
Language barriers
Lack of adequate time
Expense
Inadequately trained and experienced staff
Invasion of privacy
Suspicion
Bias (spatial, project, person, season, professional)
Cultural norms
46
Choosing a Method of Data Collection
47
Questionnaire
48
Types of Questions
Open-ended questions: are questions that permit
free responses that should be recorded in the
respondent’s own words.
Ex: “What would you do if you noticed that your
daughter (school girl) had a relationship with a
teacher?”
Closed questions: offer a list of possible options or
answers from which the respondents must choose.
Ex: “Have your every gone to local village HW?
1. Yes 2. No
49
Question forms cont…
50
Question forms cont…
Semi-structured interviews include fixed questions
but with no, or few, response codes, and are used
flexibly, often in no fixed order, to enable
respondents to raise other relevant issues not covered
by the interview schedule
51
Question forms cont…
Unstructured interviews are comprised of a checklist
of topics, rather than fixed questions, and there are no
pre-codes
52
Steps in Designing a Questionnaire
Step1: CONTENT
Take your objectives and variables as your starting
point
Step 2: FORMULATING QUESTIONS
Formulate one or more questions that will provide the
information needed for each variable
Take care that questions are specific and precise
Check whether each question measures one thing at a
time
Avoid leading questions
Use simple everyday language.
53
Steps in Designing a Questionnaire…
Step 3: SEQUENCING OF QUESTIONS
Design to be “consumer friendly”
Sequence of questions must be logical
At the beginning put questions concerning
“background variables”
Start with interesting but non-controversial question
Pose more sensitive questions as late as possible
54
Steps in Designing a Questionnaire…
Step 4: FORMATTING THE QUESTIONNAIRE
When you finalize your questionnaire, be sure that:
Each questionnaire has a heading and space to insert
the number or response
Layout is such that questions belonging together
appear together visually
55
Steps in Designing a Questionnaire…
Step 5: TRANSLATION
If interview will be conducted in one or more local
languages, the questionnaire has to be translated
Then retranslated into the original language to check
for consistency
56
Questionnaire layout
The questionnaire should be visually easy to read and
comprehend
Lower case letters should be used for texts
questions should be understandable / avoid vague
questions
began with easy questions and put sensitive
questions at the end
Questions should be in logical order
57
The covering letter
The covering letter should:
be written on the organization’s headed paper,
include the name and address of the sample
member and the identification (serial) number, and
address the recipient by name
58
The covering letter cont…
The letter should:
explain how the person’s name was obtained/
selected
outline the study aims, benefits and risks
guarantee confidentiality
59
Piloting
60
Methods of data organization and
presentation
61
Data organizing and presenting
62
Methods of data organization and presentation
Organization Presentation
Ordered array Statistical tables
Frequency distribution Simple /one way tables
Simple frequency Two way table
distribution High order tables
Categorical frequency Graphical presentation
distribution Bar cart
Grouped frequency Pie chart
distribution Histogram
Frequency polygon
Line graph
63
Ordered array
Is a serial arrangement of numerical data in an
ascending or descending order
Is an appropriate when the data are small in size
Ex. Table: Ordered Array of Ages of Subjects
64
Frequency Distributions
Frequency: number of occurrences of statistical result
The number of times a particular result occurs in a
statistical survey (absolute frequency), or
the ratio of that number to the total results obtained in
the survey (relative frequency).
Frequency Distribution: list or table to summarize data
Consists of set of classes or categories along with their
respective numerical counts/percentage.
A table which contains the values of a variable and the
corresponding frequencies with which each value occurs
(Frequencies with which data falls within each range )
65
Frequency Distribution…
The actual summarization and organization of data starts
from frequency distribution.
The distribution condenses the raw data into a more useful
form and allows for a quick visual interpretation of the data.
Frequency distribution is used to display categorical data or
grouped quantitative data.
A relative frequency distribution: shows the proportion of
counts that fall into each class or category (the value for
any category is obtained by dividing the number of
observations in that category by the total number of
observations)
This can be reported as percentage
66
Frequency distribution for nominal data
67
Example 2 –Ordinal Data
68
Frequency Distribution for
Quantitative/numerical
A frequency distribution can also show the number
of observations at different values or within certain
ranges
For a discrete variable, the frequencies may be
tabulated either for each value of the variable or for
groups of values
With continuous variables, groups (class intervals
have to be formed)
For both discrete or continuous data, the values are
grouped into distinct non-overlapping intervals,
usually of equal width
69
Frequency Distribution for
Quantitative/numerical
Example 2: Age for 25 individuals
70
Grouped frequency distribution
Can be condensed as follows using Grouped
frequency distribution
71
Example 2- grouped quantitative data
A study was conducted to assess the prevalence of
nutritional status among women of child bearing age
in urban Ethiopia-evidence from EDHS 2011.
72
Grouped frequency distribution
Steps to construct grouped frequency distribution
1. Deciding on the number of classes
number of classes usually 6 – 20, 15 is suggested
A guide on the determination of the number of classes
(k) can be the Sturge’s Formula
73
Steps to construct grouped frequency..
2. Determination of class limits
Classes should be mutually exclusive
76
Grouped frequency distribution…
77
Grouped frequency distribution…
Cumulative frequencies: when the frequencies of two or
more class are added
Cumulative relative frequencies : the proportion of the
total number of observations that a value less than or
equal to the upper limit of the interval.
78
Mid point & True class limits
79
Data presentation
Tabular
Diagrammatic
80
Data presentation
1. Statistical Tables
A statistical table is an orderly and systematic
presentation of numerical data in rows and columns.
Rows (stubs) are horizontal and columns (captions)
are vertical arrangements.
The use of tables for organizing data involves
grouping the data into mutually exclusive categories
of the variables and counting the number of
occurrences (frequency) to each category.
81
Construction of tables
No hard and fast rules but general principles:
1. Tables should be as simple as possible.
2. Tables should be self-explanatory. For that purpose
Title should be clear & to the point( good title answers: what?
when? where? how classified ?) & should be placed above the
table.
Each row and column should be labelled.
Numerical entities of zero should be explicitly written rather
than indicated by a dash.
Dashed are reserved for missing or unobserved data.
Totals should be shown either in the top row and the first
column or in the last row.
3. If data are not original, source should be given in a footnote
82
A. Simple or one-way table: The simple frequency table is used when
the individual observations involve only to a single variable
Table1: Overall immunization status of children in Adami Tullu
Woreda, Feb. 1995
Immunization status Number percent
Not immunized 75 35.7
Source: Fikru T et al. EPI Coverage in Adami Tulu. Eth J Health Dev
1997;11(2): 109-113 83
B. Two-way table: This table shows two characteristics & formed
when either the raw or column is divided into two or more parts.
Table 2: TT immunization by marital status of the women of childbearing age,
Assendabo town, Jimma Zone, 1996
Characteristics
Immunization status
84
C. Higher order table: When it is desired to represent three or
more characteristics in a single table.
85
Data presentation ...
2. Diagrammatic presentation of Data
It is pictorial or graphical presentation of data
87
General rules for construction of graphs
Graph should be self-explanatory & as simple as
possible.
Titles are usually placed below the graph and it should
again question what ? Where? When? How classified?
Legends or keys should be used to differentiate variables
if more than one is shown.
The axes label should be placed to read from the left
side and from the bottom.
The units in to which the scale is divided should be
clearly indicated.
The numerical scale representing frequency must start at
zero or a break in the line should be shown.
88
Types of graphical data presentation methods
I. Bar chart
II. Pie chart
III. Histograms
IV. Frequency Polygon
V. Ogive curve or Cumulative frequency curve
VI. Line graph
89
I. Bar Charts
Used to represent & compare frequency distribution of
discrete variables & categorical series
Types of bar chart:
A. Simple bar chart,
The bars are not joined together (leave equal space b/n bars)
All bars should rest on the same line called the base
92
Simple Bar Chart…
Frequency of referred
cases
Source of referral
95
Example: Use the Data give below to construct Sub-divided Bar chart
96
Stacked bar chart
F
Figure x: Average amount (g) bread consumed per person per
week by year in London.
97
100% Component Bar chart
99
Multiple bar chart…
101
It is also possible to plot bars horizontally
102
II. Pie-chart
Shows the relative frequency for each category by dividing
a circle in to sectors, the angles of which are proportional to
the relative frequency
Used for single categorical variable & percentage
distribution
Steps to construct Pie chart
Construct a frequency table
Change the frequency into percentage
Change the percentage in to degrees, Where: degree=
Percentage *3600
Draw a circle and draw it accordingly. Eg.
103
Pie chart …
104
Pie chart …
105
III. Histograms
106
Histograms…
For each class in the distribution a vertical rectangle is
drawn with
its base on the horizontal axis extending from one class
boundary of the class to the other class boundary,
there will never be any gap between the rectangles
the bases of all rectangles will be determined by the width
of the class intervals.
If distribution with unequal class interval, it is necessary to
make adjustment for varying magnitudes of class intervals.
bar/ rectangle is proportional to the frequency of
observation in the interval
107
Histograms…
108
Histograms…
109
IV. FREQUENCY POLYGON
110
Frequency polygon …
111
Frequency polygon …
Note that it is not essential to draw histogram in order to
obtain frequency polygon.
Can be drawn with out histogram rectangles as follows;
The scale should be marked in the numerical values of
the mid- points of intervals.
Erect ordinates on the midpoints of the interval - the
length or altitude of an ordinate representing the
frequency of the class on whose mid-point it is erected.
Join the tops of the ordinates and extend the connecting
lines to the scale of sizes.
112
Example: Consider the above data on time spend on leisure
activities
The histogram under the frequency polygon can be omitted.
30
25
Number of students
20
15
10
5
0
7 12 17 22 27 32 37 42
Mid points of class intervals
113
V. Ogive curve or Cumulative frequency curve
115
Ogive curve…
116
The less than and more than frequency method
117
118
VI. Line graph
Useful to study some variables according to passage of time.
The time, in weeks, months or years is marked along the
horizontal axis; and the value of the quantity that is being
studied is marked on the vertical axis.
The distance of each plotted point above the base-line
indicates its numerical value.
The line graph is suitable for depicting a consecutive trend of
a series over a long period.
119
Example: Malaria parasite rates as obtained from malaria
seasonal blood survey results, Ethiopia (1967-79 E.C),
5.5
5.0
4.5
4.0
Rate (%)
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
1967 1969 1971 1973 1975 1977 1979
Year
120
Summarizing Data
121
Summarizing Data
At the end of this topic, the student will be able to:
Identify the different methods of data summarization
Compute appropriate summary values for a set of
data
Appreciate the properties and limitations of summary
values
122
Data summarizing methods
1. Measure of central tendency (location):
The tendency of statistical data to get concentrated at
certain values is called the “Central Tendency”.
The various methods of determining the actual value
at which the data tend to concentrate are called
measures of central Tendency.
Mean, Median and Mode, weighted mean are
parameters used to measures central Tendency.
123
Data summarizing methods ...
2. Measure of dispersion
We also need to know how “spread out” the numbers
are about the center.
The deviation of each data value from the center
Types:
Range
Interquartile range
Variance
Standard deviation and
Coefficient of variation.
124
1. THE ARITHMETIC MEAN/ SIMPLE MEAN
(X )
is the sum of the values of all observations divided by
the number of observations.
It is written in statistical terms as:
X= ∑ Xi , i = 1,2,…n sample mean
n
Xi population mean
i1
N
125
Arithmetic Mean…
Example: Suppose the following data consists of birth
weights (in grams) of all live born infants born at
HUCSH, in the last 2-weeks period.
126
Arithmetic Mean…
Calculate the mean for the above data
X= ∑ Xi , i = 1,2,…n
n
= (3265 + 3260 + ….+ 2834) = 63, 338/20= 3166.9g
20
127
Arithmetic Mean…
Exercise: Assume there were 15 patients scheduled
for a particular OR procedure on Monday. The
following data shows the time (in minutes) each
patient will stay on the OR table for the procedure.
Calculate the average time required to undertake the
procedure.
30, 26, 26, 36, 48, 50, 41, 31, 29, 27, 33, 35, 52, 28, 37
128
Arithmetic Mean…
Answer for the exercise
X= ∑ Xi , X= time in minuets and i = 1,2,…n
n
= (30 + 26 + ….+ 37) = 529/20 ≈ 35 minutes
15
129
Mean for Grouped data
130
Example: Calculate the mean for the following data
60+187+264+189+96+74=870/40=21.75 Hr
131
Mean for grouped data…
Exercise: Assume different scheduled OR procedures
were done for 122 patients in 2007. The following
data shows the days each patient will wait before the
procedure. Calculate the average waiting time for OR
procedure in this Hospital.
Interval of waiting Frequency
Days (f)
2-6 27
7-11 34
12-16 36
17-21 25
Total 122
132
Mean for grouped data…
Answer for the exercise
Interval of waiting Days Frequency Mid point (m) f*m
(f)
2-6 27 1.5+6.5/2 =4 27*4=108
7-11 34 6.5+11.5/2=9 34*9=306
12-16 36 11.5+16.5/2=14 36*14=504
17-21 25 16.5+21.5/2=19 19*25=475
Total 122 1393
= 1393/122 = 11.4
137
Geometric Mean…
138
Geometric Mean…
139
Geometric mean…
Example: the following data shows the minimum
inhibitory concentration of penicilin in urine for N.
gonorrhoeae in 74 patients. Calculate the geometric
mean.
140
Geometric mean…
GM = antilog of [ (Ʃfilogxi)/n ]
=antilog[(21log0.003125)+ (6log0.0625)+ …
(3log1.0)]
74
= 0.143
141
Weighted Mean
142
Weighted mean…
Example:
143
Weighted mean…
Exercise: Suppose you are selecting one nurse for the educational
criteria from human resource office the following weights were given for
Based on this Abebe and Kebede apply for the opportunity and their
profile is as follows.
Abebe: 8 years work experience, GPA 3.2 and 97.5 boss evaluation result.
Kebede: 18 years work experience, GPA 3.8 and 74.0 boss evaluation result.
145
Example: Compute median for the birth weight data
Solution: First arrange the sample in ascending order
2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101,
3200, 3245, 3248, 3260, 3265, 3314, 3323, 3484,
3541, 3609, 3649, 4146
Since n=20 is even,
Median = average of the 10th and 11th largest
observation = (3245 + 3248)/2 = 3246.5 g
?Omit the last BWT (4146g) and calculate the sample
median
146
Median for grouped data
In calculating median for grouped data we assume that the
values within a class-interval are evenly distributed
through the interval
The first step is to locate the class interval in which the
median is located, using the following procedure
Find n/2 and see a class interval with a minimum cum.
Freq. which contains n/2
Then use the following formula
147
Median For Grouped data
148
Compute the median age for
the following grouped data
149
Median for grouped data…
First, we need to find out the median class
The median class is the first class with cumulative
frequency of at least 162/2 = 84.5- in the third class
interval- 30-39
LTL= 29.5, Fc= 70 frequency = fm =47,
Median = 29.5+(84.5-70) 10=
47
29.5 +[ (14.5/47)10]=32.58=33
150
Median for grouped data…
Exercise: Assume different scheduled OR procedures
were done for 122 patients in 2007. The following
data shows the days each patient will wait before the
procedure. Calculate the median waiting time for OR
procedure in this Hospital.
Interval of waiting Frequency
Days (f)
2-6 27
7-11 34
12-16 36
17-21 25
Total 122
151
Median for grouped data…
Answer for the exercise
Interval of waiting Days Frequency Mid point (m) Cumulative
(f) Frequency
2-6 27 1.5+6.5/2 =4 27
7-11 34 6.5+11.5/2=9 61
12-16 36 11.5+16.5/2=14 97
17-21 25 16.5+21.5/2=19 122
Total 122
= 6.5+(61-27)* 5
4 34
6.5 +[ (34/34)5]= 7.5
152
Median…
Advantages
It is easily calculated and is not much disturbed by
extreme values
It is more typical of the series
The median may be located even when the data are
incomplete, e.g, when the class intervals are irregular
and the final classes have open ends.
Disadvantages
The median is not so well suited to algebraic treatment
as the arithmetic, geometric and harmonic means.
It is not so generally familiar as the arithmetic mean
153
Percentile
Certain percentile or functions of percentiles have specific
names:
154
Percentile…
A percentile has an intuitively simple meaning—for
example, the 25th percentile is that value of a variable
such that 25% of the observations are less than that
value and 75% of the observations are greater.
The Pth percentile of a sample of n observations is
that value of the variable with rank (P /100)(1+n).
If this rank is not integer, it is rounded to nearest half
rank
155
Percentile…
Example: The following data deal with the number of
patients scheduled for surgery for 15 consecutive
days in the last month. Calculate 50 th, 25th, 10th, and
90th percentile.
30,26,26,36,48,50,16,31,22,27,23,35,52,28,37
Rank: 16,22,23,26,26,27,28,30,31,35,36,37,48,50,52
156
Percentile
The 50th percentile is that value with rank(50/100)(1+15)
=8. The eighth largest (or smallest) observation is 30.
The 25th percentile is the observation with rank(25/100)
(1+15)=4, and this is 26.
Similarly, the 75th percentile is 37.
The 10th percentile (or decile) is that value with
rank(10/100)(1+15) =1.6, so we take the value halfway
between the smallest and second-smallest observation,
which is (1/2)(16+22) =19.
The 90th percentile is the value with rank(90/100)
(1+15)=14.4; this is rounded to the nearest half rank of 14.5.
The value with this half rank is(1/2)(50+52)=51.
157
Mode
is the value which occurs with the greatest
frequency.
If all the values are different there is no mode; on
the other hand, a set of values may have more than
one mode (bimodal, trimodal…).
A distribution with one mode is referred to as
unimodal.
Characteristics of mode:
Is average of position and could be more than one.
It is not affected by extreme values
158
Mode…
Example: What is the mode of the above data on the
number of patients scheduled for surgery for 15
consecutive days in the last month.
30,26,26,36,48,50,16,31,22,27,23,35,52,28,37
Mode = 26
159
Mode for grouped data
160
Mode…
Advantages
Since the mode is usually an “actual value”, it
indicates the precise value of an important part of the
series.
Disadvantages
Unless the number of items is fairly large and the
distribution reveals a distinct central tendency, the
mode has no significance
It is not capable of mathematical treatment
In a small number of items the mode may not exist.
161
Exersice
Suppose the following data show the maximal static
inspiratory pressure (PI max in cmH2O) of patients
with cystic fibrosis admitted in a certain hospital
during one month duration.
80 100 85 110 75 85 45 70 125 110
110 95 80 75 150 95 130 100 100 75
90 75 120 40 95
Compute the arithmetic mean, median, and mode?
162
Answer
Mean = ∑ Xi ,
n
= 80 + 85 + … + 95 = 2315 = 92.6 cm H2O
25 25
40 45 70 75 75 75 75 80 80 85 85
90 95 95
95 100 100 100 110 110 110 120 125 130 150
Since n=25, is odd,
Median = (n+1/2) th observation = 25+1/2 = 13 th
95 cm H2O
163
Answer …
If we discard the last observation, n=24 is even;
therefore,
(n/2 ) th and (n/2+1) th observation
90 + 95/2 = 92.5 cm H2O.
C) Mode
The modal value of the above PI max data is 75cmH2O
(this value occurred with the greatest frequency as
compared to the other values).
164
Measure of Dispersion
165
Measures of Dispersion/ Variation
Consider the following data sets:
Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50
The two data sets given above have a mean of 50
Which data set is more scattered?- set 1 is more
“spread out” than set 2.
How do we express this numerically? Using measures
of dispersion.
The commonly used measures of scatter are: range,
standard deviation, variance and coefficient of
variation 166
Measure of dispersion…
167
1. Range
Is the difference between the highest and lowest value of the
observation in the data.
It is the crudest measure of dispersion.
Range=Xmax – Xmin
Ex: Ranked data for number of patients scheduled for surgery for
15 consecutive days in the last month on the above example
16,22,23,26,26,27,28,30,31,35,36,37,48,50,52
Range= 52-16= 36
168
2. Interquartile range
16,22,23,26,26,27,28,30,31,35,36,37,48,50,52
25th percentile is the; (25/100)(1+15)=4, and this is 26
75th percentile is the; (75/100)(1+15)=12, and this is 37
37−26=11
169
Interquartile range…
Exercise: The following data shows 10 days FBS
measurement of a patient admitted in the surgical
ward in mg/dl. Calculate the interquartile range.
222, 300, 188, 89, 155, 306, 121, 414, 600, 326
170
Interquartile range…
Answer for the exercise
Raw data: 222, 300, 188, 89, 155, 306, 121, 414, 600, 326
Ranked data: 89,121,155,188,222,300,306,326,414,600
25th percentile is the; (25/100)(1+10)=2.75 th observation, Or
1st quartile = (n+1)/4th observation = (2.75)th observation
= 121 + (155-121)x 0.75 = 146.5
75th percentile is the; (75/100)(1+10)=8.25 th observation, Or
3rd quartile=3/4 (n+1)th observation = (8.25)th observation
= 326 + (414-326)x 0.25 = 348
348−146.5=201.5
171
Measure of dispersion…
Some Measures of Central tendency and dispersion for number
of patients scheduled for surgery for 15 consecutive days in
the last month from the above example
16,22,23,26,26,27,28,30,31,35,36,37,48,50,52
172
3. Mean Deviation
Each value in a data set differs from the sample mean by some
specific amount called deviation
Mean deviation is the average of the absolute deviations
from the a central value , generally the mean or median
173
Computation of the Mean Deviation
1. Calculate the mean from the data
2. Calculate the deviations from the mean
3. Sum up all deviations , treated as pos. (take absolute
value)
4. Divide sum of deviations by the total numbers of
observation
174
Mean Deviation…
175
Mean Deviation…
Exercise: Calculate the MD for 10 days FBS
measurement of a patient admitted in the surgical
ward in mg/dl in the above exercise.
222, 300, 188, 89, 155, 306, 121, 414, 600, 326
176
Mean Deviation…
Answer for the exercise
Raw data: 222, 300, 188, 89, 155, 306, 121, 414, 600, 326
177
Properties of MD
It is based on all observations in the data set
It is not affected much by extreme values
how ever ignoring the negative signs is not
mathematically sound .
178
4. Variance
In MD, ignoring the negative signs is not mathematically sound .
To tackle this limitation taking the square of each deviation
The variance is the average of the squares of the deviations taken
from the mean
This measure of variation is universally used to show the scatter of
the individual values around the mean of a given distribution
(Population variance = σ2 and Sample variance = S2)
Let X1, X2, ..., Xn be the measurement on n sample units, then:
S2 =
179
Variance…
Exercise: Calculate the variance for 10 days FBS
measurement of a patient admitted in the surgical
ward in mg/dl in the above exercise.
222, 300, 188, 89, 155, 306, 121, 414, 600, 326
180
Variance…
Answer for the exercise
Raw data: 222, 300, 188, 89, 155, 306, 121, 414, 600, 326
Mean= 2721/10= 272.1mg/dl
S2 = = 212138.9/9=23571(mg/dl)2
181
5. Sample Standard Deviation
The main disadvantage of variance is that the units of
variance are the square of the units of the original
observations.
The easiest way around this difficulty is to use the square
root of the variance (termed as standard deviation),
which is the widely used measure of dispersion.
It is the positive square root of the variance.
182
Example
The followings are the survival times of 11 patients
after cardiac transplant surgery.
Patients are identified numerically from 1 to 11, and
the survival time for the “ith” patient is represented as
Xi for i= 1, …, 11.
183
Example…
184
SD…
Exercise: Calculate the SD for 10 days FBS
measurement of a patient admitted in the surgical
ward in mg/dl in the above exercise.
222, 300, 188, 89, 155, 306, 121, 414, 600, 326
185
SD…
Answer for the exercise
Raw data: 222, 300, 188, 89, 155, 306, 121, 414, 600, 326
Mean= 2721/10= 272.1mg/dl
187
SD for grouped frequency…
188
6. COEFFICIENT OF VARIATION (CV)
189
Example
Consider the following two samples that represent
cholesterol measurements (mg/100ml), each on the
same person, but using different measurement
techniques.
=
√ 340 = 18. 4 mg/100 ml
191
Solution…
For ME method:
S = √ ∑ (192 – 200) 2 +(197-200) 2…+ (200 -200) 2
5– 1
=
√ 39.5 = 6.3 mg/ml
192
Example
Compute the Coefficient of variation CV for the age and
weight of groups of students.
Variable Mean Standard deviation
193
Solution…
CV (age ) = 3.15 years/20.63 years *100
= 15.3%
CV (weight) = 6.10 kg/ 58.89 *100
= 13.8%
Thus, the age of the students is relatively more spread
out than their weights.
But if one considered only the respective standard
deviations, he would say that the weight of the students
is more spread out than their age.
Read about properties of mean and standard deviation.
194
Which measures to use?
When the distribution of the data is symmetric and unimodal
(i.e. the data are approximately normally distributed), it is
usual to summarize the data using means and standard
deviations.
However when the data are skewed, it is preferable to use the
median and inter quartile range as summary statistics.
Median and quartiles are not easily influenced by extreme
values in a skewed distribution unlike means and standard
deviations.
Remark:
The mean and median of symmetric distribution coincide.
When the distribution is skewed to the right, its mean is
larger than its median.
When the distribution is skewed to the left, its mean is
smaller than its median.