0% found this document useful (0 votes)
7 views

Chapter One

The document outlines a course on Advanced Biostatistics and Data Management, detailing its content and learning objectives. It covers key statistical concepts, methods of data collection, and the distinction between primary and secondary data, emphasizing the importance of biostatistics in medical research. Additionally, it discusses various types of statistics, variables, and scales of measurement relevant to data analysis.

Uploaded by

naseemahmed5599
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Chapter One

The document outlines a course on Advanced Biostatistics and Data Management, detailing its content and learning objectives. It covers key statistical concepts, methods of data collection, and the distinction between primary and secondary data, emphasizing the importance of biostatistics in medical research. Additionally, it discusses various types of statistics, variables, and scales of measurement relevant to data analysis.

Uploaded by

naseemahmed5599
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 146

ADVANCED BIOSTATISTICS

AND DATA MANAGEMENT

Ahmed M (Assistant Professor, PhD in Epidemiology


candidate)

2/20/2025 1
Course Content
• Introduction
• Elementary probabality and probablity
Distribution
• Statistical Inference
• Sampling Distribution
• Introduction to STATA
• Categorical data analysis
• Continious data analysis
• Longitudinal data Analysis
• Non Parametric Test
2/20/2025 2
INTRODUCTION
Learning objectives
At the end of this chapter the student will be able
to:
• Define what is meant by statistics.
• Explain what is meant by descriptive statistics and inferential statistics.
• Distinguish between a qualitative variable and a quantitative variable.
• Distinguish among the nominal, ordinal, interval, and ratio levels
of measurement.
• Source data
• Method data collection
• Method data presentation
• 2/20/2025
Numerical measures 3
Statistical data: refers to numerical
descriptions of things. These descriptions may
take the form of counts or measurements.
E.g. statistics of malaria cases include fever
cases, number of positives obtained, sex and
age distribution of positive cases, etc.

NB: Even though statistical data always denote figures


(numerical descriptions), it must be remembered that all
'numerical descriptions' are not statistical data.
Why?

4
Statistical methods: refers methods that are used for
collecting, organizing, analyzing and interpreting
numerical data for understanding a phenomenon or
making wise decisions. In this sense it is a branch of
scientific method and helps us to know in a better
way the objective under study.

5
Introduction…
• What is statistics?
- we use statistics every day, often without
realising it.
• Statistics: A field of study concerned with:
– collection, organization, analysis,
summarization and interpretation of
numerical data, &
– the drawing of inferences about a body of data
when only a small part of the data is
observed.

6
Biostatistics ?
- The application of statistical methods to the fields of
biological and medical sciences are able to methodically
distinguish between true differences among observations
and random variations caused by chance alone. .
 Concerned with interpretation of biological data & the
communication of information derived from these data.
 Has central role in medical investigations.
 Because almost every experiment involves uncertainty,
statistics is the scientific method for quantitative data
analysis.

7
Uses of biostatistics
• Provide methods of organizing information
• Assessment of health status: Vital statistics ?
• Health program evaluation
• Resource allocation: Census
• knowledge of biostatistics permits one to make valid
conclusions from data sets.
• Magnitude of association
– Strong vs weak association between exposure and
outcome
8
Uses of biostatistics
• Assessing risk factors
– Cause & effect relationship
• Evaluation of a new vaccine or drug
– What can be concluded if the proportion of people
free from the disease is greater among the
vaccinated than the unvaccinated?
– How effective is the vaccine (drug)?
– Is the effect due to chance or some bias?
• Drawing of inferences
– Information from sample to population

2/20/2025 9
10
Limitation of statistics
 It more deals with quantitative data rather than
qualitative data

 It deals on aggregates of facts : no importance to


individual items

 Statistical data are only approximately : not


mathematically correct

11
Famous quote in statistics

 Statisticians applying valid statistical methods


will reach consistent conclusions. The data
doesn't lie. It is the people that manipulate the
data that lie.

12
Types of biostatistics

13
Types of Statistics
1. Descriptive statistics:
• Descriptive statistics are methods for organizing and
summarizing data.
• Helps to identify the general features and trends in a
set of data and extracting useful information.
• For example, tables or graphs are used to organize
data, and descriptive values such as the average score
are used to summarize data.

14
Descriptive biostatistics
• Some statistical summaries which are especially
common in descriptive analyses are:
• Measures of central tendency
• Measures of dispersion
• Cross-tabulation /contingency table
• Histogram
• Quantile, Q-Q plot
• Scatter plot
• Box plot

15
Types of Statistics
2. Inferential statistics:
• Inferential statistics are methods for using sample data to
make general conclusions (inferences) about populations.

• Because a sample is typically only a part of the whole


population, sample data provide only limited information
about the population.

• As a result, sample statistics are generally imperfect


representatives of the corresponding population parameters.

16
Inferential statistics:

• Statistical summaries which are common in


inferential statistics: Principles of probability,
estimation, confidence interval, comparison of two
or more means or proportions, hypothesis testing,
etc.

2/20/2025 17
18
Types of variable
Depending on the characteristic of the measurement,
variable can be:
• Qualitative(Categorical) variable A variable or
characteristic which cannot be measured in quantitative
form. But, can only be identified by name or
categories, or variable that can be placed into distinct
categories, according to some characteristic or
attribute.
For instance place of birth, ethnic group, type of
drug, stages of breast cancer (I, II, III, or IV),
degree of pain (low, moderate, sever ).

19
2/20/2025 20
21
• Examples: Patient status, cancer stages, social
class, Likert scales etc.

22
Example of ordinal scale:

• Pain level: • The numbers have


1. None LIMITED meaning
2. Mild 4>3>2>1 is all we know
3. Moderate apart from their utility
as labels
4. Severe

23
3. Interval scale:
- Measured on a continuum and differences between any
two numbers on a scale are of known size.
Example: Temp. in oF on 4 consecutive days
Days: A B C D
Temp. oF: 50 55 60 65
For these data, not only is day A with 50o cooler than
day D with 65o, but is 15o cooler.
- It has no true zero point. “0” is arbitrarily chosen and
doesn’t reflect the absence of temp.

24
4. Ratio scale:
- Measurement begins at a true zero point and the
scale has equal space.
- Examples: Height, age, weight, BP, etc.
• Note on meaningfulness of “ratio”-
– Someone who weighs 80 kg is two times as heavy as
someone else who weighs 40 kg. This is true even if
weight had been measured in other measurements.

25
Scales of Measurement

• Nominal = Naming
• Ordinal = Naming + Order
• Interval = Naming + Order + Equal Intervals
• Ratio = Naming + Order + Equal Intervals +
True Zero

26
27
Interval
Ordinal
Nominal

Ratio
Degree of precision in measuring
Basic Terms in statistics

28
Why Sample?
Census of a population may be:
 Impossible
 Impractical
 Too costly

29
Basic terms cont . . .
o Census
 A census is the collection of data from every member
of the population.
o Parameter
 A parameter is a numerical measurement describing
some characteristics of a population.
o Statistic
 A statistic is a numerical measurement describing
some characteristics of a sample.

30
Basic terms cont . . .
• Data are observations (such as measurements,
genders, survey responses) that have been
collected.
• It is the raw material for statistics.
• Can be obtained from:
– Routinely kept records, literature
– Surveys
– Counting
– Experiments
– Reports
– Observation
– Etc
31
Population and Sample
• Population:
– Refers to any collection of objects
• Target population:
– A collection of items that have something in
common for which we wish to draw conclusions at
a particular time.
• E.g., All hospitals in Ethiopia
– The whole group of interest

32
Population and Sample cont…
Study (Sampled) Population:
• The subset of the target population that has at
least some chance of being sampled
The specific population group from which samples
are drawn and data are collected
Sample:
. A subset of a study population, about which
information is actually obtained.
. The individuals who are actually measured and
comprise the actual data.
33
Population & Sample

Population Sample
• Includes ALL POSSIBLE • A set of observations from a
OBSERVATIONS population
• Greek Letters • Roman Letters

34
Population
• Role of statistics
in using information
from a sample to make
inferences about the
population

Information

Sample

35
E.g.: In a study of the prevalence
of HIV among adolescents in
Ethiopia, a random sample of
adolescents in Lideta Kifle
Ketema of AA were included.

Sample Target Population: All


adolescents in Ethiopia
Study Population Study population: All
adolescents in Addis Ababa
Target Population
Sample: Adolescents in Lideta
Kifle Ketema who were included
in the study

36
Parameter and Statistic
• Parameter: A descriptive measure computed
from the data of a population.

– E.g., the mean (µ) age of the target population

• Statistic: A descriptive measure computed from


the data of a sample.

– E.g., sample mean age

37
Exercise:- Consider the following Scales of
measurement (types of data) and answer questions
1. Blood group
2. Temperature (Celsius)
3. Sex
4. Job satisfaction index (1-5)
5. Number of heart attacks
6. Calendar year
7. Serum uric acid (mg/100ml)
8. Number of cases of each reportable disease reported by a
health worker
9. The average weight gain of 6 1-year old dogs with a special
diet supplement was 950 grams last month.
10. Injury severity (a score between 1and 3 is allocated
depending on the severity) – scores 1 and 3 show mild and
very severe respectively.

38
Source of data and methods of data collection
Discuss 5 minutes

39
Source of Data

Source of data

Internal External
source source

Primary Secondary
source source

40
Internal and External Source of Data
Internal Sources of Data External Sources of Data
 Many institutions and o When information is
departments have information collected form outside
about their regular functions, agencies, it is called
for their own internal purpose. external source of data.
 When those information is
o Such type of data are either
used in any survey, it’s called
Primary or Secondary.
Internal Source Of Collection
of Data. o This type of information
 E.g.., Public health Institutes can be collected by Census
& Nursing association or Sampling method by
members etc. conducting surveys

41
Primary Data
• Primary data are those which are collected for
the first time.
• It is real time data which are collected by the
researcher himself.
• This is the process of Collecting and making
use of the data.
• This Data originated by the researcher
specifically to address the research problem

42
Method of Collecting Primary
Data
1. Direct personal Investigation ( i.e. Interview
Method)
2. Indirect oral investigation ( i.e. through
enumerators)
3. Investigation through Local reporters
Questionnaire
4. Investigation through mailed Questionnaire
5. Investigation through Observation
43
Secondary Data
• Secondary data are those that have already been
collected by others.
• These are usually in journals, periodicals, research
publications, official records etc.
• Secondary data may be available in the published
or unpublished form.
• When it is not possible to collect the data by
primary method, the investigator go for
Secondary method.
• This Data collected for some purpose other than
the problem at hand.
44
Method of Collecting Secondary Data
1. Published Sources
a) International Publication
b) Government Publications

c) Commercials Research, Educational Institute,


Unions, Organizations etc.
2. Unpublished Sources

45
Difference between Primary and Secondary Data
Primary Data Secondary Data
• Real time data. • Past data.
• Sure about sources of data. • Not sure about sources of data.
• Help to give results/finding • Refining the problem.
• Costly and Time consuming • Cheap and No time consuming
process. process.
• Avoid biasness of response • Can not know in data biasness
data or not
• More flexible. • Less Flexible.

46
Data collection techniques VS Data collection tools

Data collection Technique Data collection tools


Available document review Checklist: Data compilation forms
Observing Eyes and other sense, pen/paper, watch,
questionnaire, microscope
Interviewing Interview guide, Tape- recorder, etc
Focus Group discussion (FGD)
Administering written questionnaires Questionnaires

47
• Data collection methods?

48
Data collection methods…

 Accuracy and “practicability” are often inversely


correlated. A method providing more satisfactory
information will often be a more expensive or
inconvenient one.

♣ Therefore, accuracy must be balanced against


practical considerations (resources and other
practical limitations)

49
Data collection methods…

 For quantitative data, we usually use


questionnaires (standard or structured).
- The questionnaire could be self-administered or
interviewer-administered (either face-to-face or
telephone, or other electronic media such as
online internet)

50
Data collection methods…

- Self-administered questionnaire is filled by the


study subjects themselves at spot or through
mails.
- Self -administered questionnaires are suitable for
literate study subjects, simple questions that don't
need further clarifications and sensitive matters
(e.g. sexual issues, criminal activities, substance
abuse).

51
Data collection methods…

- Interviewer- administered questionnaires are


suited for illiterate study subjects, complex
questions that need further clarifications and
non-private or non-sensitive issues, and when
information on emotional reactions of study
subjects is to be recorded.

52
Data collection methods…

For qualitative data, the common methods of


collection are focus group discussion, in-depth
interview(unstructured/semi-structured)
observation (participant/non-participant), and case
studies

53
Types of Questions
 Depending on how questions are asked and recorded we
can distinguish two major possibilities - open –ended
questions, and closed questions.
Open-ended questions
Open-ended questions permit free responses that should
be recorded in the respondent’s own words. The
respondent is not given any possible answers to choose
from.
Such questions are useful to obtain information on:
 Facts with which the researcher is not very familiar,
 Opinions, attitudes, and suggestions of informants, or
 Sensitive issues. 54
Open-ended questions…

For example
Can you describe exactly what the traditional birth
attendant did when your labor started?
What do you think are the reasons for a high drop-out
rate of village health committee members?
What would you do if you noticed that your daughter
(school girl) had a relationship with a teacher?

55
Closed Questions
 Closed questions offer a list of possible options or
answers from which the respondents must choose.
 When designing closed questions one should try to:
 Offer a list of options that are exhaustive and
mutually exclusive

 Keep the number of options as few as possible.

 Closed questions are useful if the range of possible


responses is known.

56
Closed Questions…

For example
 What is your marital status?
1. Single
2. Married/living together
3. Separated
4. divorced
5. widowed
 Have you ever gone to the local village health worker for
treatment?
1. Yes
2. No
Closed questions may also be used if one does not want
to waste the time of the respondent and interviewer by
obtaining more information than one needs. 57
Methods of data collection summary
Types of data Data type by source Methods of data
collection

Qualitative Primary FGD

Primary In-depth – interview

primary Observation

Quantitative Primary / secondary Questionnaires


-open/closed
-Structured
-Self/Interviewer
administered
Primary / secondary -Observation
-Use of documentary
sources 58
Method of data presentation

59
method of data presentation
Definitions
• Frequency distribution: is the organization of
raw data in the form of table, using classes and
frequency.
• Frequency: for a particular class is the number
of original values that fall into that class (the
number of values in a specific class of the
distribution
• Raw data: recorded information in its original
collected form, whether it is to be counts or
measurements, is referred to as raw data.
60
Cont…
• Tabulation: is the process of summarizing,
classifying the data in the form of table.
• There are three basic types of frequency
distributions.
– Categorical frequency distribution.
– Ungrouped frequency distribution.
– Grouped frequency distribution.

61
Categorical frequency distribution
• Used to present data that can be place in specific
categories such as nominal or ordinal scale.
• Used for qualitative data, such as marital status,
religion, blood type…etc
• Example: A social worker collected the following data
on marital status for 25 people. (M=married, S=single,
W=widowed, D=divorced)
• M W D S S M MM W D S M D
S W D D S SS W W D D M
• The following will demonstrate how to construct the
frequency distribution for this data.
62
Cont…

class tally frequency percent


M ///// 5 20
S ///// // 7 28
D ///// // 7 28
W ///// / 6 24
25 100

63
2. Ungrouped frequency distribution

• It is often constructed for small set or data on


discrete variable. Each individual value is
presented separately.
• Example: the following data represent the
mark of 20 students.
80 76 90 85 80 70 60 62 70 85 65
60 63 74 75 76 70 70 80 85
• Then construct the ungrouped frequency
distribution data set.
64
mark tally frequency
60 // 2
62 / 1
63 / 1
65 / 1
70 //// 4
74 / 1
75 // 2
76 / 1
80 /// 3
85 /// 3
90 / 1
tot

65
Grouped frequency distribution
• When the range of the data is large, the data must be grouped
in to classes that are more than one unit in width.
Definitions:
• Grouped Frequency Distribution: a frequency distribution
when several numbers are grouped in one class.
• Class limits (CL): Separates one class in a grouped frequency
distribution from another.
• The limits could actually appear in the data and have gaps
between the upper limits of one class and lower limit of the
next.
• Units of measurement (U): the distance between two possible
consecutive measures.
• It is usually taken as 1, 0.1, 0.01, 0.001, -----etc.

66
• 1 5 8 10 u=1
• 1 5.2 7 8 u=0.1
• 2 9.12 4 8 3 u=0.01
• 2 3.33 1.2 8 7 U=0.01

67
• Class boundaries (CB) are the numbers used to
separate classes, but without the gaps created by
class limits.
• They are obtained as follows:
– Find the size of the gap between the upper class
limit of one class and the lower class limit of the
next class.
– Add half of that amount to each upper class limits
to find the upper class boundaries; and subtract
half of that amount from each lower class limit to
find the lower class boundaries.

68
Guidelines for classes

• There should be between 5 and 20 classes.


• The classes must be mutually exclusive. This
means that no data value can fall into two different
classes
• The classes must be all inclusive or exhaustive.
This means that all data values must be included.
• The classes must be continuous. There are no gaps
in a frequency distribution.
• . The classes must be equal in width.
69
Steps for constructing Grouped
frequency Distribution

k = 1+3.32logn

70
Example:
– Leisure time (hours) per week for 40 college
students:
23 24 18 14 20 36 24 26 23 21 16 15 19 20 22
14 13 10 19 27 29 22 38 28 34 32 23 19 21 31
16 28 19 18 12 27 15 21 25 16
range=38-10=28
K = 1 + 3.22 (log40) = 6.32 ≈ 6
Maximum value = 38, Minimum value = 10
Width = (38-10)/6 = 4.66 ≈ 5
71
Time Relative Cumulative
(Hours) Frequency Frequency Relative
Frequency
10-14 5 0.125 0.125
15-19 11 0.275 0.400
20-24 12 0.300 0.700
25-29 7 0.175 0.875
30-34 3 0.075 0.950
35-39 2 0.050 1.00
Total 40 1.00
72
Diagrammatic and Graphic presentation of
data
• After the data has been organized in to frequency
distribution, they can be presented in chart and graphs.
• These are techniques for presenting data in visual
displays using geometrics and pictures.
• The purpose of presenting data in to graphs and charts
are:
• They have greater attraction.
• They facilitate comparison.
• They are easily understandable.

73
Diagrammatic presentation of data

• Diagrams are appropriate for presenting


discrete data.
• The three most commonly used diagrammatic
presentation for discrete as well as qualitative
data are:
– Pie charts,
– Bar charts.

74
Pie chart

75
The pie chart

Frequency

15% Men
25%

Women

40% 20% Girls

Boys

76
1. Bar charts (or graphs)

• Bar graph is especially satisfactory for


nominal and ordinal data.
• Categories are listed on the horizontal axis (X-
axis)
• Frequencies or relative frequencies are
represented on the Y-axis (ordinate).
• The heights of bars represent the value of the
frequency (actual number or percentage) for
each category.
77
2.1. Simple bar chart: It is a one-dimensional
diagram
E.g. Bar chart for the type of ICU for 25
patients

78
Histogram
• Histograms are frequency distributions with continuous
class intervals that have been turned into graphs.

• To construct a histogram, we draw the class boundaries


on a horizontal line and the frequencies on a vertical line.

• Non-overlapping intervals that cover all of the data values


must be used.

• Bars are drawn over the intervals in such a way that the
areas of the bars are all proportional in the same way to
their interval frequencies.

• The area of each bar is proportional to the frequency of


observations in the interval
79
2/20/2025 80
2/20/2025 81
Box-and-whisker-plot

2/20/2025 82
2/20/2025 83
2/20/2025 84
5. Frequency polygon
• A frequency distribution can be portrayed
graphically in yet another way by means of a
frequency polygon.

• To draw a frequency polygon we connect the mid-


point of the tops of the cells of the histogram by a
straight line (i.e. By connecting mid point of the
class boundary)
• The total area under the frequency polygon is equal
to the area under the histogram

• Useful when comparing two or more frequency


distributions by drawing them on the same diagram85
Measures of central tendency

86
Measure of Central Tendency(MCT)
• Is a single value that attempts to describe a set of data
by identifying the central position within that set of
data.

• As such, measures of central tendency are sometimes


called measures of central location.

• Measure of central tendency convey information


regarding the average value.

87
MCT…

• The objective of calculating MCT is to determine a


single figure which may be used to represent the whole
data set.

• In that sense it is an even more compact description of


the statistical data than the frequency distribution.

88
Characteristics of a good MCT

1. It should be based on all the observations

2. It should not be affected by the extreme values

3. It should have a definite value

4. It should not be subjected to complicated and tedious


calculations

5. It should be capable of further algebraic treatment

89
MCT…

• The most common measures of central tendency


include:

 Arithmetic Mean

 Median

 Mode

90
MCT…

• Before attempting the measures of central tendency


and dispersion, let’s see some of the notations that
are used frequently.

Notations:  =summation
X= mean of samples
μ = the mean of the population
σ = standard deviation of the population

91
MCT…
Example :
Suppose n values of a variable are denoted as x1, x2, x3….,
xn,, then

xi = x1+x2+ x3 +…xn , where the subscript i


ranges from 1 up to n.

Let x1=2, x2 = 5, x3=1, x4 =4, x5=10, x6= 5, x7= 8

Since there are 7 observations, i ranges from 1 up to 7.

92
MEAN
• The mean (or average) is the most popular and
well known measure of central tendency.

• It is the descriptive measure most people have in


mind when they speak of the average.

• It can be used with both discrete and continuous


data, although its use is most often with continuous
data.

93
Types of Means

• Arithmetic mean

• Weighted mean

• Geometric mean

• Harmonic mean

94
Ungrouped Data: arithmetic mean...

• It is the sum of all observations divided by the number


of observations.

95
Mean….
Example:

• The heart rates for n=10 patients were as follows


(beats per minute):

=> 167, 120, 150, 125, 150, 140, 40, 136, 120, 150

• What is the arithmetic mean for the heart rate of these


patients?

96
Solution

97
Mean…
Grouped data:

• In calculating the mean value from grouped data we


assume that all the values falling into a particular
class interval are located at the mid point of the
interval.

• It is calculated as follow.

98
Mean for grouped data…

99
Example. Compute the mean age of 169 subjects from the
grouped data.

Class interval Mid-point (mi) Frequency (fi) mifi


10-19 14.5 4 58.0
20-29 24.5 66 1617.0
30-39 34.5 47 1621.5
40-49 44.5 36 1602.0
50-59 54.5 12 654.0
60-69 64.5 4 258.0

Total __ 169 5810.5

100
Solution
Class interval Mid-point (mi) Frequency (fi) mifi
10-19 14.5 4 58.0
20-29 24.5 66 1617.0
30-39 34.5 47 1621.5
40-49 44.5 36 1602.0
50-59 54.5 12 654.0
60-69 64.5 4 258.0

Total __ 169 5810.5

Mean = 5810.5/169 = 34.48 years


101
Exercise

102
Solution
Class interval Mid-point (mi) Frequency (fi) mifi
5-14 9.5 5 47.5
15-24 19.5 10 195.0
25-34 29.5 120 3540.0
35-44 39.5 22 869.0
45-54 49.5 13 643.5
55-64 59.5 5 297.5
Total __ 175 5592.5

Mean = 5592.5/175 = 31.96


103
Properties of the Arithmetic Mean.
Advantage:
• uniqueness: For a given set of data there is one and
only one arithmetic mean
• Simplicity: Easy to calculate and understand
• It is based on all values given in the distribution.
• It is most amenable to algebraic treatment.
Limitation:
• It may be greatly affected by extreme values.
104
Example

105
WM…
Grade Score Weight(Cr Hrs)

A 4 3

B 3 2

C 2 3

D 1 1

F 0 2

12+6+6+1+0 = 2.27 = GPA = WM


11

106
Home take reading assignment
• Geometric mean (GM)
• Harmonic Mean (HM)

107
2. Median
Ungrouped data:
• The median is the value which divides the data set into
two equal parts.
• If the number of values is odd, the median will be the
middle value when all values are arranged in order of
magnitude.
• When the number of observations is even, there is no
single middle value but two middle observations.
108
Suppose there are n observations in a sample.
If these observations are ordered from smallest to
largest, then the median is defined as follows:

109
Example 1

110
Example 2

111
Median…
Grouped data:
• The first step is to locate the class interval in which
the median is located, using the following
procedure.

• Find n/2 and see a class interval with a minimum


cumulative frequency which contains n/2.

112
Example
Class interval Mid-point (mi) Frequency (fi) Cum. freq

10-19 14.5 4 4
20-29 24.5 66 70
30-39 34.5 47 117
40-49 44.5 36 153
50-59 54.5 12 165
60-69 64.5 4 169

Total 169
n/2=84.5
Found on the 3rd class
interval

113
Properties of the median

Advantage:

• There is only one median for a given set of data


(uniqueness)

• The median is easy to calculate

• It is insensitive to very large or very small values

114
3. Mode
• The mode is the most frequently occurring value among
all the observations in a set of data.

• It is not influenced by extreme values.

• It is possible to have more than one mode (bimodal/two


peaks distribution) or no mode.

• The mode is not often used in biological or medical


data.
115
Mode
Mode
Mode
20
18
16
14
12
10
8
6
4
2
0 116
T. Ancelle, D. Coulombie
Mode….

Ungrouped data:

• is a value which occurs most frequently in a set of


values.

• If all the values are different there is no mode, on


the other hand, a set of values may have more than
one mode
117
Mode…
Example:
• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4 =>“Unimodal”
Example:
• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes => 2 & 5
• This distribution is said to be “bi-modal”
Example:
• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different

118
Mode…
Grouped data:
• To find the mode of grouped data, we usually refer to the modal class,
where the modal class is the class interval with the highest frequency.
• If a single value for the mode of grouped data must be specified,
Mode = Lm + (Δ1)W
(Δ1+ Δ2)
Where by:
Lm= Lower boundary of the modal class(highest frequency)
W= Class width
Δ1= difference of frequency between modal class and the class before it
Δ2= difference of frequency between modal class and the class after it

119
Mode…

The specific mode for the data is:


Lm= 71, Δ1= 25-15= 10 ; Δ2= 25-20=5 ; W=10
Mode = Lm + (Δ1)W = 71+(10)10/10+5= 71+6.6=77.6
(Δ1+ Δ2)

120
Properties of mode

Advantage:

 It is not affected by extreme values

 It can be calculated for distributions with open end classes

Disadvantage:

 Often its value is not unique

 The main drawback of mode is that often it does not exist

121
Calculate Mean, Median and Mode for the following data:
QUIZE-I (5%):

122
Measures of Dispersion

123
2. Measures of Dispersion/ Variation

 Measures of dispersion or variability will give us


information about the spread of the scores how closely
the rest of the data fall about that central value in our
distribution.

 More over, two or more sets may have the same mean
and/or median but they may be quite different

 Thus to have a clear picture of data, one needs to have a


measure of dispersion or needs to have a measure of
dispersion or variability (scatterdness) amongst
observations in the set. 124
Properties of a good measure of variation
• Easy to understand

• Simple to compute

• Based on all observation

• Uniquely defined

• Capable of further algebraic treatment

125
Dispersion…
• The most common and well known measures of
dispersion include the following:

 Range

 Inter-quartile range (IQR)

 Variance

 Standard deviation

 Coefficient of variation
126
Range
• The range is defined as the difference between the highest
and smallest observation in the data.
• It is the crudest measure of dispersion.

Range = xmax - xmin


Example:
• Data values: 5, 9, 12, 16, 23, 34, 37, 42

• Range = 42-5 = 37
• Data set with higher range exhibit more variability
127
Properties of range
• It is the simplest crude measure

• It takes into account only two values which causes it to


be a poor measure of dispersion

• Very sensitive to extreme observations

128
2. Inter quartile range (IQR)
• Quartiles divide the data into four equal parts.

• The first quartile (q1), or 25th percentile, is located such that 25


percent of the data lie below q1 and 75 percent of the data lie
above q1
• The second quartile (q2), or 50th percentile or median, is located
such that half (50 percent) of the data lie below q2 and the other
half (50 percent) of the data lie above q2.
• The third quartile (q3), or 75th percentile, is located such that 75
percent of the data lie below q3 and 25 percent of the data lie
above q3.

129
.... (IQR)

• The quartiles are sets of values which divide the distribution


into four parts.
Q1 = [(n+1)/4]th

Q2 = [2(n+1)/4]th

Q3 = [3(n+1)/4]th

• The inter-quartile range is the difference between the third and the
first quartiles.

IQR = Q3 - Q1

130
IQR…
• To compute it, we first sort the data, in ascending
order.

• Then find the data values corresponding to the first and


third quartile.

Example: Given the following data set (age of patients):

18, 59, 24, 42, 21, 23, 24, 32

Find the IQR from these data?

131
Solution
Sorting : => 18 21 23 24 24 32 42 59
=> 1st quartile = The (n+1)/4th observation = (2.25)th
observation
= 21 + (23-21)x 0.25 = 21.5

=> 3rd quartile = 3/4 (n+1) th observation = (6.75)th


observation
= 32 + (42-32) x 0 .75 = 39.5

Hence, IQR = 39.5 - 21.5 = 18

132
Variance (2, s2)
• Variance is used to measure the dispersion of values
relative to the mean.

• When values are close to their mean the dispersion is


less than when there is scattering over a wide range.
o Population variance = σ2

o Sample variance = S2

133
Variance…

Parameter:

• A numerical measure such as the mean, median, mode,


range, variance, or standard deviation calculated for a
population data set is called a parameter.

Statistic:

• A summary measure calculated for a sample data set is


called a sample statistic.

134
• A sample variance is calculated for a sample of
individual values (X1, X2, … Xn) and uses the
sample mean rather than the population mean µ.

135
Characteristics of Variance

 Its unit is the square of the unite of the original


measurement values.

 It gives more weight to the extreme values.

136
Example
• Following are the survival times of n=11 patients after
heart transplant surgery.

• The survival time for the “ith” patient is represented as


Xi for i= 1, …, 11.

• Calculate the sample variance and SD.

137
Solution

138
Exercise
Class interval (fi)
10-19 4
20-29 66
30-39 47
40-49 36
50-59 12
60-69 4
Total 169

• Based on these data:


=> Compute the variance and SD of the age of 169 subjects
from the grouped data.

139
Properties of SD

• The SD has the advantage of being expressed in the


same units of measurement as the mean.

• However, if the units of measurements of variables of


two data sets is not the same, then there variability can’t
be compared by comparing the values of SD.

140
SD Vs Standard Error (SE)

• SD describes the variability among individual values in


a given data set.

• SE is used to describe the variability in the means of


repeated samples taken from the same population.

• We interpret SE of the mean to mean that another


similarly conducted study may give a mean that may lie
between  SE.

141
Coefficient of variation (CV)

• When two data sets have different units of


measurements, or their means differ sufficiently in size,
the CV should be used as a measure of dispersion.

• Data with less coefficient of variation is considered


more consistent.

• CV is the ratio of the SD to the mean multiplied by 100.

142
Example
• The following data is the mark obtained by two
students taking the same course. Find out who was the
more consistent student.

A: 58 59 60 65 66 52 75 31 46 48

B: 56 87 89 46 93 65 44 54 78 67

143
Solution
• Step one: calculate the mean for each using the
formula for ungrouped data.

• Mean(A)= 56

• Mean(B)= 68

144
Solution…
• Step two: Calculating the variance and standard
deviation for ungrouped data as follow:

• SD(A)= 11.73
• SD(B)= 17.13

145
Thank you very much

146

You might also like