0% found this document useful (0 votes)
12 views

Biostat 1st Part

Nice Docu

Uploaded by

Tadesse Fenta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Biostat 1st Part

Nice Docu

Uploaded by

Tadesse Fenta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 195

Biostatistics

1
Contents

 Definition of terms
 Introduction to statistics
 Scale of measurement
 Methods of data collection
 Presenting and summarizing data

2
Data
 In general, the term data refers to factual material
used as a basis for discussion and decision making
 Is a raw material for statistics (in biostatistics it
refers to the material available for analysis and
interpretation)
 Numerical description of things
 Results of observation, counting, or measurement
 Can be primary or secondary data

3
Variable
 Any aspect of an individual that is measured and take
any value for different individuals or cases.
 Any characteristic (property) of the observational unit
with outcomes (data) that vary from one observation
to the other.
 Any attribute, phenomenon, or event that can have
different values.
 Example: blood pressure, age, sex…
 Value (in statistics): is the magnitudes of
measurements, statistics, or parameters.

4
Difference between variable and data?

 Variable is “anything” that we are interested to


measure where as data is the measured value/result
of that variable.
Example: age is a variable 12 years old is datum.

5
Introduction to statistics
Definition
 Statistics: A field of study concerned with the
collecting, organizing or presenting, analyzing and
interpreting numerical data for understanding a
phenomenon or making wise decisions.

 The term statistics is used to mean either


Statistical data or
Statistical methods

6
Statistics…
 Statistical data: it refers to numerical descriptions of
certain phenomenon or condition.

 These descriptions may take the form of counts or


measurements.

e.g. number of fever cases, number of positives


obtained, sex and age distribution of positive cases, etc.

7
Characteristics of statistical data

Figures (numerical descriptions) may not be statistical data.


I. They must be in aggregates – This means that statistics
are 'number of facts.'
A single fact, even though numerically stated, cannot be
called statistics.
II. They must be affected to a marked extent by a
multiplicity of causes.

8
statistical data...

III. They must be enumerated or estimated according to


a reasonable standard of accuracy
IV. They must have been collected in a systematic
manner for a predetermined purpose.
V. They must be comparable to each other .
# of malaria cases per year/season /woreda/type of
place of residence.

9
Statistics
 Statistical methods refer to a body of methods that are
used for collecting, organizing, analyzing and interpreting
numerical data for understanding a phenomenon or making
wise decisions
 can be divided in to two main branches:
1. Descriptive statistics:-
- is concerned with the organization, presentation, and
summarization of data.
2. Inferential statistics:-
- is a set of procedures used to draw conclusions about a large
body of data, called a population, based on a smaller set of
data, a sample, taken from a population.
10
Difference Between Descriptive and
Inferential Statistics

11
What is biostatistics?
 When the different statistical methods are applied in
biological, medical and public health data they
constitute the discipline of Biostatistics.

12
13
Scales of Measurement

14
Variable
 Any aspect of an individual that is measured and take
any value for different individuals or cases is called a
variable.
 Variables are mainly divided into:
Qualitative (or categorical),
Quantitative (or numerical variables).

15
Variable …
 Qualitative variable: a variable or characteristic which
cannot be measured in quantitative form but can only be
identified by name or categories,
Ex. place of birth, ethnic group, type of drug, sex

 Quantitative variable: a variable that can be measured


and expressed
 Quantitative variables can be discrete or continuous
16
Variables…
 Numerical discrete: data occur when the
observations are integers that correspond with a count
of some sort.
Ex. No of malaria cases

 Numerical continuous: Each observation


theoretically falls somewhere along a continuum.
 One is not restricted, in principle, to particular values
such as the integers of the discrete scale
Ex. Weight, height

17
Scale of measurement
 Measurement: is the assignment of numbers to objects
or events according to a set of rules
 There are four basic types of data (scales of
measurement);
1. Nominal scale
2. Ordinal scale
3. Interval scale
4. Ratio scale
18
1. Nominal data/scale
 Data that represent names or categories

 Involves the classification of subjects according to


specified categories or groups.
 There is no necessary relation ship among the categories.

 The lowest measurement scale

Examples: Blood type (A, B, AB, O), marital status


(Single, married, divorced, separated, widowed), sex (M,
F)
19
2. Ordinal scale of measurement

 Sometimes called ranked data.


 This scale differs from the nominal scale in that it
ranks the different categories specified in the scale
in terms of a graded order (greater than, less than,
equal to).
 The spaces or intervals between the categories are not
necessarily equal.

20
Ordinal scale …
Examples;
Patient satisfaction:-
Very satisfied
Satisfied
Unsatisfied
Very unsatisfied
Injuries (according to their level of severity): -
Fatal
Severe
Moderate
Minor
21
3. Interval scale of measurement
 In interval data the intervals between values are the
same
 Scales begin with an arbitrary ‘0’ point, only fixed by
convention

For example, in the Fahrenheit temperature scale, the


difference between 70 degrees and 71 degrees is the
same as the difference between 32 and 33 degrees.
But the scale is not a RATIO Scale. 40 degrees
Fahrenheit is not twice as much as 20 degrees
Fahrenheit.
22
4. Ratio data/scale

 The data values in ratio data do have meaningful


ratios.
Example: Age in years, weight in kilogram, height in
meter someone who is 40 is twice as old as someone
who is 20.
 Both interval and ratio data involve measurement

23
24
Interval
Ordinal
Nominal

Ratio
Degree of precision in measuring
Summary of variable and measurement scale

25
Class exercise
Identify the type of data (nominal, ordinal, interval and ratio) represented by

each of the following. Confirm your answers by giving your own examples.

1. Color preference
2. RH factor
3. PH (power of Hydrogen)
4. IQ (Intelligent Quotient )
5. Job satisfaction index (1-5)
6. Serum glucose level (mg/dl)
7. Number of cases of each reportable disease reported by a health worker
8. The average weight gain of 4 pregnant women
9. Temperature (Celsius)

26
Methods of data collection

27
Data collection methods
 Data collection methods: are techniques allowing us to
systematically collect information about our objects of
study (people, objects, phenomena) and about the settings
in which they occur.
 Data collection methods can generally be classified into :

1. Primary data collection methods


2. Secondary data collection methods (data mining)
28
1. Primary data collection
 In primary data collection, you collect the data yourself
using methods such as interviews and questionnaires

A) OBSERVATION
 Is a technique that involves systematically selecting,
watching and recording behavior and characteristics of
living beings, objects or phenomena
 It ranges from simple visual observation to using
machines

29
Observation

Can be two types;


 Participant observation: The observer takes part in the
situation he or she observes. (For example, a doctor
hospitalized with a broken hip, who now observes
hospital procedures ‘from within’.)
 Non-participant observation: The observer watches
the situation, openly or concealed, but does not
participate.
30
Observation

Advantage:
 It gives relatively more accurate data on behaviors and
activities.

Disadvantage:
 Investigators own bias, desires etc will affect data
quality.
 It needs more resources and skilled man power

31
Primary data collection

B) INTERVIEW AND QUESTIONNAIRE METHOD:


 Is a way of gathering information through
communication between the interviewer and the
interviewees.

B.1) IN-DEPTH INTERVIEW:


 It is an interviewing technique helps to probe deeper
into individuals attitude and concerns at the time
individual response are needed.
32
In-depth interview …
Advantage:
 It is good method to discuss sensitive or emotion-laden topics
 When the subjects are hard-to- reach (audiences are sparsely
populated.)
Disadvantage:
 Time consuming.
 Broad generalization is impossible.

33
Interview…
B2) FACE TO FACE INTERVIEW(PERSONAL
INTERVIEW)
 What differs it from in-depth interview is that this
uses structured questionnaires to be asked in order.
Advantages:
 Serious approach by respondent resulting in accurate
information.
 Good response rate.
 Completed and immediate.
 Interviewer in control & can give help for a problem.
 Can investigate motives and feeling
34
Face to face interview
Disadvantages:
 Need to set up interviews.
 Time consuming.
 Geographic limitations.
 Can be expensive.
 Respondent bias – tendency to please or impress, create
false personal image, or end interview quickly.

35
Primary data…
B.3) FOCUS GROUP DISCUSSION (FGD):
 A focus group is typically composed of eight to twelve
participants who are unfamiliar with each other and
conducted by a trained interviewer
 The group should be homogeneous

36
FGD…
Advantage:
 When details of thoughts/ attitudes in a certain target
population is wanted to be known.
 Used in application of social and behavoiral sciences
 Details of practices, beliefs etc on the topic is achieved.

Disadvantage:
 Requires skilled facilitatator to raise topics for ideas &
probe.
37
Primary data
B. 4) TELEPHONE INTERVIEW
 This is an alternative form of to the personal interview,
face-to-face interview.
Advantages:
 Quick.
 Can cover reasonably large numbers of people or
organizations.
 Wide geographic coverage.
 High response rate – keep going till the required number.
 Spontaneous response.

38
Telephone interview
Disadvantages:
 Often connected with selling.
 Questionnaire required.
 Not everyone has a telephone.
 Repeat calls are inevitable – average 2.5 calls to get someone.
 Time is wasted.
 Straightforward questions are required.
 Respondent has little time to think.

39
C. Administering written questionnaires (self-administered
questionnaire)

 Written questions are presented that are to be answered by the


respondents in written form
 A written questionnaire can be administered in different ways, such
as by:
 Sending questionnaires by mail

 Gathering all or part of the respondents in one place at one time,


giving oral or written instructions, and letting them fill out the
questionnaires
 Hand-delivering questionnaires to respondents and collecting
them later
40
Primary data
C.1) MAIL/POSTAL/INTERNET INTERVIEW
 It is interview through postal services.

 Questionnaires are sent by post to the informants together


with a polite covering letter explaining the detail, the
aims and objective of collecting the information and
requesting the respondent to corporate by furnishing the
questionnaire duly filled in

41
Mail interview
Advantages:
 Can cover a large number of people or
organizations.
 Wide geographic coverage.
 Relatively cheap
Disadvantages:
 Design problems and Questions have to be relatively
simple and carefully designed.
 Historically low response rate
 Time delay whilst waiting for responses to be returned.

42
Other types of primary data collection
A) Self reported checklist
B) Expert judgments
C) Citizen report/score/cards
D) Delphi technique
E) Maping and scalnig
F) Case-studies
G) Diaries
H) Critical incidents
I) Portfolios
J) Multi method (combination)
43
2. Secondary data collection
 It is also called as data mining.

 It is using records and available information already


collected by others.
 It is using information like that of clinical and other
personal records, death certificate, published mortality
statistics, census publications and etc.

44
Secondary data collection
Advantage:
 Data collection is inexpensive.
 Less time consuming
Disadvantages:
 It is sometimes difficult to gain access to the records or reports
required, and the data may not always be complete and precise
enough, or too disorganized.
 There could be differences in objectives between the primary
author of the data and the user.

45
Common problems might include
 Language barriers
 Lack of adequate time
 Expense
 Inadequately trained and experienced staff
 Invasion of privacy
 Suspicion
 Bias (spatial, project, person, season, professional)
 Cultural norms

46
Choosing a Method of Data Collection

 Decision to select data collection method need


information that is;
Relevant,
Timely,
Accurate and
Usable

47
Questionnaire

48
Types of Questions
 Open-ended questions: are questions that permit
free responses that should be recorded in the
respondent’s own words.
Ex: “What would you do if you noticed that your
daughter (school girl) had a relationship with a
teacher?”
 Closed questions: offer a list of possible options or
answers from which the respondents must choose.
Ex: “Have your every gone to local village HW?
1. Yes 2. No

49
Question forms cont…

 Structured questionnaires involve the use of fixed


questions, batteries of questions and/or scales which
are presented to respondents in the same way, with no
variation in question wording and with closed or open
qes (pre-coded response choices)

 Some structured questionnaires will also include


open-ended questions, to enable respondents to reply
in their own words

50
Question forms cont…
 Semi-structured interviews include fixed questions
but with no, or few, response codes, and are used
flexibly, often in no fixed order, to enable
respondents to raise other relevant issues not covered
by the interview schedule

51
Question forms cont…
 Unstructured interviews are comprised of a checklist
of topics, rather than fixed questions, and there are no
pre-codes

 The more structured approach is only suitable for


topics where sufficient knowledge exists for largely
pre-coded response formats to be developed

52
Steps in Designing a Questionnaire
Step1: CONTENT
 Take your objectives and variables as your starting
point
Step 2: FORMULATING QUESTIONS
 Formulate one or more questions that will provide the
information needed for each variable
 Take care that questions are specific and precise
 Check whether each question measures one thing at a
time
 Avoid leading questions
 Use simple everyday language.
53
Steps in Designing a Questionnaire…
Step 3: SEQUENCING OF QUESTIONS
 Design to be “consumer friendly”
 Sequence of questions must be logical
 At the beginning put questions concerning
“background variables”
 Start with interesting but non-controversial question
 Pose more sensitive questions as late as possible

54
Steps in Designing a Questionnaire…
Step 4: FORMATTING THE QUESTIONNAIRE
When you finalize your questionnaire, be sure that:
 Each questionnaire has a heading and space to insert
the number or response
 Layout is such that questions belonging together
appear together visually

55
Steps in Designing a Questionnaire…
Step 5: TRANSLATION
 If interview will be conducted in one or more local
languages, the questionnaire has to be translated
 Then retranslated into the original language to check
for consistency

56
Questionnaire layout
 The questionnaire should be visually easy to read and
comprehend
 Lower case letters should be used for texts
 questions should be understandable / avoid vague
questions
 began with easy questions and put sensitive
questions at the end
 Questions should be in logical order

57
The covering letter
The covering letter should:
 be written on the organization’s headed paper,
include the name and address of the sample
member and the identification (serial) number, and
address the recipient by name

58
The covering letter cont…
The letter should:
 explain how the person’s name was obtained/
selected
outline the study aims, benefits and risks
guarantee confidentiality

59
Piloting

 questionnaire should be developed and pretested

 Conduct pretest on a sample of about 5-10% of the


sample size depending on the complexity of the
items

60
Methods of data organization and
presentation

61
Data organizing and presenting

 The data collected in a survey is called raw data.


 In most cases, useful information is not immediately
evident from the mass of unsorted data
 So collected data need to be organized
 For the primary objective of this different techniques
of data organization and presentation like frequency
distribution, order array, tables and diagrams are
used

62
Methods of data organization and presentation

Organization Presentation
 Ordered array  Statistical tables
 Frequency distribution  Simple /one way tables
 Simple frequency  Two way table
distribution  High order tables
 Categorical frequency  Graphical presentation
distribution  Bar cart
 Grouped frequency  Pie chart
distribution  Histogram
 Frequency polygon
 Line graph

63
Ordered array
 Is a serial arrangement of numerical data in an
ascending or descending order
 Is an appropriate when the data are small in size
Ex. Table: Ordered Array of Ages of Subjects

64
Frequency Distributions
Frequency: number of occurrences of statistical result
 The number of times a particular result occurs in a
statistical survey (absolute frequency), or
 the ratio of that number to the total results obtained in
the survey (relative frequency).
Frequency Distribution: list or table to summarize data
 Consists of set of classes or categories along with their
respective numerical counts/percentage.
 A table which contains the values of a variable and the
corresponding frequencies with which each value occurs
(Frequencies with which data falls within each range )
65
Frequency Distribution…
 The actual summarization and organization of data starts
from frequency distribution.
 The distribution condenses the raw data into a more useful
form and allows for a quick visual interpretation of the data.
 Frequency distribution is used to display categorical data or
grouped quantitative data.
 A relative frequency distribution: shows the proportion of
counts that fall into each class or category (the value for
any category is obtained by dividing the number of
observations in that category by the total number of
observations)
 This can be reported as percentage
66
Frequency distribution for nominal data

67
Example 2 –Ordinal Data

68
Frequency Distribution for
Quantitative/numerical
 A frequency distribution can also show the number
of observations at different values or within certain
ranges
 For a discrete variable, the frequencies may be
tabulated either for each value of the variable or for
groups of values
 With continuous variables, groups (class intervals
have to be formed)
 For both discrete or continuous data, the values are
grouped into distinct non-overlapping intervals,
usually of equal width
69
Frequency Distribution for
Quantitative/numerical
 Example 2: Age for 25 individuals

70
Grouped frequency distribution
 Can be condensed as follows using Grouped
frequency distribution

71
Example 2- grouped quantitative data
 A study was conducted to assess the prevalence of
nutritional status among women of child bearing age
in urban Ethiopia-evidence from EDHS 2011.

72
Grouped frequency distribution
Steps to construct grouped frequency distribution
1. Deciding on the number of classes
number of classes usually 6 – 20, 15 is suggested
 A guide on the determination of the number of classes
(k) can be the Sturge’s Formula

73
Steps to construct grouped frequency..
2. Determination of class limits
 Classes should be mutually exclusive

 Classes should not be overlaping or successive classes


have no values in common.
 Class limits should be definite and clearly stated, open-
end classes should be avoided since they make it difficult
to calculate certain further descriptions / calculation
e.g. less than 10, greater than 65
74
Steps to construct grouped frequency…
 True limits (or class boundaries) are those limits, which are
determined mathematically by subtracting 0.5 from the lower
limit and add it to the upper limit to make an interval of a
continuous variable continuous in both directions, and no gap
exists between classes.
 The true limits are what the tabulated limits would correspond
with if one could measure exactly.
 Mid-point or class mark (Xc) of an interval is the value of
the interval which lies mid-way between the lower true limit
(LTL) and the upper true limit (UTL) of a class.
 It is calculated as:
Xc = Upper class limit + Lower class limit
2
75
Grouped frequency distribution…
Example: Construct a grouped frequency distribution of the
following data on the amount of time (in hours) that 40
college students devoted to leisure activities during a
typical school week:

76
Grouped frequency distribution…

77
Grouped frequency distribution…
 Cumulative frequencies: when the frequencies of two or
more class are added
 Cumulative relative frequencies : the proportion of the
total number of observations that a value less than or
equal to the upper limit of the interval.

78
Mid point & True class limits

79
Data presentation
Tabular

Diagrammatic

80
Data presentation
1. Statistical Tables
 A statistical table is an orderly and systematic
presentation of numerical data in rows and columns.
 Rows (stubs) are horizontal and columns (captions)
are vertical arrangements.
 The use of tables for organizing data involves
grouping the data into mutually exclusive categories
of the variables and counting the number of
occurrences (frequency) to each category.

81
Construction of tables
 No hard and fast rules but general principles:
1. Tables should be as simple as possible.
2. Tables should be self-explanatory. For that purpose
 Title should be clear & to the point( good title answers: what?
when? where? how classified ?) & should be placed above the
table.
 Each row and column should be labelled.
 Numerical entities of zero should be explicitly written rather
than indicated by a dash.
 Dashed are reserved for missing or unobserved data.
 Totals should be shown either in the top row and the first
column or in the last row.
3. If data are not original, source should be given in a footnote
82
A. Simple or one-way table: The simple frequency table is used when
the individual observations involve only to a single variable
Table1: Overall immunization status of children in Adami Tullu
Woreda, Feb. 1995
Immunization status Number percent
Not immunized 75 35.7

Partially immunized 57 27.1

Fully immunized 78 37.2

Total 210 100.00

Source: Fikru T et al. EPI Coverage in Adami Tulu. Eth J Health Dev
1997;11(2): 109-113 83
B. Two-way table: This table shows two characteristics & formed
when either the raw or column is divided into two or more parts.
Table 2: TT immunization by marital status of the women of childbearing age,
Assendabo town, Jimma Zone, 1996

Characteristics
Immunization status

Marital Status Immunized Not Immunized Total


No % No % No
Single 58 24.7 177 75.3 235
Married 156 34.7 294 65.3 450
Divorced 10 35.7 18 64.3 28
Widowed 7 50 7 50 14
Total 231 31.8 496 68.2 727
Source: Mikael A. et al TT immunization coverage among women of
child bearing age in Assendabo town; JIHS, 1996, 7(1): 13-20

84
C. Higher order table: When it is desired to represent three or
more characteristics in a single table.

Table 3: degree of job satisfaction among doctors and nurses in


rural and urban areas of SNNPR.
Characteristics Residence

Profession /Sex Urban rural total


Doctors Male 8 (10%) 35 (21%) 43 ()17.7%
Female 3 (3%) 16 (10.0%) 18 (7.4%)
nurses Male 46 (58.0%) 36 (22.0%) 82 (33.7%)

Female 23 (29.0%) 77 (47.0$) 100 (41.2%)

Total 79 (100.0%) 164 (100.0%) 243 (100.0%)

85
Data presentation ...
2. Diagrammatic presentation of Data
 It is pictorial or graphical presentation of data

Importance of Diagrammatic presentation


 They have greater attraction than mere figures.

 They give quick overall impression of the data.

 They facilitate comparison

 Used to understand patterns and trends

 They have greater memorizing value than mere figures.


86
Construction of graphs

 The choice of the particular form among the different


possibilities will depend on personal choices and/or
the type of the data.
 Bar charts and pie chart are commonly used for
qualitative or quantitative discrete data.
 Histograms, and frequency polygons are used for
quantitative continuous data.

87
General rules for construction of graphs
 Graph should be self-explanatory & as simple as
possible.
 Titles are usually placed below the graph and it should
again question what ? Where? When? How classified?
 Legends or keys should be used to differentiate variables
if more than one is shown.
 The axes label should be placed to read from the left
side and from the bottom.
 The units in to which the scale is divided should be
clearly indicated.
 The numerical scale representing frequency must start at
zero or a break in the line should be shown.
88
Types of graphical data presentation methods

I. Bar chart
II. Pie chart
III. Histograms
IV. Frequency Polygon
V. Ogive curve or Cumulative frequency curve
VI. Line graph

89
I. Bar Charts
 Used to represent & compare frequency distribution of
discrete variables & categorical series
Types of bar chart:
A. Simple bar chart,

B. Sub-divided bar chart


 Stacked bar chart

 100% Component bar chart

C. Multiple bar chart


90
Methods of Constructing Bar chart
 All bars must have equal in width

 The bars are not joined together (leave equal space b/n bars)

 All bars should rest on the same line called the base

 Label both axes clearly

 Categories are listed on the horizontal (X-axis)

 Frequencies/relative frequencies are represented on y axis

 The height of the bar is proportional to the frequency or


relative frequency of observations in that category
91
A. Simple Bar Chart

92
Simple Bar Chart…
Frequency of referred
cases

Source of referral

Figure1: Distribution of patients by source of referral


93
Simple bar chart example…

 Figure x: Immunization status of children in Adami Tulu


Woreda, 1996
94
B. Sub-divided Bar chart
 If there are different quantities forming the subdivisions of the
totals, simple bars may be subdivided in the ratio of the various
subdivisions to exhibit the relationship of the part to whole.
 The order in which the components are shown in a “bar” is
followed in all bars used in the diagram
Examples :
 Stacked/Actual Component Bar chart: When over all height of
bars & individual component lengths represent actual figures.

 100%Component Bar chart: Where the individual component


lengths represent the percentage each component forms the over all
total. Series of such bars will all be the same total height (100%)

95
Example: Use the Data give below to construct Sub-divided Bar chart

96
Stacked bar chart

F
Figure x: Average amount (g) bread consumed per person per
week by year in London.

97
100% Component Bar chart

Figure x: Average amount (g) bread consumed per person per


week by year in London.
98
C. Multiple bar chart
 Bar charts can be used to represent the relationship
among three or more variables
 In this type of chart the component figures are shown
as separate bars joining each other.
 The height of each bar represents the actual value of
the component figure.

99
Multiple bar chart…

 Figure x: Distribution OI, initiation of ART and HBsAg status in the


four WHO stages of HIV/AIDS among the study participants attending
ART clinic at HUTRH, Hawassa, south Ethiopia, 2014 100
Multiple Bar chart

101
It is also possible to plot bars horizontally

102
II. Pie-chart
 Shows the relative frequency for each category by dividing
a circle in to sectors, the angles of which are proportional to
the relative frequency
 Used for single categorical variable & percentage
distribution
Steps to construct Pie chart
 Construct a frequency table
 Change the frequency into percentage
 Change the percentage in to degrees, Where: degree=
Percentage *3600
 Draw a circle and draw it accordingly. Eg.

103
Pie chart …

104
Pie chart …

Figure x: Distribution of Hepatitis B related knowledge status of study


participants attending ART clinic at HTRH, Hawassa, South Ethiopia,
2014

105
III. Histograms

 A histogram is the graph of the frequency distribution of


continuous measurement variables.
 It is constructed on the basis of the following principles:

 The horizontal axis is a continuous scale running from


one extreme end of the distribution to the other.
 It should be labelled with the name of the variable and
the units of measurement.

106
Histograms…
 For each class in the distribution a vertical rectangle is
drawn with
its base on the horizontal axis extending from one class
boundary of the class to the other class boundary,
there will never be any gap between the rectangles
the bases of all rectangles will be determined by the width
of the class intervals.
If distribution with unequal class interval, it is necessary to
make adjustment for varying magnitudes of class intervals.
bar/ rectangle is proportional to the frequency of
observation in the interval

107
Histograms…

108
Histograms…

109
IV. FREQUENCY POLYGON

 If we join the midpoints of the tops of the adjacent


rectangles of the histogram with line segments a
frequency polygon is obtained.
 When the polygon is continued to the X-axis just out
side the range of the lengths the total area under the
polygon will be equal to the total area under the
histogram.

110
Frequency polygon …

111
Frequency polygon …
 Note that it is not essential to draw histogram in order to
obtain frequency polygon.
 Can be drawn with out histogram rectangles as follows;
 The scale should be marked in the numerical values of
the mid- points of intervals.
 Erect ordinates on the midpoints of the interval - the
length or altitude of an ordinate representing the
frequency of the class on whose mid-point it is erected.
 Join the tops of the ordinates and extend the connecting
lines to the scale of sizes.

112
Example: Consider the above data on time spend on leisure
activities
The histogram under the frequency polygon can be omitted.
30
25
Number of students

20
15
10
5
0
7 12 17 22 27 32 37 42
Mid points of class intervals

113
V. Ogive curve or Cumulative frequency curve

 Some times it may be necessary to know the number of


items whose values are more or less than a certain
amount
Example: to know pts whose weight is <50kg or >60kg
 To get this the simple frequency distribution should be
changed into cumulative distribution
 Then Ogive curve turns cumulative frequency
distribution into graph
114
Ogive curve…

115
Ogive curve…

116
The less than and more than frequency method

117
118
VI. Line graph
 Useful to study some variables according to passage of time.
 The time, in weeks, months or years is marked along the
horizontal axis; and the value of the quantity that is being
studied is marked on the vertical axis.
 The distance of each plotted point above the base-line
indicates its numerical value.
 The line graph is suitable for depicting a consecutive trend of
a series over a long period.

119
Example: Malaria parasite rates as obtained from malaria
seasonal blood survey results, Ethiopia (1967-79 E.C),
5.5
5.0
4.5
4.0
Rate (%)

3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
1967 1969 1971 1973 1975 1977 1979
Year

Figure: Malaria Parasite Prevalence Rates in Ethiopia, 1967 –


1979 Eth. C.

120
Summarizing Data

121
Summarizing Data
At the end of this topic, the student will be able to:
 Identify the different methods of data summarization
 Compute appropriate summary values for a set of
data
 Appreciate the properties and limitations of summary
values

122
Data summarizing methods
1. Measure of central tendency (location):
 The tendency of statistical data to get concentrated at
certain values is called the “Central Tendency”.
 The various methods of determining the actual value
at which the data tend to concentrate are called
measures of central Tendency.
 Mean, Median and Mode, weighted mean are
parameters used to measures central Tendency.
123
Data summarizing methods ...
2. Measure of dispersion
 We also need to know how “spread out” the numbers
are about the center.
 The deviation of each data value from the center
Types:
Range
Interquartile range
Variance
Standard deviation and
Coefficient of variation.
124
1. THE ARITHMETIC MEAN/ SIMPLE MEAN
(X )
 is the sum of the values of all observations divided by
the number of observations.
 It is written in statistical terms as:
X= ∑ Xi , i = 1,2,…n sample mean
n

 Xi population mean
  i1
N
125
Arithmetic Mean…
Example: Suppose the following data consists of birth
weights (in grams) of all live born infants born at
HUCSH, in the last 2-weeks period.

3265 3323 2581 2759 3260 3649 2841


3248 3245 3200 3609 3314 3484 3031
2838 3101 4146 2069 3541 2834

126
Arithmetic Mean…
Calculate the mean for the above data
X= ∑ Xi , i = 1,2,…n
n
= (3265 + 3260 + ….+ 2834) = 63, 338/20= 3166.9g
20

The average BWT of infants born in HUTRH in the last


2 weeks is 3.2kg.

127
Arithmetic Mean…
 Exercise: Assume there were 15 patients scheduled
for a particular OR procedure on Monday. The
following data shows the time (in minutes) each
patient will stay on the OR table for the procedure.
Calculate the average time required to undertake the
procedure.

30, 26, 26, 36, 48, 50, 41, 31, 29, 27, 33, 35, 52, 28, 37

128
Arithmetic Mean…
Answer for the exercise
X= ∑ Xi , X= time in minuets and i = 1,2,…n
n
= (30 + 26 + ….+ 37) = 529/20 ≈ 35 minutes
15

Averagely 35 minutes are required to perform the


procedure for a single patient.

129
Mean for Grouped data

130
Example: Calculate the mean for the following data

 60+187+264+189+96+74=870/40=21.75 Hr
131
Mean for grouped data…
 Exercise: Assume different scheduled OR procedures
were done for 122 patients in 2007. The following
data shows the days each patient will wait before the
procedure. Calculate the average waiting time for OR
procedure in this Hospital.
Interval of waiting Frequency
Days (f)
2-6 27
7-11 34
12-16 36
17-21 25
Total 122

132
Mean for grouped data…
Answer for the exercise
Interval of waiting Days Frequency Mid point (m) f*m
(f)
2-6 27 1.5+6.5/2 =4 27*4=108
7-11 34 6.5+11.5/2=9 34*9=306
12-16 36 11.5+16.5/2=14 36*14=504
17-21 25 16.5+21.5/2=19 19*25=475
Total 122 1393

= 1393/122 = 11.4

Averagely a patient will wait for 11 days for OR


procedure in 2007.
133
Arithmetic Mean
Advantages of mean
 It is based on all values given in the distribution.
 It is most easily understood.
 It is most amenable to algebraic treatment.
Disadvantages of mean
 It may be greatly affected by extreme items and its
usefulness as a “Summary of the whole” may be
considerably reduced.
 When the distribution has open-end classes, its
computation would be based assumption, and
therefore may not be valid.
134
Skewness
If extremely low or extremely high observations are
present in a distribution, then the mean tends to shift
towards those scores. Based on the type of skewness,
distributions can be:
Negatively skewed distribution: occurs when
majority of scores are at the right end of the curve and
a few small scores are scattered at the left end.
Positively skewed distribution: Occurs when the
majority of scores are at the left end of the curve and a
few extreme large scores are scattered at the right
end.
Symmetrical distribution: It is neither positively nor
negatively skewed. A curve is symmetrical if one half
of the curve is the mirror image of the other half.
135
Geometric mean
 Is the nth root of product of observations in data set
 Mainly used in laboratory data, specially data in the form
of concentration of one substance in another
 The geometric mean is preferable to the arithmetic mean
if the series of observations contains one or more
unusually large values
 Equivalently, it is the antilogarithm of the arithmetic mean
of the logarithms of the values.
GM = antilog of [ (Ʃlogxi)/n ]
For grouped data; GM = antilog of [ (Ʃfilogxi)/n ]
136
Geometric Mean…

137
Geometric Mean…

138
Geometric Mean…

139
Geometric mean…
 Example: the following data shows the minimum
inhibitory concentration of penicilin in urine for N.
gonorrhoeae in 74 patients. Calculate the geometric
mean.

140
Geometric mean…
GM = antilog of [ (Ʃfilogxi)/n ]
=antilog[(21log0.003125)+ (6log0.0625)+ …
(3log1.0)]
74

GM= antilog -0.846

= 0.143

141
Weighted Mean

 Used when the values are grouped by frequency or


relative importance
 In a weighted mean, separate outcomes have
separate influence
 The influence attached to an outcome is the weight
 Familiar in the calculation of a course grade as a
weighted average of scores on separate outcomes

142
Weighted mean…
Example:

143
Weighted mean…
 Exercise: Suppose you are selecting one nurse for the educational

opportunity offered by the institution in 2009 academic year. Based on the

criteria from human resource office the following weights were given for

different criteria (years of work experience 30%, GPA 30%, boss

evaluation points 40%)

 Based on this Abebe and Kebede apply for the opportunity and their

profile is as follows.

Abebe: 8 years work experience, GPA 3.2 and 97.5 boss evaluation result.

Kebede: 18 years work experience, GPA 3.8 and 74.0 boss evaluation result.

Who should win the opportunity ?


144
2. Median
 Is the value that comes half-way when the data are ranked in
order.
Formula:
 The (n+1/2) th observation if n is odd.

 The average of the (n/2 ) th and (n/2+1) th if n is even.

 The rational for these definitions is to ensure an equal number


of sample points on both sides of the sample median.

145
Example: Compute median for the birth weight data
 Solution: First arrange the sample in ascending order
 2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101,
3200, 3245, 3248, 3260, 3265, 3314, 3323, 3484,
3541, 3609, 3649, 4146
 Since n=20 is even,
 Median = average of the 10th and 11th largest
observation = (3245 + 3248)/2 = 3246.5 g
?Omit the last BWT (4146g) and calculate the sample
median

146
Median for grouped data
 In calculating median for grouped data we assume that the
values within a class-interval are evenly distributed
through the interval
 The first step is to locate the class interval in which the
median is located, using the following procedure
 Find n/2 and see a class interval with a minimum cum.
Freq. which contains n/2
 Then use the following formula
147
Median For Grouped data

148
Compute the median age for
the following grouped data

149
Median for grouped data…
First, we need to find out the median class
The median class is the first class with cumulative
frequency of at least 162/2 = 84.5- in the third class
interval- 30-39
LTL= 29.5, Fc= 70 frequency = fm =47,
Median = 29.5+(84.5-70) 10=
47
29.5 +[ (14.5/47)10]=32.58=33

150
Median for grouped data…
 Exercise: Assume different scheduled OR procedures
were done for 122 patients in 2007. The following
data shows the days each patient will wait before the
procedure. Calculate the median waiting time for OR
procedure in this Hospital.
Interval of waiting Frequency
Days (f)
2-6 27
7-11 34
12-16 36
17-21 25
Total 122

151
Median for grouped data…
Answer for the exercise
Interval of waiting Days Frequency Mid point (m) Cumulative
(f) Frequency
2-6 27 1.5+6.5/2 =4 27
7-11 34 6.5+11.5/2=9 61
12-16 36 11.5+16.5/2=14 97
17-21 25 16.5+21.5/2=19 122
Total 122

= 6.5+(61-27)* 5
4 34
6.5 +[ (34/34)5]= 7.5

152
Median…
Advantages
 It is easily calculated and is not much disturbed by
extreme values
 It is more typical of the series
 The median may be located even when the data are
incomplete, e.g, when the class intervals are irregular
and the final classes have open ends.
Disadvantages
 The median is not so well suited to algebraic treatment
as the arithmetic, geometric and harmonic means.
 It is not so generally familiar as the arithmetic mean

153
Percentile
 Certain percentile or functions of percentiles have specific
names:

 All these statistics tell something about the location of the


data.

154
Percentile…
 A percentile has an intuitively simple meaning—for
example, the 25th percentile is that value of a variable
such that 25% of the observations are less than that
value and 75% of the observations are greater.
 The Pth percentile of a sample of n observations is
that value of the variable with rank (P /100)(1+n).
 If this rank is not integer, it is rounded to nearest half
rank

155
Percentile…
Example: The following data deal with the number of
patients scheduled for surgery for 15 consecutive
days in the last month. Calculate 50 th, 25th, 10th, and
90th percentile.

30,26,26,36,48,50,16,31,22,27,23,35,52,28,37

 Rank: 16,22,23,26,26,27,28,30,31,35,36,37,48,50,52

156
Percentile
 The 50th percentile is that value with rank(50/100)(1+15)
=8. The eighth largest (or smallest) observation is 30.
 The 25th percentile is the observation with rank(25/100)
(1+15)=4, and this is 26.
 Similarly, the 75th percentile is 37.
 The 10th percentile (or decile) is that value with
rank(10/100)(1+15) =1.6, so we take the value halfway
between the smallest and second-smallest observation,
which is (1/2)(16+22) =19.
 The 90th percentile is the value with rank(90/100)
(1+15)=14.4; this is rounded to the nearest half rank of 14.5.
 The value with this half rank is(1/2)(50+52)=51.

157
Mode
 is the value which occurs with the greatest
frequency.
 If all the values are different there is no mode; on
the other hand, a set of values may have more than
one mode (bimodal, trimodal…).
 A distribution with one mode is referred to as
unimodal.
Characteristics of mode:
 Is average of position and could be more than one.
 It is not affected by extreme values

158
Mode…
Example: What is the mode of the above data on the
number of patients scheduled for surgery for 15
consecutive days in the last month.

30,26,26,36,48,50,16,31,22,27,23,35,52,28,37

Mode = 26

159
Mode for grouped data

 In designating the mode of grouped data, we usually


refer to the modal class, where the modal class is the
class interval with the highest frequency.
 If a single value for the mode of grouped data must
be specified, it is taken as the mid point of the modal
class interval.

160
Mode…
Advantages
 Since the mode is usually an “actual value”, it
indicates the precise value of an important part of the
series.
Disadvantages
 Unless the number of items is fairly large and the
distribution reveals a distinct central tendency, the
mode has no significance
 It is not capable of mathematical treatment
 In a small number of items the mode may not exist.

161
Exersice
 Suppose the following data show the maximal static
inspiratory pressure (PI max in cmH2O) of patients
with cystic fibrosis admitted in a certain hospital
during one month duration.
80 100 85 110 75 85 45 70 125 110
110 95 80 75 150 95 130 100 100 75
90 75 120 40 95
 Compute the arithmetic mean, median, and mode?

162
Answer
Mean = ∑ Xi ,
n
= 80 + 85 + … + 95 = 2315 = 92.6 cm H2O
25 25
 40 45 70 75 75 75 75 80 80 85 85
90 95 95
 95 100 100 100 110 110 110 120 125 130 150
Since n=25, is odd,
Median = (n+1/2) th observation = 25+1/2 = 13 th
95 cm H2O

163
Answer …
If we discard the last observation, n=24 is even;
therefore,
(n/2 ) th and (n/2+1) th observation
90 + 95/2 = 92.5 cm H2O.
 C) Mode
The modal value of the above PI max data is 75cmH2O
(this value occurred with the greatest frequency as
compared to the other values).

164
Measure of Dispersion

165
Measures of Dispersion/ Variation
Consider the following data sets:

Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50
 The two data sets given above have a mean of 50
 Which data set is more scattered?- set 1 is more
“spread out” than set 2.
 How do we express this numerically? Using measures
of dispersion.
 The commonly used measures of scatter are: range,
standard deviation, variance and coefficient of
variation 166
Measure of dispersion…

 Measure of dispersion quantify the variation or


dispersion of a set of data from its central location
 Dispersion refers to the variety exhibited by the
values of the data
 The amount may be small when the values are close
together
 If all the values are the same, no dispersion

167
1. Range
 Is the difference between the highest and lowest value of the
observation in the data.
 It is the crudest measure of dispersion.

Range=Xmax – Xmin

Ex: Ranked data for number of patients scheduled for surgery for
15 consecutive days in the last month on the above example
16,22,23,26,26,27,28,30,31,35,36,37,48,50,52
Range= 52-16= 36

168
2. Interquartile range

The interquartile range (IQR) is the difference


between the 75th and 25th percentiles.
 The interquartile range for the above example is

16,22,23,26,26,27,28,30,31,35,36,37,48,50,52
25th percentile is the; (25/100)(1+15)=4, and this is 26
75th percentile is the; (75/100)(1+15)=12, and this is 37
37−26=11
169
Interquartile range…
 Exercise: The following data shows 10 days FBS
measurement of a patient admitted in the surgical
ward in mg/dl. Calculate the interquartile range.

222, 300, 188, 89, 155, 306, 121, 414, 600, 326

170
Interquartile range…
Answer for the exercise
Raw data: 222, 300, 188, 89, 155, 306, 121, 414, 600, 326
Ranked data: 89,121,155,188,222,300,306,326,414,600
25th percentile is the; (25/100)(1+10)=2.75 th observation, Or
1st quartile = (n+1)/4th observation = (2.75)th observation
= 121 + (155-121)x 0.75 = 146.5
75th percentile is the; (75/100)(1+10)=8.25 th observation, Or
3rd quartile=3/4 (n+1)th observation = (8.25)th observation
= 326 + (414-326)x 0.25 = 348
348−146.5=201.5

171
Measure of dispersion…
Some Measures of Central tendency and dispersion for number
of patients scheduled for surgery for 15 consecutive days in
the last month from the above example
16,22,23,26,26,27,28,30,31,35,36,37,48,50,52

172
3. Mean Deviation
 Each value in a data set differs from the sample mean by some
specific amount called deviation
 Mean deviation is the average of the absolute deviations
from the a central value , generally the mean or median

173
Computation of the Mean Deviation
1. Calculate the mean from the data
2. Calculate the deviations from the mean
3. Sum up all deviations , treated as pos. (take absolute
value)
4. Divide sum of deviations by the total numbers of
observation

174
Mean Deviation…

175
Mean Deviation…
 Exercise: Calculate the MD for 10 days FBS
measurement of a patient admitted in the surgical
ward in mg/dl in the above exercise.

222, 300, 188, 89, 155, 306, 121, 414, 600, 326

176
Mean Deviation…
Answer for the exercise
Raw data: 222, 300, 188, 89, 155, 306, 121, 414, 600, 326

Mean= 2721/10= 272.1mg/dl


= 1171/10= 117.1

177
Properties of MD
 It is based on all observations in the data set
 It is not affected much by extreme values
 how ever ignoring the negative signs is not
mathematically sound .

178
4. Variance
 In MD, ignoring the negative signs is not mathematically sound .
 To tackle this limitation taking the square of each deviation
 The variance is the average of the squares of the deviations taken
from the mean
This measure of variation is universally used to show the scatter of
the individual values around the mean of a given distribution
(Population variance = σ2 and Sample variance = S2)
 Let X1, X2, ..., Xn be the measurement on n sample units, then:

S2 =

179
Variance…
 Exercise: Calculate the variance for 10 days FBS
measurement of a patient admitted in the surgical
ward in mg/dl in the above exercise.

222, 300, 188, 89, 155, 306, 121, 414, 600, 326

180
Variance…
Answer for the exercise
Raw data: 222, 300, 188, 89, 155, 306, 121, 414, 600, 326
Mean= 2721/10= 272.1mg/dl

S2 = = 212138.9/9=23571(mg/dl)2
181
5. Sample Standard Deviation
 The main disadvantage of variance is that the units of
variance are the square of the units of the original
observations.
 The easiest way around this difficulty is to use the square
root of the variance (termed as standard deviation),
which is the widely used measure of dispersion.
 It is the positive square root of the variance.

182
Example
 The followings are the survival times of 11 patients
after cardiac transplant surgery.
 Patients are identified numerically from 1 to 11, and
the survival time for the “ith” patient is represented as
Xi for i= 1, …, 11.

 Calculate the sample variance and SD.

183
Example…

184
SD…
 Exercise: Calculate the SD for 10 days FBS
measurement of a patient admitted in the surgical
ward in mg/dl in the above exercise.

222, 300, 188, 89, 155, 306, 121, 414, 600, 326

185
SD…
Answer for the exercise
Raw data: 222, 300, 188, 89, 155, 306, 121, 414, 600, 326
Mean= 2721/10= 272.1mg/dl

SD= 153.5 mg/dl


186
Variance & SD for grouped data
 For grouped frequency distribution, variance and
standard deviation is given by:

187
SD for grouped frequency…

188
6. COEFFICIENT OF VARIATION (CV)

 it expresses the standard deviation as a percentage of


the mean.
 This can be given as: CV = S/ X *100
 Coefficient of variation is free from unit of
measurement.

189
Example
 Consider the following two samples that represent
cholesterol measurements (mg/100ml), each on the
same person, but using different measurement
techniques.

Method Measurements Mean


Auto-analyzer(AA) 177 193 195 209 226 200
Micro-enzymatic(ME) 192 197 202 209 200 200

 Compute the range and standard deviations for both


methods.
190
Solution
A. Range(R)
R (AA) = 226-177 = 49mg/100ml
R (ME) = 209-192 = 17mg/100ml
Thus, the AA method clearly seems more variable.

B. Standard deviation (S)


for AA method:
S = √ ∑ (Xi – x) 2
n–1

= √ ∑ (177 – 200) 2 +(193-200) 2…+ (226 -200) 2


5– 1

=
√ 340 = 18. 4 mg/100 ml
191
Solution…
For ME method:
S = √ ∑ (192 – 200) 2 +(197-200) 2…+ (200 -200) 2
5– 1

=
√ 39.5 = 6.3 mg/ml

= 6.3 mg/100ml, ME method

N.B. the AA method provides clearly more variable


result

192
Example
 Compute the Coefficient of variation CV for the age and
weight of groups of students.
Variable Mean Standard deviation

Age 20.64 years 3.15

Weight 58.98 kg 8.1 kg

193
Solution…
 CV (age ) = 3.15 years/20.63 years *100
= 15.3%
 CV (weight) = 6.10 kg/ 58.89 *100
= 13.8%
 Thus, the age of the students is relatively more spread
out than their weights.
 But if one considered only the respective standard
deviations, he would say that the weight of the students
is more spread out than their age.
 Read about properties of mean and standard deviation.

194
Which measures to use?
 When the distribution of the data is symmetric and unimodal
(i.e. the data are approximately normally distributed), it is
usual to summarize the data using means and standard
deviations.
 However when the data are skewed, it is preferable to use the
median and inter quartile range as summary statistics.
 Median and quartiles are not easily influenced by extreme
values in a skewed distribution unlike means and standard
deviations.
Remark:
The mean and median of symmetric distribution coincide.
When the distribution is skewed to the right, its mean is
larger than its median.
When the distribution is skewed to the left, its mean is
smaller than its median.

You might also like