Introduction To Biostatistics Student Lecture Notes
Introduction To Biostatistics Student Lecture Notes
HEALTH STATISTICS
Table of Contents
CHAPTER
ONE .........................................................................................................................................
...... 3
INTRODUCTION TO
1.0 STATISTICS ............................................................................................... 3
What is
statistics? ...................................................................................................................
1.2 . 3
CHAPTER
TWO ........................................................................................................................................
...... 8
Introduction .................................................................................................................
2.1 .................. 8
Mean
(Arithmetic).............................................................................................................
2.1.1 ........ 8
Median .....................................................................................................................
2.2 ................... 10
Mode .........................................................................................................................
2.3 .................. 10
Skewed Distributions and the Mean and
2.4 Median ....................................................................... 13
Summary of when to use the mean, median and
2.5 mode ............................................................... 15
Measures of
2.6 Dispersion ................................................................................................................... 16
Introduction ............................................................................................................
2.6.1 ................ 16
Range ......................................................................................................................
2.6.2 ................ 16
2.6. Standard
3 Deviation .............................................................................................................. 16
CHAPTER
THREE ....................................................................................................................................
...... 18
CHAPTER
FOUR .....................................................................................................................................
...... 23
Introduction to
4.0 Probability ......................................................................................................... 23
CHAPTER 34
FIVE .......................................................................................................................................
2|Page
......
Correlation and
5.0 Regression ......................................................................................................... 34
RANK CORRELATION
FORMULA .............................................................................................................. 42
RANK
CORRELATION ......................................................................................................
5.13 ............. 43
Coefficient of
correlation ........................................................................................................................ 51
CHAPTER
SIX ...........................................................................................................................................
..... 54
The Chi square
6.0 test ..................................................................................................................... 54
Hypothesis
6.5 testing....................................................................................................................... 61
REFERENCES ........................................................................................................................
........................ 63
3|Page
4
|
P
a
g
e
5
|
P
a
g
e
CHAPTER ONE
OBJECTIVES
By the end of this course, the participant will be able to:-
1. Discuss terminologies, basic principles and concepts of Bio-statistics
2. Discuss commonly used descriptive statistics
3. Discuss elements of inferential statistics
4. Discuss the application of statistical methods in data management
Statistical techniques are used in studies such as identifying the causes of diseases and injuries,
evaluating public health programs to determine what works best in solving health problems, and
designing mathematical models that describe the progression of diseases in populations.
6|Page
7
|
P
a
g
e
Biostatisticians collaborate with practitioners and researchers in clinical and public health and
with local, state, and national health institutions. Biostatisticians also advise public health
officials at the local, regional and national levels.
Biostatisticians find employment in various types of organizations and settings, including local
and state health departments, with the federal government such as at the Centers for Disease
Control and Prevention or other divisions in the Department of Health and Human Services,
and in academic settings, industry such as pharmaceutical companies, and health care providers
including hospitals and managed care organizations.
9|Page
1
0
|
P
a
g
e
11 | P a g e
1
2
|
P
a
g
e
• Discrete data
This is numerical data where both ordering and magnitude are important but only whole
number values are possible (e.g., Numbers of deaths caused by heart disease (765,156 in
1988) versus suicide (40,368 in 1988, page 10 in text).
• Continuous data
Numerical data where any conceivable value is, in theory, attainable (e.g., height, weight,
etc.)
1.8 Descriptive Statistics
i. Deals with methods of describing large data (masses of numbers)
ii. Describe a collection of data
iii. Identifies patterns in the data
iv. Describe samples in summary
v. Guides choice of statistical test
vi. Describe numerical data which includes mean, median, mode, standard deviation etc.
13 | P a g e
14 | P a g e
1
5
|
P
a
g
e
16 | P a g e
1
7
|
P
a
g
e
CHAPTER TWO
A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. As such, measures of central tendency are
sometimes called measures of central location. They are also classed as summary statistics. The
mean (often called the average) is most likely the measure of central tendency that you are most
familiar with, but there are others, such as, the median and the mode.
The mean, median and mode are all valid measures of central tendency but, under different
conditions, some measures of central tendency become more appropriate to use than others. In
the following sections we will look at the mean, mode and median and learn how to calculate
them and under what conditions they are most appropriate to be used.
The mean (or average) is the most popular and well known measure of central tendency. It can
be used with both discrete and continuous data, although its use is most often with continuous
data (see our Types of Variable guide for data types). The mean is equal to the sum of all the
values in the data set divided by the number of values in the data set. So, if we have n values in
a data set and they have values x1, x2, ..., xn, then the sample mean, usually denoted
by (pronounced x bar), is:
This formula is usually written in a slightly different manner using the Greek capitol letter, ,
pronounced "sigma", which means "sum of...":
You may have noticed that the above formula refers to the sample mean. So, why call have we
called it a sample mean? This is because, in statistics, samples and populations have very
different meanings and these differences are very important, even if, in the case of the mean,
18 | P a g e
19 | P a g e
2
0
|
P
a
g
e
they are calculated in the same way. To acknowledge that we are calculating the population
mean and not the sample mean, we use the Greek lower case letter "mu", denoted as µ:
The mean is essentially a model of your data set. It is the value that is most common. You will
notice, however, that the mean is not often one of the actual values that you have observed in
your data set. However, one of its important properties is that it minimizes error in the prediction
of any one value in your data set. That is, it is the value that produces the lowest amount of error
from all other values in the data set.
An important property of the mean is that it includes every value in your data set as part of the
calculation. In addition, the mean is the only measure of central tendency where the sum of
the deviations of each value from the mean is always zero.
The mean has one main disadvantage: it is particularly susceptible to the influence of outliers.
These are values that are unusual compared to the rest of the data set by being especially small
or large in numerical value. For example, consider the wages of staff at a factory below:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this
mean value might not be the best way to accurately reflect the typical salary of a worker, as most
workers have salaries in the $12k to 18k range. The mean is being skewed by the two large
salaries. Therefore, in this situation we would like to have a better measure of central tendency.
As we will find out later, taking the median would be a better measure of central tendency in this
situation.
Another time when we usually prefer the median over the mean (or mode) is when our data is
skewed (i.e. the frequency distribution for our data is skewed). If we consider the normal
21 | P a g e
distribution - as this is the most frequently assessed in statistics - when the data is perfectly
22 | P a g e
2
3
|
P
a
g
e
normal then the mean, median and mode are identical. Moreover, they all represent the most
typical value in the data set. However, as the data becomes skewed the mean loses its ability to
provide the best central location for the data as the skewed data is dragging it away from the
typical value. However, the median best retains this position and is not as strongly influenced by
the skewed values. This is explained in more detail in the skewed distribution section later in this
guide.
2.2 Median
The median is the middle score for a set of data that has been arranged in order of magnitude.
The median is less affected by outliers and skewed data. In order to calculate the median,
suppose we have the data below:
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case 56 (highlighted in bold). It is the middle mark
because there are 5 scores before it and 5 scores after it. This works fine when you have an odd
number of scores but what happens when you have an even number of scores? What if you had
only 10 scores? Well, you simply have to take the middle two scores and average the result. So,
if we look at the example below:
65 55 89 56 35 14 56 55 87 45
14 35 45 55 55 56 56 65 87 89 92
Only now we have to take the 5th and 6th score in our data set and average them to get a
median of 55.5.
2.3 Mode
The mode is the most frequent score in our data set. On a histogram it represents the highest
bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the
24 | P a g e
25 | P a g e
10
26
|
P
a
g
e
Normally, the mode is used for categorical data where we wish to know which is the most
common category as illustrated below:
27 | P a g e
11
28 | P a g e
We can see above that the most common form of transport, in this particular data set, is the bus.
However, one of the problems with the mode is that it is not unique, so it leaves us with
problems when we have two or more values that share the highest frequency, such as below:
We are now stuck as to which mode best describes the central tendency of the data. This is
particularly problematic when we have continuous data, as we are more likely not to have any
one value that is more frequent than the other. For example, consider measuring 30 peoples'
weight (to the nearest 0.1 kg). How likely is it that we will find two or more people
with exactly the same weight, e.g. 67.4 kg? The answer, is probably very unlikely - many people
might be close but with such a small sample (30 people) and a large range of possible weights
you are unlikely to find two people with exactly the same weight, that is, to the nearest 0.1 kg.
This is why the mode is very rarely used with continuous data.
Another problem with the mode is that it will not provide us with a very good measure of central
tendency when the most common mark is far away from the rest of the data in the data set, as
depicted in the diagram below:
29 | P a g e
30 | P a g e
12
31
|
P
a
g
e
In the above diagram the mode has a value of 2. We can clearly see, however, that the mode is
not representative of the data, which is mostly concentrated around the 20 to 30 value range. To
use the mode to describe the central tendency of this data set would be misleading.
We often test whether our data is normally distributed as this is a common assumption
underlying many statistical tests. An example of a normally distributed set of data is presented
below:
32 | P a g e
13
33 | P a g e
When you have a normally distributed sample you can legitimately use both the mean or the
median as your measure of central tendency. In fact, in any symmetrical distribution the mean,
median and mode are equal. However, in this situation, the mean is widely preferred as the best
measure of central tendency as it is the measure that includes all the values in the data set for its
calculation, and any change in any of the scores will affect the value of the mean. This is not the
case with the median or mode.
However, when our data is skewed, for example, as with the right-skewed data set below:
34 | P a g e
14
35 | P a g e
we find that the mean is being dragged in the direct of the skew. In these situations, the median is
generally considered to be the best representative of the central location of the data. The more
skewed the distribution the greater the difference between the median and mean, and the greater
emphasis should be placed on using the median as opposed to the mean. A classic example of the
above right-skewed distribution is income (salary), where higher-earners provide a false
representation of the typical income if expressed as a mean and not a median.
If dealing with a normal distribution, and tests of normality show that the data is non-normal,
then it is customary to use the median instead of the mean. This is more a rule of thumb than a
strict guideline however. Sometimes, researchers wish to report the mean of a skewed
distribution if the median and mean are not appreciably different (a subjective assessment) and if
it allows easier comparisons to previous research to be made.
Please use the following summary table to know what the best measure of central tendency
is with respect to the different types of variable.
36 | P a g e
15
37 | P a g e
Nominal Mode
Ordinal Median
2.6.1 Introduction
2.6.2 Range
i. Defined as the difference between the largest and smallest sample values.
ii. One of the simplest measures of variability to calculate.
iii. Depends only on extreme values and provides no information about how
the remaining data is distributed.
38 | P a g e
v. Not restricted to large sample datasets, compared to the root mean square
anomaly discussed later in this section.
vi. Provides significant information into the distribution of data around the mean,
approximating normality.
a. The mean ± one standard deviation contains approximately 68% of the
measurements in the series.
b. The mean ± two standard deviations contain approximately 95% of
the measurements in the series.
c. The mean ± three standard deviations contain approximately 99.7% of
the measurements in the series.
vii. Climatologists often use standard deviations to help classify abnormal climatic
conditions. The chart below describes the abnormality of a data value by how many
standard deviations it is located away from the mean. The probabilities in the third
column assume the data is normally distributed.
Review question
39 | P a g e
Daily Bed Return is a document completed in a ward covering 24 hours ward bed state.
The actual time for completing DBR is determined by the Hospital Administration,
however it should be completed during the night normally at 12 midnight when patient
movement within the hospital is minimized.
1. Patient movement in and out of the ward that is admissions and discharges from
the ward.
2. Patient movement within the hospital that is ward inter-transfers.
3. Actual patient counts that is number of patients in the ward
4. Number of vacant beds and cots
5. A section for computation of figures by the Medical Records Officer
Example of Daily Ward Return
Form 1
Section 1
Hospital……………. Date……………..Ward……….
Admissions Discharges and Deaths
Hospital NO. Name Hospital NO. Name Discharged To
Section 3 Paroles
Admission from Parole Discharge to Parole
Hospital Number Name Hospital Number Name
Section4 Abscondees
Admissions Absconded
Hospital Number Name Hospital Number Name
40 | P a g e
Section 5 Computation
Previously Daily Return Numbers Today’s Daily Return Numbers
Beds Cots Total Beds Cots Total
Patients Patients
Well/People Well/People
Vacant Vacant
Total Total
Signed…………….i/c forwaed
RECORDS USE ONLY Well people…….
Patients ……. Previous………
Previous……. Admissions…..
Admissions…. Discharges…..
Discharges ….. TOTAL……..
Total………… CHECKED BY…………..
Form II
Daily summary form for In-patient Statistics
Hospital……………………….. Ward……………… Month……….. Year…..
ADMISSIONS DISCHARGES
T/I T/O DEATH W/P OBD
DATE HOME PAROLE ABSC HOME PAROLE ABSC
1
2
3
4
5
TOTAL
The Daily summary form summarizes the days’ statistics and indicates the total
admissions and discharges of the ward. There is a column for OBD (Occupied Bed
Days) which is the total number of patients remaining in the ward each day.
2. Bed State. The number of patients occupying hospitals at any given time.
3. Bed Turnover/ Turnover per year bed/ Throughout per bed. Is the average
number of patients expected to be treated per bed.
4. Turnover Interval. is the average number of days beds remain vacant between
successive patients.
days to the maximum available patients days or available bed days as determined
by the bed capacity during any given period.
6. Occupied Bed Days. The total number of patients remaining in the hospital or
ward each day added together over the reference period.
9. In-patient. A person who has undergone through the full admission procedure of
the hospital and is occupying a bed in the in-patient department.
10. Day case. Are persons or patients attending hospital as non-resident patients for
investigation, therapeutic, operative procedures or other treatment and who
requires some form of preparation period of recovery or both, involving the
provision of accommodation and service.
Analytical Formulae
1) Available Bed Days or Staffed Bed Days = Available Beds x Days in Period
b. In the year 2004, Etihad hospital had 800 beds permanently allocated for in-
patient use. During the period the hospital percentage occupancy was 110% and
that there were 1500 patients discharged home alive, 80 patients went to parole
and 20 deaths. Use the information to calculate
i. Occupied bed days
ii. Average length of stay
iii. Turnover per bed
iv. Excess/ vacant Bed days
43 | P a g e
v. Turnover interval
CHAPTER THREE
44 | P a g e
In a stem-and leaf plot, the greatest common place value of the data is used to form
stems.
The numbers in the next greatest place-value position are then used to form the leaves.
Stem: The digit or digits that remain when the leaf is dropped.
The stem is the remaining digits when the leaf is dropped: the number 28.
Example
45 | P a g e
46 | P a g e
47
|
P
a
g
e
Stem Leaves
15 0 1 0 4 2 0 1
14 5 7 2 4 4 7 3 4 8 4 4 1 7 3 0 9
13 8 9 9 8 9
Legend: 15 I 0 = 150
A stem and leaf plot is a display that organizes data to show its shape and distribution. In a stem-
and-leaf plot each data value is split into a "stem" and a "leaf". The "leaf" is usually the last
digit of the number and the other digits to the left of the "leaf" form the "stem". The number 123
would be split as: stem = 12, leaf = 3.
A stem-and-leaf plot does resemble a histogram turned sideways. The stem values could
represent the intervals of a histogram, and the leaf values could represent the frequency for
48 | P a g e
each interval.
One advantage to the stem-and-leaf plot over the histogram is that the stem-and-leaf plot displays
not only the frequency for each interval, but also displays all of the individual values within that
interval.
Look at the stem and leaf plot above. Notice the following features of the graph:
49 | P a g e
19
50
|
P
a
g
e
a. all graphs need a title so that people analyzing the data can understand at a glance
what the graph is trying portray
2. stem – on this particular plot, the stem column consists of the hundreds and tens digits
3. leaves – on this particular graph, the leaves consist of the one digits
4. legend – helps the reader create numbers from the stem and leaf
If you are comparing two sets of data, you can use back-to-back stem and leaf plots. For
example:
3 2 0 4 1 5 6 7
The data in the table below shows math test scores (out of 50) from a grade seven class. Using
the data from the table below and this graph as a guide, fill in the stems and leaves to complete
the plot:
51 | P a g e
35 48 36 40
42 45 50 38
45 49 47 50
Title:_____________________________________
Stem Leaves
Legend:__________________________________
1. a) The chart shows the age of 34 actresses when they won the Academy Award. Display
the data in a stem and leaf plot.
26 35 34 34 26 37 42 41 35 31
41 33 30 74 33 49 38 61 21 41
26 80 42 29
53 | P a g e
21
54
|
P
a
g
e
b) How old was the youngest actress when she won the award? the oldest actress?
d) Did you discover patterns more easily from the chart or from the stem and leaf plot?
Why?
2.a) Data similar to the data in problem 1 is shown for Best Actors. Display the data in a
stem and leaf plot.
35 47 31 46 39 56 41 44 42 43
62 43 40 48 48 56 38 60 32 41
42 37 76 39 55 45 35 61 33 51
31 42 62 62
b) How do the ages of the youngest and oldest actors compare with the actresses in
problem 1?
c) Create a stem and leaf plot to display both sets of data together to help make
comparisons.
Review question
55 | P a g e
56
|
P
a
g
e
CHAPTER FOUR
1. Define probability
2. Define sample space
3. Be able to calculate relative frequency
4. Apply multiplication and additional rules
The set of all possible outcomes of a probability experiment is called a sample space.
The sample space is an exhaustive list of all the possible outcomes of an experiment. Each
possible result of such a study is represented by one and only one point in the sample
space, which is usually denoted by S.
Examples
57 | P a g e
4.4 Event
Any event which consists of a single outcome in the sample space is called an elementary
or simple event. Events which consist of more than one outcome are called compound
events.
Set theory is used to represent relationships among events. In general, if A and B are two events
in the sample space S, then
Example
58 | P a g e
C = 'score is 7' =
Relative frequency is another term for proportion; it is the value calculated by dividing the
number of times an event occurs by the total number of times an experiment is carried out. The
probability of an event can be thought of as its long-run relative frequency when the experiment
is carried out many times.
If an experiment is repeated n times, and event E occurs r times, then the relative frequency
of the event E is defined to be
rfn(E) = r/n
Example
If an experiment is repeated many, many times without changing the experimental conditions,
the relative frequency of any particular event will settle down to some value. The probability of
the event can be defined as the limiting value of the relative frequency:
P(E) = rfn(E)
59 | P a g e
For example, in the above experiment, the relative frequency of the event 'heads' will settle
down to a value of approximately 0.5 if the experiment is repeated many more times.
4.6 Probability
The probability of an event has been defined as its long-run relative frequency. It has also been
thought of as a personal degree of belief that a particular event will occur (subjective
probability).
In some experiments, all outcomes are equally likely. For example if you were to choose one
winner in a raffle from a hat, all raffle ticket holders are equally likely to win, that is, they have
the same probability of their ticket being chosen. This is the equally-likely outcomes model and
is defined to be:
P(E) =
Examples
60 | P a g e
2. When tossing a coin, we assume that the results 'heads' or 'tails' each have equal
probabilities of 0.5.
A person's subjective probability of an event describes his/her degree of belief in the event.
Example
A Rangers supporter might say, "I believe that Rangers have probability of 0.9 of winning the
Scottish Premier Division this year since they have been playing really well."
Two events are independent if the occurrence of one of the events gives us no information about
whether or not the other event will occur; that is, the events have no influence on each other.
In probability theory we say that two events, A and B, are independent if the probability that they
both occur is equal to the product of the probabilities of the two individual events, i.e.
The idea of independence can be extended to more than two events. For example, A, B and C are
independent if:
a. A and B are independent; A and C are independent and B and C are independent
(pairwise independence);
61 | P a g e
If two events are independent then they cannot be mutually exclusive (disjoint) and vice versa.
Example
Suppose that a man and a woman each have a pack of 52 playing cards. Each draws a card from
his/her pack. Find the probability that they each draw the ace of clubs.
That is, there is a very small chance that the man and the woman will both draw the ace of clubs.
Two events are mutually exclusive (or disjoint) if it is impossible for them to occur together.
If two events are mutually exclusive, they cannot be independent and vice versa.
Examples
1. Experiment: Rolling a die once
Sample space S = {1,2,3,4,5,6}
Events A = 'observe an odd number' = {1,3,5}
62 | P a g e
2. A subject in a study cannot be both male and female, nor can they be aged 20 and 30. A
subject could however be both male and 20, or both female and 30.
The addition rule is a result used to determine the probability that event A or event B occurs or
both occur.
where:
For mutually exclusive events, that is events which cannot occur together:
=0
= P(A) + P(B)
For independent events, that is events which have no influence on each other:
Example
63 | P a g e
Suppose we wish to find the probability of drawing either a king or a spade in a single draw from
a pack of 52 playing cards.
Since there are 4 kings in the pack and 13 spades, but 1 card is both a king and a spade, we have:
The multiplication rule is a result used to determine the probability that two events, A and B,
both occur.
where:
P(A | B) = the conditional probability that event A occurs given that event B has occurred
already
P(B | A) = the conditional probability that event B occurs given that event A has occurred
already
For independent events, that is events which have no influence on one another, the rule
simplifies to:
64 | P a g e
65
|
P
a
g
e
That is, the probability of the joint events A and B is equal to the product of the individual
probabilities for the two events.
In many situations, once more information becomes available, we are able to revise our
estimates for the probability of further outcomes or events happening. For example, suppose you
go out for lunch at the same place and time every Friday and you are served lunch within 15
minutes with probability 0.9. However, given that you notice that the restaurant is exceptionally
busy, the probability of being served lunch within 15 minutes may reduce to 0.7. This is the
conditional probability of being served lunch within 15 minutes given that the restaurant is
exceptionally busy.
The usual notation for "event A occurs given that event B has occurred" is "A | B" (A given B).
The symbol | is a vertical line and does not imply division. P(A | B) denotes the probability that
event A will occur given that event B has occurred already.
A rule that can be used to determine a conditional probability from unconditional probabilities is:
where:
P (A | B) = the (conditional) probability that event A will occur given that event B has
occurred already
66 | P a g e
where:
= probability that event A and event B' both occur, i.e. A occurs and B
does not.
Bayes' Theorem
Using the multiplication rule, gives Bayes' Theorem in its simplest form:
P(B | A).P(A)
P(A | B) =
where:
67 | P a g e
P(A | B) = probability that event A occurs given that event B has occurred already
P(B | A) = probability that event B occurs given that event A has occurred already
P (B | A') = probability that event B occurs given that event A has not occurred already
Review question
1. What is probability?
2. State what is meant by sample space
3. Describe how relative frequency is calculated
4. When do we use multiplication and additional rules?
68 | P a g e
CHAPTER FIVE
iv. Can we make use of that relationship for predictive purposes i.e. forecasting?
5.2 Correlation
Correlation describes the strength of the relationship. It is not concerned with 'cause' and
'effect'.
5.3 Regression
Regression describes the relationship itself in the form of a straight line equation which best
fits the data.
Some initial insight into the relationship between two continuous variables can be obtained
by plotting a scatter diagram and simply looking at the resulting graph.
4. The 'goodness of fit' can be calculated to see how well the line fits the data.
5. Once defined by an equation, the relationship can be used for predictive purposes.
Example
'Ice cream Sales' for a particular firm of manufacturers and 'Average Monthly Temperature'.
70 | P a g e
We look for a linear relationship with the bivariate points plotted being reasonably close to the,
yet unknown, 'line of best fit'.
71 | P a g e
140
130
120
110
sSale
100
90
80
70
60
50
5 10
15
Av.Temp.
72 | P a g e
73 | P a g e
Casios. r = 0.9833
5.8 Conclusion: The test statistic exceeds the critical value so we reject the Null Hypothesis
and conclude that there is a significant association between ice-cream sales and average
monthly temperature.
This can be produced directly from a calculator in LR mode. (Shift 7 for a and shift 8 for b)
The regression line is described, in general, as the straight line of ‘best fit’ with the equation:
y = a + bx
where x and y are the independent and dependent variables respectively, at the intercept on the
y-axis, and b the slope of the line.
74 | P a g e
y = 45.5 + 5.45x
To draw this line on the scatter diagram, any three points are plotted and joined up. These may
be points, (0, a), the centroid, ( x, y ), and/or any other points calculated from the
For any value of x the corresponding value of y can be found directly from the calculator in L.R.
mode from key [ yˆ ].
140
130
120
110
100
Sales
90
80
70
60
50
5 10 15
Av.Temp.
75 | P a g e
The correlation coefficient r was 0.983 so we have (0.983)2 x 100 = 96.6% fit.
This indicates the percentage of the variation in Ice-cream Sales which is accounted for by
the variation in Average monthly temperature.
S
xy
r −1 ≤ r ≤ 1
S
xx Syy
∑x∑x
2
∑
where S xx x −
n
∑y∑y
2
∑
S
yy y −
n
∑x∑y
S
xy ∑ xy −
76 | P a g e
The following example gives ice-cream sales per month against mean monthly
temperature
(Fahrenheit)
Calculations
77 | P a g e
2433
r
0.9869 792 7674
S
xy
b
S
xx
S
where xy ∑ xy − ∑ x ∑ y
n
and S xx ∑ x 2 − ∑x∑x
78 | P a g e
Since the regression line passes through the centroid, both means, its equation can be
used to find the value of a, the intercept on the y-axis:
a y − bx
The values of a and b are therefore 45.5 and 5.45 respectively giving the regression line:
y = 45.5 + 5.45x
The data is ranked in order of size, using the numbers 1, 2 …N If two variables x and
y are ranked in such a manner, the coefficient of rank correlation is given by
n (n2 – 1)
where
79 | P a g e
n = No. of parts of values (x and y) in the data spearman’s formula for rank
correlation
e.g.
The following table shows how 10 students arranged in the alphabetical order, were
ranked according to their achievements in both laboratory and lectures part of the
biology course. Find the coefficient of rank correlation
Laboratory 8 3 9 2 7 10 4 6 1 5
lecture 9 5 10 1 8 7 3 4 2 6
Solution
The difference of rank D in laboratory and lecture for each student is given in the
following
table. Also given in the table are d2 and Σ d2
Rank d -1 -2 -1 1 -1 3 1 2 -1 -1
difference
1 4 1 1 1 9 1 4 1 1
d2
∑d2 = 24
r Rank = 1- 6∑d2
n(n2 – 1) = 1 – 6 ( 24)
10 ( 102 -1)
= 0. 8545
80 | P a g e
This measures the degree of correlation between two sets of observation of paired
values when only the relative order of magnitude is available for each set of data.
An example would make you understand more.
Physics French x’ y’ d d 2
3 1 5 7 2 4
2 4 6 4 2 4
1 2 7 6 1 1
4 3 4 5 1 1
6 5 2 3 1 1
5 7 3 1 2 4
7 6 1 2 1 1
16
rr = 1- 6∑d2
n (n2 – 1) = 1 – 6 x 16
7(72 - 1) = 0. 714.
e.g. suppose the student set two papers in an examination, medical terminology and
Anatomy and physiology and instead of the actual marks a warded on each paper they
were told only their ranking in the order of merit. To establish whether the performance
on two papers are correlated or not the method of rank correlation is used.
rr = 1- 6∑d2
n(n2 – 1)
81 | P a g e
Rank Rank
1 1 3 -2 4
2 2 2 0 0
3 3 1 2 4
4 4 6 -2 4
5 5 5 0 0
6 6 8 -2 4
7 7 4 3 9
8 8 10 -2 4
9 9 7 2 4
10 10 9 1 1
34
Rank correlation
rr = 1- 6∑d2
n (n2 – 1)
82 | P a g e
When d is the numerical difference between corresponding parts of Ranks and the
number of pairs using this formula of the Rank correlation in an example.
rr = 1- 6 x 34
10(10 2 -1)
=1- 204
10(100-1)
= 1 – 204
990
= 1 – 0.206060
= 0.79394
This suggests that there is quite strong relationship between performances in two papers.
As with all the techniques described so far, correlation analysis has no value for its own
sake. It is useful surely because if properly used, it permits theories and hypothesis to be
verified or repeated on the basis of imperial evidence. It cannot be used on its own as a
profess of cause and effect. At all times it must be remembered that such specialized
tools may easily be misapplied and gives misreading results.
Another exercise
83 | P a g e
1 70 165 4 3 1 1
2 66 130 9 10 1 1
3 72 180 1 1 0 0
6 64 150 10 6 4 16
7 68 140 7 9 2 4
39.0
rr = 1- 6∑d2
n (n2 – 1) = 1 – 6 x 39
84 | P a g e
10(10 2 – 1)
= 1 - 234
999
=1 –
0.2363636 =
0.7636364
Assignment
1 121 59.0
2 122 54.5
3 124 61.5
4 126 60.0
5 129 61.0
6 131 60.0
7 133 61.0
Compute:
(a) Rank correlation
85 | P a g e
children w h Rw Rh d d2
1 121 59.0 7 6 1 1
2 122 54.5 6 7 1 1
3 124 61.0 5 1 4 16
27
rr = 1-6∑d2
n (n2 – 1)
= 1 – 6(27)
7(49-1)
86 | P a g e
= 1 - 162
336
= 1 – 0.48795
= 0.51204
= 0. 52
121 59.0 6 1 36 1 6
126 60.0 1 0 1 0 0
129 61.0 2 1 4 1 2
131 60.0 4 0 16 0 0
133 61.0 6 1 36 1 6
Mean of height = 60
w = a + bh
b = ∑(w – w’) (h – h’)
87 | P a g e
(w – w’) 2
= 46
127
= 0.362
88 | P a g e
49
89
|
P
a
g
e
Formula
Σ ( x – x’ )2 Σ( y – ў) 2
r=
Σ ( w –w’ )2 Σ( h – h’) 2
46
r =
127 x 35.5
= 46
67.1
= 0. 686
90 | P a g e
Coefficient of correlation
Mean x = x = Σx
= 1260
10
= 126
91 | P a g e
Mean y =
y = Σy
n
= 590.5
10
= 59.05
r = Σ(x - x) (y – y’)
Σ ( x – x ‘)2 ( y – ў)
34.5
r =
142 x 103.25
= 34.5
11.9 x10.16
= 34.5
120.92
= 0.285
= 0.29
or
92 | P a g e
r = Σ(x - x) (y – y)
Σ[ ( sDx ) ( sDx)]
34.5
r =
10(3.8x3.21)
= 34.5
10 x 12.2
= 34.5
122
= 0.283
= 0.29
Review questions
1. Define correlation
2. Define regression
3. How do you interpret a scatter graph?
4. How do you interpret rank correlation?
93 | P a g e
CHAPTER SIX
Chi is a letter of the Greek alphabet; the symbol is χ and it's pronounced like KYE, the
sound in "kite." The chi square test uses the statistic chi squared, written χ2. The "test"
that uses this statistic helps an investigator determine whether an observed set of
results matches an expected outcome. In some types of research (genetics provides
many examples) there may be a theoretical basis for expecting a particular result- not a
guess, but a predicted outcome based on a sound theoretical foundation.
A familiar example will help to illustrate this. In a single toss of a coin (called a "trial"),
there are two possible outcomes: head and tail. Further, both outcomes are equally
probable. That is, neither one is more likely to occur than the other. We can express this
in several ways; for example, the probability of "head" is 1/2 (= 0.5), and the probability
of "tail" is the same. Then if we tossed a single coin 100 times, we would expect to see
50 heads and 50 tails. That distribution (50:50 or 1:1) is an expected result, and you see
the sound basis for making such a prediction about the outcome. Suppose that you do the
100 tosses and get 48 heads and 52 tails. That is an observed result, a real set of data.
However, 48 heads: 52 tails is not exactly what you expected. Do you suspect
something's wrong because of this difference? No? But why aren't you suspicious? Is
94 | P a g e
95
|
P
a
g
e
the observed 48 heads: 52 tails distribution close enough to the expected 50 heads: 50
tails (= 1:1) distribution for you to accept it as legitimate?
We need to consider for a moment what might cause the observed outcome to differ from
the expected outcome. You know what all the possible outcomes are (only two: head and
tail), and you know what the probability of each is. However, in any single trial (toss) you
can't say what the outcome will be. Why… because of the element of chance, which is a
random factor. Saying that chance is a random factor just means that you can't control it.
But it's there every time you flip that coin. Chance is a factor that must always be
considered; it's often present but not recognized. Since it may affect experimental work, it
must be taken into account when results are interpreted.
What else might cause an observed outcome to differ from the expected? Suppose that at
your last physical exam, your doctor told you that your resting pulse rate was 60 (per
minute) and that that's good, that's normal for you. When you measure it yourself later
you find it's 58 at one moment, 63 ten minutes later, 57 ten minutes later. Why isn't it the
same every time, and why isn't it 60 every time? When measurements involving living
organisms are under study, there will always be the element of inherent variability. Your
resting pulse rate may vary a bit, but it's consistently about 60, and those slightly different
values are still normal.
In addition to these factors, there's the element of error. You've done enough lab work
already to realize that people introduce error into experimental work in performing
steps of procedures and in making measurements. Instruments, tools, implements
themselves may have built-in limitations that contribute to error.
Putting all of these factors together, it's not hard to see how an observed result may differ
a bit from an expected result. But these small departures from expectation are not
significant departures. That is, we don't regard the small differences observed as being
important.
96 | P a g e
97
|
P
a
g
e
What if the observed outcome in your coin toss experiment of 100 tosses (trials) had been
20 heads and 80 tails? Would you attribute to chance that much difference between the
expected and observed distributions? We expect chance to affect results, but not that
much. Such a large departure from expectation makes one suspect that the assumption
about equal probabilities of heads and tails is not valid. Suppose that the coin had been
tampered with, had been weighted, so as to favor the tail side coming up more often?
How do we know where to draw the line between an amount of difference that can be
explained by chance (not significant) and an amount that must be due to something other
than chance (significant)? That is what the chi square test is for, to tell us where to draw
that line.
In performing the χ2 test, you have an expected distribution (like 50 heads: 50 tails) and
an observed distribution (like 40 heads: 60 tails, the results of doing the experiment).
From these data you calculate a χ2 value and then compare that with a predetermined χ2
value that reflects how much difference can be accepted as insignificant, caused by
random chance. The predetermined values of χ2 are found in a table of "critical values."
Such a table is shown on the last page here.
The formula for calculating χ2 is: χ2 = Σ [(o - e) 2 / e], where "o" is observed and "e" is
expected.
The sigma symbol, Σ, means "sum of what follows."
For each category (type or group such as "heads") of outcome that is possible, we would
have an expected value and an observed value (for the number of heads and the number
of tails, e.g.) For each one of those categories (outcomes) we would calculate the quantity
(o - e) 2/e and then add them for all the categories, which was two in the coin toss
example (head category and tail category). It is convenient to organize the data in table
form, as shown below for two coin toss experiments.
98 | P a g e
99
|
P
a
g
e
Experiment 1 Experiment 2
o (observed) 47 53 61 39
e (expected) 50 50 50 50
o-e -3 3 11 -11
χ2 0.36 4.84
NOTE: do not take square root of χ2. The statistic is χ2, not χ.
Note that in each experiment the total number of observed must equal the total number
of expected. In expt. 1, for example, 47 + 53 = 100 = 50 + 50.
Having calculated a χ2 value for the data in experiment #1, we now need to evaluate that
χ2 value. To do so we must compare our calculated χ 2 with the appropriate critical value
of χ2 from the table shown on the last page here. [All of these critical values in the table
have been predetermined by statisticians.] To select a value from the table, we need to
know 2 things:
1. The number of degrees of freedom. That is one less than the number of categories
(groups) we have. For our coin toss experiment that is 2 groups - 1 = 1. So our critical
value of χ2 will be in the first row of the table.
2. The probability value, which reflects the degree of confidence we want to have in our
interpretation. The column headings 0.05 and 0.01 correspond to probabilities, or
confidence levels. 0.05 means that when we draw our conclusion, we may be 95%
100 | P a g e
confident that we have drawn the correct conclusion. That shows that we can't be
certain; there would still be a 5% probability of drawing the wrong conclusion. But 95%
is very good. 0.01 would give us 99% confidence, only a 1% likelihood of drawing the
101 | P a g e
10
2
|
P
a
g
e
wrong conclusion. We will now agree that, unless told otherwise, we will always use
the 0.05 probability column (95% confidence level).
For 1 degree of freedom, in our coin toss experiment, the table χ2 value is 3.84. We
compare the calculated χ2 (0.36) to that.
In every χ2-test the calculated χ2 value will either be (i) less than or equal to the critical χ2
value OR (ii) greater that the critical χ2 value.
• If calculated χ2 > critical χ2, then we conclude that there is a statistically significant
difference between the two distributions. That is, the observed results are significantly
different from the expected results, and the numerical difference between observed and
expected cannot be attributed to chance. That means that the difference found is due to
some other factor. This test won't identify that other factor, only that there is some factor
other than chance responsible for the difference between the two distributions.
For our expt. #1, 0.36 < 3.84. Therefore, we may be 95% confident that there is no
significant difference between the 47:53 observed distributions and the 50:50 expected
distribution. That small difference is due to random chance.
For expt. #2 shown in the table above, the calculated χ2 = 4.84. 4.84 > 3.84. Therefore the
61:39 observed distribution is significantly different from the 50:50 expected
distributions. That much difference cannot be attributed to chance. We may be 95%
confident that something else, some other factor, caused the difference. The χ2-test won't
103 | P a g e
58
10
4
|
P
a
g
e
identify that other factor, only that there is some factor other than chance responsible for
the difference between the two distributions.
Suppose in a coin toss experiment you got 143 heads and 175 tails; see the table setup
below. That's 318 tosses (trials) total. In setting up the table to calculate χ2, note that the
expected 1:1 distribution here means that you expect 159 heads: 159 tails, not 50 heads
and 50 tails as previously when the total was only 100. The point here: the sum of
observed values for all groups must equal the sum of expected values for all groups. In
this example 143 + 175 = 318 and 159 + 159 = 318.
heads Tails
o-e -16 16
χ2 3.22
105 | P a g e
106 | P a g e
59
10
7
|
P
a
g
e
o (observed) 40 21 9 10
e (expected) 45 15 15 5
o-e -5 6 -6 5
(o - e)2 25 36 36 25
χ2 10.36
The given information says that both parents have the genotype G//g R//r. Then the
expected distribution of progeny phenotypes would be 9/16 tall red: 3/16 tall white: 3/16
short red: 1/16 short white. [This is a cross like the first hybrid cross we did in lecture.]
The total number of observed progeny in this cross is 80. So the expected values are
based on that total: 9/16 of 80 = 45 tall red expected, and so forth. (Refer to the setup
table above.) The total number expected must equal the total number observed.
Entering the fractions 9/16, 3/16, etc. in the setup table for expected values is incorrect
and would give a wildly incorrect χ2 value. And that, in turn, would probably lead us to
the wrong conclusion in interpreting the results.
The observed distribution (which is given in the problem) is obviously "different" from
the expected. The numbers aren't the same, are they? But that difference may just be due
to chance, as discussed earlier here. The χ2 -test will help us decide whether the
difference is significant. Here the calculated χ2 value is 10.36. There are 3 degrees of
freedom here (4 categories - 1). So, the critical χ2 value for 0.05 probabilities (see table
on the last page) is 7.81. Since our calculated value exceeds the critical value, we
conclude that there is a significant difference between the observed distribution and the
expected distribution. The difference found here could not be attributed to chance alone.
We may be 95% confident of this conclusion. What does this mean? Perhaps the
inheritance of these traits is not as simple as 2 unlinked loci with dominant and recessive
alleles at each. Or maybe there is some environmental factor that influenced the
108 | P a g e
outcome. That is for additional investigation to determine. The χ2-test alerted us to the
fact that our results were too much different from the expectation.
109 | P a g e
60
11
0
|
P
a
g
e
Statisticians formally describe what we've just done in terms of testing a hypothesis. This
process begins with stating the "null hypothesis." The null hypothesis says that the
difference found between observed distribution and expected distribution is not
significant, i.e. that the difference is just due to random chance. Then we use the χ2-test to
test the validity of that null hypothesis.
If calculated χ2 ≤ critical χ2, then we accept the null hypothesis. That means that the two
distributions are not significantly different, that the difference we see is due to chance,
not some other factor. On the other hand, if calculated χ2 > critical χ2, then we reject the
null hypothesis. That means that the two distributions are significantly different, that the
difference we see is not due to chance alone.
Note this well: In performing the chi squared test in this course, it is not sufficient in
your interpretation to say "accept null hypothesis" or "reject null hypothesis." You will
be expected to fully state whether the distributions being compared are significantly
different or not and whether the difference is due to chance alone or other factors.
1 3.84 6.63
2 5.99 9.21
3 7.81 11.34
4 9.49 13.28
111 | P a g e
5 11.07 15.09
112 | P a g e
61
11
3
|
P
a
g
e
6 12.59 16.81
7 14.07 18.48
8 15.51 20.09
9 16.92 21.67
10 18.31 23.21
11 19.68 24.73
12 21.03 26.22
13 22.36 27.69
14 23.68 29.14
15 25.00 30.58
16 26.30 32.00
17 27.59 33.41
18 28.87 34.81
19 30.14 36.19
20 31.41 37.57
Review question
1. What is hypothesis testing?
2. How do you calculate and interpret chi squire
114 | P a g e
62
11
5
|
P
a
g
e
REFERENCES
116 | P a g e
63
11
7
|
P
a
g
e
Instructions
Answer ALL questions in Section A and any 2 in Section B
SECTION A
Q1. An outbreak of Pediculosis Capitis is being investigated in a girls’ school
containing 291 pupils. Of 130 children who live in a nearby housing estate, 18 were
infected and of 161 who live elsewhere, 37 were infected.
Q3. a) Make a tree-diagram to illustrate the sample space for the event
b) What is the complement of the event ‘Mr. Quinn’s car is red’? If the probability
of Mr. Quinn having a red car is 44.3 %, what is the probability of the
complement?
c) A bag contains 5 red marbles and 9 blue ones. If I draw two marbles from
64
118 | P a g e
the bag without replacement what is the probability that they are both blue?
d) Compute: 8C5 and P
17 4
(20 Marks)
Q5. Illustrate using diagram the following terms in correlation and regression
Q6. The following information refers to patients who attended Kiambu District Hospital over
a
period of 2 months in 2001.
135, 132, 141, 131, 165, 142, 158, 171, 182, 164, 147, 161, 163, 158, 172,
140, 141, 150, 127, 166, 166, 172, 180, 158, 159, 149, 154, 161, 173, 182,
169, 159, 155, 148, 150, 157, 156, 141, 163, 168, 156, 169, 176, 175, 161,
176, 169, 154, 152, 144, 143, 159, 160, 135, 161, 152, 157, 185, 169, 170
119 | P a g e
Q7. A company doctor is investigating the possible effect of stress upon the health of the
company’s
management employees. She suspects that employees under stress will suffer from high
systolic blood pressure. She takes a random sample of ten employees, aged between 35
and 55
years, and records their age and blood pressure:
A 37.2 133
B 39.8 143
C 42.1 135
D 44.6 151
E 47.2 143
F 48.9 158
G 50.0 163
H 51.3 146
I 52.8 168
J 54.4 160
Q8. The following table shows the marks of eight pupils in Biology and Chemistry.
Biology 65 65 70 75 75 80 85 85
Chemistry 50 55 58 55 65 58 61 65
120 | P a g e
121 | P a g e
12
2
|
P
a
g
e
Instructions
Answer ALL questions in Section A and any 2 in Section B
SECTION A
Q1. The following data represent the number of children for 10 physicians on a particular
hospital staff; 5, 4, 2, 3, 6, 9, 8, 4, 7, 4. Using the above data, find the following
descriptive measures:-
Q2. The following information refers to patients who attended Mbagathi District Hospital
over a period of 2 months in 2001.
135, 132, 141, 131, 165, 142, 158, 171, 182, 164, 147, 161, 163, 158, 172,
140, 141, 150, 127, 166, 166, 172, 180, 158, 159, 149, 154, 161, 173, 182,
169, 159, 155, 148, 150, 157, 156, 141, 163, 168, 156, 169, 176, 175, 161,
176, 169, 154, 152, 144, 143, 159, 160, 135, 161, 152, 157, 185, 169, 170
67
123 | P a g e
Malaria 96 78 72 65 77 71 67 73 53
T.B 13 10 9 10 10 9 8 7 6
a) A component bar chart of the volume of diseases given in the above table from
April to August.
d) Use the chart to write a brief interpretation for the indicated period.
(20 Marks)
Q4. a) Describe how health data can be used in evidence based decision making in
hospital
(20 Marks)
124 | P a g e
PART ONE
SECTION A
SECTION B
SECTION C
1. When inspecting data after collection for correction of obvious mistakes and omissions;
the process is referred to as data……………..
2. Arrangement of data according to geographic location…………………
125 | P a g e
PART TWO
126 | P a g e
68
12
7
|
P
a
g
e
12
8
|
P
a
g
e
69
12
9
|
P
a
g
e