Study Manual Quantitative Methods
Study Manual Quantitative Methods
QUANTITATIVE METHODS
CODE:
1
STUDY MANUAL
QUANTITATIVE METHODS
COPYRIGHT
Published by the International University of Management
Windhoek, Namibia
© International University of Management 2010
No part of this publication may be reproduced, stored in retrieval system or transmitted in any form or by
any means, electronic, mechanical, photocopying, recording or otherwise, without prior permission of the
publishers.
International University of Management
59 Bahnhof Street
Private Bag 14005
Windhoek
Telephone (264 61) 245150/84
Fax (264 61) 248112
E-mail: [email protected]
Website: www.ium.edu.na
2
TABLE OF CONTENTS
PAGE
Bibliography
3
Quantitative Methodology
Students will have a chance to play the roles of both clients (in defining the
research process and objectives) and researchers in defining the research
study, sample population, sample size, questionnaire and the methodology
for data analysis and interpretation.
Given the progressive nature of the course as per the course outline
coverage of the various statistical tools cannot be extensive. Hence,
students who have specific interest in this area are expected to pursue the
advanced modules for year 3 and 4.
Learning Objectives
4
Textbook
Required
1. Ticehurst, G. W & Veal, T. R. (1999). Business research Methods: a
managerial, Longman approach
Recommended
1. Hannagan T. (1997). Mastering Statistics, 3rd Ed., Macmillan
2. Salkind, N.J. (2000). Exploring Research, 4th Ed., Prentice Hall
3. Trochim, W. (2001). The Research Methods Knowledge Base, 2e
Available online at http://www.atomicdogpublishing.com/home.asp
Evaluation
There will be at least one exercise for each module, a midterm test, a
research project and a final examination based on the lectures, assigned
readings, class discussions.
Grading
5
UNIT 1: INTRODUCTION TO QUANTITATIVE METHODS AND
RESEARCH CONCEPTS
Quantitative methods
Research
Research is a systematic investigation or study to establish or clarify
principles, theories, facts and relationships. Such activities usually start with
identifying the research problem and ending with a clearly defined objective.
The objective statement is often based on some basic assumption or
hypothesis. Scientific research includes the process of investigating and
proving a potential application of established scientific and/or engineering
knowledge.
Research methods
Research methods are the systematic application of one or more techniques
to investigate the research problem. A method can be clearly described, and
is capable of being repeated by different people on different occasions.
Research proposal
Research proposals are clear specification of the research, and are aimed at
securing authority and funding for the research.
Accreditation
Formal procedures giving students certification after completion of studies,
e.g. a certificate, a diploma, or a degree.
Added value
6
Baseline Research
Bias
Any influence that distorts the results of a research study
Census
A count of every member of a population
Closed-Ended Question
Cluster Analysis
Confidence Interval
A range of values within which a population mean lies, for a given confidence
level
Confidence Level
The probability that a population mean lies within a given range of values
Convenience Sample
A sample selected merely on the participants’ availability or ability to
participate
Correlation
An established relationship between two variables
Database
A collection of related data organized for quick access
Demographic Data
Objective and descriptive population data that is easily identifiable, e.g. age,
income, gender, and education level
7
Dependent Variable
Descriptive statistics
Statistical methods used to describe or summarise data collected from a
specific sample (e.g. mean, median, mode, range, standard deviation).
Frequency Distribution
Variable
A variable in an experiment whose value is thought to affect the value of
another variable dependent variable).
Independent variable
The variable (or antecedent) that is assumed to cause or influence the
dependent variable(s) or outcome. The independent variable is manipulated
in experimental research to observe its effect on the dependent variable(s).
It is sometimes referred to as the treatment variable.
Informed consent
Interval Scale
A measurement scale in which all levels are equally spaced, such that an
increase of one unit on the scale is equal to the same increase at a different
point on the scale; it does not include a true zero point; e.g., the 10-point
rating scale (1 to 10) used in gymnastics is an interval scale. The categories
are ordered and there are equal intervals between points on the scale, but
8
the zero point on the scale is arbitrary so that a particular measure cannot
be said to be 'twice as' large as another measure on the same scale (e.g.
degrees Centigrade).
Leading Question
Questions designed to direct the respondent to provide a response desired
by the questioner
Likert Scale
Likert-Type Scale
Any measurement scale wherein a respondent is asked to rate his or her
attitude regarding a given statement
Mean
The average value.
Mode
Median
The mid-point; exactly one-half of responses are less than the median and
one-half are greater than the median. A descriptive statistic used to measure
central tendency. The median is the score/value that is exactly in the middle
of a distribution (i.e. the value above which and below which 50% of the
scores lie).
Methodology
The process used and steps taken to collect data in a research effort
Nominal Scale
9
order (e.g. classification by gender or by the colour of a person's hair or
eyes)
Open-Ended Question
A measurement scale in which the values are ordered, but not equally
spaced; for example, finishers in a race are measured on an ordinal scale
(1st, 2nd, 3rd, etc.). These categories can be used to rank order a variable,
but the intervals between categories are not equal or fixed (e.g. strongly
agree, agree, neither agree nor disagree, disagree, strongly disagree; social
class I professional, II semi-professional, IIIa non-manual, IIIb manual, IV
semi-skilled, and V unskilled).
Population
The total number of people in which a marketer is interested. A well-defined
group or set that has certain specified properties (e.g. all registered teachers
working full-time in Namibia.
Qualitative Research
Research such as focus groups and in-depth interviews, the results of which
cannot be statistically applied to a population.
Quantitative Research
Research design:
The strategic plan for a research project that sets out the outline
and key features of data collection methodology and analysis, including a
detailed consideration of the research parameters.
Random Sample
10
Random sampling
Rating Scale
Ratio Scale
An interval scale with a true zero point; for example, height and
distance are measured on ratio scales
Range
Regression Analysis
Sample
A subset of the population
Sample Error
The chance that the results of a study are due to “misrepresentation” of the
sample caused by random chance; for example, out of 1 million flips of a
penny, the penny will land heads-up 50% of the time and tails-up 50% of
the time. If we look at a sample of 100 of those 1 million flips, heads could
come up 70% of the time even though, over the long-term, it will only occur
50% of the time—this difference is due to sample error.
11
Sample Frame
A representative list from which the final sample for the study is drawn. This
means that the sample frame must be representative of the population, else
the sample will not be representative of the population; for example, the
telephone book can be used as a sample frame, but it does not include
people without a telephone, thereby biasing the sample drawn from that
frame against people without telephones
Sample Size
Secondary Research
Research gathered from sources other than directly from the population, e.g.
publications, associations, government research
Significance level
Established at the outset by a researcher when using statistical analysis to
test a hypothesis (e.g. 0.05 level or 0.01 significance level). A significance
level of 0.05 indicates the probability that an observed difference or
relationship would be found by chance only 5 times out of every 100 (1 out
of every 100 for a significance level of 0.01). It indicates the risk of the
researcher making a Type I error (i.e. an error that occurs when a
researcher rejects the null hypothesis when it is true and concludes that a
statistically significant relationship/difference exists when it does not).
Standard deviation
Statistical Significance
12
A term used to indicate whether the results of an analysis of data drawn
from a sample are unlikely to have been cause by chance at a specified level
of probability (usually 0.05 or 0.01).
Stratified Sample
Theoretical framework
The conceptual underpinning of a research study which may be based on
theory or a specific conceptual model (in which case it may be referred to as
the conceptual framework).
T-Test
A statistical test to determine if two groups are significantly different from
one another.
Type I Error
A false positive; incorrectly accepting a positive result to be true
Type II Error
A false negative; incorrectly accepting a negative result to be true
Variable
A quantity or function that may assume a given value or set of values.
Variance
The total amount of variation in values for a given variable; for example, the
average daily temperature in Swakopmund, Namibia has a larger variance
over a period of a year (there is a greater amount of variation in
temperatures as the seasons change) than the average daily temperature at
the Cape (Cape Town, South Africa) although it changes slightly day-to-day,
it does not change very much.
13
UNIT 2: Introduction to Linear Correlation and Regression
Correlation and regression refer to the relationship that exists between two
variables, X and Y, in the case where each particular value of Xi is paired
with one particular value of Yi.
For example:
• the measures of height of individuals paired with their corresponding
measures of weight;
• This means that the more you have of this variable, the more you
have of that one.
• Or conversely, the more you have of this variable, the less you have of
that one.
• Or, the more you have of height, the more you will tend to have of
weight;
• the more students study prior to a statistics exam, the more they will
tend to do well on the exam.
• Or conversely, the greater the amount of class time prior to the exam
that students spend snoozing and daydreaming, the less they will tend
to do well on the exam.
• In the first kind of case (the more of this, the more of that), you are
speaking of a positive correlation between the two variables; and in
the second kind (the more of this, the less of that), you are speaking
of a negative correlation between the two variables.
Correlation and regression are two sides of the same coin. This means
that you can begin with either one and end up with the other.
14
Correlation
Figure 1.1 below shows the relationship between these two variables—
percentage of high school seniors taking the SAT versus average state score
on the SAT—for all the 50 schools sampled in the regions. Within the context
of correlation and regression, a two-variable coordinate plot of this general
type is typically spoken of as a scatterplot or scattergram.
15
Figure 3.1 Percentage of High School Seniors Taking the SAT versus
Average Combined State SAT Scores: 2008
Schools Xi Yi
in Percentage Average
Region taking SAT SAT score
1i X1 Y1
2i X2 Y2
For the present example, designating the percentage of high school seniors
within a state taking the SAT as X, and the state's combined average SAT
score as Y, we would have a total of N=50 paired values of Xi and Yi. Thus
for schoos in Khomas, Xi=5% would be paired with Yi=1103; for schools in
Omasati, Xi=81% would be paired with Yi=903; and so on for all the 50
schools in the dataset. The entire bivariate list would look like the following,
except that the abstract designations for Xi and Yi would of course be
replaced by particular numerical values.
16
A further convention in bivariate coordinate plotting applies only to those
cases where a causal relationship is known or hypothesized to exist between
the two variables. In examining the relationship between two causally
related variables, the independent variable is the one that is capable of
influencing the other, and the dependent variable is the one that is
capable of being influenced by the other. For example, growing taller will
tend to make you grow heavier, whereas growing heavier will have no
systematic effect on whether you grow taller.
In the present SAT example, the percentage of high school seniors within a
state who take the SAT can conceivably affect the state's average score on
the SAT, whereas the state's average score in any given year cannot
retroactively influence the percentage of high school seniors who took the
test. Thus, the percentage of high school seniors taking the test is the
independent variable, X, while the average state score is the dependent
variable, Y.
In cases of this type, the convention is to reserve the X axis for the
independent variable and the Y axis for the dependent variable. For cases
where the distinction between "independent" and "dependent" does not
apply, it makes no difference which variable is called X and which is called Y.
17
At any rate, the clear message of Figure 3.1 is that states with relatively low
percentages of high school seniors taking the SAT in 2006 tended to have
relatively high average SAT scores, while those that had relatively high
percentages of high school seniors taking the SAT tended to have relatively
low average SAT scores. The relationship is not a perfect one, though it is
nonetheless clearly visible to the naked eye. The following version of
Figure 1.1 will make it even more visible. It is the same as shown before,
except that now we include the straight line that forms the best "fit" of this
relationship. We will return to the meaning and derivation of this line a bit
later.
18
A relationship that can be described by a straight line is spoken of as linear
(short for 'rectilinear'), while one that can be described by a curved line is
spoken of as curvilinear. We will touch upon the subject of curvilinear
correlation in a later chapter. Our present coverage will be confined to linear
correlation.
Figure 3.2 (a-e) illustrate the various forms that linear correlation is capable
of taking. The basic possibilities are: (i) positive correlation; (ii) negative
correlation; and (iii) zero correlation. In the case of zero correlation, the
coordinate plot will look something like the rather patternless jumble shown
in Figure 3.2a, reflecting the fact that there is no systematic tendency for X
and Y to be associated, either the one way or the other.
The plot for a positive correlation, on the other hand, will reflect the
tendency for high values of Xi to be associated with high values of Yi, and
vice versa; hence, the data points will tend to line up along an upward
slanting diagonal, as shown in Figure 3.2b. The plot for negative correlation
will reflect the opposite tendency for high values of Xi to be associated with
low values of Yi, and vice versa; hence, the data points will tend to line up
along a downward slanting diagonal, as shown in Figure 3.2d.
The limiting case of linear correlation, as illustrated in Figures 3.2c and 3.2e,
is when the data points line up along the diagonal like beads on a taut
string. This arrangement, typically spoken of as perfect correlation, would
represent the maximum degree of linear correlation, positive or negative,
that could possibly exist between two variables.
In the real world you will normally find perfect linear correlations only in the
realm of basic physical principles; for example, the relationship between
voltage and current in an electrical circuit with constant resistance. Among
the less tidy phenomena of the behavioral and biological sciences, positive
and negative linear correlations are much more likely to be of the
"imperfect" types illustrated in Figures 3.2b and 3.2d.
19
A closely related companion measure of linear correlation is the coefficient
of determination, symbolized as r2, which is simply the square of the
correlation coefficient. The coefficient of determination can have only
positive values ranging from r2=+1.0 for a perfect correlation (positive or
negative) down to r2=0.0 for a complete absence of correlation. The
advantage of the correlation coefficient, r, is that it can have either a
positive or a negative sign and thus provide an indication of the positive or
negative direction of the correlation.
Alternatively, you could say that the looser positive correlation of Example II
is only 44% as strong as the perfect one shown in Example I. The essential
meaning of "strength of correlation" in this context is that such-and-such
percentage of the variability of Y is associated with (tied to, linked with,
coupled with) variability in X, and vice versa. Thus, for Example I, 100% of
the variability in Y is coupled with variability in X; whereas, in Example II,
20
only 44% of the variability in Y is linked with variability in X.
The correlations shown in Examples III and IV are obviously mirror images
of the ones just described. For Example III the six values of Xi and Yi are
paired in such a way as to produce a perfect negative correlation, which
yields a correlation coefficient of r=—1.0 and a coefficient of determination
of r2=1.0.
You can also go further and say that the perfect positive and negative
correlations in Examples I and III are of equal strength (for both, r2=1.0)
but in opposite directions; and similarly, that the looser positive and
negative correlations in Examples II and IV are of equal strength (for both,
r2=0.44) but in opposite directions.
To illustrate the next point in closer detail, we will focus for a moment on the
particular pairing of Xi and Yi values that produced the positive correlation
shown in Example II of Figure 3.3.
21
When you perform the
Pair Xi Yi
computational procedures for
a 1 6 linear correlation and
b 2 2 regression, what you are
c 3 4
d 4 10
essentially doing is defining the
e 5 12 straight line that best fits the
f 6 8 bivariate distribution of data
points, as shown in the
following version of the same
graph. This line is spoken of as
the regression line, or line of regression, and the criterion for "best fit" is
that the sum of the squared vertical distances (the green lines ||||) between
the data points and the regression line must be as small as possible.
The details of this line—in particular, where it begins on the Y axis and the
rate at which it slants either upward or downward—will not be explicitly
drawn out until we consider the regression side of correlation and
regression. Nonetheless, they are implicitly present when you perform the
computational procedures for the correlation side of the coin. As indicated
above, the slant of the line upward or downward is what determines the sign
of the correlation coefficient (r), positive or negative; and the degree to
which the data points are lined up along the line, or scattered away from it,
determines the strength of the correlation (r2).
You have already encountered the general concept of variance for the case
where you are describing the variation that exists among the variate
instances of a single variable. The measurement of linear correlation
requires an extension of this concept to the case where you are describing
the co-variation that exists among the paired bivariate instances of two
variables, X and Y, together. We have already touched upon the general
concept. In positive correlation, high values of X tend to be associated with
high values of Y, and low values of X tend to be associated with low values
of Y.
22
In negative correlation, it is the opposite: high values of X tend to be
associated with low values of Y, and low values of X tend to be associated
with high values of Y. In both cases, the phrase "tend to be associated" is
another way of saying that the variability in X tends to be coupled with
variability in Y, and vice versa—or, in brief, that X and Y tend to vary
together. The raw measure of the tendency of two variables, X and Y, to co-
vary is a quantity known as the covariance. As it happens, you will not
need to be able to calculate the quantity of covariance in and of itself,
because the destination we are aiming for, the calculation of r and r2, can be
reached by way of a simpler shortcut.
However, you will need to have at least the general concept of it; so keep in
mind as we proceed through the next few paragraphs that covariance is a
measure of the degree to which two variables, X and Y, co-vary.
observed covariance
r=
maximum possible positive covariance
For any n numerical values, a, b, c, etc., the geometric mean is the nth root of the
product of those values. Thus, the geometric mean of a and b would be the square
root of axb; the geometric mean of a, b and c would be the cube root of axbxc; and
so on.
observed covariance
r=
sqrt[(varianceX) x (varianceY)] 23
Although in principle this relationship involves two variances and a
covariance, in practice, through the magic of algebraic manipulation, it boils
down to something that is computationally much simpler. In the following
formulation you will immediately recognize the meaning of SSX, which is the
sum of squared deviates
for X; by extension, you will Recall that "sqrt" means "the square root of."
also be able to recognize SSY,
which is the sum
of squared In order to get from the formula above to the one below, you will
deviates for Y. need to recall that the variance (s2) of a set of values is simply the
average of the squared deviates: SS/N.
The third item, SCXY, denotes a quantity that we will speak of as the sum of
co-deviates; and as you can no doubt surmise from the name, it is
something very closely akin to a sum of squared deviates. SSX is the raw
measure of the variability among the values of Xi; SSY is the raw measure of
the variability among the values of Yi; and SCXY is the raw measure of the
co-variability of X and Y together.
24
For a value of Yi it is (deviateY) x (deviateY)
This should give you a sense of the underlying concepts. Just keep in mind,
no matter what particular computational sequence you follow when you
calculate the correlation coefficient, that what you are fundamentally
calculating is the ratio which, for computational purposes, comes down to:
Now for the nuts-and-bolts of it. Here, once again, is the particular pairing of
Xi and Yi values that produced the positive correlation shown in Example II
of Figure 3.3. But now we subject them to a bit of number-crunching,
calculating the square of each value of Xi and Yi, along with the cross-
product of each XiYi pair. These are the items that will be required for the
calculation of the three summary quantities in the above formula: SSX, SSY,
and SSXY.
a 1 6 1 36 6
b 2 2 4 4 4
c 3 4 9 16 12
d 4 10 16 100 40
e 5 12 25 144 60
f 6 8 36 64 48
25
where the sum of squared deviates for a set of Xi values can be calculated
according to the computational formulaT
Thus:
SSX = 91—(441/6) = 17.5
Thus:
SSY = 364—(1764/6) = 70.0
Xi = 21
Yi = 42
( Xi)( Yi) = 21 x 42 = 882
(XiYi) = 170
26
Thus:
SCXY = 170—(882/6) = 23.0
Once you have these preliminaries,TSSX = 17.5, SSY = 70.0, and SCXY = 23.0
you can then easily calculate the correlation coefficient as
23.0
SCXY
r= = sqrt[17.5 x 70] = +0.66
sqrt[SSX x SSY]
Recall that each example starts out with the same values of Xi and Yi:
Xi = {1, 2, 3, 4, 5, 6} and Yi = {2, 4, 6, 8, 10, 12}
They differ only with respect to how these values are paired up with one
another.
Hence, the following values will remain the same from one example to
another
N=6
x Xi = 21 x Xi2 = 91 x Yi = 42 x Yi2 = 364
SSX = 17.5 SSY = 70.0
The only thing that changes is the co-variation, as measured in its rawest
form by the sum of the XiYi cross-products (shown in the red cell), and then
by SCXY, the sum of co-deviates. Recall that the computational formulas for
the relevant SS and SC measures are
27
SCXY
r=
sqrt[SSX x SSY]
28
UNIT 3: RANK-ORDER CORRELATION
Correlation applies to those cases where the values of X and of Y are both
measured on an equal- interval scale. It is also possible to apply the
apparatus of linear correlation to cases where X and Y are measured on a
merely ordinal scale. When applied to ordinal data, the measure of
correlation is spoken of as the Spearman rank- order correlation coefficient,
typically symbolized as rs.
Suppose, for example, that two experts, X and Y, were asked to rank N=8
items with respect to some dimension germane to their field of expertise
(rank#1=highest, rank#8=lowest). To make it specific, you can imagine two
physicians ranking 8 patients with respect to the severity of their disease;
two psychotherapists ranking 8 patients with respect to the likelihood of
improvement; two wine experts ranking 8 wines from best to worst; two
statisticians ranking 8 statistical concepts with respect to their fundamental
importance; or whatever else it might be that strikes your fancy.
wine X Y
a 1 2
b 2 1
c 3 5
d 4 3
e 5 4
f 6 7
g 7 8
h 8 6
As you can see from the above graph, there is a substantial degree of
agreement between the rankings of the two experts if you plug the bivariate
values of X and Y into the formulaic structure.
29
As it happens, these are exactly the same values you will get when you
calculate the Spearman coefficient, rs. The simple reason for this is that r
and rs are algebraically equivalent in the case where the values of X and Y
consist of two sets of N rankings. The only advantage of rs is that the
calculations are easier if you are doing them by hand. [Note, however, that
rs is precisely equal to r only when the rankings within X and Y are the
consecutive integer values: 1, 2, 3, and so on, with no ties. With tied ranks,
rs will tend to be larger than r. If the proportion of tied ranks is fairly large,
you would be better advised to plug your rankings for X and Y into the
standard formula for r.]
Here is the same table you saw above, except now we wine X Y D D2
also take the difference between each pair of ranks
(D=X—Y), and then the square of each difference. All a 1 2 —1 1
that is required for the calculation of the Spearman b 2 1 1 1
coefficient are the values of N and- D2, according to c 3 5 —2 4
the formula d 4 3 1 1
e 5 4 1 1
2 f 6 7 —1 1
6 D
rs = 1 — g 7 8 —1 1
2 h 8 6 2 4
N(N —1)
N=8 - D2 = 14
If this formula seems a bit odd to you, you are in good company.
Generations of statistics students have been presented with it, and
generations have puzzled over such mind- bending questions as: why do you
start out with "1" and subtract something from it?; where does that N(N2—
1) in the denominator come from?; and, above all, how does that peculiar
"6" get into the numerator?
o For any set of N paired bivariate ranks, the minimum possible value
of- D2 occurs in the case of perfect positive correlation. In this case,
rank 1 for X is paired with rank 1 for Y, rank 2 for X with rank 2 for Y,
and so on. Each value of D will accordingly be equal to zero, and so
too will be the sum of the squared values of D.
o Conversely, the maximum possible value of- D2 occurs in the case of
perfect negative correlation. This maximum possible value is in every
instance equal to
30
N(N2—1)
2
maximum- D =
3
o
Thus, for N=8 with perfect negative correlation:T
Item X Y D D2
a 1 8 —7 49
b 2 7 —5 25
c 3 6 —3 9
d 4 5 —1 1
e 5 4 1 1 - D2 = 168
f 6 3 3 9
g 7 2 5 25 8(82—1)/3 = 168
h 8 1 7 49
- D2 3 D2
=
N(N2—1)/3 N(N2—1)
o
Double this ratio, subtract it from 1, and voila! you have a quantity
that will be equal to +1.0 in the case of perfect positive correlation, to
—1.0 in the case of perfect negative correlation, and to zero in the
case of zero correlation.
6 D2
rs = 1—
N(N2—1)
And here, finally, is the calculation of rs for the example with which we
began:
31
wine X Y D D2 6 D2 6 x 14
rs = 1 —
a 1 2 —1 1 N(N2—1) 8(82—1)
b 2 1 1 1
c 3 5 —2 4 84
d 4 3 1 1 = 1—
e 5 4 1 1 504
f 6 7 —1 1
g 7 8 —1 1
h 8 6 2 4 1- 0.166 = +.83
The meanings of rs and r2s in a rank- order correlation are essentially the
same as those of r and r2 in a correlation based on equal- interval data. For
the present example, r2s=.69 means that the covariance between the X
and Y rankings is 69% as strong as it possibly could be, and the positive sign
of rs=+.83 signals that this covariation occurs along the upward slant, with
higher values of X tending to be associated with higher values of Y, and vice
versa.
However, I would not recommend taking the parallels much farther than
this. In particular, I think it would not make much sense to subject bivariate
rankings to the predictive apparatus of linear regression.
32
UNIT 4: SAMPLING AND SURVEY DESIGN
Knowing what the client wants is the key factor to success in any type of
business. News media, government agencies and political candidates need to
know what the public thinks. Associations need to know what their members
want. Large companies need to measure the attitudes of their employees.
The best way to find this information is to conduct a survey.
This chapter is intended primarily for those who are new to survey research.
It discusses options and provides suggestions on how to design and conduct
a successful survey project. It does not provide instruction on using specific
parts of The Survey System, although it mentions parts of the program that
can help you with certain tasks.
Establishing Goals
The first step in any survey is deciding what you want to learn. The goals of
the project determine whom you will survey and what you will ask them. If
your goals are unclear, the results will probably be unclear. Some typical
goals include learning more about:
These sample goals represent general areas. The more specific you can
make your goals, the easier it will be to get usable answers.
33
Selecting Your Sample
There are two main components in determining whom you will interview.
The first is deciding what kind of people to interview. Researchers often call
this group the target population. If you conduct an employee attitude survey
or an association membership survey, the population is obvious. If you are
trying to determine the likely success of a product, the target population
may be less obvious. Correctly determining the target
population is critical. If you do not interview the right kinds of people, you
will not successfully meet your goals.
The next thing to decide is how many people you need to interview.
Statisticians know that a small, representative sample will reflect the group
from which it is drawn. The larger the sample, the more precisely it reflects
the target group. However, the rate of improvement in the precision
decreases as your sample size increases. For example, to increase a sample
from 250 to 1,000 only doubles the precision. You must make a decision
about your sample size based on factors such as: time available, budget and
necessary degree of precision.
A biased sample will produce biased results. Totally excluding all bias is
almost impossible; however, if you recognize bias exists you can intuitively
discount some of the answers. The following list shows a few examples of
biased samples.
34
This can be a serious problem, unless you are only
interested in people who have Internet access.
Quotas
Interviewing Methods
Once you have decided on your sample you must decide on the method of
data collection. Each method has advantages and disadvantages.
Personal Interviews
An interview is called personal when the Interviewer asks the questions
face-to-face with the Interviewee. Personal interviews can take place in the
home, at a shopping mall, on the street, outside a movie theater or polling
place, and so on.
Advantages
• The ability to let the Interviewee see, feel and/or taste a product.
35
• The ability to find the target population. For example, you can find people who have
seen a film more easily outside a theater in which it is playing than by calling phone
numbers at random.
• Longer interviews are sometimes tolerated. Particularly with in-home interviews that
have been arranged in advance. People may be willing to talk longer face-to-face to a
person than to someone on the phone.
Disadvantages
• Personal interviews usually cost more per interview than other methods. This is
particularly true of in-home interviews, where travel time is a major factor.
• Each mall has its own characteristics. It draws its clientele from a specific geographic
area surrounding it, and its shop profile also influences the type of client. These
characteristics may differ from the target population and create a non-representative
sample.
Telephone Surveys
• Surveying by telephone is the most popular interviewing method in the USA. This is
made possible by nearly universal coverage (96% of homes have a telephone).
Advantages
• People can usually be contacted faster over the telephone than with other methods. If
the Interviewers are using CATI (computer-assisted telephone interviewing), the
results can be available minutes after completing the last interview.
• You can dial random telephone numbers when you do not have the actual telephone
numbers of potential respondents.
• If you are using computer-assisted interviewing, The Survey System's optional
Interviewing Module (see Chapter 11 in the Main Manual) helps automatically ensure
that questions are skipped when they should be, can check the logical consistency of
answers and can present questions or answers in a random order (the last two are
sometimes important for reasons that are described later).
Disadvantages
36
Mail Surveys
Advantages
Disadvantages
• Time! Mail surveys take longer than other kinds. You will need to wait several
weeks after mailing out questionnaires before you can be sure that you have gotten
most of the responses.
• In populations of lower educational and literacy levels, response rates to mail surveys
are often too small to be useful. This, in effect, eliminates many immigrant
populations that form substantial markets in many areas. Even in well-educated
populations, response rates vary from as low as 3% up to 90%. As a rule of thumb,
the best response levels are achieved from highly-educated people and people with a
particular interest in the subject (which, depending on your target population, could
lead to a biased sample).
One way of improving response rates to mail surveys is to mail a postcard telling your
sample to watch out for a questionnaire in the next week or two. You can also follow up a
questionnaire mailing after a couple of weeks with another card asking them to return the
questionnaire. The downside is that this doubles or triples your mailing cost. If you have
purchased a mailing list from a supplier you may also have to pay a second (and third) use fee -
you often cannot buy the list once and re-use it.
Another way to increase responses to mail surveys is to use an incentive. One possibility is to
send a dollar bill along with the survey (or offer to donate the dollar to a charity specified by the
respondent.) Another is to include the people who return completed surveys in a drawing for a
prize. A third is to offer a copy of the (non-confidential) result highlights to those who complete
the questionnaire. Any of these techniques will increase the response rates.
Remember that if you want a sample of 1,000 people, and you estimate a 10% response
level, you need to mail 10,000 questionnaires. You may want to check with your local post office
about bulk mail rates - you can save on postage using this mailing method. However, many
researchers do not use bulk mail, because many people associate "bulk" with "junk" and will
throw it out without opening the envelope, lowering your response rate.
37
Computer Direct Interviews
These are interviews in which the Interviewees enter their own answers directly into a
computer. They can be used at malls, trade shows, offices, and so on. The Survey
System's optional Interviewing Module and Interview Stations can easily create
computer-direct interviews.
Advantages
Disadvantages
• The Interviewees must have access to a computer or one must be provided for them.
• As with mail surveys, computer direct interviews may have serious response rate
problems in populations of lower educational and literacy levels. This method may
grow in importance as computer use increases.
E-mail Surveys
• E-mail surveys are both very economical and very fast. More people have e-mail than
have full Internet access. This makes e-mail a better choice than a Web page survey
for some populations. On the other hand, e-mail surveys are limited to simple
questionnaires, whereas Web page surveys can include complex logic.
Advantages
• Speed. An e-mail questionnaire can gather several thousand responses within a day
or two.
38
• There is practically no cost involved once the set up has been completed.
• You can attach pictures and sound files.
• The novelty element of an e-mail survey often stimulates higher response levels than
ordinary “snail” mail surveys.
Disadvantages
• You must possess (or purchase) a list of e-mail addresses to mail to.
• Some people will respond several times or pass questionnaires along to friends to
answer. Many programs have no check to eliminate people respon-ding multiple
times to bias the results.
• Many people dislike unsolicited e-mail even more than unsolicited regular mail. You
may want to send e-mail questionnaires only to people who expect to get mail from
you.
• You cannot use e-mail surveys to generalize findings to the whole populations.
People who have e-mail are different from those who do not, even when matched on
demographic characteristics, such as age and gender.
Many e-mail programs are limited to plain ASCII text questionnaires and cannot show
pictures. E-mail questionnaires from The Survey System can attach graphic or sound files.
Although use of e-mail is growing very rapidly it is not universal - and is even less so outside the
USA (three-quarters of the world's e-mail traffic takes place within the USA). Many “average”
citizens still do not possess e-mail facilities. So e-mail surveys do not reflect the population as a
whole. At this stage they are probably best used in a corporate environ-ment where e-mail is
much more common or when most members of the target population are known to have e-mail.
Web surveys are rapidly gaining popularity. They have major speed and cost advantages,
but also major sampling limitations. These limitations make software selection especially
important and restrict the groups you can study using this technique.
Advantages
• Web page surveys are extremely fast. A questionnaire posted on a popular Website
can gather several thousand responses within a few hours. Many people who will
respond to an e-mail invitation to take a Web survey will do so the first day; most will
do so within a few days.
• There is practically no cost involved once the set up has been completed. Large
samples do not cost more than smaller ones (except for any cost to acquire the
sample).
• You can show pictures and play sounds.
• Web page questionnaires can use complex question skipping logic, randomizations
and other features not possible with paper questionnaires or most e-mail surveys.
39
• Web page questionnaires can use colours, fonts and other formatting options not
possible in most e-mail surveys.
• On average, people give longer answers to open-ended questions on Web page
questionnaires than they do on other kinds of self-administered surveys.
Disadvantages
• Current use of the Internet is far from universal. Internet surveys do not reflect the
population as a whole. This is true even if a sample of Internet users is selected to
match the general population in terms of age, gender and other demographics.
• People can easily quit in the middle of a questionnaire. They are not as likely to
complete a long questionnaire on the Web as they would be if talking with an
interviewer.
• Depending on your software, you may have no control over who replies – anyone
from Afghanistan to Zanzibar, cruising that web page may answer.
There is often no control over people responding multiple times to bias the results.
At this stage we recommend using the Internet for surveys only when your target population
consists entirely of Internet users. Business-to- business research and employee attitude surveys
can often meet this requirement. Surveys of the general population usually will not.
In either case, be sure your survey software prevents people from completing more than
one questionnaire. You may also want to restrict access by requiring a password (The
Survey System’s Internet Module allows you to do this) or by putting the survey on a page that
can only be accessed directly (there are no links to it).
Scanning Questionnaires
Scanning questionnaires is a method of data collection that can be used with paper questionnaires
that have been administered in face-to-face interviews; mail surveys or surveys completed by an
Interviewer over the telephone.
Advantages
• Scanning can be the fastest method of data entry for paper questionnaires.
• Scanning is more accurate than a person in reading a properly completed
questionnaire.
Disadvantages
• Scanning is best-suited to "check the box" type surveys and bar codes. Scanning
programs have various methods to deal with text responses, but all require additional
data entry time.
• Scanning is less forgiving (accurate) than a person in reading a poorly marked
questionnaire. Requires investment in additional hardware to do the actual scanning.
40
Summary of Survey Methods
Your choice of survey method will depend on several factors. These include:
• Speed: E-mail and Web page surveys are the fastest methods, followed
by telephone interviewing. Interviewing by mail is the slowest.
41
Unit 5: MEASURES OF CENTRAL TENDENCY
These are statistics that attempt to describe typical scores that reflect how
the data is similar. The average is a commonly used term; in statistics this
includes 3 different expressions: the mean, median and mode. (see
explanatory notes at and).
The appropriateness of which measure to use depends on the data type (see
below), and its use:
Data type
Average
Mode Median Mean
Nominal
Ordinal
Interval/Ratio
42
The sample data yielded average waiting times:
3 months is the most common length of waiting time, but this does not take
into account that more than half the sample waited longer than this. The
mode is also vulnerable to data collection errors. The median gives a better
representation, but still does not take into account the longer waiting times
(ie the outliers). Is this important? The mean waiting time of 5.25 months
does reflect the several longer waiting times. The mean is more useful, in
that many further statistical analyses use it.
Many basic statistics textbooks will give more details, eg Cambell & Machin
(1999), Clegg (1982), Cobby & Gilchrist (2003)
Computer output
Types of data
Think about any collected data that you have experience of; for example, weight,
sex, ethnicity, job grade, and consider their different attributes. These variables can
be described as categorical or quantitative. The table summarises data types and
their associated measurement level, plus some examples. It is important to
appreciate that appropriate methods for summary and display depend on the type
43
of data being used. This is also true for ensuring the appropriate statistical test is
employed.
Categorical Ordinal
Job grade, age groups
(categories have inherent order)
Binary
Gender
(2 categories – special case of above)
Discrete
Quantitative Size of household (ratio)
(usually whole numbers)
(Interval/Ratio)
Continuous
Temperature °C/°F (no
(NB units of (can, in theory, take any value in a
absolute zero) (interval)
measurement range, although necessarily recorded
used) to a predetermined degree of
Height, age (ratio)
precision)
Illustrative example
Mark out of
Student Mark relative to 40% pass mark Position Result
100%
Ratio Interval Ordinal Nominal
Ahmed 56 16 6 Pass
Ben 48 8 7 Pass
Ceri 65 25 3 Pass
Desmond 73 33 2 Pass
Esme 62 22 4 Pass
Francesca 35 -5 10 Fail
Gemma 20 -20 9 Fail
Hannah 38 -2 8 Fail
Ian 58 18 5 Pass
Julie 82 40 1 Pass
Average
Data type Mode Median Mean
Nominal
Ordinal
Interval/ Ratio
44
UNIT 5: MEASURES OF DISPERSION
These statistics describe how the data varies or is dispersed (spread out).
The two most commonly used measures of dispersion are the range and the
standard deviation. Rather than showing how data are similar, they show
how data differs (its variation, spread, or dispersion).
Standard Deviation
The standard deviation (SD) is usually quoted along with the mean. It is a
measure of the variability of the sample data around the mean.
Interpreting the SD
The figure quoted for the SD represents one standard deviation from the
mean. One standard deviation is added to and subtracted (+/-) from the
mean to give a range within which, for many sets of data, about two-thirds
of subjects may fall.
45
Experimental group Control group
n = 30 n = 30
Mean age (SD) 70 (5) yrs 71 (10) yrs
Mean age +/- 1
65 to 75 yrs 61 to 81 yrs
SD
It can be observed that the age range is much greater in the control group.
If the study purported to have a sample of elderly people, the age range in
the control group might contradict this, as clearly some are under 65 (and it
is very likely some are even younger than 61!). Ideally it would be more
informative to have the minimum and maximum values given.
46
xi = individual sample values
x "bar" = sample mean
Estimation
The precision of the estimate depends on the size of the sample. Clearly the larger
the sample the better the estimate will be. Precision is measured by calculating the
standard error of the estimate or a confidence interval (usually the 95% confidence
interval).
Worked example
Consider the following times (to the nearest hour) that 16 patients experience relief
from a migraine after taking a certain drug:
7 8 1 2 6 3 5 2 4 9 4 6 5 6 9 8
Thus the estimate of the mean time for a patient to experience relief is 5.312 hours
The 95% confidence interval for the mean time to experience relief is
calculated to be 3.975 to 6.649.
It can be said that there is a probability of 0.95 that the population mean
lies between 3.975 hours and 6.649 hours. This provides a clear idea of how
precisely the population mean has been estimated by these data.
Although 95% confidence intervals are most often reported, you will
sometimes see 99% confidence intervals, in which case the confidence
interval contains the population parameter with probability 0.99 and will,
consequently, be wider than the corresponding 95% confidence interval. To
calculate a 99% confidence interval, the factor 2 is replaced by 2.6.
47
Calculation of a 95% confidence interval for the mean
The factor 2 varies according to the sample size but only varies from 2.201
to 1.960 for sample sizes greater than 10, so that 2 is an adequate
approximation in most cases. If you use a computer package that calculates
the confidence interval the exact factor will be used.
Since the sample mean and standard deviation are estimates of fixed (albeit
unknown) quantities the only way of affecting the confidence interval is by
altering the sample size, n. Increasing n will reduce the standard error of
the mean and thus the width of the interval. But notice that to halve the
width of the interval we have to quadruple the sample size (because of the
square root in the formula).
Computer Output
To obtain a confidence interval for a set of data in Minitab, click on Stat >
Basic Statistics > 1-Sample t…
Data from the above entered in column C1gives the following output:
T Confidence Intervals
Changing the confidence interval level to 99.0 gives the following output
T Confidence Intervals
48
Confidence intervals for the mean in SPSS
Not easy and best to be avoided! But if you have to, the following is an
example of how to calculate an approximate lower 95% confidence limit of a
set of data in cells A1 to A16 using Excel’s CONFIDENCE function.
=AVERAGE(A1:A16)-CONFIDENCE(1-0.95,STDEV(A1:A16),COUNT(A1:A16))
=AVERAGE(A1:A16)+CONFIDENCE(1-
0.95,STDEV(A1:A16),COUNT(A1:A16))
Median is the value in the middle of the dataset and the value that divides
the dataset into two halves.
Standard deviation
The standard deviation (SD) is usually quoted along with the mean. It is a
measure of the variability of the sample data around the mean.
49
Note: In this table (which is very typical of what might be seen in a
research paper) it can be observed that the mean age for each group is very
similar, but the standard deviation is quite different. This might suggest that
the groups are not as similar as the mean suggests, and the greater
variability in the control group might have an influence on the findings.
50
UNIT 5: CHOOSING SIMPLE RANDOM SAMPLE
In doing a sample survey, it is standard practice to number the units in the population by N, and
the number of units in the sample by n. If n= N, this is called a census and the normal
requirements of sampling do not apply. To select a random sample of size n, from a population
of N, you will need to develop:
Let’s assume that we want a sample of 15 from a population of N = 100. We could use a
standard random table or one generated by the computer. The example below shows a partial
random table.
25 85 52 40 80 50 80 78 58 42 11 31 85 77 77 25 16 08 54 37
58 73 38 58 78 92 12 38 43 41 31 77 97 30 33 45 00 17 60 35
66 04 44 17 00 38 61 37 54 84 38 54 05 96 18 96 20 83 65 29
96 22 27 19 23 83 09 18 22 67 17 31 63 08 80 18 68 08 47 88
83 86 48 37 00 91 51 91 62 88 04 62 12 46 51 12 55 22 43 34
23 34 45 56 18 56 34 78 59 90 67 56 65 43 23 12 11 01 23 45
Since the Population is in “double figure” we need to select “two digit numbers”. Let us assume
that we want to start with the fifth column of the second row, that is, with the number 78. We
could go vertical as well, it wouldn’t matter. Accordingly, we would select the numbers until we
reach the sample size of 15, to get:
78 92 12 38 43 41 31 77 97 30 33 45 00 17 60
If the population N was 1000, and we need a sample size of 50 we would need to choose “3-
digit numbers” using the same process. Starting with the fifth column of the fourth row, we
would select numbers until we reach the sample size of 50 to get:
789 212 384 341 317 797 303 345 001 760 035 660 444 170 038 613 ….
51
Data representation
Data can be represented in many forms; primary, first, secondary or tertiary levels. Here,
primary level refers to the presentation of raw or unorganized data, that is the exact way we get
the information. First level data will have undergone some re-arrangement or ordering, but no
real transformation. Secondary level data will have undergone some transformation or
processing. And tertiary level data have undergone substantive transformation.
The first level data usually puts the raw data in some order or sequence (in the case of
quantitative data) or in summary, group, cluster, or category (in the case of non-quantitative
data). With the primary (quantitative) data, we can derive plots (scatter plots as they are called);
and from the first and secondary level (quantitative) data we can derive graphs, tables, charts and
regressions. Using secondary level data we can derive predictory models.
Tables
This is a very practical and popular method of presenting data. Tables can be of single, multi-
vector or matrix format, depending on the number of variables, the links, and their complexity.
Tables may be arranged horizontal or vertical and these can get quite sophisticated with multi-
level row and columns of information.
89,60,92,74,76,65,77,83,87,62,85,64,79,77,96,80,70,85,80,81,82,81,86,71,90,87,71,72
60 62 64 65 70 71 71 72 74 76 77 77 79 80 Double
80 81 81 82 83 85 85 86 87 87 89 90 92 96 Vector
Student Frequency
52
Male Fem Male
60-64 3 Age Marks Freq Marks Freq Marks Freq Marks Freq
65-69 1
70-74 4
75-79 6 18-25 60-79 3 60-79 2 80-99 1 80-99 2
80-84 6 26-33 60-79 2 60-79 3 80-99 2 80-99 3
85-89 2 >31 60-79 2 60-79 2 80-99 1 80-99 1
90-94 1
95-99 1 Total 60-79 7 60-79 7 80-99 4 80-99 6
Total 24
Graphs
There are graphical many ways to represent data. The common ones are:
• the scatter lot (see figure A below)
• the line or curve diagram (see figure B below)
• the bar diagram or chart (see figure C below)
• the pie diagram or chart (see figure D below)
Figure A Figure C
xxx
xx x x x
x x x xxx x
xx xx x x sx
x x x x xx xxxxx
xxx xx x xxxx xx
Figure B
Figure D
53
UNIT 7: REVIEW OF BASIC STATISTICS AND STATISTICAL TOOLS
What is statistics?
54
UNIT 9: SURVEY DESIGN
List of all
Information
Conceptual Required Pilot Test
Problem
55
of statements and asked to indicate, using a scale, the extent to which
they agree or disagree with them. Ex: 5 = agree strongly; 4 = agree, etc.
Semantic differentials: offers respondent pairs of contrasting descriptors
and asking them to indicate how the concept being studied relates to the
descriptors.
---------------------------------------------------------
---------------------------------------------------------
c. Ranking: Please rank the following items in terms of their importance to you
Rank them 1 for the most important to 5 for the least important.
Rank
A Good reputation ____
B Easy Access ____
C Curriculum ____
F. Management pays fees ____
G. Easy parking ____
d. Likert scales How important was each of the following items in your decision to choose
this training course
e. Attitude statement: Please read the statements below and indicate your level of agreement
56
more important than the
qualification in education
Graduate courses fees are too high ____5 ____4 ____3 ____2 _____1
f. Semantic differential: Please look at the list and tick the line where you think this
Training course falls in relation to each factor listed
57
UNIT 10: Questionnaire Design
General Considerations
The first rule is to design the questionnaire to fit the medium. Phone
interviews cannot show pictures. Survey-by-mail respondents cannot ask,
“What exactly do you mean by that?” if they do not understand a question.
Intimate, personal questions are sometimes best handled by mail or
computer, where anonymity is most assured.
A mail survey will often not give the same answers as the same survey done
by phone or in person. If you used one method in the past and need to
compare results, stick to that method, unless there is a compelling reason to
change.
Respondents who feel they are being coerced into giving an answer they do
not want to give often do not complete the questionnaire. For the same
reason, include “Other” or “None” whenever either of these are a logically
58
possible answer. When the answer choices are a list of possible opinions,
preferences or behaviours you should usually allow these answers.
On paper, computer direct and Internet surveys these four choices should
appear as appropriate. You may want to combine two or more of them into
one choice, if you have no interest in distinguishing between them. You will
rarely want to include “Don't Know,” “Not Applicable,” “Other” or “None” in
a list of choices being read over the telephone or in person, but you should
allow the interviewer the ability to accept them when given by respondents.
Question Types
Multi Choice
__________________________________
Rating scales
59
5. On a scale where “10” means you have a great amount of interest in a
Subject and “1” means you have none at all. How would you rate your interest in
each of the following topics?
Strongly Strongly
Agree Agree Disagree Disagree
Rating Scales and Agreement Scales are two common types of questions
that some researchers treat as multiple choice questions and others treat as
numeric open end questions. Examples of these kinds of questions are:
There are two broad issues to keep in mind when considering question and
answer choice order. One is how the question and answer choice order can
encourage people to complete your survey. The other issue is how the order
of questions or the order of answer choices could affect the results of your
survey.
Whenever possible leave difficult or sensitive questions until near the end of
your survey. Any rapport that has been built up will make it more likely
people will answer these questions. If people quit at that point anyway, at
least they will have answered most of your questions.
Answer choice order can make individual questions easier or more difficult to
answer. Whenever there is a logical or natural order to answer choices, use
it. Always present agree-disagree choices in that order. Presenting them in
disagree-agree order will seem odd. For the same reason, positive to
60
negative and excellent to poor scales should be presented in those orders.
When using numeric rating scales higher numbers should mean a more
positive or more agreeing answer.
Question order can affect the results in two ways. One is that mentioning
something (an idea, an issue, a brand) in one question can make people
think of it while they answer a later question, when they might not have
thought of it if it had not been previously mentioned.
The other way question order can affect results is habituation. This problem
applies to a series of questions that all have the same answer choices. It
means that some people will usually start giving the same answer, without
really considering it, after being asked a series of similar questions. People
tend to think more when asked the earlier questions in the series and so
give more accurate answers to them.
Another way to reduce this problem is to ask only a short series of similar
questions at a particular point in the questionnaire. Then ask one or more
different kinds of questions, and then another short series, if needed.
One negative aspect of this technique is that you will usually have to do
Data Transformations on some of the questions after the results are entered,
because having the higher levels of agreement always mean a positive (or
negative) answer makes the analysis much easier. However, the few
minutes extra work may be a worthwhile price to pay to get more accurate
data.
The order in which the answer choices are presented can also affect the
answers given. People tend to pick the choices nearest the start of a list
when they read the list themselves on paper or a computer screen. People
tend to pick the most recent answer when they hear a list of choices read to
them.
61
obvious to the person that is answering them (e.g., “What brand(s) of car
do you own?”).
Other Tips
Start with a Title (e.g., Leisure Activities Survey). Always include a short
introduction - who you are and why you are doing the survey. It is often a
good idea to give the name of the research company rather than the client
(e.g., XYZ Research Agency rather than the manufacturer of the product/
service being surveyed).
Reassure your respondent that his or her responses will not be revealed to
your client, but only combined with many others to learn about overall
attitudes.
Include a cover letter with all mail surveys. A good cover letter will increase
the response rate. A bad one, or none at all, will reduce the response rate.
Include the information in the preceding two paragraphs and mention the
incentive (if any). Describe how to return the questionnaire. Include the
name and telephone number of someone the respondent can call if they
have any questions. Include instructions on how to complete the survey
itself.
Mail questionnaires should be numbered on each page and include the return
address on the questionnaire itself, because pages and envelopes can be
separated from each other. Envelopes should have return postage prepaid.
Using a postage stamp often increases response rates, but is expensive,
62
since you must stamp every envelope - not just the
returned ones.
You may want to leave a space for the respondent to add their name and
title. Some people will put in their names, making it possible for you to re-
contact them for clarification or follow-up questions. Indicate that filling in
their name is optional. Do not have a space for a name, if the questions are
sensitive in nature. Some people would become suspicious and not complete
the survey.
The best way to ask security questions is in reverse (i.e., if you are
surveying for a pharmaceutical product, phrase the question as "We want to
interview people in certain industries - do you or any member of your
household work in the pharmaceutical industry?). If the answer is "Yes"
thank the respondent and terminate the interview. Similarly, it is best to
eliminate people working in the advertising, market research or media
industries, since they may work with competing companies.
After the security question, start with general questions. If you want to limit
the survey to users of a particular product, you may want to disguise the
qualifying product. As a rule, start from general attitudes to the class of
products, through brand awareness, purchase patterns, specific product
usage to questions on specific problems (i.e., work from "What
types of coffee have you bought in the last three months" to "Do you recall
seeing a special offer on your last purchase of Brand X coffee?"). If possible
put the most important questions into the first half of the survey. If a person
gives up half way through, at least you have the most important
information.
Make sure you include all the relevant alternatives as answer choices.
Leaving out a choice can give misleading results. For example, a number of
63
recent polls that ask Americans if they support the death penalty yes or no
have found 70-75% of the respondents choosing ”yes.” But polls that offer
the choice between the death penalty and life in prison without the
possibility of parole show support for the death penalty at about 50-60%.
While polls that offer the alternatives of the death penalty or life in prison
without the possibility of parole, with the inmates working in prison to pay
restitution to their victims’ families have found support of the death penalty
closer to 30%.
So what is the true level of support for the death penalty? The lowest figure
is probably best, since it represents the percentage that favour that penalty
regardless of the alternative offered. The need to include all relevant
alternatives is not limited to political polls. You can get misleading data
anytime you leave out alternatives.
Do not put two questions into one. Avoid questions such as "Do you buy
frozen meat and frozen fish?" A "Yes" answer can mean the respondent
buys meat or fish or both. Similarly with a question such as "Have you ever
bought Product X and, if so, did you like it?" A "No" answer can mean "never
bought" or "bought and disliked." Be as specific as possible.
"Do you ever buy pasta?" can include someone who once bought some in
1990. It does not tell you whether the pasta was dried, frozen or canned and
may include someone who had pasta in a restaurant. It is better to say
"Have you bought pasta (other than in a restaurant) in the last three
months?" "If yes, was it frozen, canned or dried?" Few people can remember
what they bought more than three months ago unless it was a major
purchase such as an automobile or appliance.
64
If you are comparing different products to find preferences, give each one a
neutral name or reference. Do not call one "A" and the second one "B." This
immediately brings images of A grades and B grades to mind, with the
former being seen as superior to the latter. It is better to give each a
"neutral" reference such "M" or "N" that do not have as strong a quality
difference image. If possible, just refer to the "first" product and the
"second" product.
Avoid technical terms and acronyms, unless you are absolutely sure that
respondents know they mean. MHETEC ADB, GPA, UNDP, Adjusted Gross
Income, Grade Point Average and Engineering Information External Inquiries
Officer) are all well-known acronyms to people in those particular fields, but
very few people would understand all of them. If you must use an acronym,
spell it out the first time it is used.
Make sure your questions accept all the possible answers. A question like
"Do you use regular or premium gas in your car?" does not cover all possible
answers. The owner may alternate between both types. The question also
ignores the possibility of diesel or electric-powered cars. A better way of
asking this question would be "Which type(s) of fuel do you use in your
cars?" The responses allowed might be:
o Regular gasoline
o Premium gasoline
o Diesel
o Other
o Do not have a car
If you want only one answer from each person, ensure that the options are
mutually exclusive. For example:
Score or Scale questions (e.g., "If "5" means very good and "1" means very
poor how would rate this product?") are a particular problem. Researchers
are very divided on this issue. Many surveys use a ten-point scale, but
there is considerable evidence to suggest that anything over a five point
65
scale is irrelevant. This depends partially on education. Among university
graduates a ten point scale will work well. Among people with less than a
high school education five points is sufficient. In third world countries, a
three- point scale (good/acceptable/bad) is often all a respondent can
understand. Another problem is that you are assuming that the difference in
the factors is within the scale limits - you may have a five-point scale but in
a respondent's mind one factor may rate 10 points in comparison to the
others.
If you do use a rating scale be sure the labels are meaningful. For example:
A question phrased like the one above will force most answers into the
middle category, resulting in very little usable information.
If you have used a particular scale before and need to compare results, use
the same scale. Four on a five-point scale is not equivalent to eight on a
ten-point scale. Someone who rates an item "4" on a five-point scale might
rate that item anywhere between "6" and "9" on a ten-point scale.
Accordingly they often give "correct" answers rather than what they really
believe. Even when the questions are not overtly political and deal purely
with commercial products or services, the desire not to disappoint important
visitors with answers that may be considered negative may lead to
exaggerated scores. Always discount "favorable" answers by a significant
factor in all cases. The desire to please is not limited to the third world.
In personal interviews it is vital for the Interviewer to have empathy with the
Interviewee. In general, Interviewers should try to "blend" with respondents
in terms of race, language, sex, age, etc. Choose your Interviewers
according to the likely respondents.
Leave your demographic questions (age, sex, income, education, etc.) until
the end of the questionnaire. By then the Interviewer should have built a
rapport with the Interviewee that will allow honest responses to such
66
personal questions. Mail questionnaires should do the same, although the
rapport must be built by good question design, rather than personality.
Sometimes respondents offer casual remarks that are worth their weight in
gold and cover some area you did not think of, but which respondents
consider critical. Many products have a wide range of secondary uses that
the manufacturer knows nothing about but which could provide a valuable
source of extra sales if approached properly. In one third world market, a
major factor in the sale of candles was the ability to use the spent wax as
floor polish - but the manufacturer only discovered this by a chance remark.
Look at the following layouts and decide which you would prefer to use:
67
• A good vacation policy - agree/not sure/disagree.
• Good management feedback - agree/not sure/disagree.
• Good medical insurance - agree/not sure/disagree.
• High wages - agree/not sure/disagree.
The second example shows the answer choices in neat columns and has
more space between the lines. It is easier to read. The numbers in the
second example will also speed data entry.
Surveys are a mixture of science and art, and a good researcher will save
their cost many times over by knowing how to ask the correct questions.
Pre-test Questionnaire
If you change any questions after a pre-test, you should not combine the
results from the pre-test with the results of post-test interviews. The Survey
System will invariably provide you with mathematically correct answers to
your questions, but choosing sensible questions and ad ministering surveys
with sensitivity and common sense will improve the quality of your results
dramatically.
68
UNIT 9: REVIEW OF RESEARCH METHODS AND ETHICS
• Qualitative
• Experimental
• Observation
• Questionnaire-based survey
• Other methods
Note: Quantitative methods are not categorized as research methods. They are regarded as
analytic tools.
Research ethics
69
UNIT 10: FORMAT OF A RESEARCH PROPOSAL/PROJECT OR
THESIS
1. Introduction
A. The problem Statement
B. Rationale for the research
• Statement of the research objective
C. Hypothesis
D. Definition of terms
E. Summary, including restatement of the problem
Although each research project will have its own peculiarity, in general, the headings below
should serve as a useful guide. Note that some heading may be combined or appear as
subheadings depending on the importance that are attached to them.
1. Background to issues/problems
2. Identify Problem – problem statement clarity
3. Identify Purpose of research
4. Scope, limitations and risks associated with study
5. Identify Hypothesis
6. Identify Method/methodology/theoretic framework
7. Research strategy to obtain express, process data and report results
8. Analyze and discuss results in relation to problem/hypothesis - confirmation/refutation –
limitations, risks
9. Conclusion and Recommendations
10. Review references & citations - primary and secondary sources, footnotes, – relevance
and dateline
11. Prepare General comments about the report
70
UNIT: REVIEW OF REGRESSION ANALYSIS
Regression
Study of the nature of the relationship between one variable (the dependent) and another or
others (the independent). The variable we are trying to predict is called the dependent variable
and is conventionally plotted on the y-axis. The independent variable is plotted on the x- axis. If
the relationship between the dependent and the independent variable is linear, then we can
represent its equation by:
Y = a + X where N = 15
y is an estimate of the average value of Y corresponding to a given value of X
X is the actual value of the independent variable
A is a constant - an estimate of the a , the y intercept of the regression line
b is an estimate of , the slope of the regression line; also a constant
The statistic necessary to find a and b are the variance of X, and what we call the covariance of
X and Y. The variance of the line we need is:
Variance = sx2 = ∑ x2 - (∑ x )2
n n
X Y XY X2 Y2
$
71
s2xy = ∑ XY - ∑X x ∑Y = 7,072 - 1,322 x 78 = 13.17
n n*n 15 15 x 15
Data of this form usually have some dependent relationship on prior observation or previous
data, and as such it may be possible to establish this dependence and use it to form the basis for
short term prediction of future behaviour or value. Observations in a time series data can be
represented graphically in a line diagram where the observations are plotted against a time
(horizontal axis).
Example: 1
Consider the import table for sweet crude oil into Namibia between 1974 and 2001
Year Sweet crude Year Sweet crud Year Sweet crude Year Sweet crude
72
If the import manager asked you, as the researcher economist, to clarify the inherent pattern in
the Country’s import of sweet crude oil over time, what would you do and why. You could do
this by using various systems of moving averages, whereby an observation is replaced by the
mean of a number of observations centred on the one in question. For example, if we take 5-year
moving average, the 1976 figure is replaced by the mean volume of imports for years 1974 to
1978, i.e.
Likewise, the 1977 figure is replaced by the mean of years 1975 to 1979, i.e.
and so on. The charts below shows the 5-year moving average, yearly imports & trend
Y = a + bX + e
Y = the dependent variable
X = the independent variable
a = is the y-intercept (i.e. the value of the intercept of the dependent variable (Y)
when the independent variable (X) is equal to zero)
b = is the regression coefficient; it is the slope of the regression line which
measures how much the dependent variable (Y) changes per unit change in the
independent variable (X)
b = ∂Y
∂X e is an error term, or average residual, that exists because
regressions generally are not perfect or reliable models
The values of a and b that satisfy the least squares principle are determined from the following
equations:
b = N ∑ XY - ∑ X ∑ Y
N ∑ X2 - (∑ X)2
a = Y -bX = ∑y - b ∑x
n n
X Y XY X2 X -X Y - Y
5 10 50 25
-10 -15 150 100
10 15 150 100
0 5 0 0
-10 -5 50 100
∑ -5 10 400 325
73
Y -5/5 = -1 and X = 10/5 = 2
b = N ∑ XY - ∑ X ∑ Y
N ∑ X2 - (∑ X)2
b = Y - bX X = -5 = -1 Y = 10 = 5
5 2
Trend
To isolate the trend, we need to use a moving average of such period as to eliminate the seasonal
effects or to reformulate the table in such a way that we can do a regression analysis.
74