0% found this document useful (0 votes)
20 views

Study Manual Quantitative Methods

tyye

Uploaded by

Venkat R
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Study Manual Quantitative Methods

tyye

Uploaded by

Venkat R
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

STUDY MANUAL

QUANTITATIVE METHODS

CODE:

1
STUDY MANUAL

QUANTITATIVE METHODS

COPYRIGHT
Published by the International University of Management
Windhoek, Namibia
© International University of Management 2010
No part of this publication may be reproduced, stored in retrieval system or transmitted in any form or by
any means, electronic, mechanical, photocopying, recording or otherwise, without prior permission of the
publishers.
International University of Management
59 Bahnhof Street
Private Bag 14005
Windhoek
Telephone (264 61) 245150/84
Fax (264 61) 248112
E-mail: [email protected]
Website: www.ium.edu.na

2
TABLE OF CONTENTS
PAGE

Course Description and Learning Objectives 4

UNIT 1: Introduction to Quantitative Methods and


Research Concepts 6

UNIT 2: Introduction to Linear Correlation and regression 14

UNIT 3 Rank Order Correlation 29

UNIT 4: Sampling And Survey Design 33

UNIT 5: Measures of Dispersion 45

UNIT 6: Choosing Simple Random Sample 51

UNIT 7: Review Of Basic Statistics And


Statistical Tools 54

UNIT 8: Survey Design 55

UNIT 9: Questionnaire Design 58

UNIT 10 : Review Of Research Methods And Ethics 69

UNIT 11: Format Of A Research Proposal/Project Or Thesis 70

UNIT 11: Review Of Regression Analysis 71

Bibliography

3
Quantitative Methodology

Course Description and Scope

The course is designed to help students develop understanding of


quantitative techniques, the essential concepts, and the basic analytic tools
which can be used in business decisions, analysis and research, and the
opportunity for them to apply the knowledge and creativity to “real life”
solutions. Students will be provided with the necessary tools to undertake
major primary research projects on their own, such as their final year
projects or research related to their job environments.

Students will have a chance to play the roles of both clients (in defining the
research process and objectives) and researchers in defining the research
study, sample population, sample size, questionnaire and the methodology
for data analysis and interpretation.

Given the progressive nature of the course as per the course outline
coverage of the various statistical tools cannot be extensive. Hence,
students who have specific interest in this area are expected to pursue the
advanced modules for year 3 and 4.

Learning Objectives

Upon completion of this course the student should be able to:

1. Understand the rational basis for business decision-making, as well as


research concepts and terminologies, and the basic statistical tools
used in business decision analysis, research and for research data
analysis
2. explain the quantitative methods and processes and how to apply
them in real business situations
3. understand the different research approaches, techniques and analytic
tools, and where each is best applied
4. formulate research problem statements and hypotheses
5. contrast and compare the advantages and drawbacks of the different
approaches to research design e.g. Qualitative vs quantitative methods
used to analyze different kinds of business or research data
6. undertake research projects and make business decisions on their own

4
Textbook

Required
1. Ticehurst, G. W & Veal, T. R. (1999). Business research Methods: a
managerial, Longman approach

Recommended
1. Hannagan T. (1997). Mastering Statistics, 3rd Ed., Macmillan
2. Salkind, N.J. (2000). Exploring Research, 4th Ed., Prentice Hall
3. Trochim, W. (2001). The Research Methods Knowledge Base, 2e
Available online at http://www.atomicdogpublishing.com/home.asp

Evaluation

There will be at least one exercise for each module, a midterm test, a
research project and a final examination based on the lectures, assigned
readings, class discussions.
Grading

1. Completion of class 8 exercises 16 points


2. Midterm test 24 “
3. Research Project 10* “
4. Final Exam 50 or 60 points w/o the project
Total 100%

5
UNIT 1: INTRODUCTION TO QUANTITATIVE METHODS AND
RESEARCH CONCEPTS

Quantitative methods

Quantitative methods refer to the range of mathematical and statistical


tools, techniques and models used to acquire and analyse data, test
empirical theories and hypotheses and make rational business
recommendations and decisions. Beside business, quantitative methods
have been widely used in a variety of contexts, for example, the study of
arms races, of political stability, of political violence, and of the behaviour of
legislators, but by far their most prominent application has been in the area
of electoral attitudes and behaviour, where data are easily quantified.

Research
Research is a systematic investigation or study to establish or clarify
principles, theories, facts and relationships. Such activities usually start with
identifying the research problem and ending with a clearly defined objective.
The objective statement is often based on some basic assumption or
hypothesis. Scientific research includes the process of investigating and
proving a potential application of established scientific and/or engineering
knowledge.

Research methods
Research methods are the systematic application of one or more techniques
to investigate the research problem. A method can be clearly described, and
is capable of being repeated by different people on different occasions.

Research proposal
Research proposals are clear specification of the research, and are aimed at
securing authority and funding for the research.

Accreditation
Formal procedures giving students certification after completion of studies,
e.g. a certificate, a diploma, or a degree.

Added value

Any extra benefit given to output through production, research, teaching,


learning or administration, by an additional factor which was not included in
the process.

6
Baseline Research

Research designed to collect data to be used as a basis for comparison


against future data

Bias
Any influence that distorts the results of a research study

Census
A count of every member of a population

Closed-Ended Question

A question to which the respondent is limited in the type of responses he or


she may give, e.g. there are only three possible responses to the question,
"Will you buy this product"—"Yes," "No," or "I don’t know"

Cluster Analysis

A multivariate analysis technique that seeks to organize information about


variables so that relatively homogenous groups, or "clusters," can be formed

Confidence Interval
A range of values within which a population mean lies, for a given confidence
level

Confidence Level
The probability that a population mean lies within a given range of values

Convenience Sample
A sample selected merely on the participants’ availability or ability to
participate

Correlation
An established relationship between two variables

Database
A collection of related data organized for quick access

Demographic Data
Objective and descriptive population data that is easily identifiable, e.g. age,
income, gender, and education level

7
Dependent Variable

Variable in an experiment that is thought to be affected by (to depend on)


another variable (independent variable). In experimental research, the
dependent variable is the variable presumed within the research hypothesis
to depend on (be caused by) another variable (the independent variable); it
is sometimes referred to as the outcome variable. Empirical Data Verifiable
information that is collected through scientific observation or experiment.
For example, results from surveys or focus groups are empirical data.

Descriptive statistics
Statistical methods used to describe or summarise data collected from a
specific sample (e.g. mean, median, mode, range, standard deviation).

Frequency Distribution

A report of the number and type of responses to a particular question, e.g.


the following is a frequency distribution for an income variable.

Variable
A variable in an experiment whose value is thought to affect the value of
another variable dependent variable).

Independent variable
The variable (or antecedent) that is assumed to cause or influence the
dependent variable(s) or outcome. The independent variable is manipulated
in experimental research to observe its effect on the dependent variable(s).
It is sometimes referred to as the treatment variable.

Informed consent

The process of obtaining voluntary participation of individuals in research


based on a full understanding of the possible benefits and risks.

Interval Scale

A measurement scale in which all levels are equally spaced, such that an
increase of one unit on the scale is equal to the same increase at a different
point on the scale; it does not include a true zero point; e.g., the 10-point
rating scale (1 to 10) used in gymnastics is an interval scale. The categories
are ordered and there are equal intervals between points on the scale, but

8
the zero point on the scale is arbitrary so that a particular measure cannot
be said to be 'twice as' large as another measure on the same scale (e.g.
degrees Centigrade).

Leading Question
Questions designed to direct the respondent to provide a response desired
by the questioner

Likert Scale

A five-point scale measuring a respondent’s level of agreement with a given


statement. A method used to measure attitudes, which involves respondents
indicating their degree of agreement or disagreement with a series of
statements. Scores are summed to give a composite measure of attitudes.

Likert-Type Scale
Any measurement scale wherein a respondent is asked to rate his or her
attitude regarding a given statement

Mean
The average value.

Mode

A descriptive statistic that is a measure of central tendency; it is the


score/value that occurs most frequently in a distribution of scores.

Median
The mid-point; exactly one-half of responses are less than the median and
one-half are greater than the median. A descriptive statistic used to measure
central tendency. The median is the score/value that is exactly in the middle
of a distribution (i.e. the value above which and below which 50% of the
scores lie).

Methodology
The process used and steps taken to collect data in a research effort

Nominal Scale

A measurement scale in which the levels can be assigned arbitrary numeric


values, but the values have no intrinsic order or mathematical properties;
for example, race and gender are both measured using a nominal scale. It
is the lowest level of measurement that involves assigning characteristics
into categories which are mutually exclusive, but which lack any intrinsic

9
order (e.g. classification by gender or by the colour of a person's hair or
eyes)
Open-Ended Question

A question to which the respondent is not limited in the type of answer he or


she can give, e.g. "Why do you prefer golf to tennis;" See also “Closed-
Ended Question”.
Ordinal Scale

A measurement scale in which the values are ordered, but not equally
spaced; for example, finishers in a race are measured on an ordinal scale
(1st, 2nd, 3rd, etc.). These categories can be used to rank order a variable,
but the intervals between categories are not equal or fixed (e.g. strongly
agree, agree, neither agree nor disagree, disagree, strongly disagree; social
class I professional, II semi-professional, IIIa non-manual, IIIb manual, IV
semi-skilled, and V unskilled).

Population
The total number of people in which a marketer is interested. A well-defined
group or set that has certain specified properties (e.g. all registered teachers
working full-time in Namibia.

Qualitative Research

Research such as focus groups and in-depth interviews, the results of which
cannot be statistically applied to a population.

Quantitative Research

Survey research using a sample of people drawn at random from a given


population. If the sample is drawn properly, the results of quantitative
research can be generalized to the population.

Research design:

The strategic plan for a research project that sets out the outline
and key features of data collection methodology and analysis, including a
detailed consideration of the research parameters.

Random Sample

A sample of persons selected using an independent random sampling


technique to participate in a research study.

10
Random sampling

A process of selecting a sample whereby each member of the population has


an equal chance of being included.

Rating Scale

A means of measuring interval or ratio data, wherein respondent


rate a target on a given attribute using scaled values; for example,

"Please rate your satisfaction with your current physician using a


10-point scale, where 10 means you are completely satisfied and
1 means you are not at all satisfied"

Ratio Scale

An interval scale with a true zero point; for example, height and
distance are measured on ratio scales

Range

A measure of variability indicating the difference between the highest and


lowest values in a distribution of scores.

Regression Analysis

A statistical technique used to measure the ability of several independent


variables to predict the value of a dependent variable; for example, a
regression analysis could be used to predict which website characteristics
are most influential in a person’s likelihood of returning to the site.

Sample
A subset of the population

Sample Error

The chance that the results of a study are due to “misrepresentation” of the
sample caused by random chance; for example, out of 1 million flips of a
penny, the penny will land heads-up 50% of the time and tails-up 50% of
the time. If we look at a sample of 100 of those 1 million flips, heads could
come up 70% of the time even though, over the long-term, it will only occur
50% of the time—this difference is due to sample error.

11
Sample Frame

A representative list from which the final sample for the study is drawn. This
means that the sample frame must be representative of the population, else
the sample will not be representative of the population; for example, the
telephone book can be used as a sample frame, but it does not include
people without a telephone, thereby biasing the sample drawn from that
frame against people without telephones

Sample Size

Refers to the number of people in the sample. Sample size is a key


determinant in sample error —as sample size increases, sample error
decreases

Secondary Research

Research gathered from sources other than directly from the population, e.g.
publications, associations, government research

Significance level
Established at the outset by a researcher when using statistical analysis to
test a hypothesis (e.g. 0.05 level or 0.01 significance level). A significance
level of 0.05 indicates the probability that an observed difference or
relationship would be found by chance only 5 times out of every 100 (1 out
of every 100 for a significance level of 0.01). It indicates the risk of the
researcher making a Type I error (i.e. an error that occurs when a
researcher rejects the null hypothesis when it is true and concludes that a
statistically significant relationship/difference exists when it does not).

Standard deviation

A descriptive statistic used to measure the degree of variability within a set


of scores

Statistical Significance

A measurement of the likelihood that an observed effect will be present in


the population, based on sample size distribution of values, and size of the
effect; i.e., the difference between two values from a sample is considered
statistically significant if a statistical test proves that the difference is also
likely to be present in the population.

12
A term used to indicate whether the results of an analysis of data drawn
from a sample are unlikely to have been cause by chance at a specified level
of probability (usually 0.05 or 0.01).

Stratified Sample

A sample drawn to match the distribution of a variable in the population; for


example, a stratified sample of Windhoek residents would include the same
proportion of Windhoek residents as there exists in the actual population of
Namibia.

Theoretical framework
The conceptual underpinning of a research study which may be based on
theory or a specific conceptual model (in which case it may be referred to as
the conceptual framework).

T-Test
A statistical test to determine if two groups are significantly different from
one another.

Type I Error
A false positive; incorrectly accepting a positive result to be true

Type II Error
A false negative; incorrectly accepting a negative result to be true

Variable
A quantity or function that may assume a given value or set of values.

Variance
The total amount of variation in values for a given variable; for example, the
average daily temperature in Swakopmund, Namibia has a larger variance
over a period of a year (there is a greater amount of variation in
temperatures as the seasons change) than the average daily temperature at
the Cape (Cape Town, South Africa) although it changes slightly day-to-day,
it does not change very much.

13
UNIT 2: Introduction to Linear Correlation and Regression

Correlation and regression refer to the relationship that exists between two
variables, X and Y, in the case where each particular value of Xi is paired
with one particular value of Yi.

For example:
• the measures of height of individuals paired with their corresponding
measures of weight;

• the number of hours that individual students in a statistics course


spend studying prior to an exam paired with their corresponding
measures of performance on the exam;

• the amount of class time that individual students in a statistics course


spend snoozing and daydreaming prior to an exam, paired with their
corresponding measures of performance on the exam; and so on.

Fundamentally, what we e are studying is the variation of the


quantitative functional relationship.

• This means that the more you have of this variable, the more you
have of that one.

• Or conversely, the more you have of this variable, the less you have of
that one.

• Or, the more you have of height, the more you will tend to have of
weight;

• the more students study prior to a statistics exam, the more they will
tend to do well on the exam.

• Or conversely, the greater the amount of class time prior to the exam
that students spend snoozing and daydreaming, the less they will tend
to do well on the exam.

• In the first kind of case (the more of this, the more of that), you are
speaking of a positive correlation between the two variables; and in
the second kind (the more of this, the less of that), you are speaking
of a negative correlation between the two variables.

Correlation and regression are two sides of the same coin. This means
that you can begin with either one and end up with the other.

14
Correlation

Here is an introductory example of correlation, taken from the realm of


education and public affairs. If you were a secondary or high school student
in Namibia, the chances are that you would be familiar with an instrument
known as the Scholastic
Achievement Test (SAT). Region Percentage Average SAT
Assume that the results of the taking SAT Score
SAT score for eight regions are
as follows:
Khomas 5 1103
four regions near the top of the Erongo 6 1101
list had quite small percentages
of high school seniors taking the Omaheke 6 1060
SAT, while the four regions near Hardap
the bottom had quite large 9 1042
numbers of high school seniors Karas 88 904
taking it.
Omasati 81 903
I think you will agree that this Oshana 76 892
observation raises some
interesting questions. For Kavango
74 887
example: Could it be that the
5% of Khomas high school
seniors who took the SAT in 2008 were the top 5%? What might have been
the average SAT score for Karas if the test in that state had been taken only
by the top 5% of high school seniors, rather than by (presumably) the "top"
88%? You can no doubt imagine any number of variations on this theme.

Figure 1.1 below shows the relationship between these two variables—
percentage of high school seniors taking the SAT versus average state score
on the SAT—for all the 50 schools sampled in the regions. Within the context
of correlation and regression, a two-variable coordinate plot of this general
type is typically spoken of as a scatterplot or scattergram.

Either way, it is simply a variation on the theme of Cartesian coordinate


plotting that you have almost certainly already encountered in your prior
educational experience. It is a standard method for graphically representing
the relationship that exists between two variables, X and Y, in the case
where each particular value of Xi is paired with one particular value of Yi.

15
Figure 3.1 Percentage of High School Seniors Taking the SAT versus
Average Combined State SAT Scores: 2008

Schools Xi Yi
in Percentage Average
Region taking SAT SAT score

1i X1 Y1
2i X2 Y2

::::i ::::i ::::i

49i X49 Y49


50i X50 Y50

For the present example, designating the percentage of high school seniors
within a state taking the SAT as X, and the state's combined average SAT
score as Y, we would have a total of N=50 paired values of Xi and Yi. Thus
for schoos in Khomas, Xi=5% would be paired with Yi=1103; for schools in
Omasati, Xi=81% would be paired with Yi=903; and so on for all the 50
schools in the dataset. The entire bivariate list would look like the following,
except that the abstract designations for Xi and Yi would of course be
replaced by particular numerical values.

The next step in bivariate coordinate plotting is to lay


out two axes at right angles to each other. By
convention, the horizontal axis is assigned to the X
variable and the vertical axis to the Y variable, with
values of X increasing from left to right and values of Y
increasing from bottom to top.

16
A further convention in bivariate coordinate plotting applies only to those
cases where a causal relationship is known or hypothesized to exist between
the two variables. In examining the relationship between two causally
related variables, the independent variable is the one that is capable of
influencing the other, and the dependent variable is the one that is
capable of being influenced by the other. For example, growing taller will
tend to make you grow heavier, whereas growing heavier will have no
systematic effect on whether you grow taller.

In the relationship between human height and weight, therefore, height is


the independent variable and weight the dependent variable. The amount of
time you spend studying before an exam can affect your subsequent
performance on the exam, whereas your performance on the exam cannot
retroactively affect the prior amount of time you spent studying for it.
Hence, amount of study is the independent variable and performance on the
exam is the dependent variable.

In the present SAT example, the percentage of high school seniors within a
state who take the SAT can conceivably affect the state's average score on
the SAT, whereas the state's average score in any given year cannot
retroactively influence the percentage of high school seniors who took the
test. Thus, the percentage of high school seniors taking the test is the
independent variable, X, while the average state score is the dependent
variable, Y.

In cases of this type, the convention is to reserve the X axis for the
independent variable and the Y axis for the dependent variable. For cases
where the distinction between "independent" and "dependent" does not
apply, it makes no difference which variable is called X and which is called Y.

In designing a coordinate plot of this type, it is not generally necessary to


begin either the X or the Y axis at zero. The X axis can begin at or slightly
below the lowest observed value of Xi, and the Y axis can begin at or slightly
below the lowest observed value of Yi.

In Figure 3.1 the X axis does begin at


zero, because any value much larger
than that would lop off the lower end
of the distribution of Xi values;
whereas the Y axis begins at 800,
because the lowest observed value of
Yi is 838.

17
At any rate, the clear message of Figure 3.1 is that states with relatively low
percentages of high school seniors taking the SAT in 2006 tended to have
relatively high average SAT scores, while those that had relatively high
percentages of high school seniors taking the SAT tended to have relatively
low average SAT scores. The relationship is not a perfect one, though it is
nonetheless clearly visible to the naked eye. The following version of
Figure 1.1 will make it even more visible. It is the same as shown before,
except that now we include the straight line that forms the best "fit" of this
relationship. We will return to the meaning and derivation of this line a bit
later.

Actually, in this particular example there are two


somewhat different patterns that the 50 school data
points could be construed as fitting. The first is the
pattern delineated by the solid downward slanting
straight line, and the second is the one marked out by
the dotted and mostly downward sloping curved line. T

18
A relationship that can be described by a straight line is spoken of as linear
(short for 'rectilinear'), while one that can be described by a curved line is
spoken of as curvilinear. We will touch upon the subject of curvilinear
correlation in a later chapter. Our present coverage will be confined to linear
correlation.

Figure 3.2 (a-e) illustrate the various forms that linear correlation is capable
of taking. The basic possibilities are: (i) positive correlation; (ii) negative
correlation; and (iii) zero correlation. In the case of zero correlation, the
coordinate plot will look something like the rather patternless jumble shown
in Figure 3.2a, reflecting the fact that there is no systematic tendency for X
and Y to be associated, either the one way or the other.

The plot for a positive correlation, on the other hand, will reflect the
tendency for high values of Xi to be associated with high values of Yi, and
vice versa; hence, the data points will tend to line up along an upward
slanting diagonal, as shown in Figure 3.2b. The plot for negative correlation
will reflect the opposite tendency for high values of Xi to be associated with
low values of Yi, and vice versa; hence, the data points will tend to line up
along a downward slanting diagonal, as shown in Figure 3.2d.

The limiting case of linear correlation, as illustrated in Figures 3.2c and 3.2e,
is when the data points line up along the diagonal like beads on a taut
string. This arrangement, typically spoken of as perfect correlation, would
represent the maximum degree of linear correlation, positive or negative,
that could possibly exist between two variables.

In the real world you will normally find perfect linear correlations only in the
realm of basic physical principles; for example, the relationship between
voltage and current in an electrical circuit with constant resistance. Among
the less tidy phenomena of the behavioral and biological sciences, positive
and negative linear correlations are much more likely to be of the
"imperfect" types illustrated in Figures 3.2b and 3.2d.

The Measurement of Linear Correlation

The primary measure of linear correlation is the Pearson product-moment


correlation coefficient, symbolized by the lower-case Roman letter r,
which ranges in value from r=+1.0 for a perfect positive correlation to r=—
1.0 for a perfect negative correlation. The midpoint of its range, r=0.0,
corresponds to a complete absence of correlation. Values falling between
r=0.0 and r=+1.0 represent varying degrees of positive correlation, while
those falling between r=0.0 and r=—1.0 represent varying degrees of
negative correlation.

19
A closely related companion measure of linear correlation is the coefficient
of determination, symbolized as r2, which is simply the square of the
correlation coefficient. The coefficient of determination can have only
positive values ranging from r2=+1.0 for a perfect correlation (positive or
negative) down to r2=0.0 for a complete absence of correlation. The
advantage of the correlation coefficient, r, is that it can have either a
positive or a negative sign and thus provide an indication of the positive or
negative direction of the correlation.

The advantage of the coefficient of determination, r2, is that it provides an


equal interval and ratio scale measure of the strength of the correlation.
In effect, the correlation coefficient, r, gives you the true direction of the
correlation (+ or —) but only the square root of the strength of the
correlation; while the coefficient of determination, r2, gives you the true
strength of the correlation but without an indication its direction. Both of
them together give you the whole works.

We will examine the details of calculation for these two measures in a


moment, but first a bit more by way of introducing the general concepts.
Figure 3.3 shows four specific examples of r and r2, each produced by taking
two very simple sets of X and Y values, namely

Xi = {1, 2, 3, 4, 5, 6} and Yi = {2, 4, 6, 8, 10, 12}

and pairing them up in one or another of four different ways. In Example I


they are paired in such a way as to produce a perfect positive correlation,
resulting in a correlation coefficient of r=+1.0 and a coefficient of
determination of r2=1.0. In Example II the pairing produces a somewhat
looser positive correlation that yields a correlation coefficient of r=+0.66 and
a coefficient of determination of r2= 0.44.

For purposes of interpretation, you can translate the coefficient of


determination into terms of percentages (i.e., percentage=r2x100), which
will then allow you to say such things as, for example, that the correlation in
Example I (r2=1.0 ) is 100% as strong as it possibly could be, given these
particular values of Xi and Yi, whereas the one in Example II (r2=0.44 ) is
only 44% as strong as it possibly could be.

Alternatively, you could say that the looser positive correlation of Example II
is only 44% as strong as the perfect one shown in Example I. The essential
meaning of "strength of correlation" in this context is that such-and-such
percentage of the variability of Y is associated with (tied to, linked with,
coupled with) variability in X, and vice versa. Thus, for Example I, 100% of
the variability in Y is coupled with variability in X; whereas, in Example II,

20
only 44% of the variability in Y is linked with variability in X.

Figure 3.3. Four Different Pairings of the Same Values of X and Y

The correlations shown in Examples III and IV are obviously mirror images
of the ones just described. For Example III the six values of Xi and Yi are
paired in such a way as to produce a perfect negative correlation, which
yields a correlation coefficient of r=—1.0 and a coefficient of determination
of r2=1.0.

In Example IV the pairing produces a looser negative correlation, resulting in


a correlation coefficient of r=—0.66 and a coefficient of determination of r2=
0.44. Here again you can say for Example III that 100% of the variability in
Y is coupled with variability in X; whereas for Example IV only 44% of the
variability in Y is linked with variability in X.

You can also go further and say that the perfect positive and negative
correlations in Examples I and III are of equal strength (for both, r2=1.0)
but in opposite directions; and similarly, that the looser positive and
negative correlations in Examples II and IV are of equal strength (for both,
r2=0.44) but in opposite directions.

To illustrate the next point in closer detail, we will focus for a moment on the
particular pairing of Xi and Yi values that produced the positive correlation
shown in Example II of Figure 3.3.

21
When you perform the
Pair Xi Yi
computational procedures for
a 1 6 linear correlation and
b 2 2 regression, what you are
c 3 4
d 4 10
essentially doing is defining the
e 5 12 straight line that best fits the
f 6 8 bivariate distribution of data
points, as shown in the
following version of the same
graph. This line is spoken of as
the regression line, or line of regression, and the criterion for "best fit" is
that the sum of the squared vertical distances (the green lines ||||) between
the data points and the regression line must be as small as possible.

As it happens, this line of best fit will in every instance pass


through the point at which the mean of X and the mean of Y
intersect on the graph. In the present example, the mean of X is
3.5 and the mean of Y is 7.0. Their point of intersection occurs at
the convergence of the two dotted gray lines.

The details of this line—in particular, where it begins on the Y axis and the
rate at which it slants either upward or downward—will not be explicitly
drawn out until we consider the regression side of correlation and
regression. Nonetheless, they are implicitly present when you perform the
computational procedures for the correlation side of the coin. As indicated
above, the slant of the line upward or downward is what determines the sign
of the correlation coefficient (r), positive or negative; and the degree to
which the data points are lined up along the line, or scattered away from it,
determines the strength of the correlation (r2).

You have already encountered the general concept of variance for the case
where you are describing the variation that exists among the variate
instances of a single variable. The measurement of linear correlation
requires an extension of this concept to the case where you are describing
the co-variation that exists among the paired bivariate instances of two
variables, X and Y, together. We have already touched upon the general
concept. In positive correlation, high values of X tend to be associated with
high values of Y, and low values of X tend to be associated with low values
of Y.

22
In negative correlation, it is the opposite: high values of X tend to be
associated with low values of Y, and low values of X tend to be associated
with high values of Y. In both cases, the phrase "tend to be associated" is
another way of saying that the variability in X tends to be coupled with
variability in Y, and vice versa—or, in brief, that X and Y tend to vary
together. The raw measure of the tendency of two variables, X and Y, to co-
vary is a quantity known as the covariance. As it happens, you will not
need to be able to calculate the quantity of covariance in and of itself,
because the destination we are aiming for, the calculation of r and r2, can be
reached by way of a simpler shortcut.

However, you will need to have at least the general concept of it; so keep in
mind as we proceed through the next few paragraphs that covariance is a
measure of the degree to which two variables, X and Y, co-vary.

In its underlying logic, the Pearson product-moment correlation coefficient


comes down to a simple ratio between (i) the amount of covariation between
X and Y that is actually observed, and (ii) the amount of covariation that
would exist if X and Y had a perfect (100%) positive correlation. Thus

observed covariance
r=
maximum possible positive covariance

As it turns out, the quantity listed above as "maximum possible positive


covariance" is precisely determined by the two separate variances of X and
Y. This is for the simple reason that X and Y can co-vary, together, only in
the degree that they vary separately. If either of the variables had zero
variability (for example, if the values of Xi were all the same), then clearly
they could not co-vary. Specifically, the maximum possible positive
covariance that can exist between two variables is equal to the
geometric mean of the two separate variances.

For any n numerical values, a, b, c, etc., the geometric mean is the nth root of the
product of those values. Thus, the geometric mean of a and b would be the square
root of axb; the geometric mean of a, b and c would be the cube root of axbxc; and
so on.

So the structure of the relationship now comes down to

observed covariance
r=
sqrt[(varianceX) x (varianceY)] 23
Although in principle this relationship involves two variances and a
covariance, in practice, through the magic of algebraic manipulation, it boils
down to something that is computationally much simpler. In the following
formulation you will immediately recognize the meaning of SSX, which is the
sum of squared deviates
for X; by extension, you will Recall that "sqrt" means "the square root of."
also be able to recognize SSY,
which is the sum
of squared In order to get from the formula above to the one below, you will
deviates for Y. need to recall that the variance (s2) of a set of values is simply the
average of the squared deviates: SS/N.

The third item, SCXY, denotes a quantity that we will speak of as the sum of
co-deviates; and as you can no doubt surmise from the name, it is
something very closely akin to a sum of squared deviates. SSX is the raw
measure of the variability among the values of Xi; SSY is the raw measure of
the variability among the values of Yi; and SCXY is the raw measure of the
co-variability of X and Y together.

To understand this kinship, recall


SCXY
from Chapter 2 precisely what is
r=
meant by the term "deviate."
sqrt[SSX x SSY]

For any particular item in a set of measures of the variable X,


deviateX=Xi — MX

Similarly, for any particular item in a set of measures of the variable Y,


deviateY=Yi — MY

As you have probably already guessed, a co-deviate pertaining to a


particular pair of XY values involves the deviateX of the Xi member of the
pair and the deviateY of the Yi member of the pair. The specific way in
which these two are joined to form the co-deviate is
co-deviateXY = deviateX) x (deviateY)

And finally, the analogy between a co-deviate and a squared deviate:

For a value of Xi, the squared deviate is (deviateX) x (deviateX)

24
For a value of Yi it is (deviateY) x (deviateY)

And for a pair of Xi and Yi values, the co-deviate is deviateX) x (deviateY)

This should give you a sense of the underlying concepts. Just keep in mind,
no matter what particular computational sequence you follow when you
calculate the correlation coefficient, that what you are fundamentally
calculating is the ratio which, for computational purposes, comes down to:

observed covariance SCXY


r= r=
maximum possible positive covariance sqrt[SSX x SSY]

Now for the nuts-and-bolts of it. Here, once again, is the particular pairing of
Xi and Yi values that produced the positive correlation shown in Example II
of Figure 3.3. But now we subject them to a bit of number-crunching,
calculating the square of each value of Xi and Yi, along with the cross-
product of each XiYi pair. These are the items that will be required for the
calculation of the three summary quantities in the above formula: SSX, SSY,
and SSXY.

Pair Xi Yi Xi2 Yi2 XiYi

a 1 6 1 36 6
b 2 2 4 4 4
c 3 4 9 16 12
d 4 10 16 100 40
e 5 12 25 144 60
f 6 8 36 64 48

sums 21 42 91 364 170

SSX : sum of squared deviates for Xi valuesT

25
where the sum of squared deviates for a set of Xi values can be calculated
according to the computational formulaT

In the present example,


N=6 [because there are 6 values of Xi]
Xi2 = 91
Xi = 21
( Xi)2 = (21)2 = 441

Thus:
SSX = 91—(441/6) = 17.5

SSY : sum of squared deviates for Yi valuesT


Similarly, the sum of squared deviates for a set of Yi values can be

calculated according to the formulaT

In the present example,T


N = 6 [because there are 6 values of Yi]
Yi2 = 364
Yi = 42
( Yi)2 = (42)2 = 1764

Thus:
SSY = 364—(1764/6) = 70.0

SCXY : sum of co-deviates for paired values of Xi and YiT


A moment ago we observed that the sum of co-deviates for paired values of
Xi and Yi is analogous to the sum of squared deviates for either of those
variables separately. You will probably be able to see that this analogy also
extends to the computational formula for the sum of co-deviates:T

Again, for the present example,


N = 6 [because there are 6 XiYi pairs]

Xi = 21
Yi = 42
( Xi)( Yi) = 21 x 42 = 882
(XiYi) = 170

26
Thus:
SCXY = 170—(882/6) = 23.0

Once you have these preliminaries,TSSX = 17.5, SSY = 70.0, and SCXY = 23.0
you can then easily calculate the correlation coefficient as

23.0
SCXY
r= = sqrt[17.5 x 70] = +0.66
sqrt[SSX x SSY]

and the coefficient of determination as r2 = (+0.66)2 = 0.44

Recall that each example starts out with the same values of Xi and Yi:
Xi = {1, 2, 3, 4, 5, 6} and Yi = {2, 4, 6, 8, 10, 12}
They differ only with respect to how these values are paired up with one
another.

Hence, the following values will remain the same from one example to
another

N=6
x Xi = 21 x Xi2 = 91 x Yi = 42 x Yi2 = 364
SSX = 17.5 SSY = 70.0

The only thing that changes is the co-variation, as measured in its rawest
form by the sum of the XiYi cross-products (shown in the red cell), and then
by SCXY, the sum of co-deviates. Recall that the computational formulas for
the relevant SS and SC measures are

and that the formula for the correlation coefficient is

27
SCXY
r=
sqrt[SSX x SSY]

Xi Yi Xi2 Yi2 XiYi

SSX = 17.5 SSY = 70.0 Example I (r = +1.0, r2 = 1.0)


sqrt[SSX x SSY] = 35.0
Example II (r = +0.66, r2 = 0.44)

Example III (r = —1.0, r2 = 1.0)


SCXY =
Example IV (r = —0.66, r2 = 0.44)
r= /35 =

28
UNIT 3: RANK-ORDER CORRELATION

Correlation applies to those cases where the values of X and of Y are both
measured on an equal- interval scale. It is also possible to apply the
apparatus of linear correlation to cases where X and Y are measured on a
merely ordinal scale. When applied to ordinal data, the measure of
correlation is spoken of as the Spearman rank- order correlation coefficient,
typically symbolized as rs.

Suppose, for example, that two experts, X and Y, were asked to rank N=8
items with respect to some dimension germane to their field of expertise
(rank#1=highest, rank#8=lowest). To make it specific, you can imagine two
physicians ranking 8 patients with respect to the severity of their disease;
two psychotherapists ranking 8 patients with respect to the likelihood of
improvement; two wine experts ranking 8 wines from best to worst; two
statisticians ranking 8 statistical concepts with respect to their fundamental
importance; or whatever else it might be that strikes your fancy.

As a token of my liberal- mindedness— for I am one of those benighted


souls who find all wines to taste suspiciously like vinegar— I will use the
image of the wine experts. The following table shows the rankings from 1
to 8, best to worst, of two experts, X and Y.

wine X Y

a 1 2
b 2 1
c 3 5
d 4 3
e 5 4
f 6 7
g 7 8
h 8 6

As you can see from the above graph, there is a substantial degree of
agreement between the rankings of the two experts if you plug the bivariate
values of X and Y into the formulaic structure.

SCXY and you will find


r= r = +.83
sqrt[SSX x SSY] r2 = .69.

29
As it happens, these are exactly the same values you will get when you
calculate the Spearman coefficient, rs. The simple reason for this is that r
and rs are algebraically equivalent in the case where the values of X and Y
consist of two sets of N rankings. The only advantage of rs is that the
calculations are easier if you are doing them by hand. [Note, however, that
rs is precisely equal to r only when the rankings within X and Y are the
consecutive integer values: 1, 2, 3, and so on, with no ties. With tied ranks,
rs will tend to be larger than r. If the proportion of tied ranks is fairly large,
you would be better advised to plug your rankings for X and Y into the
standard formula for r.]

The Simple Formula for rs, for Rankings without Ties

Here is the same table you saw above, except now we wine X Y D D2
also take the difference between each pair of ranks
(D=X—Y), and then the square of each difference. All a 1 2 —1 1
that is required for the calculation of the Spearman b 2 1 1 1
coefficient are the values of N and- D2, according to c 3 5 —2 4
the formula d 4 3 1 1
e 5 4 1 1
2 f 6 7 —1 1
6 D
rs = 1 — g 7 8 —1 1
2 h 8 6 2 4
N(N —1)
N=8 - D2 = 14

If this formula seems a bit odd to you, you are in good company.
Generations of statistics students have been presented with it, and
generations have puzzled over such mind- bending questions as: why do you
start out with "1" and subtract something from it?; where does that N(N2—
1) in the denominator come from?; and, above all, how does that peculiar
"6" get into the numerator?

Here are the answers to these age- old questions in a nutshell.

o For any set of N paired bivariate ranks, the minimum possible value
of- D2 occurs in the case of perfect positive correlation. In this case,
rank 1 for X is paired with rank 1 for Y, rank 2 for X with rank 2 for Y,
and so on. Each value of D will accordingly be equal to zero, and so
too will be the sum of the squared values of D.
o Conversely, the maximum possible value of- D2 occurs in the case of
perfect negative correlation. This maximum possible value is in every
instance equal to

30
N(N2—1)
2
maximum- D =
3

o
Thus, for N=8 with perfect negative correlation:T

Item X Y D D2
a 1 8 —7 49
b 2 7 —5 25
c 3 6 —3 9
d 4 5 —1 1
e 5 4 1 1 - D2 = 168
f 6 3 3 9
g 7 2 5 25 8(82—1)/3 = 168
h 8 1 7 49

o The ratio of the observed- D2 to its maximum possible value will


therefore be equal to zero in the case of perfect positive correlation,
to +1.0 in the case of perfect negative correlation, and to +.50 in the
case of zero correlation.

- D2 3 D2
=
N(N2—1)/3 N(N2—1)

o
Double this ratio, subtract it from 1, and voila! you have a quantity
that will be equal to +1.0 in the case of perfect positive correlation, to
—1.0 in the case of perfect negative correlation, and to zero in the
case of zero correlation.

6 D2
rs = 1—
N(N2—1)
And here, finally, is the calculation of rs for the example with which we
began:

31
wine X Y D D2 6 D2 6 x 14
rs = 1 —
a 1 2 —1 1 N(N2—1) 8(82—1)
b 2 1 1 1
c 3 5 —2 4 84
d 4 3 1 1 = 1—
e 5 4 1 1 504
f 6 7 —1 1
g 7 8 —1 1
h 8 6 2 4 1- 0.166 = +.83

N=8 - D2 = 14 r2s = .69

The meanings of rs and r2s in a rank- order correlation are essentially the
same as those of r and r2 in a correlation based on equal- interval data. For
the present example, r2s=.69 means that the covariance between the X
and Y rankings is 69% as strong as it possibly could be, and the positive sign
of rs=+.83 signals that this covariation occurs along the upward slant, with
higher values of X tending to be associated with higher values of Y, and vice
versa.

However, I would not recommend taking the parallels much farther than
this. In particular, I think it would not make much sense to subject bivariate
rankings to the predictive apparatus of linear regression.

32
UNIT 4: SAMPLING AND SURVEY DESIGN

Knowing what the client wants is the key factor to success in any type of
business. News media, government agencies and political candidates need to
know what the public thinks. Associations need to know what their members
want. Large companies need to measure the attitudes of their employees.
The best way to find this information is to conduct a survey.

This chapter is intended primarily for those who are new to survey research.
It discusses options and provides suggestions on how to design and conduct
a successful survey project. It does not provide instruction on using specific
parts of The Survey System, although it mentions parts of the program that
can help you with certain tasks.

The Steps in a Survey Project

1.Establish the goals of the project - What you want to learn?


2.Determine your sample - Who you will ask?
3.Choose interviewing methodology - How you will ask?
4.Create your questionnaire - What you will ask?
5.Pre-test the questionnaire, if practical - Test the questions.
6.Conduct interviews and enter data - Ask the questions.
7.Analyze the data - Produce the reports.

Establishing Goals

The first step in any survey is deciding what you want to learn. The goals of
the project determine whom you will survey and what you will ask them. If
your goals are unclear, the results will probably be unclear. Some typical
goals include learning more about:

• The potential market for a new product or service


• Ratings of current products or services
• Employee attitudes
• Customer/patient satisfaction levels
• Reader/viewer/listener opinions
• Association member opinions
• Opinions about political candidates or issues
• Corporate images

These sample goals represent general areas. The more specific you can
make your goals, the easier it will be to get usable answers.

33
Selecting Your Sample

There are two main components in determining whom you will interview.
The first is deciding what kind of people to interview. Researchers often call
this group the target population. If you conduct an employee attitude survey
or an association membership survey, the population is obvious. If you are
trying to determine the likely success of a product, the target population
may be less obvious. Correctly determining the target
population is critical. If you do not interview the right kinds of people, you
will not successfully meet your goals.

The next thing to decide is how many people you need to interview.
Statisticians know that a small, representative sample will reflect the group
from which it is drawn. The larger the sample, the more precisely it reflects
the target group. However, the rate of improvement in the precision
decreases as your sample size increases. For example, to increase a sample
from 250 to 1,000 only doubles the precision. You must make a decision
about your sample size based on factors such as: time available, budget and
necessary degree of precision.

Avoiding a Biased Sample

A biased sample will produce biased results. Totally excluding all bias is
almost impossible; however, if you recognize bias exists you can intuitively
discount some of the answers. The following list shows a few examples of
biased samples.

Sample Probable Bias Reason


Your customers Favorable They would not be your customers if they
were unhappy, but it is important to know what
keeps them happy.
Your ex-customers Unfavorable If they were happy they would not be
ex-customers, but it is important to know why
they left you.
"Phone in" Extreme Views Only people with a strong interest polls in a
subject (either for or against) are likely to call in
- and they may do so several times to load the
vote.
Daytime Non-working Most people who are at home during
interviews the day do not work. Their opinions
may not reflect the working population.

Internet Atypical People Limited to people with Internet access.


Internet users are not representative of the general
population, even when matched on age, gender, etc..

34
This can be a serious problem, unless you are only
interested in people who have Internet access.

The consequences of a source of bias depend on the nature of the survey.


For example, a survey for a product aimed at retirees will not be as biased
by daytime interviews as will a general public opinion survey. A survey of
possible Internet products can safely ignore people who are not on the
Internet.

Quotas

A Quota is a sample size for a sub-group. It is sometimes useful to establish


quotas to ensure that your sample accurately reflects relevant sub-groups
in your target population. For example, men and women have somewhat
different opinions in many areas. If you want your survey to accurately
reflect the general population's opinions, you will want to ensure that the
percentage of men and women in your sample reflect their percentages of
the general population.

If you are interviewing users of a particular type of product, you probably


want to ensure that users of the different current brands are represented in
proportions that approximate the current market share. Alternatively, you
may want to ensure that you have enough users of each brand to be able to
analyze the users of each brand as a separate group.

Interviewing Methods

Once you have decided on your sample you must decide on the method of
data collection. Each method has advantages and disadvantages.

Personal Interviews
An interview is called personal when the Interviewer asks the questions
face-to-face with the Interviewee. Personal interviews can take place in the
home, at a shopping mall, on the street, outside a movie theater or polling
place, and so on.

Advantages

• The ability to let the Interviewee see, feel and/or taste a product.

35
• The ability to find the target population. For example, you can find people who have
seen a film more easily outside a theater in which it is playing than by calling phone
numbers at random.
• Longer interviews are sometimes tolerated. Particularly with in-home interviews that
have been arranged in advance. People may be willing to talk longer face-to-face to a
person than to someone on the phone.

Disadvantages

• Personal interviews usually cost more per interview than other methods. This is
particularly true of in-home interviews, where travel time is a major factor.
• Each mall has its own characteristics. It draws its clientele from a specific geographic
area surrounding it, and its shop profile also influences the type of client. These
characteristics may differ from the target population and create a non-representative
sample.

Telephone Surveys

• Surveying by telephone is the most popular interviewing method in the USA. This is
made possible by nearly universal coverage (96% of homes have a telephone).

Advantages

• People can usually be contacted faster over the telephone than with other methods. If
the Interviewers are using CATI (computer-assisted telephone interviewing), the
results can be available minutes after completing the last interview.
• You can dial random telephone numbers when you do not have the actual telephone
numbers of potential respondents.
• If you are using computer-assisted interviewing, The Survey System's optional
Interviewing Module (see Chapter 11 in the Main Manual) helps automatically ensure
that questions are skipped when they should be, can check the logical consistency of
answers and can present questions or answers in a random order (the last two are
sometimes important for reasons that are described later).

Disadvantages

• Many telemarketers have given legitimate research a bad name by claiming to be


doing research when they start a sales call. Consequently, many people are reluctant
to answer phone interviews and use their answering machines to screen calls. Since
over half of the homes in the USA have answering machines, this problem is getting
worse.
• The growing number of working women often means that no one is home during the
day. This limits calling time to a "window" of about 6-9 p.m. when you can be sure to
interrupt dinner or a favourite TV program.
• You cannot show or sample products by phone.

36
Mail Surveys

Advantages

• Mail surveys are among the least expensive.


• This is the only kind of survey you can do if you have the names and addresses of the
target population, but not their telephone numbers.
• The questionnaire can include pictures - something that is not possible over the
phone.
• Mail surveys allow the respondent to answer at their leisure, rather than at the often
inconvenient moment they are contacted for a phone or personal interview. For this
reason, they are not considered as intrusive as other kinds of interviews.

Disadvantages

• Time! Mail surveys take longer than other kinds. You will need to wait several
weeks after mailing out questionnaires before you can be sure that you have gotten
most of the responses.
• In populations of lower educational and literacy levels, response rates to mail surveys
are often too small to be useful. This, in effect, eliminates many immigrant
populations that form substantial markets in many areas. Even in well-educated
populations, response rates vary from as low as 3% up to 90%. As a rule of thumb,
the best response levels are achieved from highly-educated people and people with a
particular interest in the subject (which, depending on your target population, could
lead to a biased sample).

One way of improving response rates to mail surveys is to mail a postcard telling your
sample to watch out for a questionnaire in the next week or two. You can also follow up a
questionnaire mailing after a couple of weeks with another card asking them to return the
questionnaire. The downside is that this doubles or triples your mailing cost. If you have
purchased a mailing list from a supplier you may also have to pay a second (and third) use fee -
you often cannot buy the list once and re-use it.

Another way to increase responses to mail surveys is to use an incentive. One possibility is to
send a dollar bill along with the survey (or offer to donate the dollar to a charity specified by the
respondent.) Another is to include the people who return completed surveys in a drawing for a
prize. A third is to offer a copy of the (non-confidential) result highlights to those who complete
the questionnaire. Any of these techniques will increase the response rates.

Remember that if you want a sample of 1,000 people, and you estimate a 10% response
level, you need to mail 10,000 questionnaires. You may want to check with your local post office
about bulk mail rates - you can save on postage using this mailing method. However, many
researchers do not use bulk mail, because many people associate "bulk" with "junk" and will
throw it out without opening the envelope, lowering your response rate.

37
Computer Direct Interviews

These are interviews in which the Interviewees enter their own answers directly into a
computer. They can be used at malls, trade shows, offices, and so on. The Survey
System's optional Interviewing Module and Interview Stations can easily create
computer-direct interviews.

Advantages

• The virtual elimination of data entry and editing costs.


• You will get more accurate answers to sensitive questions. Recent studies of
potential blood donors have shown respondents were more likely to reveal
HIV-related risk factors to a computer screen than to either human interviewers or
paper questionnaires. The National Institute of Justice has also found that computer-
aided surveys among drug users get better results than personal interviews.
Employees are also more often willing to give more honest answers to a computer
than to a person or paper questionnaire.
• The elimination of interviewer bias. Different interviewers can ask questions in
different ways, leading to different results. The computer asks the questions the same
way every time.
• Ensuring skip patterns are accurately followed. The Survey System can ensure
people are not asked questions they should skip, based on their earlier answers.
• These automatic skips are more accurate than relying on an Interviewer reading a
paper questionnaire.
• Response rates are usually higher. Computer-aided interviewing is still novel enough
that some people will answer a computer interview when they would not have
completed another kind of interview.

Disadvantages

• The Interviewees must have access to a computer or one must be provided for them.
• As with mail surveys, computer direct interviews may have serious response rate
problems in populations of lower educational and literacy levels. This method may
grow in importance as computer use increases.

E-mail Surveys

• E-mail surveys are both very economical and very fast. More people have e-mail than
have full Internet access. This makes e-mail a better choice than a Web page survey
for some populations. On the other hand, e-mail surveys are limited to simple
questionnaires, whereas Web page surveys can include complex logic.
Advantages

• Speed. An e-mail questionnaire can gather several thousand responses within a day
or two.

38
• There is practically no cost involved once the set up has been completed.
• You can attach pictures and sound files.
• The novelty element of an e-mail survey often stimulates higher response levels than
ordinary “snail” mail surveys.

Disadvantages

• You must possess (or purchase) a list of e-mail addresses to mail to.
• Some people will respond several times or pass questionnaires along to friends to
answer. Many programs have no check to eliminate people respon-ding multiple
times to bias the results.
• Many people dislike unsolicited e-mail even more than unsolicited regular mail. You
may want to send e-mail questionnaires only to people who expect to get mail from
you.
• You cannot use e-mail surveys to generalize findings to the whole populations.
People who have e-mail are different from those who do not, even when matched on
demographic characteristics, such as age and gender.

Many e-mail programs are limited to plain ASCII text questionnaires and cannot show
pictures. E-mail questionnaires from The Survey System can attach graphic or sound files.

Although use of e-mail is growing very rapidly it is not universal - and is even less so outside the
USA (three-quarters of the world's e-mail traffic takes place within the USA). Many “average”
citizens still do not possess e-mail facilities. So e-mail surveys do not reflect the population as a
whole. At this stage they are probably best used in a corporate environ-ment where e-mail is
much more common or when most members of the target population are known to have e-mail.

Internet/Intranet (Web Page) Surveys

Web surveys are rapidly gaining popularity. They have major speed and cost advantages,
but also major sampling limitations. These limitations make software selection especially
important and restrict the groups you can study using this technique.

Advantages

• Web page surveys are extremely fast. A questionnaire posted on a popular Website
can gather several thousand responses within a few hours. Many people who will
respond to an e-mail invitation to take a Web survey will do so the first day; most will
do so within a few days.
• There is practically no cost involved once the set up has been completed. Large
samples do not cost more than smaller ones (except for any cost to acquire the
sample).
• You can show pictures and play sounds.
• Web page questionnaires can use complex question skipping logic, randomizations
and other features not possible with paper questionnaires or most e-mail surveys.

39
• Web page questionnaires can use colours, fonts and other formatting options not
possible in most e-mail surveys.
• On average, people give longer answers to open-ended questions on Web page
questionnaires than they do on other kinds of self-administered surveys.

Disadvantages

• Current use of the Internet is far from universal. Internet surveys do not reflect the
population as a whole. This is true even if a sample of Internet users is selected to
match the general population in terms of age, gender and other demographics.
• People can easily quit in the middle of a questionnaire. They are not as likely to
complete a long questionnaire on the Web as they would be if talking with an
interviewer.
• Depending on your software, you may have no control over who replies – anyone
from Afghanistan to Zanzibar, cruising that web page may answer.
There is often no control over people responding multiple times to bias the results.

At this stage we recommend using the Internet for surveys only when your target population
consists entirely of Internet users. Business-to- business research and employee attitude surveys
can often meet this requirement. Surveys of the general population usually will not.

In either case, be sure your survey software prevents people from completing more than
one questionnaire. You may also want to restrict access by requiring a password (The
Survey System’s Internet Module allows you to do this) or by putting the survey on a page that
can only be accessed directly (there are no links to it).

Scanning Questionnaires

Scanning questionnaires is a method of data collection that can be used with paper questionnaires
that have been administered in face-to-face interviews; mail surveys or surveys completed by an
Interviewer over the telephone.

Advantages

• Scanning can be the fastest method of data entry for paper questionnaires.
• Scanning is more accurate than a person in reading a properly completed
questionnaire.

Disadvantages

• Scanning is best-suited to "check the box" type surveys and bar codes. Scanning
programs have various methods to deal with text responses, but all require additional
data entry time.
• Scanning is less forgiving (accurate) than a person in reading a poorly marked
questionnaire. Requires investment in additional hardware to do the actual scanning.

40
Summary of Survey Methods

Your choice of survey method will depend on several factors. These include:

• Speed: E-mail and Web page surveys are the fastest methods, followed
by telephone interviewing. Interviewing by mail is the slowest.

• Cost: Personal interviews are the most expensive followed by


telephone then mail. E-mail and Web page surveys are the least
expensive for large samples.

• Internet Usage: E-mail and Web page surveys offer significant


advantages, but you cannot generalize their results to
the population as a whole.

• Literacy Levels: Illiterate and less-educated people rarely respond to


mail surveys.

• Sensitive Questions: People are more likely to answer sensitive


questions when

41
Unit 5: MEASURES OF CENTRAL TENDENCY

These are statistics that attempt to describe typical scores that reflect how
the data is similar. The average is a commonly used term; in statistics this
includes 3 different expressions: the mean, median and mode. (see
explanatory notes at and).

The appropriateness of which measure to use depends on the data type (see
below), and its use:

Data type
Average
Mode Median Mean
Nominal
Ordinal
Interval/Ratio

In addition it is good practice to quote one or more of these statistics with


the relevant measure of dispersion.

Central tendency - representative of all the values in the


sample.
1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6

Which average best represents the waiting times data?

To a certain extent, the answer to this question depends on who is preparing


the statistics and who is reading them! Are your interests as a manager or a
user of services?

42
The sample data yielded average waiting times:

• mean = 5.25 months


• median 4.50 months
• mode = 3 months

3 months is the most common length of waiting time, but this does not take
into account that more than half the sample waited longer than this. The
mode is also vulnerable to data collection errors. The median gives a better
representation, but still does not take into account the longer waiting times
(ie the outliers). Is this important? The mean waiting time of 5.25 months
does reflect the several longer waiting times. The mean is more useful, in
that many further statistical analyses use it.

Many basic statistics textbooks will give more details, eg Cambell & Machin
(1999), Clegg (1982), Cobby & Gilchrist (2003)

Computer output

SPSS: Enter data in a column, and choose:


Analyse > Descriptive Statistics > Frequencies, and
You will need to go into the Statistics button to get all
the choices listed below

Types of data

Think about any collected data that you have experience of; for example, weight,
sex, ethnicity, job grade, and consider their different attributes. These variables can
be described as categorical or quantitative. The table summarises data types and
their associated measurement level, plus some examples. It is important to
appreciate that appropriate methods for summary and display depend on the type

43
of data being used. This is also true for ensuring the appropriate statistical test is
employed.

Type of data Level of measurement Examples


Nominal
Eye colour, ethnicity, diagnosis
(no inherent order in categories)

Categorical Ordinal
Job grade, age groups
(categories have inherent order)

Binary
Gender
(2 categories – special case of above)
Discrete
Quantitative Size of household (ratio)
(usually whole numbers)
(Interval/Ratio)
Continuous
Temperature °C/°F (no
(NB units of (can, in theory, take any value in a
absolute zero) (interval)
measurement range, although necessarily recorded
used) to a predetermined degree of
Height, age (ratio)
precision)

Illustrative example

Consider the following results recorded from a test for 10 students

Mark out of
Student Mark relative to 40% pass mark Position Result
100%
Ratio Interval Ordinal Nominal
Ahmed 56 16 6 Pass
Ben 48 8 7 Pass
Ceri 65 25 3 Pass
Desmond 73 33 2 Pass
Esme 62 22 4 Pass
Francesca 35 -5 10 Fail
Gemma 20 -20 9 Fail
Hannah 38 -2 8 Fail
Ian 58 18 5 Pass
Julie 82 40 1 Pass

Average
Data type Mode Median Mean
Nominal
Ordinal
Interval/ Ratio

44
UNIT 5: MEASURES OF DISPERSION

These statistics describe how the data varies or is dispersed (spread out).

The two most commonly used measures of dispersion are the range and the
standard deviation. Rather than showing how data are similar, they show
how data differs (its variation, spread, or dispersion).

Quoting both a measure of central tendency and the relevant measure of


dispersion for one set of data gives a much better picture of the data than
quoting one alone.

Other measures of dispersion that may be encountered include the


Interquartile range (IQR) and the Semi-interquartile range (SIQR).

Standard Deviation

The standard deviation (SD) is usually quoted along with the mean. It is a
measure of the variability of the sample data around the mean.

For example, in a study comparing two groups of subjects in a trial, details


are given of the age of subjects in each group so the extent to which the
groups are similar can be determined, and are therefore comparable.

Experimental group Control group


n = 30 n = 30
Mean age (SD) 70 (5) yrs 71 (10) yrs

In this table (which is very typical of what might be seen in a research


paper) it can be observed that the mean age for each group is very similar,
but the standard deviation is quite different. This might suggest that the
groups are not as similar as the mean suggests, and the greater variability
in the control group might have an influence on the findings.

Interpreting the SD

The figure quoted for the SD represents one standard deviation from the
mean. One standard deviation is added to and subtracted (+/-) from the
mean to give a range within which, for many sets of data, about two-thirds
of subjects may fall.

45
Experimental group Control group
n = 30 n = 30
Mean age (SD) 70 (5) yrs 71 (10) yrs
Mean age +/- 1
65 to 75 yrs 61 to 81 yrs
SD

It can be observed that the age range is much greater in the control group.
If the study purported to have a sample of elderly people, the age range in
the control group might contradict this, as clearly some are under 65 (and it
is very likely some are even younger than 61!). Ideally it would be more
informative to have the minimum and maximum values given.

Assuming the data are approximately normally distributed, 68.3% of the


data will lie within a range of 1 SD of the mean and, as shown below, about
95% is within 2 SDs, and virtually all the data (99.7%) within 3 SDs.

The SD formula may be calculated manually, but in practice the computer


does this (using software like SPSS, Minitab or Excel), as will any calculator
that has statistical functions.

Where: SD = Standard Deviation


n = number in sample

46
xi = individual sample values
x "bar" = sample mean

Estimation

To obtain an accurate estimate of a population parameter, the sample must be


representative of the population. To avoid bias the sample items should be selected
from the population at random. This means that all members of the population
have an equal chance of being in the sample.

The precision of the estimate depends on the size of the sample. Clearly the larger
the sample the better the estimate will be. Precision is measured by calculating the
standard error of the estimate or a confidence interval (usually the 95% confidence
interval).

Worked example

Consider the following times (to the nearest hour) that 16 patients experience relief
from a migraine after taking a certain drug:

7 8 1 2 6 3 5 2 4 9 4 6 5 6 9 8

mean = 5.312 hours

standard deviation = 2.522 hours

Thus the estimate of the mean time for a patient to experience relief is 5.312 hours

The 95% confidence interval for the mean time to experience relief is
calculated to be 3.975 to 6.649.

It can be said that there is a probability of 0.95 that the population mean
lies between 3.975 hours and 6.649 hours. This provides a clear idea of how
precisely the population mean has been estimated by these data.

The precision of estimates should always be reported alongside the estimate,


and sometimes authors quote the standard error rather than a confidence
interval.

Although 95% confidence intervals are most often reported, you will
sometimes see 99% confidence intervals, in which case the confidence
interval contains the population parameter with probability 0.99 and will,
consequently, be wider than the corresponding 95% confidence interval. To
calculate a 99% confidence interval, the factor 2 is replaced by 2.6.

47
Calculation of a 95% confidence interval for the mean

The 95% confidence interval for a mean is calculated (approximately) from:

Sample mean – 2 x (Standard error of the mean) to Sample mean + 2 x


(Standard error of the mean),

where the standard error of the mean = .

The factor 2 varies according to the sample size but only varies from 2.201
to 1.960 for sample sizes greater than 10, so that 2 is an adequate
approximation in most cases. If you use a computer package that calculates
the confidence interval the exact factor will be used.

Since the sample mean and standard deviation are estimates of fixed (albeit
unknown) quantities the only way of affecting the confidence interval is by
altering the sample size, n. Increasing n will reduce the standard error of
the mean and thus the width of the interval. But notice that to halve the
width of the interval we have to quadruple the sample size (because of the
square root in the formula).

Computer Output

Confidence intervals for the mean in Minitab

To obtain a confidence interval for a set of data in Minitab, click on Stat >
Basic Statistics > 1-Sample t…

Data from the above entered in column C1gives the following output:

T Confidence Intervals

Variable N Mean Std Dev SE Mean 95.0 % CI


C1 16 5.312 2.522 0.631 ( 3.968, 6.657)

Changing the confidence interval level to 99.0 gives the following output

T Confidence Intervals

Variable N Mean StDev SE Mean 99.0 % CI


C1 16 5.312 2.522 0.631 ( 3.968, 7.171)

48
Confidence intervals for the mean in SPSS

To obtain a confidence interval. The output also includes a single sample t-


test,click on Analyze > Compare Means > One-Sample T Test…

Confidence intervals for the mean in Excel

Not easy and best to be avoided! But if you have to, the following is an
example of how to calculate an approximate lower 95% confidence limit of a
set of data in cells A1 to A16 using Excel’s CONFIDENCE function.

=AVERAGE(A1:A16)-CONFIDENCE(1-0.95,STDEV(A1:A16),COUNT(A1:A16))

the upper limit is

=AVERAGE(A1:A16)+CONFIDENCE(1-
0.95,STDEV(A1:A16),COUNT(A1:A16))

Notes on Mean, Median, Mode and Standard Deviation

Mode, median and mean are all measures of averages.

Mean is what the non-statistician calls average. Mean is determined by


adding all the values up and then dividing them by the values of the number
in the dataset. The mean is a good number but its results can be influenced
negatively by outliers.

Mode is the most frequently occurring value in the dataset.

Median is the value in the middle of the dataset and the value that divides
the dataset into two halves.

Standard deviation

The standard deviation (SD) is usually quoted along with the mean. It is a
measure of the variability of the sample data around the mean.

For example, in a study comparing two groups of subjects in a trial test,


details are given of the age of subjects in each group so the extent to which
the groups are similar can be determined, and are therefore comparable.

Experimental group Control group


n = 30 n = 30
Mean age (SD) 70 (5) yrs 71 (10) yrs

49
Note: In this table (which is very typical of what might be seen in a
research paper) it can be observed that the mean age for each group is very
similar, but the standard deviation is quite different. This might suggest that
the groups are not as similar as the mean suggests, and the greater
variability in the control group might have an influence on the findings.

50
UNIT 5: CHOOSING SIMPLE RANDOM SAMPLE

In doing a sample survey, it is standard practice to number the units in the population by N, and
the number of units in the sample by n. If n= N, this is called a census and the normal
requirements of sampling do not apply. To select a random sample of size n, from a population
of N, you will need to develop:

1. a list of all the N units of the population, numbering 1 to N


2. a mechanism to select n different numbers from the range 1 to N

Let’s assume that we want a sample of 15 from a population of N = 100. We could use a
standard random table or one generated by the computer. The example below shows a partial
random table.

25 85 52 40 80 50 80 78 58 42 11 31 85 77 77 25 16 08 54 37
58 73 38 58 78 92 12 38 43 41 31 77 97 30 33 45 00 17 60 35
66 04 44 17 00 38 61 37 54 84 38 54 05 96 18 96 20 83 65 29
96 22 27 19 23 83 09 18 22 67 17 31 63 08 80 18 68 08 47 88
83 86 48 37 00 91 51 91 62 88 04 62 12 46 51 12 55 22 43 34
23 34 45 56 18 56 34 78 59 90 67 56 65 43 23 12 11 01 23 45

Since the Population is in “double figure” we need to select “two digit numbers”. Let us assume
that we want to start with the fifth column of the second row, that is, with the number 78. We
could go vertical as well, it wouldn’t matter. Accordingly, we would select the numbers until we
reach the sample size of 15, to get:

78 92 12 38 43 41 31 77 97 30 33 45 00 17 60

If the population N was 1000, and we need a sample size of 50 we would need to choose “3-
digit numbers” using the same process. Starting with the fifth column of the fourth row, we
would select numbers until we reach the sample size of 50 to get:

789 212 384 341 317 797 303 345 001 760 035 660 444 170 038 613 ….

Qualitative vs Quantitative data


Statistics is concerned with collecting, processing, analyzing and interpreting data obtained
through observations, interviews or experiments. Data can be either qualitative or quantitative.
These attributes are referred to as random variables or variates. Quantitative variables are
usually quite discreet, as numbers or range that can be ordered
or placed in some logical sequence, as 2 4 6 or tall, taller, tallest, etc whereas qualitative
variables usually define categories which are not necessarily quantifiable.

51
Data representation
Data can be represented in many forms; primary, first, secondary or tertiary levels. Here,
primary level refers to the presentation of raw or unorganized data, that is the exact way we get
the information. First level data will have undergone some re-arrangement or ordering, but no
real transformation. Secondary level data will have undergone some transformation or
processing. And tertiary level data have undergone substantive transformation.

The first level data usually puts the raw data in some order or sequence (in the case of
quantitative data) or in summary, group, cluster, or category (in the case of non-quantitative
data). With the primary (quantitative) data, we can derive plots (scatter plots as they are called);
and from the first and secondary level (quantitative) data we can derive graphs, tables, charts and
regressions. Using secondary level data we can derive predictory models.

Tables
This is a very practical and popular method of presenting data. Tables can be of single, multi-
vector or matrix format, depending on the number of variables, the links, and their complexity.
Tables may be arranged horizontal or vertical and these can get quite sophisticated with multi-
level row and columns of information.

Midterm marks for 28 students in Econ 1

Primary (Raw) data

89,60,92,74,76,65,77,83,87,62,85,64,79,77,96,80,70,85,80,81,82,81,86,71,90,87,71,72

First level data


Single
60,62,64,65,70,71,71,72,74,76,77,77,79,80,80,81,81,82,83,85,85,86,87,89,90,92,96 Vector

60 62 64 65 70 71 71 72 74 76 77 77 79 80 Double
80 81 81 82 83 85 85 86 87 87 89 90 92 96 Vector

Simple Table Multi-Level Table

Student Frequency

52
Male Fem Male
60-64 3 Age Marks Freq Marks Freq Marks Freq Marks Freq
65-69 1
70-74 4
75-79 6 18-25 60-79 3 60-79 2 80-99 1 80-99 2
80-84 6 26-33 60-79 2 60-79 3 80-99 2 80-99 3
85-89 2 >31 60-79 2 60-79 2 80-99 1 80-99 1
90-94 1
95-99 1 Total 60-79 7 60-79 7 80-99 4 80-99 6
Total 24

Graphs
There are graphical many ways to represent data. The common ones are:
• the scatter lot (see figure A below)
• the line or curve diagram (see figure B below)
• the bar diagram or chart (see figure C below)
• the pie diagram or chart (see figure D below)

Figure A Figure C

xxx
xx x x x
x x x xxx x
xx xx x x sx
x x x x xx xxxxx
xxx xx x xxxx xx

Figure B

Figure D

53
UNIT 7: REVIEW OF BASIC STATISTICS AND STATISTICAL TOOLS

What is statistics?

Statistics is the process of collecting, presenting and interpreting numerical


data. Essentially quantitative data analysis often contain descriptive
statistics and inferential statistics.

A. Descriptive statistics include measures of central tendency (averages -


mean, median and mode) and measures of variability about the average
(range and standard deviation). These give the reader a 'picture' of the data
collected and are used in research projects.

B Inferential statistics are the outcomes of statistical tests, helping


managers to make deductions from the data collected, to test hypotheses
set and relate findings to a population.

What are descriptive (summary) statistics?

Quantitative research may well generate masses of data. For example, a


comparatively small study that distributes 200 questionnaires with maybe 20
items on each can generate potentially 4000 items of raw data.

To make sense of this data it needs to be summarised in some way, so that


the reader has an idea of the typical values in the data, and how these vary.
To do this researchers use descriptive or summary statistics: they describe
or summarise the data, so that the reader can construct a mental picture of
the data and the people, events or objects they relate to.

Types of descriptive statistics


All quantitative studies will have some descriptive statistics, as well as
frequency tables. For example, sample size, maximum and minimum values,
averages and measures of variation of the data about the average. In many
studies this is a first step, prior to more complex inferential analysis.

The two main types of descriptive statistics encountered in research papers


are:

(a) measures of central tendency, (averages) and


(b) measures of dispersion.

The choice of which particular descriptive statistics to report will affect


the “picture” that is presented of the data, and there is the potential
to mislead.

54
UNIT 9: SURVEY DESIGN

Questionnaire Draft Design


Management For Survey
Problem

List of all
Information
Conceptual Required Pilot Test
Problem

Identify Review Other


Research Research Final design
Questions Methods

Open-ended questions: Interviewer asks questions without any prompting.


In questionnaire appropriate lines are provided for respondent to indicate
answers

Closed or pre-coded questions: interviewer ask questions and provide


respondents with a range of answers to choose from. In the questionnaire
design respondents are asked to tick the appropriate boxes.

Combination approach: Use of both types of questions

Formats for Measuring attitudes and opinions

• Open-ended or direct questions


• Checklist
• Ranking
• Likert Scales
• Attitude statements
• Semantic differentials

Likert scales: respondents are asked to indicate their agreement or


disagreement with a proposition or importance they attach to a factor using
a standard set of responses.

Attitude Statements: means of exploring respondents attitudes toward a


wide range of possibly complex issues. Here respondents are shown a series

55
of statements and asked to indicate, using a scale, the extent to which
they agree or disagree with them. Ex: 5 = agree strongly; 4 = agree, etc.
Semantic differentials: offers respondent pairs of contrasting descriptors
and asking them to indicate how the concept being studied relates to the
descriptors.

a. Open-ended What attracted you to apply for this training course?

---------------------------------------------------------

---------------------------------------------------------

b. Checklist: A Good reputation


B Easy Access
C Curriculum
D. Management pays fees
E. Easy parking

c. Ranking: Please rank the following items in terms of their importance to you
Rank them 1 for the most important to 5 for the least important.

Rank
A Good reputation ____
B Easy Access ____
C Curriculum ____
F. Management pays fees ____
G. Easy parking ____

d. Likert scales How important was each of the following items in your decision to choose
this training course

Very Quite Not Very Not at all


Important Important Important important
A Good reputation ____1 _____ 2 _____3 _____4
B Easy Access ____1 _____2 _____3 _____4
C Curriculum ____1 _____2 _____3 _____4
H. Management pays fees ____1 ____2 _____ 3 _____4
I. Easy parking ____1 _____2 _____ 3 _____4

e. Attitude statement: Please read the statements below and indicate your level of agreement

Agree Agree No Opinion Disagree Disagree


strongly
Strongly
The learning experience is ____5 ____4 ____3 ____2 _____1

56
more important than the
qualification in education

Graduate courses fees are too high ____5 ____4 ____3 ____2 _____1

f. Semantic differential: Please look at the list and tick the line where you think this
Training course falls in relation to each factor listed

Difficult !______!______!______!______! Easy


Irrelevant !______!______!______!______! Relevant
Professional !______!______!______!______! Unprofessional
Dull !______!______!______!______! Interesting

57
UNIT 10: Questionnaire Design

General Considerations

The first rule is to design the questionnaire to fit the medium. Phone
interviews cannot show pictures. Survey-by-mail respondents cannot ask,
“What exactly do you mean by that?” if they do not understand a question.
Intimate, personal questions are sometimes best handled by mail or
computer, where anonymity is most assured.

A mail survey will often not give the same answers as the same survey done
by phone or in person. If you used one method in the past and need to
compare results, stick to that method, unless there is a compelling reason to
change.

KISS - keep it short and simple. If you present a 20-page questionnaire


most potential respondents will give up in horror before even starting. Ask
yourself what you will do with the information from each question. If you
cannot give yourself a satisfactory answer, leave it out. Avoid the
temptation to add a few more questions just because you are doing a
questionnaire anyway. If necessary, place your questions into three groups:
must know, useful to know and nice to know. Discard the last group, unless
the previous two groups are very short.

Start with an introduction or welcome message. In the case of mail


questionnaires, this message can be in a cover letter or on the questionnaire
form itself. If you are sending e-mails that ask people to take a Web page
survey, put your main introduction or welcome message in the e-mail.
When practical, state who you are and why you want the infor-mation in the
survey. A good introduction or welcome message will encourage people to
complete your questionnaire.

Allow a “Don't Know” or “Not Applicable” response to all questions, except to


those in which you are certain that all respondents will have a clear answer.
In most cases, these are wasted answers as far as the researcher is
concerned, but are necessary alternatives to avoid frustrated respondents.
Sometimes “Don't Know” or “Not Applicable” will really
represent some respondents' most honest answers to some of your
questions.

Respondents who feel they are being coerced into giving an answer they do
not want to give often do not complete the questionnaire. For the same
reason, include “Other” or “None” whenever either of these are a logically

58
possible answer. When the answer choices are a list of possible opinions,
preferences or behaviours you should usually allow these answers.

On paper, computer direct and Internet surveys these four choices should
appear as appropriate. You may want to combine two or more of them into
one choice, if you have no interest in distinguishing between them. You will
rarely want to include “Don't Know,” “Not Applicable,” “Other” or “None” in
a list of choices being read over the telephone or in person, but you should
allow the interviewer the ability to accept them when given by respondents.

Question Types

Researchers use three basic types of questions: multiple choice, numeric


open end and text open end (sometimes called
"verbatims"). Examples of each kind of question follow:

Multi Choice

1. Where do you work?


North ____
South ____
East ____
West ____

Numeric open End

2. How much did you spend on Drugs this month? _______

Text Open End

3. How can the firm improve its working conditions?


__________________________________

__________________________________

Rating scales

4. How would you rate this product?


Excellent ___
Good ___
Fair ___
Poor ___

59
5. On a scale where “10” means you have a great amount of interest in a
Subject and “1” means you have none at all. How would you rate your interest in
each of the following topics?

Domestic politics ____


Foreign affairs ____
Science and Health ____
Business ____
Agreement Scale

6. How would you agree with each of the following statements?

Strongly Strongly
Agree Agree Disagree Disagree

My manager provides constructive criticism _____ ____ ____ ____


Our medical plan provides adequate coverage _____ ____ ____ ____
I would prefer to work longer hours on fewer days _____ ____ ____ ____

Rating Scales and Agreement Scales are two common types of questions
that some researchers treat as multiple choice questions and others treat as
numeric open end questions. Examples of these kinds of questions are:

Question and Answer Choice Order

There are two broad issues to keep in mind when considering question and
answer choice order. One is how the question and answer choice order can
encourage people to complete your survey. The other issue is how the order
of questions or the order of answer choices could affect the results of your
survey.

Ideally, the early questions in a survey should be easy and pleasant to


answer. These kinds of questions encourage people to continue the survey.
In telephone or personal interviews they help build rapport with the
interviewer. Grouping together questions on the same topic also makes the
questionnaire easier to answer.

Whenever possible leave difficult or sensitive questions until near the end of
your survey. Any rapport that has been built up will make it more likely
people will answer these questions. If people quit at that point anyway, at
least they will have answered most of your questions.

Answer choice order can make individual questions easier or more difficult to
answer. Whenever there is a logical or natural order to answer choices, use
it. Always present agree-disagree choices in that order. Presenting them in
disagree-agree order will seem odd. For the same reason, positive to

60
negative and excellent to poor scales should be presented in those orders.
When using numeric rating scales higher numbers should mean a more
positive or more agreeing answer.

Question order can affect the results in two ways. One is that mentioning
something (an idea, an issue, a brand) in one question can make people
think of it while they answer a later question, when they might not have
thought of it if it had not been previously mentioned.

The other way question order can affect results is habituation. This problem
applies to a series of questions that all have the same answer choices. It
means that some people will usually start giving the same answer, without
really considering it, after being asked a series of similar questions. People
tend to think more when asked the earlier questions in the series and so
give more accurate answers to them.

Another way to reduce this problem is to ask only a short series of similar
questions at a particular point in the questionnaire. Then ask one or more
different kinds of questions, and then another short series, if needed.

A third way to reduce habituation is to change the “positive” answer. This


applies mainly to level-of-agreement questions. You can word some
statements so that a high level of agreement means satisfaction (e.g., “My
supervisor gives me positive feedback”) and others so that a high level of
agreement means dissatisfaction (e.g., “My supervisor usually ignores my
suggestions”). This technique forces the respondent to think more about
each question.

One negative aspect of this technique is that you will usually have to do
Data Transformations on some of the questions after the results are entered,
because having the higher levels of agreement always mean a positive (or
negative) answer makes the analysis much easier. However, the few
minutes extra work may be a worthwhile price to pay to get more accurate
data.

The order in which the answer choices are presented can also affect the
answers given. People tend to pick the choices nearest the start of a list
when they read the list themselves on paper or a computer screen. People
tend to pick the most recent answer when they hear a list of choices read to
them.

As mentioned previously, sometimes answer choices have a natural order


(e.g., Yes, followed by No; or Excellent - Good - Fair - Poor). If so, you
should use that order. At other times, questions have answers that are

61
obvious to the person that is answering them (e.g., “What brand(s) of car
do you own?”).

In these cases, the order in which the answer choices are


presented is not likely to affect the answers given. However, there are kinds
of questions, particularly questions about preference or recall or questions
with relatively long answer choices that express an idea or opinion, in which
the answer choice order is more likely to affect which choice is picked.

Other Tips

Keep the questionnaire as short as possible. We mentioned this principle


before, but it is so important it is worth repeating. More people will complete
a shorter questionnaire, regardless of the interviewing method. If a question
is not necessary, do not include it.

Start with a Title (e.g., Leisure Activities Survey). Always include a short
introduction - who you are and why you are doing the survey. It is often a
good idea to give the name of the research company rather than the client
(e.g., XYZ Research Agency rather than the manufacturer of the product/
service being surveyed).

Many firms create a separate research company name (even if it is only a


direct phone line to the research department) to disguise themselves. This is
to avoid possible bias, since people rarely like to criticize someone to their
face and are much more open to a third party.

Reassure your respondent that his or her responses will not be revealed to
your client, but only combined with many others to learn about overall
attitudes.

Include a cover letter with all mail surveys. A good cover letter will increase
the response rate. A bad one, or none at all, will reduce the response rate.
Include the information in the preceding two paragraphs and mention the
incentive (if any). Describe how to return the questionnaire. Include the
name and telephone number of someone the respondent can call if they
have any questions. Include instructions on how to complete the survey
itself.

Mail questionnaires should be numbered on each page and include the return
address on the questionnaire itself, because pages and envelopes can be
separated from each other. Envelopes should have return postage prepaid.
Using a postage stamp often increases response rates, but is expensive,

62
since you must stamp every envelope - not just the
returned ones.

You may want to leave a space for the respondent to add their name and
title. Some people will put in their names, making it possible for you to re-
contact them for clarification or follow-up questions. Indicate that filling in
their name is optional. Do not have a space for a name, if the questions are
sensitive in nature. Some people would become suspicious and not complete
the survey.

If you hand out questionnaires on your premises, you obviously cannot


remain anonymous, but keep the bias problem in mind when you consider
the answers.

If the survey contains commercially sensitive material, ask a "security"


question up front to find whether the respondent or any member of his
family, household or any close friend works in the industry being surveyed.
If so, terminate the interview immediately. They (or family or friends) may
work for the company that commissioned the survey - or for a competitor.
In either case, they are not representative and should be eliminated. If they
work for a competitor, the nature of the questions may betray valuable
secrets.

The best way to ask security questions is in reverse (i.e., if you are
surveying for a pharmaceutical product, phrase the question as "We want to
interview people in certain industries - do you or any member of your
household work in the pharmaceutical industry?). If the answer is "Yes"
thank the respondent and terminate the interview. Similarly, it is best to
eliminate people working in the advertising, market research or media
industries, since they may work with competing companies.

After the security question, start with general questions. If you want to limit
the survey to users of a particular product, you may want to disguise the
qualifying product. As a rule, start from general attitudes to the class of
products, through brand awareness, purchase patterns, specific product
usage to questions on specific problems (i.e., work from "What
types of coffee have you bought in the last three months" to "Do you recall
seeing a special offer on your last purchase of Brand X coffee?"). If possible
put the most important questions into the first half of the survey. If a person
gives up half way through, at least you have the most important
information.

Make sure you include all the relevant alternatives as answer choices.
Leaving out a choice can give misleading results. For example, a number of

63
recent polls that ask Americans if they support the death penalty yes or no
have found 70-75% of the respondents choosing ”yes.” But polls that offer
the choice between the death penalty and life in prison without the
possibility of parole show support for the death penalty at about 50-60%.
While polls that offer the alternatives of the death penalty or life in prison
without the possibility of parole, with the inmates working in prison to pay
restitution to their victims’ families have found support of the death penalty
closer to 30%.

So what is the true level of support for the death penalty? The lowest figure
is probably best, since it represents the percentage that favour that penalty
regardless of the alternative offered. The need to include all relevant
alternatives is not limited to political polls. You can get misleading data
anytime you leave out alternatives.

Do not put two questions into one. Avoid questions such as "Do you buy
frozen meat and frozen fish?" A "Yes" answer can mean the respondent
buys meat or fish or both. Similarly with a question such as "Have you ever
bought Product X and, if so, did you like it?" A "No" answer can mean "never
bought" or "bought and disliked." Be as specific as possible.

"Do you ever buy pasta?" can include someone who once bought some in
1990. It does not tell you whether the pasta was dried, frozen or canned and
may include someone who had pasta in a restaurant. It is better to say
"Have you bought pasta (other than in a restaurant) in the last three
months?" "If yes, was it frozen, canned or dried?" Few people can remember
what they bought more than three months ago unless it was a major
purchase such as an automobile or appliance.

The overriding consideration in questionnaire design is to make sure your


questions can accurately tell you what you want to learn. The way you
phrase a question can change the answers you get. Try to make sure the
wording does not favour one answer choice over another.

Avoid emotionally charged words or leading questions that point towards a


certain answer. You will get different answers from asking "What do you
think of the XYZ proposal?" than from "What do you think of the Republican
XYZ proposal?" The word "Republican" in the second question would cause
some people to favour or oppose the proposal based on their feelings about
Republicans, rather than about the proposal itself. It is very easy to create
bias in a questionnaire. This is another good reason to test it before going
ahead.

64
If you are comparing different products to find preferences, give each one a
neutral name or reference. Do not call one "A" and the second one "B." This
immediately brings images of A grades and B grades to mind, with the
former being seen as superior to the latter. It is better to give each a
"neutral" reference such "M" or "N" that do not have as strong a quality
difference image. If possible, just refer to the "first" product and the
"second" product.

Avoid technical terms and acronyms, unless you are absolutely sure that
respondents know they mean. MHETEC ADB, GPA, UNDP, Adjusted Gross
Income, Grade Point Average and Engineering Information External Inquiries
Officer) are all well-known acronyms to people in those particular fields, but
very few people would understand all of them. If you must use an acronym,
spell it out the first time it is used.

Make sure your questions accept all the possible answers. A question like
"Do you use regular or premium gas in your car?" does not cover all possible
answers. The owner may alternate between both types. The question also
ignores the possibility of diesel or electric-powered cars. A better way of
asking this question would be "Which type(s) of fuel do you use in your
cars?" The responses allowed might be:

o Regular gasoline
o Premium gasoline
o Diesel
o Other
o Do not have a car

If you want only one answer from each person, ensure that the options are
mutually exclusive. For example:

In which of the following do you live?


o a house
o an apartment
o the suburbs

This question ignores the possibility of someone living in a house or an


apartment in the suburbs.

Score or Scale questions (e.g., "If "5" means very good and "1" means very
poor how would rate this product?") are a particular problem. Researchers
are very divided on this issue. Many surveys use a ten-point scale, but
there is considerable evidence to suggest that anything over a five point

65
scale is irrelevant. This depends partially on education. Among university
graduates a ten point scale will work well. Among people with less than a
high school education five points is sufficient. In third world countries, a
three- point scale (good/acceptable/bad) is often all a respondent can
understand. Another problem is that you are assuming that the difference in
the factors is within the scale limits - you may have a five-point scale but in
a respondent's mind one factor may rate 10 points in comparison to the
others.

If you do use a rating scale be sure the labels are meaningful. For example:

What do you think about product X?


o It's the best on the market
o It's about average
o It's the worst on the market

A question phrased like the one above will force most answers into the
middle category, resulting in very little usable information.

If you have used a particular scale before and need to compare results, use
the same scale. Four on a five-point scale is not equivalent to eight on a
ten-point scale. Someone who rates an item "4" on a five-point scale might
rate that item anywhere between "6" and "9" on a ten-point scale.

Be aware of cultural factors. In the third world, respondents have a strong


tendency to exaggerate answers. Researchers are often perceived as being
government agents, with the power to punish or reward according to the
answer given (and this is sometimes true).

Accordingly they often give "correct" answers rather than what they really
believe. Even when the questions are not overtly political and deal purely
with commercial products or services, the desire not to disappoint important
visitors with answers that may be considered negative may lead to
exaggerated scores. Always discount "favorable" answers by a significant
factor in all cases. The desire to please is not limited to the third world.

In personal interviews it is vital for the Interviewer to have empathy with the
Interviewee. In general, Interviewers should try to "blend" with respondents
in terms of race, language, sex, age, etc. Choose your Interviewers
according to the likely respondents.

Leave your demographic questions (age, sex, income, education, etc.) until
the end of the questionnaire. By then the Interviewer should have built a
rapport with the Interviewee that will allow honest responses to such

66
personal questions. Mail questionnaires should do the same, although the
rapport must be built by good question design, rather than personality.

Exceptions are any demographic questions that qualify someone to be


included in the survey. For example, many researchers limit some surveys
to people in certain age groups. These questions must come near the
beginning.

Paper questionnaires requiring text answers, should always leave sufficient


space for handwritten answers. Lines should be about half-an-inch (one cm.)
apart. The number of lines you should have depends on the question. Three
to five lines are average. Leave a space at the end of a questionnaire
entitled "Other Comments."

Sometimes respondents offer casual remarks that are worth their weight in
gold and cover some area you did not think of, but which respondents
consider critical. Many products have a wide range of secondary uses that
the manufacturer knows nothing about but which could provide a valuable
source of extra sales if approached properly. In one third world market, a
major factor in the sale of candles was the ability to use the spent wax as
floor polish - but the manufacturer only discovered this by a chance remark.

Always consider the layout of your questionnaire. This is especially important


on paper, computer direct and Internet surveys. You want to make it
attractive, easy to understand and easy to complete. If you are creating a
paper survey, you also want to make it easy for your data entry personnel.

Try to keep your answer spaces in a straight line, either horizontal or


vertical. A single answer choice on each line is best. Eye tracking studies
show the best place to use for answer spaces is the right hand edge of the
page. It is much easier for a field worker or respondent to follow a logical
flow across or down a page. Using the right edge is also
easiest for data entry.

Questions and answer choice grids, as in the second of the following


examples, are popular with many researchers. They can look attractive
and save paper, or computer screen space. They also can avoid a long
series of very repetitive question and answer choice lists. Unfortunately,
they also are a bit harder than the repeated lists for some people to
understand. As always, consider whom you are studying when you create
your questionnaire.

Look at the following layouts and decide which you would prefer to use:

Do you agree, disagree or have no opinion that this company has:

67
• A good vacation policy - agree/not sure/disagree.
• Good management feedback - agree/not sure/disagree.
• Good medical insurance - agree/not sure/disagree.
• High wages - agree/not sure/disagree.

An alternative layout is:

Do you agree, disagree or have no opinion that this company has:


Agree Not Sure Disagree
A good vacation policy 1 2 3
Good management feedback 1 2 3
Good medical insurance 1 2 3
High wages 1 2 3

The second example shows the answer choices in neat columns and has
more space between the lines. It is easier to read. The numbers in the
second example will also speed data entry.

Surveys are a mixture of science and art, and a good researcher will save
their cost many times over by knowing how to ask the correct questions.

Pre-test Questionnaire

The last step in questionnaire design is to test a questionnaire with a small


number of interviews before conducting your main interviews. Ideally, you
should test the survey on the same kinds of people you will include in the
main study. If that is not possible, at least have a few people, other than the
question writer, try the questionnaire. This kind of test run can reveal
unanticipated problems with question wording, instructions to skip
questions, etc. It can help see if the interviewees are understanding your
questions and giving useful answers.

If you change any questions after a pre-test, you should not combine the
results from the pre-test with the results of post-test interviews. The Survey
System will invariably provide you with mathematically correct answers to
your questions, but choosing sensible questions and ad ministering surveys
with sensitivity and common sense will improve the quality of your results
dramatically.

68
UNIT 9: REVIEW OF RESEARCH METHODS AND ETHICS

• Qualitative

• Experimental

• Observation

• Questionnaire-based survey

• Other methods

• Textual analysis: study content of annual reports, speeches, manuals, etc

• Longitudinal studies: same sample of individuals/organizations being studied


periodically over a number of years. However, longitudinal studies, though very
effective, are usually expensive because they require tracking the sample members
over several years. They are however ideal for studying organizational change and the
combined effects of social change, age and experience.

• Case studies: study of an example - a case - to understand more …

Note: Quantitative methods are not categorized as research methods. They are regarded as
analytic tools.

Research ethics

Ethical behaviour is critical in research. There are issues of plagiarism,


validation of data and sources, honesty in processing and reporting
results, protection and citation of sources. Recently, issues of ethics have
been highlighted in genetic, human, biological and social science
research.

Among the main issues are:

• Harm - risk to one health, property


• Privacy – confidentiality
• Informed consent:
• Competence
• Literature review

69
UNIT 10: FORMAT OF A RESEARCH PROPOSAL/PROJECT OR
THESIS

1. Introduction
A. The problem Statement
B. Rationale for the research
• Statement of the research objective
C. Hypothesis
D. Definition of terms
E. Summary, including restatement of the problem

2. Literature review (Primary, secondary sources, relevance and dateline)


A. Current status of the topic
B. Relationship between literature and problem statement
3. Method
A. Research design
B. Data collection - sample, questionnaire
C. Analysis of data
D. Results display – tables, charts, graphics, etc
E. Discussion, Inferences, implications, limitations
F. Conclusions
Appendices
References

Evaluating a research study/project

Although each research project will have its own peculiarity, in general, the headings below
should serve as a useful guide. Note that some heading may be combined or appear as
subheadings depending on the importance that are attached to them.

1. Background to issues/problems
2. Identify Problem – problem statement clarity
3. Identify Purpose of research
4. Scope, limitations and risks associated with study
5. Identify Hypothesis
6. Identify Method/methodology/theoretic framework
7. Research strategy to obtain express, process data and report results
8. Analyze and discuss results in relation to problem/hypothesis - confirmation/refutation –
limitations, risks
9. Conclusion and Recommendations
10. Review references & citations - primary and secondary sources, footnotes, – relevance
and dateline
11. Prepare General comments about the report

70
UNIT: REVIEW OF REGRESSION ANALYSIS

Regression
Study of the nature of the relationship between one variable (the dependent) and another or
others (the independent). The variable we are trying to predict is called the dependent variable
and is conventionally plotted on the y-axis. The independent variable is plotted on the x- axis. If
the relationship between the dependent and the independent variable is linear, then we can
represent its equation by:
Y = a + X where N = 15
y is an estimate of the average value of Y corresponding to a given value of X
X is the actual value of the independent variable
A is a constant - an estimate of the a , the y intercept of the regression line
b is an estimate of , the slope of the regression line; also a constant

The statistic necessary to find a and b are the variance of X, and what we call the covariance of
X and Y. The variance of the line we need is:

Variance = sx2 = ∑ x2 - (∑ x )2
n n

The covariance is symbolized as s2xy = ∑ XY - ∑X x ∑Y


N n*n

b = covariance = s2xy and a = 1 (∑Y – b(∑Y ∑X))


variance sx2 N

X Y XY X2 Y2
$

100 9 900 10,000 81


105 8 840 11,025 64
90 5 450 8,100 25
80 2 160 6,400 4
80 4 320 6,400 16
85 6 510 7,225 36
87 4 348 7,569 16
92 7 644 8,464 49
90 6 540 8,100 36
95 7 665 9,025 49
93 5 465 8,649 25
85 5 425 7,225 25
85 4 340 7,225 16
70 3 210 4,900 9
85 3 225 7,225 9

1,322 78 7,072 117,532 460

71
s2xy = ∑ XY - ∑X x ∑Y = 7,072 - 1,322 x 78 = 13.17
n n*n 15 15 x 15

sy2 = ∑ x2 - (∑ x )2 = 117,532 - (1,322)2 = 67.98


n n 15 15

Hence b = 13.17 = 0.1937


67.98

and a = 1 (∑Y – b(∑Y ∑X)) = 1 x ( 78 – 1322 x 0.1937) = 11.87


N 15

so that Y = 0.19 X - 11.87

or Y = (75 x 0.1938) - 11.87 = 2.6

Handling Time Series Data


In real life data is not static. We are frequently presented with data consisting of observations
taken at successive points in time. Such data are said to form a time series. These may be taken
at equal or irregular intervals. For example number of deaths per day; monthly export
production of diamonds; quarterly unemployment figures for Windhoek, etc.

Data of this form usually have some dependent relationship on prior observation or previous
data, and as such it may be possible to establish this dependence and use it to form the basis for
short term prediction of future behaviour or value. Observations in a time series data can be
represented graphically in a line diagram where the observations are plotted against a time
(horizontal axis).

Example: 1

Consider the import table for sweet crude oil into Namibia between 1974 and 2001

Namibia imports of sweet crude oil, 1974-2001 (in ‘000 of barrels)

Year Sweet crude Year Sweet crud Year Sweet crude Year Sweet crude

1974 6902 1981 7882 1988 8971 1995 6428


1975 7228 1982 7568 1989 8165 1996 7348
1976 6829 1983 7804 1990 8221 1997 6369
1977 6833 1984 9103 1991 8181 1998 6709
1978 6421 1985 8639 1992 8630 1999 7053
1979 7260 1986 8248 1993 9746 2000 6131
1980 7686 1987 8734 1994 7483 2001 5649

72
If the import manager asked you, as the researcher economist, to clarify the inherent pattern in
the Country’s import of sweet crude oil over time, what would you do and why. You could do
this by using various systems of moving averages, whereby an observation is replaced by the
mean of a number of observations centred on the one in question. For example, if we take 5-year
moving average, the 1976 figure is replaced by the mean volume of imports for years 1974 to
1978, i.e.

6902 + 7228 + 6829 + 6833 + 6421 = 6842.6


5

Likewise, the 1977 figure is replaced by the mean of years 1975 to 1979, i.e.

7228 + 6829 +6833 + 6421 + 7260 = 6914.2


5

and so on. The charts below shows the 5-year moving average, yearly imports & trend
Y = a + bX + e
Y = the dependent variable
X = the independent variable
a = is the y-intercept (i.e. the value of the intercept of the dependent variable (Y)
when the independent variable (X) is equal to zero)
b = is the regression coefficient; it is the slope of the regression line which
measures how much the dependent variable (Y) changes per unit change in the
independent variable (X)

b = ∂Y
∂X e is an error term, or average residual, that exists because
regressions generally are not perfect or reliable models

The values of a and b that satisfy the least squares principle are determined from the following
equations:

b = N ∑ XY - ∑ X ∑ Y
N ∑ X2 - (∑ X)2

a = Y -bX = ∑y - b ∑x
n n

X Y XY X2 X -X Y - Y

5 10 50 25
-10 -15 150 100
10 15 150 100
0 5 0 0
-10 -5 50 100
∑ -5 10 400 325

73
Y -5/5 = -1 and X = 10/5 = 2

b = N ∑ XY - ∑ X ∑ Y
N ∑ X2 - (∑ X)2

= 5 (400 - (-5 * 10) = 2050 = 1.28


5 * 325 - 325* 325 1600

b = Y - bX X = -5 = -1 Y = 10 = 5
5 2

a = 10 - 1.28 ( - 5) = 2 + 1.28 = 3.28


5 5 Y=3.28 + 1.28X

Trend

To isolate the trend, we need to use a moving average of such period as to eliminate the seasonal
effects or to reformulate the table in such a way that we can do a regression analysis.

74

You might also like