Applied Social Methods
Applied Social Methods
2
EDITION
Applied
Social Research
Methods
FM-Bickman-45636:FM-Bickman-45636 7/28/2008 7:30 PM Page ii
2
EDITION
Applied
Social Research
Methods
Leonard Bickman
Vanderbilt University
Debra J. Rog
Westat
EDITORS
FM-Bickman-45636:FM-Bickman-45636 7/28/2008 7:30 PM Page iv
All rights reserved. No part of this book may be reproduced or utilized in any form or by any
means, electronic or mechanical, including photocopying, recording, or by any information
storage and retrieval system, without permission in writing from the publisher.
For information:
SAGE Publications, Inc. SAGE Publications India Pvt. Ltd.
2455 Teller Road B 1/I 1 Mohan Cooperative
Thousand Oaks, Industrial Area
California 91320 Mathura Road, New Delhi 110 044
E-mail: [email protected] India
H62.H24534 2009
300.72dc22 2008008495
Printed on acid-free paper
08 09 10 11 12 10 9 8 7 6 5 4 3 2 1
Contents
3. Practical Sampling 77
Gary T. Henry
6. Quasi-Experimentation 182
Melvin M. Mark and Charles S. Reichardt
Acknowledgments
T he editors are grateful for the assistance of Peggy Westlake in managing the
complex process of developing and producing this Handbook.
Publishers Acknowledgments
SAGE Publications gratefully acknowledges the contributions of the following
reviewers:
vii
FM-Bickman-45636:FM-Bickman-45636 7/28/2008 7:30 PM Page viii
Introduction
Why a Handbook of Applied
Social Research Methods?
Leonard Bickman
Debra J. Rog
viii
FM-Bickman-45636:FM-Bickman-45636 7/28/2008 7:30 PM Page ix
Introduction ix
and models from the behavioral and organization sciences to help identify what is
going on in an organization and to help guide decisions based on this information.
In addition to reflecting any new developments that have occurred (such as the
technological changes noted above), other changes that have been made in this edi-
tion respond to comments made about the first edition, with an emphasis on
increasing the pedagogical quality of each of the chapters and the book as a whole.
In particular, the text has been made more classroom friendly with the inclusion
of discussion questions and exercises. The chapters also are current with new
research cited and improved examples of those methods. Overall, however, research
methods are not an area that is subject to rapid changes.
This version of the Handbook, like the first edition, presents the major method-
ological approaches to conducting applied social research that we believe need to be
in a researchers repertoire. It serves as a handy reference guide, covering key yet
often diverse themes and developments in applied social research. Each chapter
summarizes and synthesizes major topics and issues of the method and is designed
with a broad perspective but provides information on additional resources for
more in-depth treatment of any one topic or issue.
Applied social research methods span several substantive arenas, and the bound-
aries of application are not well-defined. The methods can be applied in educa-
tional settings, environmental settings, health settings, business settings, and so
forth. In addition, researchers conducting applied social research come from several
disciplinary backgrounds and orientations, including sociology, psychology, busi-
ness, political science, education, geography, and social work, to name a few.
Consequently, a range of research philosophies, designs, data collection methods,
analysis techniques, and reporting methods can be considered to be applied social
research. Applied research, because it consists of a diverse set of research strategies,
is difficult to define precisely and inclusively. It is probably most easily defined by
what it is not, thus distinguishing it from basic research. Therefore, we begin by
highlighting several differences between applied and basic research; we then present
some specific principles relevant to most of the approaches to applied social
research discussed in this Handbook.
Differences in Purpose
Knowledge Use Versus Knowledge Production. Applied research strives to improve
our understanding of a problem, with the intent of contributing to the solution
of that problem. The distinguishing feature of basic research, in contrast, is that it
is intended to expand knowledge (i.e., to identify universal principles that con-
tribute to our understanding of how the world operates). Thus, it is knowledge, as
an end in itself, that motivates basic research. Applied research also may result in
new knowledge, but often on a more limited basis defined by the nature of an
immediate problem. Although it may be hoped that basic research findings will
eventually be helpful in solving particular problems, such problem solving is not
the immediate or major goal of basic research.
Broad Versus Narrow Questions. The applied researcher is often faced with fuzzy
issues that have multiple, often broad research questions, and addresses them in a
messy or uncontrolled environment. For example, what is the effect of the provi-
sion of mental health services to people living with AIDS? What are the causes of
homelessness?
Even when the questions are well-defined, the applied environment is complex,
making it difficult for the researcher to eliminate competing explanations (e.g.,
events other than an intervention could be likely causes for changes in attitudes or
behavior). Obviously, in the example above, aspects of an individuals life other than
mental health services received will affect that persons well-being. The number and
complexity of measurement tasks and dynamic real-world research settings pose
major challenges for applied researchers. They also often require that researchers
make conscious choices (trade-offs) about the relative importance of answering var-
ious questions and the degree of confidence necessary for each answer.
In contrast, basic research investigations are usually narrow in scope. Typically,
the basic researcher is investigating a very specific topic and a very tightly focused
question. For example, what is the effect of white noise on the short-term recall of
FM-Bickman-45636:FM-Bickman-45636 7/28/2008 7:30 PM Page xi
Introduction xi
nonsense syllables? Or what is the effect of cocaine use on fine motor coordination?
The limited focus enables the researcher to concentrate on a single measurement
task and to use rigorous design approaches that allow for maximum control of
potentially confounding variables. In an experiment on the effects of white noise,
the laboratory setting enables the researcher to eliminate all other noise variables
from the environment, so that the focus can be exclusively on the effects of the vari-
able of interest, the white noise.
Practical Versus Statistical Significance. There are differences also between the ana-
lytic goals of applied research and those of basic research. Basic researchers gener-
ally are most concerned with determining whether or not an effect or causal
relationship exists, whether or not it is in the direction predicted, and whether or
not it is statistically significant. In applied research, both practical significance and
statistical significance are essential. Besides determining whether or not a causal
relationship exists and is statistically significant, applied researchers are interested
in knowing if the effects are of sufficient size to be meaningful in a particular con-
text. It is critical, therefore, that the applied researcher understands the level of out-
come that will be considered significant by key audiences and interest groups. For
example, what level of reduced drug use is considered a practically significant out-
come of a drug program? Is a 2% drop meaningful? Thus, besides establishing
whether the intervention has produced statistically significant results, applied
research has the added task of determining whether the level of outcome attained
is important or trivial.
Differences in Context
Open Versus Controlled Environment. The context of the research is a major factor
in accounting for the differences between applied research and basic research. As
FM-Bickman-45636:FM-Bickman-45636 7/28/2008 7:30 PM Page xii
noted earlier, applied research can be conducted in many diverse contexts, includ-
ing business settings, hospitals, schools, prisons, and communities. These settings,
and their corresponding characteristics, can pose quite different demands on
applied researchers. The applied researcher is more concerned about generalizabil-
ity of findings. Since application is a goal, it is important to know how dependent
the results of the study are on the particular environment in which it was tested. In
addition, lengthy negotiations are sometimes necessary for a researcher even to
obtain permission to access the data.
Basic research, in contrast, is typically conducted in universities or similar aca-
demic environments and is relatively isolated from the government or business
worlds. The environment is within the researchers control and is subject to close
monitoring.
Client Initiated Versus Researcher Initiated. The applied researcher often receives
research questions from a client or research sponsor, and sometimes these ques-
tions are poorly framed and incompletely understood. Clients of applied social
research can include federal government agencies, state governments and legisla-
tures, local governments, government oversight agencies, professional or advo-
cacy groups, private research institutions, foundations, business corporations and
organizations, and service delivery agencies, among others. The client is often in
control, whether through a contractual relationship or by virtue of holding a
higher position within the researchers place of employment (if the research is
being conducted internally). Typically, the applied researcher needs to negotiate
with the client about the project scope, cost, and deadlines. Based on these param-
eters, the researcher may need to make conscious trade-offs in selecting a research
approach that affects what questions will be addressed and how conclusively they
will be addressed.
University basic research, in contrast, is usually self-initiated, even when fund-
ing is obtained from sources outside the university environment, such as through
government grants. The idea for the study, the approach to executing it, and even
the timeline are generally determined by the researcher. The reality is that the basic
researcher, in comparison with the applied researcher, operates in an environment
with a great deal more flexibility, less need to let the research agenda be shaped by
project costs, and less time pressure to deliver results by a specified deadline. Basic
researchers sometimes can undertake multiyear incremental programs of research
intended to build theory systematically, often with supplemental funding and sup-
port from their universities.
Introduction xiii
Differences in Methods
External Versus Internal Validity. A key difference between applied research and
basic research is the relative emphasis on internal and external validity. Whereas
internal validity is essential to both types of research, external validity is much more
important to applied research. Indeed, the likelihood that applied research findings
will be used often depends on the researchers ability to convince policymakers that
the results are applicable to their particular setting or problem. For example, the
results from a laboratory study of aggression using a bogus shock generator are not
as likely to be as convincing or as useful to policymakers who are confronting the
problem of violent crime as are the results of a well-designed survey describing the
types and incidence of crime experienced by inner-city residents.
The Construct of Effect Versus the Construct of Cause. Applied research concen-
trates on the construct of effect. It is especially critical that the outcome mea-
sures are validthat they accurately measure the variables of interest. Often, it
is important for researchers to measure multiple outcomes and to use multiple
measures to assess each construct fully. Mental health outcomes, for example,
may include measures of daily functioning, psychiatric status, and use of hospi-
talization. Moreover, measures of real-world outcomes often require more than
self-report and simple paper-and-pencil measures (e.g., self-report satisfaction
with participation in a program). If attempts are being made to address a social
problem, then real-world measures directly related to that problem are desirable.
For example, if one is studying the effects of a program designed to reduce inter-
group conflict and tension, then observations of the interactions among group
members will have more credibility than group members responses to questions
about their attitudes toward other groups. In fact, there is much research evi-
dence in social psychology that demonstrates that attitudes and behavior often
do not relate.
Basic research, on the other hand, concentrates on the construct of cause. In lab-
oratory studies, the independent variable (cause) must be clearly explicated and not
confounded with any other variables. It is rare in applied research settings that con-
trol over an independent variable is so clear-cut. For example, in a study of the
effects of a treatment program for drug abusers, it is unlikely that the researcher can
isolate the aspects of the program that are responsible for the outcomes that result.
This is due to both the complexity of many social programs and the researchers
inability in most circumstances to manipulate different program features to discern
different effects.
Multiple Versus Single Levels of Analysis. The applied researcher, in contrast to the
basic researcher, usually needs to examine a specific problem at more than one
FM-Bickman-45636:FM-Bickman-45636 7/28/2008 7:30 PM Page xiv
level of analysis, not only studying the individual, but often larger groups, such as
organizations or even societies. For example, in one evaluation of a community
crime prevention project, the researcher not only examined individual attitudes
and perspectives but also measured the reactions of groups of neighbors and
neighborhoods to problems of crime. These added levels of analysis may require
that the researcher be conversant with concepts and research approaches found in
several disciplines, such as psychology, sociology, and political science, and that
he or she develop a multidisciplinary research team that can conduct the multi-
level inquiry.
Similarly, because applied researchers are often given multiple questions to
answer, because they must work in real-world settings, and because they often use
multiple measures of effects, they are more likely to use multiple research methods,
often including both quantitative and qualitative approaches. Although using mul-
tiple methods may be necessary to address multiple questions, it may also be a strat-
egy used to triangulate on a difficult problem from several directions, thus lending
additional confidence to the study results. Although it is desirable for researchers to
use experimental designs whenever possible, often the applied researcher is called
in after a program or intervention is in place, and consequently is precluded from
building random assignment into the allocation of program resources. Thus,
applied researchers often use quasi-experimental studies. The obverse, however, is
rarer; quasi-experimental designs are generally not found in the studies published
in basic research journals.
Introduction xv
The Iterative Nature of Applied Research. In most applied research endeavors, the
research questionthe focus of the effortis rarely static. Rather, to maintain
the credibility, responsiveness, and quality of the research project, the researcher
must typically make a series of iterations within the research design. The iteration
is necessary not because of methodological inadequacies, but because of succes-
sive redefinitions of the applied problem as the project is being planned and
implemented. New knowledge is gained, unanticipated obstacles are encountered,
and contextual shifts take place that change the overall research situation and in
turn have effects on the research. The first chapter in this Handbook, by Bickman
and Rog, describes an iterative approach to planning applied research that con-
tinually revisits the research question as trade-offs in the design are made. In
Chapter 7, Maxwell also discusses the iterative, interactive nature of qualitative
research design, highlighting the unique relationships that occur in qualitative
research among the purposes of the research, the conceptual context, the ques-
tions, the methods, and validity.
Multiple Stakeholders. As noted earlier, applied research involves the efforts and
interests of multiple parties. Those interested in how a study gets conducted and its
results can include the research sponsor, individuals involved in the intervention or
program under study, the potential beneficiaries of the research (e.g., those who
could be affected by the results of the research), and potential users of the research
results (such as policymakers and business leaders). In some situations, the cooper-
ation of these parties is critical to the successful implementation of the project.
Usually, the involvement of these stakeholders ensures that the results of the
research will be relevant, useful, and hopefully used to address the problem that the
research was intended to study.
Many of the contributors to this volume stress the importance of consulting and
involving stakeholders in various aspects of the research process. Bickman and Rog
describe the role of stakeholders throughout the planning of a study, from the spec-
ification of research questions to the choice of designs and design trade-offs.
Similarly, in Chapter 4, on planning ethically responsible research, Sieber empha-
sizes the importance of researchers attending to the interests and concerns of all
parties in the design stage of a study. Kane and Trochim, in Chapter 14, offer con-
cept mapping as a structured technique for engaging stakeholders in the decision
making and planning of research.
Ethical Concerns. Research ethics are important in all types of research, basic or
applied. When the research involves or affects human beings, the researcher must
attend to a set of ethical and legal principles and requirements that can ensure the
protection of the interests of all those involved. Ethical issues, as Boruch and col-
leagues note in Chapter 5, commonly arise in experimental studies when individu-
als are asked to be randomly assigned into either a treatment condition or a control
FM-Bickman-45636:FM-Bickman-45636 7/28/2008 7:30 PM Page xvi
condition. However, ethical concerns are also raised in most studies in the develop-
ment of strategies for obtaining informed consent, protecting privacy, guaranteeing
anonymity, and/or ensuring confidentiality, and in developing research procedures
that are sensitive to and respectful of the specific needs of the population involved
in the research (see Sieber, Chapter 4; Fetterman, Chapter 17). As Sieber notes,
although attention to ethics is important to the conduct of all studies, the need for
ethical problem solving is particularly heightened when the researcher is dealing
with highly political and controversial social problems, in research that involves
vulnerable populations (e.g., individuals with AIDS), and in situations where stake-
holders have high stakes in the outcomes of the research.
Enhancing Validity. Applied research faces challenges that threaten the validity of
studies results. Difficulties in mounting the most rigorous designs, in collecting
data from objective sources, and in designing studies that have universal generaliz-
ability require innovative strategies to ensure that the research continues to produce
valid results. Lipsey and Hurley, in Chapter 2, describe the link between internal
validity and statistical power and how good research practice can increase the sta-
tistical power of a study. In Chapter 6, Mark and Reichardt outline the threats to
validity that challenge experiments and quasi-experiments and various design
strategies for controlling these threats. Henry, in his discussion of sampling in
Chapter 3, focuses on external validity and the construction of samples that can
provide valid information about a broader population. Other contributors in Part
III (Fowler & Cosenza, Chapter 12; Lavrakas, Chapter 16; Mangione & Van Ness,
Chapter 15) focus on increasing construct validity through the improvement of the
design of individual questions and overall data collection tools, the training of data
collectors, and the review and analysis of data.
Introduction xvii
Conclusion
We hope that the contributions to this Handbook will help guide readers in select-
ing appropriate questions and procedures to use in applied research. Consistent
with a handbook approach, the chapters are not intended to provide the details
necessary for readers to use each method or to design comprehensive research;
rather, they are intended to provide the general guidance readers will need to
address each topic more fully. This Handbook should serve as an intelligent guide,
helping readers select the approaches, specific designs, and data collection proce-
dures that they can best use in applied social research.
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:02 AM Page 1
PART I
Approaches to
Applied Research
T he four chapters in this section describe the key elements and approaches
to designing and planning applied social research. The first chapter by
Bickman and Rog presents an overview of the design process. It stresses the
iterative nature of planning research as well as the multimethod approach.
Planning an applied research project usually requires a great deal of learning about
the context in which the study will take place as well as different stakeholder per-
spectives. It took one of the authors (L.B.) almost 2 years of a 6-year study to decide
on the final design. The authors stress the trade-offs that are involved in the design
phase as the investigator balances the needs for the research to be timely, credible,
within budget, and of high quality. The authors note that as researchers make trade-
offs in their research designs, they must continue to revisit the original research
questions to ensure either that they can still be answered given the changes in the
design or that they are revised to reflect what can be answered.
One of the aspects of planning applied research covered in Chapter 1, often over-
looked in teaching and in practice, is the need for researchers to make certain that the
resources necessary for implementing the research design are in place. These include
both human and material resources as well as other elements that can make or break
a study, such as site cooperation. Many applied research studies fail because the
assumed community resources never materialize. This chapter describes how to
develop both financial and time budgets and modify the study design as needed based
on what resources can be made available.
The next three chapters outline the principles of three major areas of design:
experimental designs, descriptive designs, and making sure that the design meets
ethical standards. In Chapter 2, Lipsey and Hurley highlight the importance of plan-
ning experiments with design sensitivity in mind. Design sensitivity, also referred to
as statistical power, is the ability to detect a difference between the treatment and
1
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:02 AM Page 2
CHAPTER 1
Leonard Bickman
Debra J. Rog
3
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:02 AM Page 4
Planning Execution
Other types of applied research need to consider the interests and needs of the
research sponsor, but no other area has the variety of participants (e.g., program
staff, beneficiaries, and community stakeholders) involved in the planning stage like
program evaluation.
Stage I of the research process starts with the researchers development of an
understanding of the relevant problem or societal issue. This process involves work-
ing with stakeholders to refine and revise study questions to make sure that
the questions can be addressed given the research conditions (e.g., time frame,
resources, and context) and can provide useful information. After developing poten-
tially researchable questions, the investigator then moves to Stage IIdeveloping the
research design and plan. This phase involves several decisions and assessments,
including selecting a design and proposed data collection strategies.
As noted, the researcher needs to determine the resources necessary to conduct
the study, both in the consideration of which questions are researchable as well as
in making design and data collection decisions. This is an area where social science
academic education and experience is most often deficient and is one reason why
academically oriented researchers may at times fail to deliver research products on
time and on budget.
Assessing the feasibility of conducting the study within the requisite time frame
and with available resources involves analyzing a series of trade-offs in the type of
design that can be employed, the data collection methods that can be implemented,
the size and nature of the sample that can be considered, and other planning deci-
sions. The researcher should discuss the full plan and analysis of any necessary
trade-offs with the research client or sponsor, and agreement should be reached on
its appropriateness.
As Figure 1.2 illustrates, the planning activities in Stage II often occur simulta-
neously, until a final research plan is developed. At any point in the Stage II process,
the researcher may find it necessary to revisit and revise earlier decisions, perhaps
even finding it necessary to return to Stage I and renegotiate the study questions or
timeline with the research client or funder. In fact, the researcher may find that the
design that has been developed does not, or cannot, answer the original questions.
The researcher needs to review and correct this discrepancy before moving on to
Stage III, either revising the questions to bring them in line with what can be done
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:02 AM Page 5
Identify questions
Refine/revise questions
Determine Inventory
trade-offs resources
Assess feasibility
To execution
with the design that has been developed or reconsidering the design trade-offs that
were made and whether they can be revised to be in line with the questions of inter-
est. At times, this may mean increasing the resources available, changing the sam-
ple being considered, and other decisions that can increase the plausibility of the
design to address the questions of interest.
Depending on the type of applied research effort, these decisions can either
be made in tandem with a client or by the research investigator alone. Clearly,
involving stakeholders in the process can lengthen the planning process and at
some point, may not yield the optimal design from a research perspective. There
typically needs to be a balance in determining who needs to be consulted, for
what decisions, and when in the process. As described later in the chapter, the
researcher needs to have a clear plan and rationale for involving stakeholders in
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:02 AM Page 6
various decisions. Strategies such as concept mapping (Kane & Trochim, Chapter 14)
provide a structured mechanism for obtaining input that can help in designing a
study. For some research efforts, such as program evaluation, collaboration, and
consultation with key stakeholders can help improve the feasibility of a study and
may be important to improving the usefulness of the information (Rog, 1985).
For other research situations, however, there may be need for minimal involve-
ment of others to conduct an appropriate study. For example, if access or buy in
is highly dependent on some of the stakeholders, then including them in all major
decisions may be wise. However, technical issues, such as which statistical tech-
niques to use, generally do not benefit from, or need stakeholder involvement. In
addition, there may be situations in which the science collides with the prefer-
ences of a stakeholder. For example, a stakeholder may want to do the research
quicker or with fewer participants. In cases such as these, it is critical for the
researcher to provide persuasive information about the possible trade-offs of fol-
lowing the stakeholder advice, such as reducing the ability to find an effect if one
is actually presentthat is, lowering statistical power. Applied researchers often
find themselves educating stakeholders about the possible trade-offs that could
be made. The researcher will sometimes need to persuade stakeholders to think
about the problem in a new way or demonstrate the difficulties in implementing
the original design.
The culmination of Stage II is a comprehensively planned applied research proj-
ect, ready for full-scale implementation. With sufficient planning completed at this
point, the odds of a successful study are significantly improved, but far from guar-
anteed. As discussed later in this chapter, conducting pilot and feasibility studies
continues to increase the odds that a study can be successfully mounted.
In the sections to follow, we outline the key activities that need to be conducted
in Stage I of the planning process, followed by highlighting the key features that
need to be considered in choosing a design (Stage II), and the variety of designs
available for different applied research situations. We then go into greater depth
on various aspects of the design process, including selecting the data collection
methods and approach, determining the resources needed, and assessing the
research focus.
Developing a Consensus on
the Nature of the Research Problem
Before an applied research study can even begin to be designed, there has to be
a clear and comprehensive understanding of the nature of the problem being
addressed. For example, if the study is focused on evaluating a program for home-
less families being conducted in Georgia, the researcher should know what research
and other available information has been developed about the needs and charac-
teristics of homeless families in general and specifically in Georgia; what evidence
base exists, if any for the type of program being tested in this study; and so forth.
In addition, if the study is being requested by an outside sponsor, it is important to
have an understanding of the impetus of the study and what information is desired
to inform decision making.
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:02 AM Page 7
Strategies that can be used in gathering the needed information include the
following:
& Rog, 1986). The questions to be addressed by an applied study tend to be posed
by individuals other than the researcher, often by nontechnical persons in non-
technical language.
Therefore, one of the first activities in applied research is working with the study
clients to develop a common understanding of the research agendathe research
questions. Phrasing study objectives as questions is desirable in that it leads to more
clearly focused discussion of the type of information needed. It also makes it more
likely that key terms (e.g., welfare dependency, drug use) will be operationalized
and clearly defined. Using the logic models also helps focus the questions on what
is expected from the program and to move to measurable variables to both study
the process of an intervention or program as well as its expected outcomes. Later,
after additional information has been gathered and reviewed, the parties will need
to reconsider whether these questions are the right questions and whether it is
possible, with a reasonable degree of confidence, to obtain answers for these ques-
tions within the available resource and time constraints.
conceptual framework has been agreed on, the researcher can further refine the
study questionsgrouping questions and identifying which are primary and sec-
ondary questions. Areas that need clarification include the time frame of the data
collection (i.e., Will it be a cross-sectional study or one that will track individuals
or cohorts over time; how long will the follow-up period be?); how much the client
wants to generalize (e.g., Is the study interested in providing outcome information
on all homeless families that could be served in the program or only those families
with disabilities?); how certain the client wants the answers to be (i.e., How pre-
cise and definitive should the data collected be to inform the decisions?); and what
subgroups the client wants to know about (e.g., Is the study to provide findings on
homeless families in general only or is there interest in outcomes for subgroups of
families, such as those who are homeless for the first time, those who are homeless
more than once but for short durations, and those who are chronically home-
less?). The levels of specificity should be very high at this point, enabling a clear
agreement on what information will be produced. As the next section suggests,
these discussions between researcher and research clients oftentimes take on the flavor
of a negotiation.
research questions need to be refined. The researcher and client then typically
discuss the research approaches under consideration to answer these questions as
well as the study limitations. This gives the researcher an opportunity to introduce
constraints into the discussion regarding available resources, time frames, and any
trade-offs contemplated regarding the likely precision and conclusiveness of
answers to the questions.
In most cases, clients want sound, well-executed research and are sympathetic to
researchers need to preserve the integrity of the research. Some clients, however,
have clear political, organizational, or personal agendas, and will push researchers
to provide results in unrealistically short time frames or to produce results sup-
porting particular positions. Other times, the subject of the study itself may gener-
ate controversy, a situation that requires the researcher to take extreme care to
preserve the neutrality and credibility of the study. Several of the strategies dis-
cussed later attempt to balance client and researcher needs in a responsible fashion;
others concentrate on opening research discussions up to other parties (e.g., advi-
sory groups). In the earliest stages of research planning, it is possible to initiate
many of these kinds of activities, thereby bolstering the studys credibility, and often
its feasibility.
under most circumstances, the simple pre-post design should not be used if the
purpose of the study is to draw causal conclusions.
Usefulness refers to whether the design is appropriately targeted to answer the
specific questions of interest. A sound study is of little use if it provides definitive
answers to the wrong questions. Feasibility refers to whether the research design can
be executed, given the requisite time and other resource constraints. All three
factorscredibility, usefulness, and feasibilitymust be considered to conduct
high-quality applied research.
Design Dimensions
Maximizing Validity
In most instances, a credible research design is one that maximizes validityit
provides a clear explanation of the phenomenon under study and controls all plau-
sible biases or confounds that could cloud or distort the research findings. Four
types of validity are typically considered in the design of applied research
(Bickman, 1989; Shadish, Cook, & Campbell, 2002).
Internal validity: the extent to which causal conclusions can be drawn or the
degree of certainty that A caused B, where A is the independent variable
(or program) and B is the dependent variable (or outcome).
External validity: the extent to which it is possible to generalize from the data
and context of the research study to other populations, times, and settings
(especially those specified in the statement of the original problem/issue).
Construct validity: the extent to which the constructs in the conceptual
framework are successfully operationalized (e.g., measured or implemented)
in the research study. For example, does the program as actually implemented
accurately represent the program concept and do the outcome measures
accurately represent the outcome? Programs change over time, especially if
fidelity to the program model or theory is not monitored.
Statistical conclusion validity: the extent to which the study has used appro-
priate sample size, measures, and statistical methods to enable it to detect the
effects if they are present. This is also related to the statistical power.
All types of validity are important in applied research, but the relative emphases
may vary, depending on the type of question under study. With questions dealing
with the effectiveness of an intervention or impact, for example, more emphasis
should be placed on internal and statistical conclusion validity than on external
validity. The researcher of such a study is primarily concerned with finding any evi-
dence that a causal relationship exists and is typically less concerned (at least ini-
tially) about the transferability of that effect to other locations or populations. For
descriptive questions, external and construct validity may receive greater emphasis.
Here, the researcher may consider the first priority to be developing a comprehen-
sive and rich picture of a phenomenon. The need to make cause-effect attributions
is not relevant. Construct validity, however, is almost always relevant.
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 13
Outlining Comparisons
An integral part of design is identifying whether and what comparisons can be
madethat is, which variables must be measured and compared with other variables
or with themselves over time. In simple descriptive studies, there are decisions to be
made regarding the time frame of an observation and how many observations are
needed. Typically, there is no explicit comparison in simple descriptive studies.
Normative studies are an extension of descriptive studies in that the interest is in com-
paring the descriptive information to some appropriate standard. The decision for
the researcher is to determine where that standard will be drawn from or how it will
be developed. In correlative studies, the design is again an extension of simple descrip-
tive work, with the difference that two or more descriptive measures are arrayed
against each other to determine whether they covary. Impact or outcome studies, by
far, demand the most judgment and background work. To make causal attributions
(X causes Y), we must be able to compare the condition of Y when X occurred with
what the condition of Y would have been without X. For example, to know if a drug
treatment program reduced drug use, we need to compare drug use among those who
were in the program with those who did not participate in the program.
Level of Analysis
Knowing what level of analysis is necessary is also critical to answering the
right question. For example, if we are conducting a study of drug use among high
school students in Toledo, Are we interested in drug use by individual students,
aggregate survey totals at the school level, aggregate totals at the school district, or
for the city as a whole?
Correct identification of the proper level or unit of analysis has important impli-
cations for both data collection and analysis. The Stage I client discussions should
clarify the desired level of analysis. It is likely that the researcher will have to help
the client think through the implications of these decisions, providing information
about research options and the types of findings that would result. In addition, this
is an area that is likely to be revisited if initial plans to obtain data at one level (e.g.,
the individual student level) prove to be prohibitively expensive or unavailable. A
design fallback position may be to change to an aggregate analysis level (e.g., the
school), particularly if administrative data at this level are more readily available
and less costly to access.
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 14
Level of Precision
Knowing how precise an answer must be is also crucial to design decisions. The
level of desired precision may affect the rigor of the design. When sampling is used,
the level of desired precision also has important ramifications for how the sample is
drawn and the size of the sample used. In initial discussions, the researcher and the
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 15
Choosing a Design
There are three main categories of applied research designs: descriptive, exper-
imental, and quasi-experimental. In our experience, developing an applied
research design rarely allows for implementing a design straight from a textbook;
rather, the process more typically involves the development of a hybrid, reflecting
combinations of designs and other features that can respond to multiple study
questions, resource limitations, dynamics in the research context, and other con-
straints of the research situation (e.g., time deadlines). Thus, our intent here is to
provide the reader with the tools to shape the research approach to the unique
aspects of each situation. Those interested in more detailed discussion should
consult Mark and Reichardts work on quasi-experimentation (Chapter 6) and
Boruch and colleagues chapter on randomized experiments (Chapter 5). In addi-
tion, our emphasis here is on quantitative designs; for more on qualitative
designs, readers should consult Maxwell (Chapter 7), Yin (Chapter 8), and
Fetterman (Chapter 17).
Key Features. Because the category of descriptive research is broad and encompasses
several different types of designs, one of the easiest ways to distinguish this class of
research from others is to identify what it is not: It is not designed to provide infor-
mation on cause-effect relationships.
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 16
Variations. There are only a few features of descriptive research that vary. These are the
representativeness of the study data sources (e.g., the subjects/entities)that is, the
manner in which the sources are selected (e.g., universe, random sample, stratified
sample, nonprobability sample); the time frame of measurementthat is, whether the
study is a one-shot, cross-sectional study, or a longitudinal study; whether the study
involves some basis for comparison (e.g., with a standard, another group or popula-
tion, data from a previous time period); and whether the design is focused on a simple
descriptive question, on a normative question, or on a correlative question.
Strengths. Exploratory descriptive studies can be low cost, relatively easy to imple-
ment, and able to yield results in a fairly short period of time. Some efforts, however,
such as those involving major surveys, may sometimes require extensive resources
and intensive measurement efforts. The costs depend on factors such as the size of
the sample, the nature of the data sources, and the complexity of the data collection
methods employed. Several chapters in this volume outline approaches to surveys,
including mail surveys (Mangione & Van Ness, Chapter 15), internet surveys (Best
& Harrison, Chapter 13), and telephone surveys (Lavrakas, Chapter 16).
Variations. The most basic experimental study is called a post-only design, in which
individuals are randomly assigned either to a treatment group or to a control group,
and the measurement of the effects of the treatment is conducted at a given period
following the administration of the treatment. There are several variations to this
simple experimental design that can respond to specific information needs as well as
provide control over possible confounds or influences that may exist. Among the
features that can be varied are the number and scheduling of posttest measurement
or observation periods, whether a preobservation is conducted, and the number of
treatment and control groups used. The post-only design is rarely used because
faulty random assignment may result in the control and treatment groups not being
equivalent at the start of the study. Few researchers are that (over) confident in the
implementation of a field randomized design to take the chance that the results
could be interpreted as being caused by faulty implementation of the design.
Quasi-Experimental Designs
Description and Purpose. Quasi-experimental designs have the same primary purpose
as experimental studiesto test the existence of a causal relationship between two or
more variables. They are used when random assignment is not feasible or desired.
Variations. Quasi-experiments vary along several of the same dimensions that are
relevant for experiments. Overall, there are two main types of quasi-experiments:
those involving data collection from two or more nonequivalent groups and those
involving multiple observations over time. More specifically, quasi-experimental
designs can vary along the following dimensions: the number and scheduling of
pre- or postobservation periods; the nature of the observationswhether the pre-
observation uses the same measurement procedure as the postobservation, or
whether both are using measures that are proxies for the real concept; the manner
in which the treatment and comparison groups are determined; and whether the
treatment group serves as its own comparison group or a separate comparison
group or groups are used.
Some of the strongest time-series designs supplement a time series for the treat-
ment group with comparison time series for another group (or time period).
Another powerful variation occurs when the researcher is able to study the effects
of an intervention over time under circumstances where that intervention is both
initiated and later withdrawn. A third strong design is the regression discontinuity
design in which participants are assigned to a treatment or comparison group
based on a clearly designated pretest score. Although this design has been used in
clinical screening (e.g., CATS Consortium, 2007), it is rarely used as most studies
do not involve the use of a pretest score as a cutoff.
When to Use. A quasi-experimental design is not the method of choice but rather a
fallback strategy for situations in which random assignment is not feasible. Situations
such as these include when the nature of the independent variable precludes the use
of random assignment (e.g., exposure or involvement in a natural disaster); retro-
spective studies (e.g., the program is already well under way or over); studies focused
on economic or social conditions, such as unemployment; when randomization is
too expensive, not feasible to initiate, or impossible to monitor closely; when there are
obstacles to withholding the treatment or when it seems unethical to withhold it; and
when the timeline is tight and a quick decision is mandated.
Sources of Data
The researcher should identify the likely sources of data to address the research
questions. Data sources typically fall into one of two broad categories: primary and
secondary. Among the potential primary data sources that exist for the applied
researcher are people (e.g., community leaders, program participants, service
providers, the general public), independent descriptive observations of events and
activities, physical documents, and test results. These data are most often collected
by the investigator as part of the study through one or more methods (e.g., ques-
tionnaires, interviews, observations). Secondary sources can include administrative
records, management information systems, economic and social indicators, and
various types of documents (e.g., prior research studies, fugitive unpublished
research literature) (Gorard, 2002; Hofferth, 2005; Stewart & Kamins, 1993).
Typically the investigator does not collect these data but uses already existing
sources such as census data, program administrative records, and others. In recent
years, there has been an increasing emphasis on performance-monitoring systems
and the implementation of management information systems, especially in agen-
cies and organizations that receive government funding. These systems can be often
considered potential sources to tap in applied research projects depending on the
quality and completeness of the data collected (as discussed below).
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 20
Self-Report Data
When dealing with self-reported data, the researcher may ask individual
research participants to provide, to the best of their ability, information on the areas
of interest. These inquiries may be made through individual interviews, through
telephone or mail surveys, Web-based surveys, or through written corroboration or
affirmation. Self-report data may be biased if the questions deal with socially desir-
able behavior, thoughts, or attitudes. In general, people like to present themselves
in a positive way. Making the data collection anonymous may improve the accuracy
of these data, especially about sensitive topics. However, anonymous data can be
difficult, but not impossible, to use in the conduct of longitudinal studies.
Extant Databases
When dealing with extant data from archival sources, the researcher is generally
using the information for a purpose other than that for which they were originally
collected. There are several secondary data sources that are commonly used, such as
those developed by university consortia, federal sources such as the Bureau of the
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 21
Census, state and local sources such as Medicaid databases, and commercial sources
such as Inform, a database of 550 business journals.
Given the enormous amount of information routinely collected on individuals
in U.S. society, administrative databases are a potential bonanza for applied
researchers. More and more organizations, for example, are computerizing their
administrative data and archiving their full databases at least monthly. Management
information systems, in particular, are becoming more common in service settings
for programmatic and evaluation purposes as well as for financial disbursement
purposes.
Administrative data sets, however, have one drawback in common with data-
bases of past researchthey were originally constructed for operational purposes,
not to meet the specific objectives of the researchers task. When the data are to be
drawn from administrative databases, the researcher should ask the following ques-
tions: Are the records complete? Why were the data originally collected? Did the
database serve some hidden political purpose that could induce systematic distor-
tions? What procedures have been used to deal with missing data? Do the comput-
erized records bear a close resemblance to the original records? Are some data items
periodically updated or purged from the computer file? How were the data col-
lected and entered, and by whom?
Biobehavioral Data
Biobehavioral measures are becoming increasingly important, especially in
health and health-related research. Body mass index, for example, is often used in
research on obesity as a measure of fitness (Flegal, Carroll, Ogden, & Johnson,
2002). Increasingly, in studies of illegal behavior, such as drug use, biobehavioral
measures using urinalysis are viewed as more valid than self-reports due to the
stigma associated with the behavior (e.g., Kim & Hill, 2003). Many of the measures,
however, require the use of advanced technology and can increase the expense of
data collection.
Observational Data
Observational procedures become necessary when events, actions, or circum-
stances are the major form of the data. If the events, actions, or circumstances are
repetitive or numerous, this form of data can be easier to collect than data com-
posed of rare events that are difficult to observe. Because the subject of the data col-
lection is often complex, the researcher may need to create detailed guidelines to
structure the data collection, coding, and analysis (see Maxwell, Chapter 7, for more
detail on qualitative data categorization and analysis).
Documents
Documentary evidence may also serve as the basis for an applied researchers
data collection. Particular kinds of documents may allow the researcher to track
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 22
what happened, when it happened, and who was involved. Examples of documen-
tary data include meeting minutes, journals, and program reports. Investigative
research may rely on documentary evidence, often in combination with data from
interviews.
Amount of Data
The research planner must anticipate the amount of data that will be needed to
conduct the study. Planning for the appropriate amount involves decisions regard-
ing the number and variety of data sources, the time periods of interest, and the
number of units (e.g., study participants), as well as the precision desired. As noted
earlier, statistical conclusion validity concerns primarily those factors that might
make it appear that there were no statistically significant effects when, in fact, there
were effects.
Effect size is defined as the proportion of variance accounted for by the treat-
ment, or as the difference between a treatment and control group measured in stan-
dard deviation units. The purpose of using standard deviation units is to produce
a measure that is independent of the metric used in the original variable. Thus, we
can discuss universal effect sizes regardless of whether we are measuring school
grades, days absent, or self-esteem scores. This makes possible the comparison of
different studies and different measures in the same study. Conversion to standard
deviation units can be obtained by subtracting the mean of the control group from
the mean of the treatment group and then dividing this difference by the pooled or
combined standard deviations of the two groups.
There are several factors that could account for not finding an effect when there
actually is one. As Lipsey and Hurley (Chapter 2) indicate, there are four factors that
govern statistical power: the statistical test, the alpha level, the sample size, and the
effect size. Many researchers, when aware of power concerns, mistakenly believe that
increasing sample size is the only way to increase statistical power. Increasing the
amount of data collected (the sample size) is clearly one route to increasing power;
however, given the costs of additional data collection, the researcher should consider
an increase in sample size only after he or she has thoroughly explored the alterna-
tives of increasing the sensitivity of the measures, improving the delivery of treat-
ment to obtain a bigger effect, selecting other statistical tests, and raising the alpha
level. If planning indicates that power still may not be sufficient, then the researcher
faces the choice of not conducting the study, changing the study to address more
qualitative questions, or proceeding with the study but informing the clients of the
risk of missing effects below a certain size. (More information on how to improve
the statistical power of a design can be found in Lipsey & Hurley, Chapter 2.)
With qualitative studies, the same set of trade-offs are made in planning how
much data to collectthat is, consideration of the number and variety of data
sources available, the time periods of interest, and the number of units, as well as
the precision desired (see Harrison, Chapter 10). Precision in qualitative studies,
however, does not refer to statistical power as much as the need for triangulation to
establish the validity of conclusions. Triangulation refers to the use of multiple data
sources and/or methods to measure a construct or a phenomenon in order to see if
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 23
they converge and support the same conclusions. The more diverse the sources and
methods, the greater confidence there is in the convergence of the findings.
Maxwell (Chapter 7) describes a number of strategies, including triangulation, for
ensuring and assessing the validity of conclusions from qualitative data.
Design Fit
Even when accurate and reliable data exist or can be collected, the researcher
must ask whether the data fit the necessary parameters of the design. Are they available
on all necessary subgroups? Are they available for the appropriate time periods? Is
it possible to obtain data at the right level of analysis (e.g., individual student vs.
school)? Do different databases feeding into the study contain comparable vari-
ables? Are they coded the same way?
If extant databases are used, the researcher may need to ask if the database is suf-
ficiently complete to support the research. Are all variables of interest present? If an
interrupted time-series design is contemplated, the researcher may need to make
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 24
Tests
In applied studies, researchers are more likely to make use of existing instru-
ments to measure knowledge or performance than to develop new ones. Whether
choosing to use a test off the shelf or to capitalize on an existing database that
includes such data, it is very important that the researcher be thoroughly familiar
with the content of the instrument, its scoring, the literature on its creation and
norming, and any ongoing controversies about its accuracy. There are several com-
pendiums of tests available that describe their characteristics (e.g., Robinson,
Shaver, & Wrightsman, 1999).
information into the research file. In circumstances where there are multiple
sources of data (e.g., monthly welfare caseload data tapes), it may be necessary to
apply these procedures to multiple data sources, using another program to merge
the information into the appropriate format for analysis.
the key difference being the mode of data collection. In Chapter 16, Lavrakas
describes telephone survey methods, including issues of sampling and selection of
respondents and supervision of interviewers. Computer-assisted telephone inter-
viewing (CATI), the oldest form of computer-assisted interviewing, allows inter-
viewers to ask questions over the telephone and key the data directly into the
computer system. As with CAPI, CATI has a strong advantage in situations where
the interview has a complex structure (e.g., complicated skip patterns) and also
provides the ability to reconcile data inconsistencies at the point of data collection
(e.g., Fowler, 2002). In Chapter 15, Mangione and Van Ness provide more detail on
the use of mail surveys.
Resource Planning
Before making final decisions about the specific design to use and the type of data
collection procedures to employ, the investigator must take into account the
resources available and the limitations of those resources. Resource planning is an
integral part of the iterative Stage II planning activities (see Figure 1.2). Resources
important to consider are the following:
Data: What are the sources of information needed and how will they be
obtained?
Time: How much time is required to conduct the entire research project,
including final analyses and reporting?
Personnel: How many researchers are needed and what are their skills?
Money: How much money is needed to implement the research and in what
categories?
Data as a Resource
The most important resource for any research project consists of the data
needed to answer the research question. As noted, data can be obtained primarily
in two ways: from original data collected by the investigator and from existing data.
We discuss below the issues associated with primary data collection and the issues
involved in the use of secondary data.
Site Selection. Applied research and basic research differ on several dimensions, as
discussed earlier, but probably the most salient difference is in the location of the
research. The setting has a clear impact on the research, not only in defining the
population studied, but also in the researchers formulation of the research ques-
tion, the research design, the measures, and the inferences that can be drawn from
the study. The setting can also determine whether there are enough research par-
ticipants available.
Deciding on the appropriate number and selection of sites is an integral part of
the design/data collection decision, and often there is no single correct answer. Is it
best to choose typical sites, a range of sites, representative sites, the best site,
or the worst site? There are always more salient variables for site selection than
resources for study execution, and no matter what criteria are used, some critics will
claim that other more important site characteristics were omitted. For this reason,
we recommend that the researcher make decisions regarding site selection in close
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 28
coordination with the research client and/or advisory group. In general, it is also
better to concentrate on as few sites as are required, rather than stretching the time
and management efforts of the research team across too many locations.
There is another major implication connected with site selection. As noted ear-
lier, multilevel designs have implications for the number and type of sites selected.
In hierarchical designs, if the research intervention is at the site level (as in the ear-
lier school example), then the investigator needs to have a sufficient number of sites
in each experimental condition to maintain enough statistical power to detect a
meaningful effect. For example, if a drug prevention program is instituted at the
school level, then the number of schools, not classes or students, is what is impor-
tant. One of the problems of using units lower in the hierarchy, such as classes, is that
there may be concern about contamination from one condition to another. In the
case where teachers are delivering the intervention and they teach in more than one
classroom, then it should be obvious that classroom is not a suitable unit of analy-
sis. Even if there is little or no chance of contamination, the observations still may
be correlated and not independent of each other. This correlation, sometimes called
the design effect, reduces the statistical power by reducing in effect the number of
participants or units. Proper design and analysis requires multiple units, with the
implication that enough units have to exist in the environment to do the study. In
the case of schools, there may be a sufficient number in a given city. The same may
not be true for hospital emergency rooms, public housing units, or mental health
centers. Studies with these organizations will typically require the participation of
multiple cities. More about designing and analyzing these site-based hierarchical
designs can be found in Raudenbush and Bryk (2002) and Graham et al. (2008).
The distinction between frontstage and backstage made by Goffman (1959)
also helps assess the openness of the setting to research. Frontstage activities are
available to anyone, whereas backstage entrance is limited. Thus in a trial, the
actions that take place in the courtroom constitute frontstage activity, open to any-
one who can obtain a seat. Entrance to the judges chambers is more limited, pres-
ence during lawyer-client conferences is even more restricted, and the observation
of jury deliberations is not permitted. The researcher needs to assess the openness
of the setting before taking the next stepseeking authorization for the research.
Authorization. Even totally open and visible settings usually require some degree of
authorization for data collection. Public space may not be as totally available to the
researcher as it may seem. For example, it is a good idea to notify authorities if a
research team is going to be present in some public setting for an extended period
of time. Although the team members presence may not be illegal and no permis-
sion is required for them to conduct observations or interviews, residents of the
area may become suspicious and call the police.
If the setting is a closed one, the researcher will be required to obtain the per-
mission of the individuals who control or believe they control access. If there are
several sites that are eligible for participation and they are within one organization,
then it behooves the researcher to explore the independence of these sites from the
parent organization. For example, in doing research in school systems, it might also
be advisable to approach a principal to obtain preliminary approval that then can
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 29
homeless families, for example, researchers often have to convince program staff
that the research will be worthwhile, will not place the families in the position of
guinea pigs, and will treat the families with respect and dignity. Most important,
an organizations decision makers must be convinced that the organization will not
be taking a significant risk or taking up valuable time in participating in the study.
The planner must be prepared to present a strong case for why a nonresearch-
oriented organization should want to involve itself in a research project.
Finally, any agreement between the researcher and the organization should be in
writing. This may take the form of a letter addressed to the organizations project
liaison officer (there may be one) for the research. The letter should describe the
procedures that will take place and indicate the dates that the investigator will be
on-site. The agreement should be detailed and should include how the organization
will cooperate with the research.
The importance of site cooperation cannot be stressed too much. Lack of cooper-
ation or dropping out of the study are some of the major factors that cause studies to
fail. It is better to recruit more sites than you think you will need because invariably
some will drop out before the study starts, and others will not have the client flow that
they assured you that they had. This is discussed more in the next section.
for delays in hiring the full set of case managers as well as other times when one or
more positions were unfilled, delays in enrolling families, and difficulties in both
having full caseloads and moving families out of service in the 9-month time
period due to the problems that families faced. Therefore, with the slippage of each
part of the equation, the number of potential families for the study (before even
considering eligibility criteria and refusal rates) was considerably smaller than ini-
tial expectations.
Care must be especially used in defining exactly who is eligible to participate in
the study. For example, a pipeline study found that there were more than enough
potential participants. However, the participant sample was limited to one child per
family. It was not known until the study was underway that 30% of the potential
participants had a sibling receiving treatment from the same organization.
Related to the number of participants is the assurance that the research design
can be successfully implemented. Randomized designs are especially vulnerable to
implementation problems. It is easy to promise that there will be no new taxes, that
the check is in the mail, and that a randomized experiment will be conductedbut
it is often difficult to deliver on these promises. In an applied setting, the investiga-
tor should obtain agreement from authorities in writing that they will cooperate in
the conduct of the study. This agreement must be detailed and procedurally ori-
ented and should clearly specify the responsibilities of the researcher and those who
control the setting. While a written document may be helpful, it is not a legal con-
tract that can be enforced. The organization leadership can change and with it the
permission to conduct the study.
The ability to implement the research depends on the ability of the investigator
to carry out the planned data collection procedures. A written plan for data collec-
tion is critical to success, but it does not assure effective implementation. A pilot
study or walk-through of the procedure is necessary to determine if it is feasible. In
this procedure, the investigator needs to consider both accessibility and other sup-
port. Written plans agreed to before the start of the study are helpful but not the final
word. The researcher needs to monitor the implementation of the research. Studies
can be sabotaged by resentful employees. For example, children eligible for services
were recruited from a mental health center by the staff person who determined the
severity of each case on a 10-point scale. The staff person was instructed that
the mild cases, rated 4 or less, or the emergency cases, rated 10, were not eligible for the
study. That left us cases rated in the range of 5 to 9, which would supply the needed
number of participants. In the first month, much fewer children entered the study
than expected. It was discovered that the person answering the phone was rating much
fewer cases in the range than needed because she didnt think the study should be
done. Once the director of the center talked to her, the situation was resolved.
Accessibility. There are a large number of seemingly unimportant details that can
damage a research project, if they are ignored. Will the research participants have
the means to travel to the site? Is there sufficient public transportation? If not, will
the investigator arrange for transportation? Will families need child care to partic-
ipate? If the study is going to use an organizations space for data collection, will the
investigator need a key? Is there anyone else who may use the space? Who controls
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 32
scheduling and room assignments? Has this person been notified? For example, a
researcher about to collect posttest data in a classroom should ensure that he or she
will not be asked to vacate the space before data collection is completed.
Other Support. Are the lighting and sound sufficient for the study? If the study
requires the use of electrical equipment, will there be sufficient electrical outlets?
Will the equipments cords reach the outlets or should the researcher bring exten-
sion cords? Do the participants need food or drink? Space is a precious commod-
ity in many institutions; the researcher should never assume that the research
project will have sufficient space.
Finally, once the researcher has judged the administrative records or other data-
base to be of sufficient quality for the study, he or she must then go through the
necessary procedures to obtain the data. In addition to determining the procedures
for extracting and physically transferring the data, the investigator also must
demonstrate how the confidentiality of the records will be protected. For example,
school systems may want a formal contractual agreement between the university
and the school system before they would release identifiable student achievement
data. Knowledge of relevant laws and regulations are important. In this example,
the researchers had legitimate right to the identifiable data under federal regula-
tions, namely, the Family Educational Rights and Privacy Act (FERPA) and the
Protection of Pupil Rights Amendment (PPRA). While it may seem to be a simple
request, it took over a year to obtain the data.
Time as a Resource
Time takes on two important dimensions in the planning of applied research:
calendar time and clock time. Calendar time is the total amount of time available
for a project, and it varies across projects.
Time Budget
In planning to use any resource, the researcher should create a budget that describes
how the resource will be allocated. Both calendar and clock time need to be budgeted.
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 34
To budget calendar time, the researcher must know the duration of the entire project.
In applied research, the duration typically is set at the start of the project, and the
investigator then tailors the research to fit the length of time available. There may be
little flexibility in total calendar time on some projects. Funded research projects usu-
ally operate on a calendar basis; that is, projects are funded for specific periods of time.
Investigators must plan what can be accomplished within the time available.
The second time budget a researcher must create concerns clock time. How
much actual time will it take to develop a questionnaire or to interview all the par-
ticipants? It is important for the investigator to decide what units of time (e.g.,
hours, days, months) will be used in the budget. That is, what is the smallest unit of
analysis of the research process that will be useful in calculating how much time it
will take to complete the research project? To answer this question, we now turn to
the concepts of tasks.
important to pilot the procedures in all the sites or at least a sample that represents
the range of sites involved. If the data collection approach involves extracting infor-
mation from administrative records, the researcher should pilot test the training
planned for data extractors as well as the data coding process. Checks should be
included for accuracy and consistency across coders.
When external validity or generalizability is a major concern, the researcher will
need to take special care in planning the construction of the sample. The sampling
procedure describes the potential subjects and how they will be selected to partici-
pate in the study. This procedure may be very complex, depending on the type of
sampling plan adopted.
The next phase of research is usually the data collection. The investigator needs
to determine how long it will take to gain access to the records as well as how long
it will take to extract the data from the records. It is important that the researcher
not only ascertains how long it will take to collect the data from the records but also
discovers whether information assumed to be found in those records is there. If the
researcher is planning to conduct a survey, the procedure for estimating the length
of time needed for this process could be extensive. Fowler and Cosenza (Chapter 12)
describe the steps involved in conducting a survey. These include developing the
instrument, recruiting and training interviewers, sampling, and the actual collec-
tion of the data. Telephone interviews require some special techniques that are
described in detail by Lavrakas (Chapter 16). Time must also be allotted to obtain
institutional review boards approval of the project if it involves human subjects. If
a project is involved in federal data collection, review may also be required by the
Office of Management and Budget (OMB), which, depending on the size of the
project, can involve a considerable effort to develop the OMB review package and
up to 4 months for the review to occur.
The next phase usually associated with any research project is data analysis.
Whether the investigator is using qualitative or quantitative methods, time must be
allocated for the analysis of data. Analysis includes not only statistical testing using
a computer but also the preparation of the data for computer analysis. Steps in this
process include cleaning the data (i.e., making certain that the responses are read-
able and unambiguous for data entry personnel), physically entering the data, and
checking for the internal consistency of the data (Smith, Breda, Simmons, Vides de
Andrade, & Bickman, 2008). Once the data are clean, the first step in quantitative
analysis is the production of descriptive statistics such as frequencies, means, stan-
dard deviations, and measures of skewness. More complex studies may require
researchers to conduct inferential statistical tests. As part of the design, a clear and
comprehensive analysis plan should be developed that includes the steps for clean-
ing the data as well as the sequence of analyses that will take place, including analy-
ses that may be needed to test for possible artifacts (e.g., attrition).
Finally, time needs to be allocated for communicating the results. An applied
research project almost always requires a final report, usually a lengthy, detailed
analysis as well as one or more verbal briefings. Within the report itself, the
researcher should take the time needed to communicate the data to the audience at
the right level. In particular, visual displays can often communicate even the most
complex findings in a more straightforward manner than prose.
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 36
Because most people will not read the entire report, it is critical that the researcher
include a two- or three-page executive summary that succinctly and clearly summa-
rizes the main findings. The executive summary should focus on the findings,
presenting them as the highlights of the study. No matter how much effort and inno-
vation went into data collection, these procedures are of interest primarily to other
researchers, not to typical sponsors of applied research or other stakeholders. The best
the researcher can hope to accomplish with these latter audiences is to educate them
about the limitations of the findings based on the specific methods used.
The investigator should allocate time not only for producing a report but also
for verbally communicating study findings to sponsors and perhaps to other key
audiences. Moreover, if the investigator desires to have the results of the study used,
it is likely that he or she needs to allocate time to work with the sponsor and other
organizations in interpreting and applying the findings of the study. This last
utilization-oriented perspective is often not included by researchers planning their
time budgets.
Time Estimates
Once the researcher has described all the tasks and subtasks, the next part of the
planning process is to estimate how long it will take to complete each task. One way to
approach this problem is to reduce each task to its smallest unit. For example, in the
data collection phase, an estimate of the total amount of interviewing time is needed.
The simplest way to estimate this total is to calculate how long each interview should
take. Pilot data are critical for helping the researcher to develop accurate estimates.
The clock-time budget indicates only how long it will take to complete each task.
What this budget does not tell the researcher is the sequencing and the real calen-
dar time needed for conducting the research. Calendar time can be calculated from
clock-time estimates, but the investigator needs to make certain other assumptions
as well. For example, calendar conflicts need to be considered in the budgeting.
Schools, for example, have a restricted window of time for data collection, usually
avoiding the month around school entry and any testing. As another example, some
service programs have almost no time for researchers around the busy holiday
times, making December a difficult time to schedule any onsite data collection.
Another set of assumptions is based on the time needed for data collection. For
example, if the study uses interviewers to collect data and 200 hours of interviewing
time are required, the length of calendar time needed will depend on several factors.
Most clearly, the number of interviewers will be a critical factor. One interviewer will
take a minimum of 200 hours to complete this task, whereas 200 interviewers could
theoretically do it in 1 hour. However, the larger number of interviewers may create
a need for other mechanisms to be put into place (e.g., interviewer supervision and
monitoring) as well as create concerns regarding the quality of the data. Thus the
researcher needs to specify the staffing levels and research team skills required for the
project. This is the next kind of budget that needs to be developed.
Each research project has unique characteristics that make it difficult to gener-
alize from one project to another. Estimating time and expenses is an inexact art. In
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 37
most cases the research underestimates the time and cost of a project. Unexpected
events that disrupt the research should be expected. Since research budgets typically
do not permit funds to be reserved for unforeseen events the planner is advised to
build in some aspect of the project that could be sacrificed without affecting the
central features of the research. The time and funds allocated to that task can usu-
ally be used to provide the additional support needed to complete the research.
Personnel as a Resource
Skills Budget
Once the investigator has described the tasks that need to be accomplished, the
next step is to decide what kinds of people are needed to carry out those tasks. What
characteristics are needed for a trained observer or an interviewer? What are the
requirements for a supervisor? To answer these questions, the investigator should
complete a skills matrix that describes the requisite skills needed for the tasks and
attaches names or positions of the research team to each cluster of skills. Typically,
a single individual does not possess all the requisite skills, so a team will need to be
developed for the research project. As noted earlier, in addition to specific research
tasks, the investigator needs to consider management of the project. This function
should be allocated to every research project. Someone will have to manage the var-
ious parts of the project to make sure that they are working together and that the
schedule is being met.
Person Loading
Once the tasks are specified and the amount of time required to complete each
task is estimated, the investigator must assign these tasks to individuals. The assign-
ment plan is described by a person-loading table that shows how much time each
person is supposed to work on each task.
At some point in the planning process, the researcher needs to return to real, or
calendar, time, because the project will be conducted under real-time constraints.
Thus the tasking chart, or Gantt chart, needs to be superimposed on a calendar.
This chart simply shows the tasks on the left-hand side and the months of the study
period at the top. Bars show the length of calendar time allocated for the comple-
tion of specific subtasks. The Gantt chart shows not only how long each task takes,
but also the approximate relationship in calendar time between tasks. Although
inexact, this chart can show the precedence of research tasks and the extent to
which some tasks will overlap and require greater staff time. One of the key rela-
tionships and assumptions made in producing a plan is that no individual will work
more than 40 hours a week. Thus the person-loading chart needs to be checked
against the Gantt chart to make sure that tasks can be completed by those individ-
uals assigned to them within the periods specified in the Gantt chart. Very reason-
ably priced computer programs are available to help the planner do these
calculations and draw the appropriate charts.
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 38
Financial Resources
Usually, the biggest part of any research projects financial budget is consumed
by personnelresearch staff. Social science research, especially applied social
science, is very labor-intensive. Moreover, the labor of some individuals can be very
costly. To produce a budget based on predicted costs, the investigator needs to
follow a few simple steps.
Based on the person-loading chart, the investigator can compute total personnel
costs for the project by multiplying the hours allocated to various individuals by
their hourly costs.
The investigator should compute personnel costs for each task. In addition, if
the project will take place over a period of years, the planner will need to provide
for salary increases in the estimates. Hourly cost typically includes salary and fringe
benefits and may also include facilities and administration (F&A) or overhead
costs. (In some instances, personnel costs need to be calculated by some other time
dimensions, such as daily or yearly rates; similarly, project costs may need to be
categorized by month or some time frame other than year.)
After the budget has been calculated, the investigator may be faced with a total
cost that is not reasonable for the project, either because the sponsor does not have
those funds available or because the bidding for the project is very competitive. If
this occurs, the investigator has several alternatives. Possible alternatives are to elim-
inate some tasks, reduce the scope of others, and/or shift the time from more expen-
sive to less expensive staff for certain tasks where it is reasonable. The investigator
needs to use ingenuity to try to devise not only a valid, reliable, and sensitive project,
but one that is efficient as well. For example, in some cases this may mean recom-
mending streamlining data collection or streamlining the reporting requirements.
The financial budget, as well as the time budget, should force the investigator to
realize the trade-offs that are involved in applied research. Should the investigator
use a longer instrument, at a higher cost, or collect fewer data from more subjects?
Should the subscales on an instrument be longer, and thus more reliable, or should
more domains be covered, with each domain composed of fewer items and thus less
reliable? Should emphasis be placed on representative sampling as opposed to a pur-
posive sampling procedure? Should the researcher use multiple data collection tech-
niques, such as observation and interviewing, or should the research plan include
only one technique, with more data collected by that procedure? These and other
such questions are ones that all research planners face. However, when a researcher
is under strict time and cost limitations, the salience of these alternatives is very high.
design trade-offs and (b) testing the feasibility of the proposed design. These activ-
ities almost always occur simultaneously. The results may require the researcher to
reconsider the potential design approach or even to return to the client to renego-
tiate the study questions.
Generalizability
Generalizability refers to the extent to which research findings can be credibly
applied to a wider setting than the research setting. For example, if one wants to
describe the methods used in vocational computer training programs, one might
decide to study a local high school, an entire community (including both high
schools and vocational education agencies and institutions), or schools across the
nation. These choices vary widely with respect to the resources required and the
effort that must be devoted to constructing sampling frames. The trade-offs here
are ones of both resources and time. Local information can be obtained much more
inexpensively and quickly than can information about a larger area; however, one
will not know whether the results obtained are representative of the methods used
in other high schools or used nationally.
Generalizability can also involve time dimensions, as well as geographic and
population dimensions. Moreover, generalizability decisions need to have a clear
understanding of the generalizability boundaries at the initiation of the study.
Conclusiveness of Findings
One of the key questions the researcher must address is how conclusive the study
must be. Research can be categorized as to whether it is exploratory or confirma-
tory in nature. An exploratory study might seek only to identify the dimensions of
a problemfor example, the types of drug abuse commonly found in a high school
population. More is demanded from a confirmatory study. In this case, the
researcher and client have a hypothesis to testfor example, among high school
students use of marijuana is twice as likely as abuse of cocaine or heroin. In this
example, it would be necessary to measure with confidence the rates of drug abuse
for a variety of drugs and to test the observed differences in rate of use.
Precision of Estimates
In choosing design approaches, it is essential that the researcher have an idea of
how small a difference or effect it is important to be able to detect for an outcome
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 40
evaluation or how precise a sample to draw for a survey. This decision drives the
choice of sample sizes and sensitivity of instrumentation, and thus affects the
resources that must be allocated to the study.
Sampling error in survey research poses a similar issue. The more precise the
estimate required, the greater the amount of resources needed to conduct a survey.
If a political candidate feels that he or she will win by a landslide, then fewer
resources are required to conduct a political poll than if the race is going to be close
and the candidate requires more precision or certainty concerning the outcome as
predicted by a survey.
Comprehensiveness of Measurement
The last area of choice involves the comprehensiveness of measurement used in
the study. It is usually desirable to use multiple methods or multiple measures in a
study (especially in qualitative studies, as noted earlier) for this allows the researcher
to look for consistency in results, thereby increasing confidence in findings. However,
multiple measures and methods can sometimes be very expensive and potentially
prohibitive. Thus researchers frequently make trade-offs between resources and com-
prehensiveness in designing measurement and data collection approaches.
Choosing the most appropriate strategy involves making trade-offs between the
level of detail that can be obtained and the resources available. Calendar time to exe-
cute the study also may be relevant. Within the measurement area, the researcher
often will have to make a decision about breadth of measurement versus depth of
measurement. Here the choice is whether to cover a larger number of constructs,
each with a brief instrument, or to study fewer constructs with longer and usually
more sensitive instrumentation. Some trade-off between comprehensiveness
(breadth) and depth is almost always made in research. Thus, within fixed resources,
a decision to increase external validity by broadening the sample frame may require
a reduction in resources in other aspects of the design. The researcher needs to con-
sider which aspects of the research process require the most resources, often in con-
sultation with the research sponsor or other possible users of the study findings.
resources from being wasted on research that has no chance of answering the posed
questions. A no-go decision does not represent a failure on the part of the researcher
but rather an opportunity to improve on the design or research procedures, and it
ultimately results in better research and hopefully better research utilization. A go
decision reinforces the confidence of the researcher and others in the utility of
expending resources to conduct the study.
Once the researcher has appropriately balanced any design trade-offs and deter-
mined the feasibility of the research plan, he or she should hold final discussions with
the research client to confirm the proposed approach. If the clients agreement is
obtained, the research planning phase is complete. If agreement is not forthcoming,
the process may start again, with a change in research scope (questions) or methods.
Conclusion
The key to conducting a sound applied research study is planning. In this chapter,
we have described several steps that can be taken in the planning stage to bolster a
study and increase its potential for successful implementation. We hope that these
steps will help you to conduct applied research that is credible, feasible, and useful.
References
Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social
psychological research: Conceptual, strategic, and statistical considerations. Journal of
Personality and Social Psychology, 51, 11731182.
Bickman, L. (1985). Randomized experiments in education: Implementations lessons. In
R. F. Boruch (Ed.), Randomized field experiments (pp. 3953). San Francisco: Jossey-Bass.
Bickman, L. (1987). The functions of program theory. In L. Bickman (Ed.), Using program
theory in evaluation (pp. 518). San Francisco: Jossey-Bass.
Bickman, L. (1989). Barriers to the use of program theory: The theory-driven perspective.
Evaluation and Program Planning, 12, 387390.
Bickman, L. (Ed.). (1990). Advances in program theory. San Francisco: Jossey-Bass.
Bickman, L., & Rog, D. J. (1986). Stakeholder assessment in early intervention projects. In
L. Bickman & D. Weatherford (Eds.), Evaluating early childhood intervention programs.
Austin, TX: PRO-ED.
CATS Consortium. (2007). Implementing CBT for traumatized children and adolescents
after September 11th: Lessons learned from the Child and Adolescent Trauma
Treatments and Services (CATS) project. Journal of Clinical Child & Adolescent
Psychology, 36, 581592.
Chen, H. (1990). Theory-driven evaluations. Newbury Park, CA: Sage.
Cook, T. D. (2002). Randomized experiments in educational policy research: A critical exam-
ination of the reasons the educational evaluation community has offered for not doing
them. Educational Evaluation and Policy Analysis, 24, 175199.
Cronbach, L. J., Ambron, S. R., Dornbusch, S. M., Hess, R. D., Hornik, R. C., Phillips, D. C.,
et al. (1980). Toward reform of program evaluation. San Francisco: Jossey-Bass.
Dennis, M. L. (1990). Assessing the validity of randomized field experiments: An example
from drug treatment research. Evaluation Review, 14, 347373.
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 42
Dillman, D. A. (1978). Mail and telephone surveys: The total design method. New York:
Wiley-Interscience.
Dillman, D. A. (2000). Mail and internet surveys: The tailored design method. New York: Wiley.
Dillman, D. A. (2006). Why choice of survey mode makes a difference. Public Health Reports,
191, 1113.
Eid, M., & Diener, E. (Eds.). (2006). Handbook of multimethod measurement in psychology.
Washington, DC: American Psychological Association.
Felce, D., & Emerson, E. (2000). Observational methods in assessment of quality of life. In
T. Thompson, D. Felce, & F. J. Symons (Eds.), Behavioral observation: Technology and
applications in developmental disabilities (pp. 159174). Baltimore: Paul Brookes.
Flegal, K. M., Carroll, M. D., Ogden, C. L., & Johnson, C. L. (2002). Prevalence and trends
in obesity among US adults, 19992000. Journal of the American Medical Association,
288, 17231727.
Foster, E. M. (2003). Propensity score matching: An illustrative analysis of dose response.
Medical Care, 41, 11831192.
Fowler, F. J., Jr. (2002). Survey research methods (3rd ed.). Thousand Oaks, CA: Sage.
Frechtling, J. A. (2007). Logic modeling in program evaluation. San Francisco: Jossey-Bass.
Goffman, E. (1959). The presentation of self in everyday life. Garden City, NY: Doubleday.
Gorard, S. (2002). The role of secondary data in combining methodological approaches.
Educational Review, 54, 231237.
Graham, S. E., Singer, J. D., & Willett, J. B. (2008). An introduction to the multilevel model-
ing of change. In P. Alasuutari, L. Bickman, & J. Brannen (Eds.), The SAGE handbook of
social research methods (pp. 869899). London: Sage.
Henry, G. T. (1990). Practical sampling. Newbury Park, CA: Sage.
Hofferth, S. L. (2005). Secondary data analysis in family research. Journal of Marriage and
Family, 67, 891907.
Kim, M. T., & Hill, M. N. (2003). Validity of self-report of illicit drug use in young hyper-
tensive urban African American males. Addictive Behaviors, 28, 795802.
Macias, C., Hargreaves, W., Bickman, L., Fisher, W., & Aronson, E. (2005). Impact of referral
source and study applicants preference in random assignment on research enrollment, ser-
vice engagement, and evaluative outcomes. American Journal of Psychiatry, 162, 78187.
McLaughlin, J. A., & Jordan, G. B. (2004). Using logic models. In H. P. Hatry, J. S. Wholey, &
K. E. Newcomer (Eds.), Handbook of practical program evaluation (2nd ed., pp. 732).
San Francisco: Jossey-Bass.
New Hampshire-Dartmouth Psychiatric Research Center. (1995). Residential follow-back
calendar. Lebanon, NH: Dartmouth Medical School.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data
analysis methods (2nd ed.). Thousand Oaks, CA: Sage.
Records, K., & Rice, M. (2006). Enhancing participant recruitment in studies of sensitive
topics. Journal of the American Psychiatric Nurses Association, 12, 2836.
Riccio, J. A., & Bloom, H. (2002). Extending the reach of randomized social experiments:
New directions in evaluations of American welfare-to-work and employment initia-
tives. Journal of the Royal Statistical Society: Series A (Statistics in Society), 165, 1330.
Robinson, J. P., Shaver, P. R., & Wrightsman, L. S. (Eds.). (1999). Measures of political atti-
tudes. San Diego, CA: Academic Press.
Rog, D. J. (1985). A methodological analysis of evaluability assessment. PhD dissertation,
Vanderbilt University, Nashville, TN.
Rog, D. J. (1994). Expanding the boundaries of evaluation: Strategies for refining and
evaluating ill-defined interventions. In S. L. Friedman & H. C. Haywood (Eds.),
01-Bickman-45636:01-Bickman-45636 7/28/2008 11:03 AM Page 43
Developmental follow-up: Concepts, genres, domains, and methods (pp. 139154). New
York: Academic Press.
Rog, D. J., & Huebner, R. (1992). Using research and theory in developing innovative
programs for homeless individuals. In H. Chen & P. H. Rossi (Eds.), Using theory to
improve program and policy evaluations (pp. 129144). Westport, CT: Greenwood Press.
Rog, D. J., & Knickman, J. (2004). Strategies for comprehensive initiatives. In M. Braverman,
N. Constantine, & J. Slater (Eds.), Foundations and evaluations: Contexts and practices
for effective philanthropy (pp. 223235). San Francisco: Jossey-Bass.
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of propensity score in observa-
tional studies of causal effects. Biometrica, 70, 4155.
Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias in observational studies using
subclassification on the propensity score. Journal of the American Statistical Association,
79, 516524.
Rubin, D. B. (1997). Estimating causal effects from large data sets using propensity scores.
Annals of Internal Medicine, 127, 757763.
Shadish, W. R., Cook, T., & Campbell, D. T. (2002). Experimental and quasi-experimental
designs for generalized causal inference. Boston: Houghton Mifflin.
Smith, C. M., Breda, C. B., Simmons, T. M., Vides de Andrade, A. R., & Bickman, L. (2008).
Data preparation and data standards: The devil is in the details. In A. R. Stiffman (Ed.),
The nitty gritty of managing field research. New York: Oxford University Press.
Stewart, D. W., & Kamins, M. A. (1993). Secondary research: Information sources and methods
(2nd ed). Newbury Park, CA: Sage.
Tsemberis, S., McHugo, G., Williams, V., Hanrahan, P., & Stefancic, A. (2006). Measuring
homelessness and residential stability: The residential time-line follow-back inventory.
Journal of Community Psychology, 35, 2942.
Wholey, J. S. (2004). Evaluability assessment. In J. S. Wholey, H. P. Hatry, & K. E. Newcomer
(Eds.), Handbook of Practical Program Evaluation (2nd ed., pp. 3361). San Francisco:
Jossey-Bass.
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 44
CHAPTER 2
Design Sensitivity
Statistical Power for
Applied Experimental Research
Mark W. Lipsey
Sean M. Hurley
44
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 45
Design Sensitivity 45
the intervention under investigation. What, then, determines our ability to detect
it? Answering this question requires that we specify what is meant by detecting a
difference in experimental research. Following current convention, we will take this
to mean that statistical criteria are used to reject the null hypothesis of no differ-
ence between the mean on the outcome measure for the persons in the treatment
condition and the mean for those in the control condition. In particular, we con-
clude that there is an effect if an appropriate statistical test indicates a statistically
significant difference between the treatment and control means.
Our goal in this chapter is to help researchers tune experimental design to
maximize sensitivity. However, before we can offer a close examination of the prac-
tical issues related to design sensitivity, we need to present a refined framework for
describing and assessing the desired resulta high probability of detecting a given
magnitude of effect if it exists. This brings us to the topic of statistical power, the
concept that will provide the idiom for this discussion of design sensitivity.
Population Circumstances
Conclusion From Statistical
Test on Sample Data T and C Differ T and C Do Not Differ
given that there really is an intervention effect. This is the probability that must be
maximized for a research design to be sensitive to actual intervention effects.
Note that and in Table 2.1 are statements of conditional probabilities. They are
of the following form: If the null hypothesis is true (false), then the probability of an
erroneous statistical conclusion is (). When the null hypothesis is true, the prob-
ability of a statistical conclusion error is held to 5% by the convention of setting =
.05. When the null hypothesis is false (i.e., there is a real effect), however, the proba-
bility of error is , and can be quite large. If we want to design experimental research
in which statistical significance is found when the intervention has a real effect, then
we must design for a low error, that is, for high statistical power (1 ).
An important question at this juncture concerns what criterion level of statisti-
cal power the researcher should strive forthat is, what level of risk for Type II
error is acceptable? By convention, researchers generally set = .05 as the maxi-
mum acceptable probability of a Type I error. There is no analogous convention for
beta. Cohen (1977, 1988) suggested = .20 as a reasonable value for general use
(more specifically, he suggested that power, equal to 1 , be at least .80). This sug-
gestion represents a judgment that Type I error is four times as serious as Type II
error. This position may not be defensible for many areas of applied research where
a null statistical result for a genuinely effective intervention may represent a great
loss of valuable practical knowledge.
A more reasoned approach would be to analyze explicitly the cost-risk issues that
apply to the particular research circumstances at hand (more on this later). At the
first level of analysis, the researcher might compare the relative seriousness of Type
I and Type II errors. If they are judged to be equally serious, the risk of each should
be kept comparable; that is, alpha should equal beta. Alternatively, if one is judged
to be more serious than the other, it should be held to a stricter standard even at the
expense of relaxing the other. If a convention must be adopted, it may be wise to
assume that, for intervention research of potential practical value, Type II error is at
least as important as Type I error. In this case, we would set = .05, as is usually done
for , and thus attempt to design research with power (1 ) equal to .95.
Sample Size. Statistical significance testing is concerned with sampling error, the
expectable discrepancies between sample values and the corresponding population
value for a given sample statistic such as a difference between means. Because sam-
pling error is smaller for large samples, it is less likely to obscure real differences
between means and statistical power is greater.
Alpha Level. The level set for alpha influences the likelihood of statistical signifi-
cancelarger alpha makes significance easier to attain than does smaller alpha. When
the null hypothesis is false, therefore, statistical power increases as alpha increases.
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 47
Design Sensitivity 47
Effect Size. If there is a real difference between the treatment and control conditions,
the size of that difference will influence the likelihood of attaining statistical signif-
icance. The larger the effect, the more probable is statistical significance and the
greater the statistical power. For a given dependent measure, effect size can be
thought of simply as the difference between the means of the treatment versus con-
trol populations. In this form, however, its magnitude is partly a function of how the
dependent measure is scaled. For most purposes, therefore, it is preferable to use an
effect size formulation that standardizes differences between means by dividing by
the standard deviation to adjust for arbitrary units of measurement. The effect size
(ES) for a given difference between means, therefore, can be represented as follows:
t c
ES =
where t and c are the respective means for the treatment and control popula-
tions and is their common standard deviation. This version of the effect size
index was popularized by Cohen (1977, 1988) for purposes of statistical power
analysis and is widely used in meta-analysis to represent the magnitude of inter-
vention effects (Lipsey & Wilson, 2000). By convention, effect sizes are computed
so that positive values indicate a better outcome for the treatment group than
for the control group, and negative values indicate a better outcome for the
control group.
For all but very esoteric applications, the most practical way actually to estimate
the numerical values for statistical power is to use precomputed tables or a com-
puter program. Particularly complete and usable reference works of statistical
power tables have been published by Cohen (1977, 1988). Other general reference
works along similar lines include those of Kraemer and Thiemann (1987), Lipsey
(1990), and Murphy and Myors (2004). Among the computer programs available
for conducting statistical power calculations are Power and Precision (from Biostat),
nQuery Advisor (from Statistical Solutions), and SamplePower (from SPSS). In
addition, there are open access power calculators on many statistical Web sites. The
reader should turn to sources such as these for information on determining statis-
tical power beyond the few illustrative cases presented in this chapter.
Figure 2.1 presents a statistical power chart for one of the more common situa-
tions. This chart assumes (a) that the statistical test used is a t test, one-way ANOVA,
or other parametric test in this same family (more on this later) and (b) that the
conventional = .05 level is used as the criterion for statistical significance. Given
these circumstances, the chart shows the relationships among power (1 ), effect
size (ES), and sample size (n for each group) plotted on sideways log-log paper,
which makes it easier to read values for the upper power levels and the lower
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 48
1.00
20
00
.00
E S . 50
0 70
1. .8 .60 50 40
1.
.
=. .45 =. 35 30
=1
=2 = = =. =.
=
= = =
ES ES ES E S ES
ES E S
ES ES 25
ES
ES
.95 =.
ES
.90
0
.2
=
.80
ES
Power
.70
.60
= .15
.50 ES
.40
.30
ES = .10
.20
.10
1 10 20 30 40 50 60 70 80 100 120 140 160 200 250 300 350 400 500
Sample Size for Each Group
sample sizes. This chart shows, for instance, that if we have an experiment with
40 participants in each of the treatment and control groups (80 total), the power to
detect an effect size of .80 (.8 standard deviations difference between the treatment
and control group means) is about .94 (i.e., given a population ES = .80 and group
n = 40, statistical significance would be expected 94% of the time at the = .05 level
with a t test or one-way ANOVA).
Design Sensitivity 49
Sample Size
The relationship between sample size and statistical power is so close that many
textbooks discuss power only in terms of determining the sample size necessary to
attain a desired power level. A look at Figure 2.1 makes clear why sample size war-
rants so much attention. Virtually any desired level of power for detecting any given
effect size can be attained by making the samples large enough.
The difficulty that the relationship between sample size and statistical power
poses for intervention research is that the availability of participants is often lim-
ited. Although a researcher can increase power considerably by parading a larger
number of participants through the study, there must be individuals ready to march
before this becomes a practical strategy. In practical intervention situations, rela-
tively few persons may be appropriate for the intervention or, if there are enough
appropriate persons, there may be limits on the facilities for treating them. If facil-
ities are adequate, there may be few who volunteer or whom program personnel are
willing to assign; or, if assigned, few may sustain their participation until the study
is complete. The challenge for the intervention researcher, therefore, is often one of
keeping power at an adequate level with modest sample sizes. If modest sample
sizes in fact generally provided adequate power, this particular challenge would not
be very demanding. Unfortunately, they do not.
Suppose, for instance, that we decide that ES = .20 is the minimal effect size
that we would want our intervention study to be able to detect reliably. An ES of
.20 is equivalent to a 22% improvement in the success rate for the treatment
group (more on this later). It is also the level representing the first quintile in the
effect size distribution derived from meta-analyses of psychological, behavioral,
and education intervention research (Lipsey & Wilson, 1993). Absent other con-
siderations, therefore, ES = .20 is a reasonable minimal effect size to ask research
to detectit is not so large that it requires heroic assumptions to think it might
actually be produced by an intervention and not so small that it would clearly lack
practical significance.
If we calculate the sample size needed to yield a power level of .95 ( = = .05),
we find that the treatment and control group must each have a minimum of
about 650 participants for a total of about 1,300 in both groups (see Figure 2.1).
The sample sizes in social intervention research are typically much smaller
than that, often less than 100 in each group. If we want to attain a power level for
ES = .20 that makes Type II error as small as the conventional limit on Type I
error through sample size alone, then we must increase the number of partici-
pants quite substantially over the average in present practice. Even attaining the
more modest .80 power level suggested as a minimum by Cohen (1988) would
require a sample size of about 400 per treatment group, larger than many studies
can obtain.
Increased sample size is thus an effective way to boost statistical power and
should be employed whenever feasible, but its costs and limited availability of par-
ticipants may restrict the researchers ability to use this approach. It is important,
therefore, that the researcher be aware of other routes to increasing statistical
power. The remainder of this chapter discusses some of these alternate routes.
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 50
Alpha Level
Alpha is conventionally set at .05 for statistical significance testing and, on the sur-
face, may seem to be the one straightforward and unproblematic element of statisti-
cal power for the intervention researcher. That impression is misleading. An of .05
corresponds to a .95 probability of a correct statistical conclusion only when the null
hypothesis is true. However, a relatively conservative alpha makes statistical signifi-
cance harder to attain when the null hypothesis is false and, therefore, decreases the
statistical power. Conversely, relaxing the alpha level required for statistical signifi-
cance increases power. The problem is that this reduction in the probability of a Type
II error comes at the expense of an increased probability of a Type I error. This means
that the researcher cannot simply raise alpha until adequate power is attained but,
rather, must find some appropriate balance between alpha and beta. Both Type I error
() and Type II error () generally have important implications in the investigation
of intervention effects. Type I error can mean that an ineffective or innocuous inter-
vention is judged beneficial or, possibly, harmful, whereas Type II error can permit a
truly effective intervention (or a truly harmful one) to go undiscovered. Though little
has been written in recent years about how to think about this balancing act, useful
perspectives can be found in Brown (1983), Cascio and Zedeck (1983), Nagel and
Neef (1977), and Schneider and Darcy (1984). In summary form, the advice of these
authors is to consider the following points in setting error risk levels.
Prior Probability. Because the null hypothesis is either true or false, only one type of
inferential error is possible in a given studyType I for a true null hypothesis and
Type II for a false null hypothesis. The problem, of course, is that we do not know
if the null hypothesis is true or false and, thus, do not know which type of error is
relevant to our situation. However, when there is evidence that makes one alternative
more likely, the associated error should be given more importance. If, for example,
prior research tends to show an intervention effect, the researcher should be especially
concerned about protection against Type II error and should set beta accordingly.
Relative Costs and Benefits. Perhaps the most important aspect of error risk in inter-
vention research has to do with the consequences of an error. Rarely will the costs of
each type of error be the same, nor will the benefits of each type of correct inference.
Sometimes, intervention effects and their absence can be interpreted directly in
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 51
Design Sensitivity 51
terms of dollars saved or spent, lives saved or lost, and the like. In such cases, the
optimal relationship between alpha and beta error risk should be worked out
according to their relative costs and benefits. When the consequences of Type I and
Type II errors cannot be specified in such definite terms, the researcher may still be
able to rely on some judgment about the relative seriousness of the risks. Such judg-
ment might be obtained by asking those familiar with the intervention circum-
stances to rate the error risk and the degree of certainty that they feel is minimal for
the conclusions of the research. This questioning, for instance, may reveal that
knowledgeable persons believe, on average, that a 95% probability of detecting a
meaningful effect is minimal and that Type II error is three times as serious as Type
I error. This indicates that should be set at .05 and at .15. Nagel and Neef (1977)
provided a useful decision theory approach to this judgment process that has the
advantage of requiring relatively simple judgments from those whose views are
relevant to the research context.
If some rational analysis of the consequences of error is not feasible, it may be
necessary to resort to a convention (such as = .05) as a default alternative. For
practical intervention research, the situation is generally one in which both types of
errors are serious. Under these circumstances, the most straightforward approach is
to set alpha risk and beta risk equal unless there is a clear reason to do otherwise. If
we hold to the usual convention that should be .05, then we should design research
so that will also be .05. If such high standards are not practical, then both alpha
and beta could be relaxed to some less stringent levelfor example, .10 or even .20.
To provide some framework for consideration of the design issues related to the
criterion levels of alpha and beta set by the researcher, Table 2.2 shows the required
sample size per group for the basic two-group experimental design at various effect
sizes under various equal levels of alpha (two-tailed) and beta. It is noteworthy that
maintaining relatively low levels of alpha and beta risk (e.g., .05 or below) requires
either rather large effect sizes or rather large sample sizes. Moreover, relaxing alpha
levels does not generally yield dramatic increases in statistical power for the most
difficult to detect effect sizes. Manipulation of other aspects of the power function,
such as those described later, will usually be more productive for the researcher
seeking to detect potentially modest effects with modest samples sizes.
Statistical Test
Consider the prototypical experimental design in which one treatment group is
compared with one control group. The basic statistical tests for analyzing this
design are the familiar t test and one-way analysis of variance (ANOVA). These tests
use an error term based on the within-group variability in the sample data to
assess the likelihood that the mean difference between the groups could result from
sampling error. To the extent that within-group variability can be eliminated, min-
imized, or somehow offset, intervention research will be more powerfulthat is,
more sensitive to true effects if they are present.
Two aspects of the statistical test are paramount in this regard. First, for a given set
of treatment versus control group data, different tests may have different formulations
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 52
Table 2.2 Approximate Sample Size for Each Group Needed to Attain Various
Equal Levels of Alpha and Beta for a Range of Effect Sizes
of the sampling error estimate and the critical test values needed for significance.
For instance, nonparametric teststhose that use only rank order or categorical
information from dependent variable scoresgenerally have less inherent power
than do parametric tests, which use scores representing degrees of the variable
along some continuum.
The second and most important aspect of a statistical test that is relevant to power
is the way it partitions sampling error and which components of that error variance
are used in the significance test. It is often the case in intervention research that some
of the variability on a given dependent measure is associated with participant char-
acteristics that are not likely to change as a result of intervention. If certain factors
extraneous to the intervention effect of interest contribute to the population vari-
ability on the dependent measure, the variability associated with those factors can be
removed from the estimate of sampling error against which differences between treat-
ment and control means are tested with corresponding increases in power.
A simple example might best illustrate the issue. Suppose that men and women,
on average, differ in the amount of weight they can lift. Suppose further that we
want to assess the effects of an exercise regimen that is expected to increase muscu-
lar strength. Forming treatment and control groups by simple random sampling of
the undifferentiated population would mean that part of the within-group vari-
ability that is presumed to reflect the luck of the draw (sampling error) would be
the natural differences between men and women. This source of variability may
well be judged irrelevant to an assessment of the intervention effectthe interven-
tion may rightfully be judged effective if it increases the strength of women relative
to the natural variability in womens strength and that of men relative to the nat-
ural variability in mens strength. The corresponding sampling procedure is not
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 53
Design Sensitivity 53
simple random sampling but stratified random sampling, drawing women and
men separately so that the experimental sample contains identified subgroups of
women and men. The estimate of sampling error in this case comes from the
within-group variancewithin experimental condition within genderand omits
the between-gender variance, which has now been identified as having a source
other than the luck of the draw.
All statistical significance tests assess effects relative to an estimate of sampling
error but they may make different assumptions about the nature of the sampling
and, hence, the magnitude of the sampling error. The challenge to the intervention
researcher is to identify the measurable extraneous factors that contribute to pop-
ulation variability and then use (or assume) a sampling strategy and corresponding
statistical test that assesses intervention effects against an appropriate estimate of
sampling error. Where there are important extraneous factors that correlate with
the dependent variable (and there almost always are), using a statistical significance
test that partitions them out of the error term can greatly increase statistical power.
With this in mind, we review below some of the more useful of the variance con-
trol statistical designs with regard to their influence on power.
Analysis of Covariance
One of the most useful of the variance control designs for intervention
research is the one-way analysis of covariance (ANCOVA). Functionally, the
ANCOVA is like the simple one-way ANOVA, except that the dependent variable
variance that is correlated with a covariate variable (or linear combination of
covariate variables) is removed from the error term used for significance testing.
For example, a researcher with a reading achievement test as a dependent variable
may wish to remove the component of performance associated with IQ before
comparing the treatment and control groups. IQ differences may well be viewed
as nuisance variance that is correlated with reading scores but is not especially rel-
evant to the impact of the program on those scores. That is, irrespective of a
students IQ score, we would still expect an effective reading program to boost the
reading score.
It is convenient to think of the influence of variance control statistical designs on
statistical power as a matter of adjusting the effect size in the power relationship.
Recall that ES, as it is used in statistical power determination, is defined as (t c)/
where is the pooled within-groups standard deviation. For assessing the power of
variance control designs, we adjust this ES to create a new value that is the one that
is operative for statistical power determination. For the ANCOVA statistical design,
the operative ES for power determination is as follows:
c
ES ac = t ,
1 rdc2
where ESac is the effect size formulation for the one-way ANCOVA; t and c are
the means for the treatment and control populations, respectively; is the common
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 54
standard deviation; and rdc is the correlation between the dependent variable and the
covariate. As this formula shows, the operative effect size for power determination
using ANCOVA is inflated by a factor of 1/1 r2, which multiples ES by 1.15 when
r = .50, and 2.29 when r = .90. Thus, when the correlation of the covariate(s) with the
dependent variable is substantial, the effect of ANCOVA on statistical power can be
equivalent to more than doubling the operative effect size. Examination on Figure 2.1
reveals that such an increase in the operative effect size can greatly enhance power at
any given sample size.
An especially useful application of ANCOVA in intervention research is when
both pretest and posttest values on the dependent measure are available. In many
cases of experimental research, preexisting individual differences on the character-
istic that intervention is intended to change will not constitute an appropriate stan-
dard for judging intervention effects. Of more relevance will be the size of the
intervention effect relative to the dispersion of scores for respondents that began at
the same initial or baseline level on that characteristic. In such situations, a pretest
measure is an obvious candidate for use as a covariate in ANCOVA. Because
pretest-posttest correlations are generally high, often approaching the test-retest
reliability of the measure, the pretest as a covariate can dramatically increase the
operative effect size in statistical power. Indeed, ANCOVA with the pretest as the
covariate is so powerful and so readily attainable in most instances of intervention
research that it should be taken as the standard to be used routinely unless there are
good reasons to the contrary.
c
ES ab = t ,
1 PV b
where ESab is the effect size formulation for the blocked one-way ANOVA, is the
pooled within-groups standard deviation (as in the unadjusted ES), and PVb is
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 55
Design Sensitivity 55
2b/ 2 with 2b the between-blocks variance and 2 the common variance of the
treatment and control populations.
The researcher, therefore, can estimate PVb, the between-blocks variance, as a
proportion of the common (or pooled) variance within experimental groups and
use it to adjust the effect size estimate in such a way as to yield the operative effect
size associated with the statistical power of this design. If, for instance, the blocking
factor accounts for as much as half of the common variance, the operative ES
increases by more than 40%, with a correspondingly large increase in power.
Effect Size
The effect size parameter in statistical power can be thought of as a signal-to-
noise ratio. The signal is the difference between treatment and control population
means on the dependent measure (the ES numerator, t c). The noise is the
within-groups variability on that dependent measure (the ES denominator, ).
Effect size and, hence, statistical power is large when the signal-to-noise ratio is
highthat is, when the ES numerator is large relative to the ES denominator. In the
preceding section, we saw that variance control statistical designs increase statisti-
cal power by removing some portion of nuisance variance from the ES denomina-
tor and making the operative ES for statistical power purposes proportionately
larger. Here, we will look at some other approaches to increasing the signal-to-noise
ratio represented by the effect size.
Dependent Measures
The dependent measures in intervention research yield the set of numerical val-
ues on which statistical significance testing is performed. Each such measure chosen
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 56
.05 1.03
.10 1.05
.15 1.08
.20 1.12
.25 1.15
.30 1.20
.35 1.24
.40 1.29
.45 1.35
.50 1.41
.55 1.49
.60 1.58
.65 1.69
.70 1.83
.75 2.00
.80 2.24
.85 2.58
.90 3.16
.95 4.47
.99 10.00
for a study constitutes a sort of listening station for certain effects expected to result
from the intervention. If the listening station is in the wrong place or is unrespon-
sive to effects when they are actually present, nothing will be heard. To optimize the
signal-to-noise ratio represented in the effect size, the ideal measure for interven-
tion effects is one that is maximally responsive to any change that the intervention
brings about (making a large ES numerator) and minimally responsive to anything else
(making a small ES denominator). In particular, three aspects of outcome mea-
surement have direct consequences for the magnitude of the effect size parameter
and, therefore, statistical power: (a) validity for measuring change, (b) reliability,
and (c) discrimination of individual differences among respondents.
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 57
Design Sensitivity 57
Validity for Change. For a measure to respond to the signal, that is, to intervention
effects, it must, of course, be a valid measure of the characteristic that the interven-
tion is expected to change. But validity alone is not sufficient to make a measure
responsive to intervention effects. What is required is validity for change. A measure
can be a valid indicator of a characteristic but still not be a valid indicator of change
on that characteristic. Validity for change means that the measure shows an observ-
able difference when there is, in fact, a change on the characteristic measured that is
of sufficient magnitude to be interesting in the context of application.
There are various ways in which a measure can lack validity for change. For one,
it may be scaled in units that are too gross to detect the change. A measure of mor-
tality (death rate), for instance, is a valid indicator of health status but is insensitive
to variations in how sick people are. Graduated measures, those that range over
some continuum, are generally more sensitive to change than categorical measures,
because the latter record changes only between categories, not within them. The
number of readmissions to a mental hospital, for example, constitutes a continuum
that can differentiate one readmission from many. This continuum is often repre-
sented categorically as readmitted versus not readmitted, however, with a con-
sequent loss of sensitivity to change and statistical power.
Another way in which a measure may lack validity for measuring change is by
having a floor or ceiling that limits downward or upward response. A high school-
level mathematics achievement test might be quite unresponsive to improve-
ments in Albert Einsteins understanding of mathematicshe would most likely
score at the top of the scale with or without such improvements. Also, a measure
may be specifically designed to cancel out certain types of change, as when scores
on IQ tests are scaled by age norms to adjust away age differences in ability to
answer the items correctly.
In short, measures that are valid for change will respond when intervention alters
the characteristic of interest and, therefore, will differentiate a treatment group from
a control group. The stronger this differentiation, the greater the contrast between
the group means will be and, correspondingly, the larger the effect size.
Reliability. Turning now to the noise in the signal detection analogy, we must con-
sider variance in the dependent measure scores that may obscure any signal due
to intervention effects. Random error variancethat is, unreliability in the measure
is obviously such a noise. Unreliability represents fluctuations in the measure that
are unrelated to the characteristic being measured, including intervention effects on
that characteristic. Measures with lower measurement error will yield less variation
in the distribution of scores for participants within experimental groups. Because
within-groups variance is the basis for the denominator of the ES ratio, less mea-
surement error makes that denominator smaller and the overall ES larger.
Some measurement error is intrinsicit follows from the properties of the mea-
sure. Self-administered questionnaires, for instance, are influenced by fluctuations
in respondents attention, motivation, comprehension, and so forth. Some mea-
surement error is proceduralit results from inconsistent or inappropriate appli-
cation of the measure. Raters who must report on an observed characteristic,
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 58
for instance, may not be trained to use the same standards for their judgment, or
the conditions of observation may vary for different study participants in ways that
influence their ratings.
Also included in measurement error is systematic but irrelevant variation
response of the measure to characteristics other than the one of interest. When
these other characteristics vary differently than the one being measured, they intro-
duce noise into a measure. For example, frequency of arrest, which may be used to
assess the effects of intervention for juvenile delinquency, indexes police behavior
(e.g., patrol and arrest practices) as well as the criminal behavior of the juveniles. If
the irrelevant characteristic to which the measure is also responding can be identi-
fied and separately measured, its influence can be removed by including it as a
covariate in an ANCOVA, as discussed above. For instance, if we knew the police
precinct in which each arrest was made, we could include that information as con-
trol variables (dummy coding each precinct as involved vs. not involved in a given
arrest) that would eliminate variation in police behavior across precincts from the
effect size for a delinquency intervention.
Design Sensitivity 59
Dose Response. Experimental design is based on the premise that intervention levels
can be made to vary and that different levels might result in different responses.
Generally speaking, the stronger the intervention, the larger the response should
be. One way to attain a large effect size, therefore, is to design intervention research
with the strongest possible dose of the intervention represented in the treatment
condition. In testing a new math curriculum, for instance, the researcher might
want the teachers to be very well-trained to deliver it and to spend a significant
amount of class time doing so. If the intervention is effective, the larger effect size
resulting from a stronger dose will increase statistical power for detecting the effect.
Optimizing the strength of the intervention operationalized in research requires
some basis for judging what might constitute the optimal configuration for pro-
ducing the expected effects. There may be insufficient research directly on the inter-
vention under study (else why do the research), but there may be other sources of
information that can be used to configure the intervention so that it is sufficiently
strong to potentially show detectable effects. One source, for example, is the expe-
rience and intuition of practitioners in the domain where the intervention, or vari-
ants, is applied.
variable. If there is variation around those means, it goes into the within-groups
variance of the effect size denominator, making the overall ES smaller. Maintaining
a uniform application of treatment and control conditions is the best way to pre-
vent this problem. One useful safeguard is for the researcher to actually measure the
amount of intervention received by each participant in the treatment and control
conditions (presumably little or none in the control). This technique yields infor-
mation about how much variability there actually was and generates a covariate
that may permit statistical adjustment of any unwanted variability.
Control Group Contrast. Not all aspects of the relationship between the independent
variable and the effect size have to do primarily with the intervention. The choice
of a control condition also plays an important role. The contrast between the treat-
ment and control means can be heightened or diminished by the choice of a con-
trol that is more or less different from the treatment condition in its expected
effects on the dependent measure.
Generally, the sharpest contrast can be expected when what the control group
receives involves no aspects of the intervention or any other attentionthat is, a
no treatment control. For some situations, however, this type of control may be
unrepresentative of participants experiences in nonexperimental conditions or
may be unethical. This occurs particularly for interventions that address problems
that do not normally go unattendedsevere illness, for example. In such situations,
other forms of control groups are often used. The treatment as usual control
group, for instance, receives the usual services in comparison to a treatment group
that receives innovative services. Or a placebo control might be used in which the
control group receives attention similar to that received by the treatment group but
without the specific active ingredient that is presumed to be the basis of the inter-
ventions efficacy. Finally, the intervention of interest may simply be compared with
some alternative intervention, for example, traditional psychotherapy compared
with behavior modification as treatment for anxiety.
The types of control conditions described above are listed in approximate order
according to the magnitude of the contrast they would generally be expected to
show when compared with an effective intervention. The researchers choice of a
control group, therefore, will influence the size of the potential contrast and hence
of the potential effect size that appears in a study. Selection of the control group
likely to show the greatest contrast from among those appropriate to the research
issues can thus have an important bearing on the statistical power of the design.
Design Sensitivity 61
Determinants of Statistical
Power for Multilevel Designs
Basically, the same four factors that influence power in single-level designs
apply to multilevel designssample size, alpha level, the statistical test (especially
whether variance controls are included), and effect size. The alpha level at which the
intervention effect is tested and the effect size are defined virtually the same way in
multilevel designs as in single-level ones and function the same way in power analy-
sis. It should be particularly noted that despite the greater complexity of the struc-
ture of the variance within treatment and control groups in multilevel designs, the
effect size parameter remains the same. It is still defined as the difference between
the mean score on the dependent variable for all the individuals in the treatment
group and the mean for all the individuals in the control group divided by the com-
mon standard deviation of all the scores within the treatment and control groups.
In a multilevel design, the variance represented in that standard deviation could,
in turn, be decomposed into between- and within-cluster components or built up
from them. It is, nonetheless, the same treatment or control population variance
(estimated from sample values) irrespective of whether the participants providing
scores have been sampled individually or clusterwise.
The statistical analysis on the other hand will be differentit will involve a multi-
level statistical model that represents participant scores at the lowest level and the
clusters that were randomized at the highest level. One important implication of
this multilevel structure is that variance control techniques, such as use of selected
covariates, can be applied at both the participant and cluster levels of the analysis.
Similarly, sample size applies at both levels and involves the number of clusters
assigned to experimental conditions and the number of participants within clusters
who provide scores on the dependent measures.
One additional factor distinctive to multilevel designs also plays an important
role in statistical power: the intracluster correlation (ICC; Hox, 2002; Raudenbush
& Bryk, 2002; Snijders & Bosker, 1999). The ICC is a measure of the proportion of
the total variance of the dependent variable scores that occurs between clusters. It
can be represented as follows:
2between
= ,
2between + 2within
where the numerator is the variance between the clusters and the denominator is
the total variance in the model (between-cluster plus within-cluster variance).
If none of the variability in the data is accounted for by between-cluster differ-
ences, then the ICC will be 0 and the effective sample size for the study will simply
be the total number of participants in the study. If, on the other hand, all the vari-
ability is accounted for by between-cluster differences, then the ICC will be 1 and
the effective N for the study will be the number of clusters. In practice, the ICC will
be somewhere between these two extremes, and the effective N of the study will be
somewhere in between the number of participants and the number of clusters.
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 63
Design Sensitivity 63
Figure 2.2 contains a graph that depicts the effect of the magnitude of the ICC on
the power to detect an effect size of .40 at = .05 with 50 clusters total (evenly
divided between treatment and control) and 15 participants per cluster. As the
figure shows, even small increases in the ICC can substantially reduce the power.
1.0
= 0.050
= 0.40, J = 50, n = 15
0.9
0.8
0.7
0.6
Power
0.5
0.4
0.3
0.2
0.1
Figure 2.2 The Relationship Between ICC and Power to Detect an Effect Size of
.40, With 50 Clusters Total, 15 Participants per Cluster, and = .05
(graph generated using optimal design software)
Clearly, the ICC is crucial for determining statistical power when planning a
study. Unfortunately, the researcher has no control over what the ICC will be for a
particular study. Thus, when estimating the statistical power of a planned study, the
researcher should consider the ICC values that have been reported for similar
research designs. For example, the ICCs for the educational achievement outcomes
of students clustered within classroom or schools typically range from approxi-
mately .15 to .25 (Hedges & Hedberg, 2006).
Unlike the ICC, the number of clusters and the number of participants within
each cluster are usually within the researchers control, at least to the extent that
resources allow. Unfortunately, in multilevel analyses the total number of participants
(which are usually more plentiful) has less of an effect on power than the number of
clusters (which are often available only in limited numbers). This is in contrast to
single-level designs in which the sample size at the participant level plays a large role
in determining power. See Figure 2.3 for a graph depicting the relationship between
sample size at the participant level and power to detect an effect size of .40 at = .05
for a study with 50 clusters total and an ICC of .20. Once clusters have around 15 par-
ticipants each, adding additional participants yields only modest gains in power.
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 64
1.0 = 0.050
J = 50
0.9 = 0.40, = 0.20
0.8
0.7
0.6
Power
0.5
0.4
0.3
0.2
0.1
11 20 29 38 47
Number of Subjects per Cluster
Figure 2.3 The Relationship Between Cluster Size and Power to Detect an Effect
Size of .40, With 50 Clusters Total, an ICC of .20, and = .05 (graph
generated using optimal design software)
Figure 2.4 depicts the relationship between the number of clusters and the
power to detect an effect size of .40 at = .05 for a study with 15 participants per
cluster and an ICC of .20. As that graph shows, a power of .80 to detect this effect
size is only achieved when the total number of clusters is above 50, and it requires
82 clusters for .95 power. In many research contexts, collecting data from so many
clusters may be impractical and other techniques for attaining adequate power
must be employed.
Design Sensitivity 65
1.0
0.9
= 0.050
0.8 n = 15
= 0.40, = 0.20
0.7
0.6
Power
0.5
0.4
0.3
0.2
0.1
23 42 61 80 99
Number of Clusters
Figure 2.4 The Relationship Between Number of Clusters and Power to Detect
an Effect Size of .40, With 15 Participants per Cluster, an ICC of .20,
and = .05 (graph generated using optimal design software)
a closely related one. Including a pretest covariate can reduce the number of clus-
ters required to achieve adequate power anywhere from one half to one tenth and
cluster-level pretest scores (the mean for each cluster) may be just as useful as
participant-level pretest scores (Bloom, Richburg-Hayes, & Black, 2005).
Figure 2.5 illustrates the change in power associated with adding a cluster-level
covariate that accounts for varying proportions of the between-cluster variance on the
outcome variable. Without a covariate, 52 clusters (26 each in the treatment and con-
trol groups) with 15 participants per cluster and an ICC of .20 are required to detect
an effect size of .40 at = .05 with .80 power. With the addition of a cluster-level
covariate that accounts for 66% of the between-cluster variance (i.e., correlates about
.81), the same power is attained with half as many clusters (26 total). Accounting for
that proportion of between-cluster variance would require a strong covariate (or set of
covariates), but not so strong as to be unrealistic for many research situations.
1.0
0.9
0.8
0.7
= 0.050
0.6 n = 15
= 0.40, = 0.20, J = 26
Power
0.5
0.4
0.3
0.2
0.1
Figure 2.5 Power for Detecting an Effect Size of .40, With 26 Clusters,
15 Participants per Cluster, ICC of .20, and = .05, as Influenced
by the Addition of a Cluster-Level Covariate of Various Strengths
(graph generated using optimal design software)
each cluster, the ICC associated with those clusters, and any covariates or blocking
factors involved in the design. Given all these considerations, it is not surprising
that computing power estimates is rather complicated (see Raudenbush, 1997;
Snijders & Bosker, 1993, for examples of computational techniques). Fortunately,
there is software available that facilitates these computations. One of the best doc-
umented and easiest to use is Optimal Design, based on the calculations outlined
in Raudenbush and Liu (2000) (available without cost at the time this chapter was
written at http://sitemaker.umich.edu/group-based/optimal_design_software).
Optimal Design was used to generate the graphs in Figures 2.2, 2.3, 2.4, and 2.5.
Power Analysis in Two-Level designs (PINT), developed by Snijders and his col-
leagues and using the formulas derived in Snijders and Bosker (1993), is another
package that provides similar power calculations, but is currently more limited in
the research designs that it can accommodate (PINT is available at the time this
chapter was written at http://stat.gamma.rug.nl/snijders).
Design Sensitivity 67
With an operative effect size and a desired power level now established, the
researcher is ready to turn to the question of the size of the sample in each experi-
mental group. This is simply a matter of looking up the appropriate value using
a statistical power chart or computer program. If the result is a sample size the
researcher can achieve, then all is well.
If the required sample size is larger than can be attained, however, it is back to
the drawing board for the researcher. The options at this point are limited. First, of
course, the researcher may revisit previous decisions and further tune the design
for example, enhancing the treatment versus control contrast, improving the sensi-
tivity of the dependent measure, or applying a stronger variance control design. If
this is not possible or not sufficient, all that remains is the possibility of relaxing one
or more of the parameters of the study. Alpha or beta levels, or both, might be
relaxed, for instance. Because this increases the risk of a false statistical conclusion,
and because alpha levels particularly are governed by strong conventions, this must
obviously be done with caution. Alternatively, the threshold effect size that the
research can reliably detect may be increased. This amounts to reducing the likeli-
hood that effects already assumed to be potentially meaningful will be detected.
Despite best efforts, the researcher may have to proceed with an underpowered
design. Such a design may be useful for detecting relatively large effects but may have
little chance of detecting smaller, but still meaningful, effects. Under these circum-
stances, the researcher should take responsibility for communicating the limitations
of the research along with its results. To do otherwise encourages misinterpretation
of statistically null results as findings of no effect when there may be a reasonable
probability of an actual effect that the research was simply incapable of detecting.
As is apparent in the above discussion, designing research sensitive to interven-
tion effects depends heavily on an advance specification of the magnitude of statis-
tical effect that represents the threshold for what is important or meaningful in the
intervention context. In the next section, we discuss some of the ways in which
researchers can approach this judgment.
Design Sensitivity 69
such effect size estimates can then be used as a basis for judging the likelihood that
the research being planned will produce effects of a specified size. For example, a
study could reliably detect 80% of the likely effects if it is designed to have sufficient
power for the effect size at the 20th percentile of the distribution of effect sizes
found in similar studies.
Other than the problem of finding sufficient research literature to draw on, the
major difficulty with the actuarial approach is the need to extract effect size esti-
mates from studies that typically do not report their results in those terms. This,
however, is exactly the problem faced in meta-analysis when a researcher attempts
to obtain effect size estimates for each of a defined set of studies and do higher-
order analysis on them. Books and articles on meta-analysis techniques contain
detailed information about how to estimate effect sizes from the statistics provided
in study reports (see, e.g., Lipsey & Wilson, 2000).
A researcher can obtain a very general picture of the range and magnitude of
effect size estimates in intervention research by examining any meta-analyses that
have been conducted on similar interventions. Lipsey and Wilson (1993) reported
the distribution of effect sizes from more than 300 meta-analyses of research on psy-
chological, behavioral, and educational research. That distribution had a median
effect size of .44, with the 20th percentile at .24 and the 80th percentile at .68. These
values might be compared with the rule of thumb for effect size suggested by Cohen
(1977, 1988), who reported that across a wide range of social science research, ES =
.20 could be judged as a small effect, .50 as medium, and .80 as large.
ES
r= .
ES 2 + 4
For example, if the correlation between the independent variable and the depen-
dent variable is .24, then the difference between the success proportions of the
groups is .24, evenly divided around the .50 point, that is, .50 .12, or 38% success
in the control group, 62% in the treatment group. More generally, the distribution
with the lower mean will have .50 (r/2) of its cases above the grand median suc-
cess threshold, and the distribution with the greater mean will have .50 + (r/2) of
its cases above that threshold. For convenience, Table 2.4 presents the BESD terms
for a range of ES and r values as well as Cohens U3 index described above.
The most striking thing about the BESD and the U3 representations of the
effect size is the different impression that they give of the potential practical
significance of a given effect from that of the standard deviation expression. For
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 71
Design Sensitivity 71
ES = .50
Control Treatment
Figure 2.6 Depiction of the Percentage of the Treatment Distribution Above the
Success Threshold Set at the Mean of the Control Distribution
Table 2.4 Effect Size Equivalents for ES, r, U3, and BESD
example, an effect size of one fifth of a standard deviation (ES = .20) corresponds
to a BESD success rate differential of .10that is, 10 percentage points between
the treatment and control group success rates (55% vs. 45%). A success increase
of 10 percentage points on a control group baseline of 45% represents a 22% improve-
ment in the success rate (10/45). Viewed in these terms, the same intervention
effect that may appear rather trivial in standard deviation units now looks poten-
tially meaningful.
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 72
Design Sensitivity 73
Conclusion
Attaining adequate statistical power in intervention research is not an easy matter.
The basic dilemma is that high power requires a large effect size, a large sample size,
or both. Despite their potential practical significance, however, the interventions of
interest all too often produce modest statistical effects, and the samples on which
they can be studied are often of limited size. Intervention researchers need to learn
to live responsibly with this problem. The most important elements of a coping
strategy are recognizing the predicament and attempting to overcome it in every
possible way during the design phase of a study. The keys to designing sensitive
intervention research are an understanding of the factors that influence statistical
power and the adroit application of that understanding to the planning and imple-
mentation of each study undertaken. As an aid to recall and application, Table 2.5
lists the factors discussed in this chapter that play a role in the statistical power of
experimental research along with some others of an analogous sort.
Independent variable
Strong treatment, high dosage in the treatment condition
Untreated or low-dosage control condition for high contrast with treatment
Treatment integrity; uniform application of treatment to recipients
Control group integrity; uniform control conditions for recipients
Study participants
Large sample size (or number of clusters in the case of multilevel research) in each
experimental condition
(Continued)
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 74
Deploying limited participants into few rather than many experimental groups
Little initial heterogeneity on the dependent variable
Measurement or variance control of participant heterogeneity
Differential participant response accounted for statistically (interactions)
Dependent variables
Validity for measuring characteristic expected to change
Validity, sensitivity for change on characteristic measured
Fine-grained units of measurement rather than coarse or categorical
No floor or ceiling effects in the range of expected response
Mastery or criterion-oriented rather than individual differences measures
Inherent reliability in measure, unresponsiveness to irrelevant factors
Consistency in measurement procedures
Aggregation of unreliable measures
Timing of measurement to coincide with peak response to treatment
Statistical analysis
Larger alpha for significance testing
Significance tests for graduated scores, not ordinal or categorical
Statistical variance control; blocking, ANCOVA, interactions
Discussion Questions
1. In your area of research, which type of error (Type I or Type II) typically
carries more serious consequences? Why?
2. In your field, would it ever be sensible to perform a one-tailed significance
test? Why or why not?
3. In your field, what are some typical constructs that would be of interest as
outcomes, and how are those constructs usually measured? What are the pros and
cons of these measures in terms of validity for measuring change, reliability, and
discrimination of individual differences?
4. In your research, what are some extraneous factors that are likely to be
correlated with your dependent variables? Which of these are measurable so that
they might be included as covariates in a statistical analysis?
5. What are some ways that you might measure implementation of an inter-
vention in your field of research? Is it likely that interventions in your field are deliv-
ered uniformly to all participants?
6. Is the use of no treatment control groups (groups that receive no form of
intervention) typically possible in your field? Why or why not?
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 75
Design Sensitivity 75
Exercises
References
Bloom, H. S. (1995). Minimum detectable effects: A simple way to report the statistical
power of experimental designs. Evaluation Review, 19(5), 547556.
Bloom, H. S. (2005). Randomizing groups to evaluate place-based programs. In H. S. Bloom
(Ed.), Learning more from social experiments: Evolving analytic approaches (pp. 115172).
New York: Russell Sage Foundation.
02-Bickman-45636:02-Bickman-45636 7/28/2008 7:36 PM Page 76
Bloom, H. S., Richburg-Hayes, L., & Black, A. R. (2005). Using covariates to improve precision:
Empirical guidance for studies that randomize schools to measure the impacts of educational
interventions (MDRC Working Papers on Research Methodology). New York: MDRC.
Brown, G. W. (1983). Errors, Type I and II. American Journal of Disorders in Childhood,
137, 586591.
Carver, R. P. (1974). Two dimensions of tests: Psychometric and edumetric. American
Psychologist, 29, 512518.
Cascio, W. F., & Zedeck, S. (1983). Open a new window in rational research planning: Adjust
alpha to maximize statistical power. Personnel Psychology, 36, 517526.
Cohen, J. (1977). Statistical power analysis for the behavioral sciences (Rev. ed.). New York:
Academic Press.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale,
NJ: Lawrence Erlbaum.
Hedges, L. V., & Hedberg, E. C. (2006). Intraclass correlation values for planning group ran-
domized trials in education (Institution for Policy Research Working Paper). Evanston,
IL: Northwestern University.
Hox, J. (2002) Multilevel Analysis: Techniques and Applications. Hillsdale, NJ: Lawrence Erlbaum.
Kraemer, H. C., & Thiemann, S. (1987). How many subjects? Statistical power analysis in
research. Newbury Park, CA: Sage.
Lipsey, M. W. (1990). Design sensitivity: Statistical power for experimental research. Newbury
Park, CA: Sage.
Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of psychological, educational, and behav-
ioral treatment: Confirmation from meta-analysis. American Psychologist, 48, 11811209.
Lipsey, M. W., & Wilson, D. B. (2000). Practical meta-analysis. Thousand Oaks, CA: Sage.
Murphy, K. R., & Myors, B. (2004). Statistical power analysis: A simple and general model for
traditional and modern hypothesis tests (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.
Murray, D. M., & Blitstein, J. L. (2003). Methods to reduce the impact of intraclass correla-
tion in group-randomized trials. Evaluation Review, 27, 79103.
Nagel, S. S., & Neef, M. (1977). Determining an optimum level of statistical significance. In
M. Guttentag & S. Saar (Eds.), Evaluation studies review annual (Vol. 2, pp. 146158).
Beverly Hills, CA: Sage.
Rasbash, J., Steele, F., Browne, W. J., & Prosser, B. (2004). A users guide to MLwiN (Version
2.0). London: Institute of Education.
Raudenbush, S. W. (1997). Statistical analysis and optimal design for cluster randomized
trials. Psychological Methods, 2, 173185.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data
analysis methods (2nd ed.). Thousand Oaks, CA: Sage.
Raudenbush, S. W., Bryk, A. S., & Congdon, R. (2004). Hierarchical linear and nonlinear
modeling. Lincolnwood, IL: SSI.
Raudenbush, S. W., & Liu, X. (2000). Statistical power and optimal design for multisite
randomized trials. Psychological Methods, 5(2), 199213.
Rosenthal, R., & Rubin, D. B. (1982). A simple, general purpose display of magnitude of
experimental effect. Journal of Educational Psychology, 74, 166169.
Schneider, A. L., & Darcy. R. E. (1984). Policy implications of using significance tests in
evaluation research. Evaluation Review, 8, 573582.
Snijders, T. A. B., & Bosker, R. J. (1993). Standard errors and sample sizes for two-level
research. Journal of Educational Statistics, 18, 237259.
Snijders, T. A. B., & Bosker, R. J. (1999). Multilevel analysis: An introduction to basic and
advanced multilevel modelling. London: Sage.
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 77
CHAPTER 3
Practical Sampling
Gary T. Henry
Practical Sampling 79
the relationship between their responses and the actual vote can be used to predict
the voting totals accurately.
Just as the researchers can exercise judgment in the selection processes, the indi-
viduals selected have a right to choose if they will participate in a study. Individuals,
whether they have been selected by random processes or human judgments, have a
right to exercise their own judgments about participation in the study. While prob-
ability samples eliminate researchers judgments about which individuals will be
selected to participate in a study, both probability and nonprobability samples have
the potential for systematic error, also referred to as bias, in attributing sample char-
acteristics to the entire study population when individuals decide not to participate
in a study. An important difference between the use of probability samples and
nonprobability samples is in the rigorous tracking and reporting of the potential for
bias from probability samples. For example, it is often required or at least commonly
expected that researchers using probability samples will use standard definitions for
calculating response rates, such as those that have been promulgated by the American
Association of Public Opinion Research (2006). Response rates are the selected sam-
ple members that participated in the study divided by the total sample and expressed
in percentage terms. Reporting the response rates using the standard calculation
methods makes the potential for bias transparent to the reader. It is very difficult, if
not impossible, to specify what response rates are necessary to reduce bias to a neg-
ligible amount. For example, Keeter, Miller, Kohut, Groves, and Presser (2000) show
that it is extremely rare for findings to differ in a statistically significant way between
a survey with an exceptionally high responses rate (60.6%) and one with a more
common response rate (36.0%). While similar monitoring and reporting procedures
could be applied to nonprobability samples, presenting information about partici-
pation rates is highly variable and much less standardized.
As this discussion begins to show, probability and nonprobability samples differ
in very fundamental and significant ways. Perhaps, the most significant difference
is whether the sample data present a valid picture of the study population or rather
is used to provide evidence about the individual or cases in the sample itself. Before
beginning to develop a sampling plan, the research team must make a definitive
statement about the purpose for which the study is undertaken. For studies that are
undertaken to describe the study population or test hypotheses that are to be attrib-
uted to the membership of the study population, probability samples are required.
Nonprobability sampling is appropriate when individuals or cases have intrinsic
interest or when contrasting cases can help to develop explanations or theories
about why differences occur. The evaluation literature is filled with exemplary or
successful case studies and studies that seek to contrast successful cases and
unsuccessful ones. Using nonprobability samples for these studies makes good
sense and can add explanatory evidence to the discussion about how to improve
social programs. However, once the decision is made to use nonprobability sam-
pling methods, it is inappropriate to present the findings in ways that suggest that
they apply to the study population. Conversely, probability samples will not always
produce sharp contrasts that allow for the development of explanatory theories.
Therefore, the next section of the chapter provides some guidance about the types
of nonprobability samples that applied research could consider and the methods
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 81
Practical Sampling 81
Nonprobability Sampling
Nonprobability samples are important tools for applied research that can be used to
Convenience Select cases based on their availability for the study and ease
of data collection
Contrasting cases Select cases that are judged to represent very different
conditions; often well used when a theoretically or practically
important variable can be used as the basis for the contrast
Typical cases Select cases that are known beforehand to be useful and not
to be extreme
Critical cases Select cases that are key or essential for overall acceptance
or assessment
Snowball Group members identify additional members to be included
in sample
Quota Interviewers select sample that yields the same proportions
as in the population on easily identified variables
Practical Sampling 83
violent movies on the U.S. population, the use of a convenience sample severely
constrains the studys external validity. The differences in these two groups cannot
be used to formally estimate the impact of violent movies on the U.S. population.
Other conditions, such as age, may alter responses to seeing violent movies. The
students in this sample are likely to be in their teens and early 20s if they were
attending a traditional college or university, and their reactions to the violent movie
may be different from the reactions of older adults. Applying the effects found
in this study to the entire U.S. population could be misleading. The randomized
assignment that was used increases the internal validity of a study, but it should not
be confused with random sampling. Random sampling is a probability sampling
technique that increases external validity. Although applied studies can be designed
to provide high levels of both internal validity and generalizability, most prioritize
one over the other due to practical concerns such as costs or study purposes or
because there are gaps in the current knowledge about the topic that the research
sets out to examine that lead to developing strategies to fill an important gap.
Convenience sampling and contrasting cases sampling are but two of the many
types of nonprobability sampling that are frequently used in applied social research.
Quota sampling, which was mentioned earlier, was frequently used by polling firms
and other survey research organizations but has been largely discarded. Quota sam-
ples exactly match the study population on easily observed characteristics, but
because the interviewers select the respondents, bias can produce significant differ-
ences between the sample and the study population. Snowball samples are very com-
monly used for studies where the study population members are not readily
identified or located. Examples of these types of populations are individuals involved
with gangs, drugs, or other activities that are not condoned by society or populations
that may be stigmatized or potentially suffer other consequences if their membership
in the group is known, such as individuals living with HIV/AIDS or undocumented
workers. Snowball sampling involves recruiting a few members of the study popula-
tion to participate in the study and asking them to identify or help recruit other
members of the study population for the study. Snowball samples may be signifi-
cantly biased if the individuals recruited for the study have limited knowledge of
other members of the group. However, snowball samples may be used to obtain
evidence about some members of the study population, when time and resources are
limited or when developing a list of the members is considered unethical.
Probability Samples
As I stated earlier, probability samples have the distinguishing characteristic that
each unit in the population has a known, nonzero probability of being selected for
the sample. To have this characteristic, a sample must be selected through a random
mechanism. Random selection mechanisms are independent means of selection
that are free from human judgment and the other biases that can inadvertently
undermine the independence of each selection.
Random selection mechanisms include a lottery-type procedure in which balls on
which members of the population have been identified are selected from a well-mixed
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 84
bowl of balls, a computer program that generates a random list of units from an
automated listing of the population, and a random digit-dialing procedure that
provides random lists of four digits matched with working telephone prefixes in
the geographic area being sampled (see, e.g., Lavrakas, Chapter 16, this volume).
Random selection requires ensuring that the selection of any unit is not affected by
the selection of any other unit. The procedure must be carefully designed and car-
ried out to eliminate any potential human or inadvertent biases. Random selection
does not mean arbitrary or haphazard selection (McKean, 1987).
The random selection process underlies the validity, precision, power, and cred-
ibility of sample data and statistics. The validity of the data affects the accuracy of
generalizing sample results to the study population and drawing correct conclu-
sions about the population from the analytical procedures used to establish differ-
ences between two groups or covariation. Sampling theory provides the basis for
calculating the precision of statistics for probability samples. Because sampling
variability has an established relationship to several factors (including sample size
and variance), the precision for a specific sample can be planned in advance of con-
ducting a study. Power is closely related to precision. Precision applies to the size of
the confidence interval around a parameter estimate such as the mean or a per-
centage. The confidence interval is the interval around the sample mean estimate in
which the true mean is likely to fall given the degree of confidence specified by the
analyst. For example, when a newspaper reports that a poll has a margin of error of
3%, it is a way of expressing the precision of the sample. It means that the analyst
is confident that 95 out of 100 times, the true percentage will fall within 3 percent-
age points of the percentage estimated for the sample. Power refers to the probabil-
ity of detecting a difference of a specified size between two groups or a relationship
of a specified size between two variables given a probability sample of a specific size.
The principal means of increasing precision and power is increasing sample size,
although sample design can have a considerable effect as will be discussed later in
this chapter. Credibility, in large measure, rests on absence of perceived bias in the
sample selection process that would result in the sample being systematically dif-
ferent from the study population. Probability sampling can increase credibility by
eliminating the potential bias that can arise from using human judgment in the
selection process. Credibility is a subjective criterion while validity, precision, and
power are objective criteria and have widely agreed on technical definitions.
A distinct advantage of probability samples is that sampling theory provides the
researcher with the means to decompose and in many cases calculate the probable
error associated with any particular sample. One form of error is known as bias.
Bias, in sampling, refers to systematic differences between the sample and the pop-
ulation that the sample represents. Bias can occur because the listing of the popu-
lation from which the sample has been drawn (sampling frame) is flawed or
because the sampling methods cause some populations to be overrepresented in the
sample. Bias is a direct threat to the external validity of the results.
The other form of error in probability samples, sampling variability, is the
amount of variability surrounding any sample statistic that results from the fact
that a random subset of cases is used to estimate population parameters. Because a
probability sample is chosen at random from the population, different samples will
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 85
Practical Sampling 85
Each component of error generates specific concerns for researchers and all
three sources of error should be explicitly considered in the sampling plan and
adaptation of the plan during the research process. Each of the three components
of total error and some examples of the sources of each are illustrated in Figure 3.1.
Because sample design takes place under resource constraints, decisions that allo-
cate resources to reduce error from one component necessarily affect the resources
available for reducing error from the other two components. Limited resources
force the researcher to make trade-offs in reducing total error. The researcher must
be fully aware of the three components of error to make the best decisions based on
the trade-offs to be considered in reducing total error. I describe below each of the
three sources of error and then return to the concept of total error for an example.
Nonsampling Bias
Nonsampling bias is the difference between the true target population value and
the population value that would be obtained if the data collection procedures were
administered with the entire population. Nonsampling bias results from decisions
as well as implementation of the decisions during data collection efforts that are
not directly related to the selection of the sample. For example, the definition of the
study population may exclude some members of the target population that the
researcher would like to include in the study findings. Even if data were collected
on the entire study population, in this case, the findings would be biased because of
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 86
Nonsampling Bias
Study Population Population Listing
Nonresponse
Operational definition of
Measurement Error
target population and
measurement instruments T
O
T
A
Sample Distribution Sampling Bias L
The distribution of an Selection Bias
estimator, e.g., x or b Estimation Bias E
computed from many R
samples centered around R
its expected value, O
E (x ) = (x ) or E (b) = b R
Sampling Variability
Sample Size
Sample
Sample Homogeneity
The subset of subjects or
units for which data is
obtained
the exclusion of some target population members. For example, using the Atlanta
telephone directory as the sampling frame for the current residents of the Atlanta
metropolitan area would produce biased estimates of household characteristics due
to unlisted numbers, households with phone service established after the phone
book went to press, and residents without phones, including the homeless and
those who rely exclusively on cellular phones.
Differences in the true mean of the population and the survey population mean
arise from several sources. A principal difference relevant to sample design is the
difference between the target population and the study population. The target pop-
ulation is the group about which the researcher would like to make statements. The
target population can be defined based on conditions and concerns that arise from
the theory being tested or factors specific in the policy or program being evaluated,
such as eligibility criteria. For instance, in a comprehensive needs assessment for
homeless individuals, the target population should include all homeless individu-
als, whether served by current programs or not. On the other hand, an evaluation
of the effectiveness of community mental health services provided to the homeless
should include only homeless recipients of community mental health care, which
may exclude large numbers of the homeless. The target population for the needs
assessment is more broadly defined and inclusive of all homeless.
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 87
Practical Sampling 87
Sampling Bias
Sampling bias is the difference between the study population value and the
expected value for the sample. The expected value of the mean is the average of the
means obtained by repeating the sampling procedures on the study population.
The expected value of the mean is equal to the study population value if the sam-
pling and calculation procedures are unbiased. Sampling bias can be subdivided
into two components: selection bias and estimation bias. Selection bias occurs
when not all members of the study population have an equal probability of selec-
tion. Estimation procedures can adjust for the unequal probabilities when the
probabilities of selection are known. When the probability of selection is not equal,
researchers adjust the estimates of the population parameters by using weights to
compensate for the unequal probabilities of selection.
An illustrative example of selection bias is a case in which a sample is selected
from a study population list that contains duplicate entries for some members of
the population. In the citizen survey example presented in Henry (1990), two lists
are combined to form the study population list: state income tax returns and
Medicaid-eligible clients. An individual appearing on both lists would have twice
the likelihood of being selected for the sample. It may take an inordinate amount
of resources to purge such a combined list of all duplicate listings, but it could be
feasible to identify sample members that appeared on both lists and adjust for the
unequal probability of selection that arises.
To adjust for this unequal probability of selection, a weight (w) equal to the
inverse of the ratio of the probability of selection of unit to the probability of selec-
tion of units only listed once (r) should be applied in the estimation process:
w = 1/p = 1/2 = .5
The probability of selection for this individual was twice the probability of selec-
tion for the members of the study population appearing on the list only once.
Therefore, this type of individual would receive only one half the weight of the
other population members to compensate for the increased likelihood of appearing
in the sample. The logic here is that those with double listings have been overrep-
resented by a factor of two in the sample and, therefore, must be given less weight
in the estimation procedures to compensate.
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 88
Estimation bias occurs when the average calculated using an estimation tech-
nique on all possible simple random samples from a population does not equal the
study population value. For example, the median is a biased estimate of the central
tendency for the population. This is due to the fact that the expected value of the
median of the sample means is not equal to the true study population mean.
Generally, biased estimators, such as the median, are used to overcome other issues
with the data and, therefore, the estimation bias is outweighed by other factors. For
example, the median income of a population is often estimated rather than the
mean income because relatively few very high income individuals can cause the mean
to be high relative to median and the income that most members of the population
actually receive.
Sampling Variability
The final component of total error in a sample is directly attributable to the fact
that statistics from randomly selected samples will vary from one sample to the next
due to chance. In any particular sample, some members of the study population
will be included and others will be excluded, which produces this variation. Because
it is rare for sample estimates to be exactly equal to the study population value, it is
useful to have an estimate of their likely proximity to the population value, or in the
terms that I have used before, the precision of the sample estimate.
Sampling theory can be used to provide a formula to estimate the precision of
any probability sample based on information available from the sample. Two fac-
tors have the greatest influence sampling on the standard error: the amount of vari-
ation around the mean of the variable (standard deviation or square root of the
variance) and the size of the sample. Smaller standard deviations reduce the sam-
pling error of the mean. The larger the sample, the smaller the standard deviation
of the sampling distribution.
Because the standard deviation for the population can be estimated from the
sample information and the sample size is known, a formula can be used to estimate
the standard deviation of the sampling distribution, referred to hereafter as the stan-
dard error of the estimate, in this particular case, the standard error of the mean:
s
sx- = ,
n 1/2
where sx- is the estimate of the standard error of the mean, s is the estimate of the
standard deviation, and n is the sample size. Using this formula allows the
researcher to estimate the standard error of the mean, the statistic that measures the
final component of total error, based solely on information from the sample.
The standard error is used to compute a confidence interval around the mean
(or other estimate of a population parameter), or the range which is likely to
include the true mean for the study population. The likelihood that the confidence
interval contains the true mean is based on the product of the t statistic chosen for
the following formula:
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 89
Practical Sampling 89
I = x--- t (sx- ).
The confidence interval is the most popular direct measure of the precision
of the estimates, and it is common practice to use the value that represents 95%
confidence, 1.96, for t. In most cases, the researcher should report the confidence
interval along with the point estimate for the mean to give the audience an under-
standing of the precision of the estimates.
Two more technical points are important for discussion here. First, probability
sampling design discussions thus far in this chapter have assumed that the sample
would be selected without replacement; that is, once a unit has been randomly
drawn from the population to appear in the sample, it is set aside and not eligible to
be selected again. Sampling without replacement limits the cases available for selec-
tion as more are drawn from the population. If a sample is drawn from a finite pop-
ulation, sampling without replacement may cause a finite population correction
(FPC) factor to be needed in the computation of the standard error of the estimate.
For the standard error of the mean, the formula using the FPC is
s
sx- = (1 n/N ) .
n 1/2
As a rule of thumb, the sample must contain more than 5% of the population to
require the FPC. This is based on the fact that the FPC factor is so close to 1 when
(n/N) the sampling fraction is less than .05 that it does not appreciably affect the
standard error calculation.
Second, standard error calculations are specific to the particular population
parameter being estimated. For example, the standard error for proportions is also
commonly used:
sp = [(pq)/n]1/2 .
where sp is the standard error for the proportion, p is the estimate of the proportion,
and q = 1 p. Most statistic textbooks present formulas for the standard error of
several estimators, including regression coefficients. Also, they are calculated for the
statistic being used by almost any statistical software package. These formulas, like
the formulas presented above, assume that a simple random sample design has been
used to select the sample. Formulas must be adjusted for more complex sampling
techniques (Henry, 1990; Kish, 1965).
One further note on terminology: The terms sampling error and standard error
are used interchangeably in the literature. They are specific statistics that measure
the more general concept of sampling variability. Standard error, however, is the
preferred term. The common use of sampling error is unfortunate for two reasons.
First, it implies an error in procedure rather than an unavoidable consequence of
sampling. Second, the audience for a study could easily assume that sampling error
is synonymous with total error concept, which could lead to the audiences ignoring
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 90
other sources of error. For example, when newspapers report the margin of error
for polling results that they publish (usually sp 1.96), they typically ignore other
sources of error, such as nonresponse that could be indicated by calculating and
publishing the response rate using the appropriate formulas published by the
American Association of Public Opinion Research (2006).
Total Error
Total error combines the three sources of error described above. Sample design is
a conscious process of making trade-offs to minimize these three components of total
error. Too frequently, reducing the standard error becomes the exclusive focus of sam-
ple design because it can be readily estimated. Because the two bias components can-
not be calculated as readily, they are often given short shrift during the design process.
When this occurs, sampling planning is reduced to the calculation of sample size and
selection of the type of probability sample to be used. However, failing to consider
and to attempt to reduce all three components of total error sufficiently can reduce
the validity and credibility of the study findings. In the next section of this chapter,
the practical sampling design framework will be described. By answering the ques-
tions presented in the framework, applied researchers can assess the options available
to reduce total error while developing a sample plan and adapting the plan to the
unexpected events that occur when the plan is being implemented.
Practical Sampling 91
Presampling choices
What is the nature of the studyexploratory, developmental, descriptive,
or explanatory?
What are the variables of greatest interest?
What is the target population for the study?
Are subpopulations important for the study?
How will the data be collected?
Is sampling appropriate?
Sampling choices
What listing of the target population can be used for the sampling frame?
What is the precision or power needed for the study?
What sampling design will be used?
Will the probability of selection be equal or unequal?
How many units will be selected for the sample?
Postsampling choices
How can the impact of nonresponse be evaluated?
Is it necessary to weight the sample data?
What are the standard errors and related confidence intervals for the study
estimates?
The answers to these questions will result in a plan to guide the sampling
process, assist the researchers in analyzing the data correctly, and provide ways to
assess the amount of error that is likely to be present in the sample data. In the next
three sections, we will focus on making choices that impact sample planning and
implementation as well as understanding some of the implications of those choices.
More detail on the implications of the various choices, as well as four detailed
examples that illustrate how choices were actually made in four sample designs, is
provided in Henry (1990). In addition, other chapters in this Handbook provide
discussion of the other issues.
Presampling Choices
What Is the Nature of the Study: Exploratory,
Developmental, Descriptive, or Explanatory?
Establishing the primary purpose of the study is one of the most important
steps in the entire research process (see Bickman & Rog, Chapter 1, this volume).
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 92
Practical Sampling 93
identified, an estimate of the frequency with which the factors occur in the study
population could be calculated from the available sample data.
Descriptive research is the core of many survey research projects in which esti-
mates of population characteristics, attributes, or attitudes are study objectives (see
Fowler & Cosenza, Chapter 12; Lavrakas, Chapter 16, this volume). In fact, proba-
bility sampling designs were originally developed for this type of research. Therefore,
most sampling texts, especially older ones, emphasize the use of sample data to
develop estimates of the characteristics of the study population, such as averages
and percentages. But it has become common for probability studies to be used for
explanatory research purposes as well. Explanatory research examines expected dif-
ferences between groups and/or relationships between variables, and the focus of
these studies is explaining variation in one or more variables or estimating the dif-
ference between two groups. Typically, the emphasis for descriptive studies will be
the precision of the estimates, while analytical studies will need to pay attention to
the power to detect effects if the effects actually occur. In practice, many studies
attempt both descriptive and explanatory tasks, which mean that the researchers
may need to assess both precision and power as decisions about sample design and
power are being considered.
In addition, it is common that practical considerations lead researchers to con-
duct their explanatory studies in more limited geographic areas than the entire area
in which certain services are provided or programs operate. For example, Gormley
and Gayer (2005) focused their evaluation of the impact of the prekindergarten
program in Oklahoma on the children who participated in the program in Tulsa
Public Schools. Even if a complete census survey of prekindergarteners attending
Tulsa Public Schools had been possible, the effects that were estimated would only
formally generalize the children who attended the Tulsa Public Schools program,
not the other children attending the state-sponsored prekindergarten in Tulsa or the
children served in the prekindergarten programs operated by the other 493 school
districts in the state of Oklahoma. In cases such as these, it requires substantive
expertise and knowledge of the populations being served in the locality chosen for
the study to assess the reasonableness of suggesting that the effects would be simi-
lar for other children in the target population who were not eligible for participa-
tion in the study. This is an example of researchers placing greater emphasis on
their ability to accurately estimate the size of the effect attributable to a program for
a subset of the participants of the entire program than on the external validity or
generalizability of the effect to the entire population served by the program. Often,
such choices are fruitful and well justified, as was the case with Gormley and Gayers
study, so that gaps in existing knowledge can be reduced, and the state of knowl-
edge in a field move forward. It is the slow and steady increments to knowledge
rather than the ideal that will often shape the decision for the type of study to be
conducted at a particular time and in specific circumstances.
Both descriptive and explanatory studies are concerned with reducing total
error. Although they have similar objectives for reducing both types of bias, the
sampling variability component of total error is quite different. For descriptive
studies, the focus is on the precision needed for estimates. For explanatory studies,
the most significant concern is whether the sample will be powerful enough to
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 94
allow the researcher to detect an effect, given the expected effect size. This is done
through a power analysis (see Lipsey & Hurley, Chapter 2, this volume).
Explanatory and descriptive studies will be the primary focus in the responses to
the remaining questions.
Practical Sampling 95
analyses. When subgroups are important focal points for separate analyses, later
sampling design choices, such as sample size and sampling technique, must con-
sider this. A sample designed without taking the subpopulation into account can
yield too few of the subpopulation members in the sample for reliable analysis.
Increasing the overall sample size or disproportionately increasing the sample size
for the subpopulation of interest, if the members of subpopulation can be identi-
fied before sampling, are potential remedies, as will be discussed later.
Is Sampling Appropriate?
The decision to sample rather than conduct a census survey should be made
deliberatively. In most cases, resources available for the study mandate sampling.
Once again, it is important to note that when resources are limited, sampling can
produce more accurate results than a population or census-type study. Often,
resources for studies of entire populations are consumed by attempts to contact
all population members. Response to the first contact is often far less than 50%,
raising the issue of substantial nonsampling bias. Sampling would require fewer
resources for the initial survey administration and could allow the investment of
more resources in follow-up activities designed to increase responses, paying divi-
dends in lowering nonsampling bias. In addition, when access to the target popula-
tion is through organizations which serve the population, gaining access can
require substantial resources. For instance, many organizations such as school
districts have research review committees that require proposals to be submitted,
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 96
reviewed, and approved, which can require substantial revisions, before access can
be gained. Obviously, these increase the time and resources required for data col-
lection. Even when automated databases that contain all members of the popula-
tion are being used, sampling can improve the accuracy of results. Missing data are
a frequent problem with automated databases. Missing data are another form of
nonresponse bias, because the missing data cannot be assumed to be missing at
random. The cost of collecting the data missing from the data base or supplement-
ing information for variables that have not been collected will be less for the sam-
ple than for the entire population, in nearly every case.
On the other hand, small populations and use of the information in the politi-
cal environment may weigh against sampling. For studies that may affect funding
allocations or when there is expert knowledge of specific cases that may appear to
be unusual or atypical, the use of a sample can affect the credibility of a study.
Credibility is vital when study results are used to inform policy or program deci-
sions. Because program decisions often determine winners and/or losers, credibil-
ity rather than validity may be the criterion on which the use of the findings turns.
Sampling Choices
What Listing of the Target Population
Can Be Used for the Sampling Frame?
The sampling frame, or the list from which the sample is selected, provides the
definition of the study population. Differences between the target population and
the study population as listed in the sampling frame constitute a significant com-
ponent of nonsampling bias. The sampling frame is the operational definition of
the population, the group about which the researchers can reasonably speak.
For general population surveys, it is nearly impossible to obtain an accurate list-
ing of the target population. A telephone directory would seem to be a likely explicit
sampling frame for a study of the population in a community. However, it suffers
from all four flaws that are commonplace in sampling frames:
Omissions: target population units missing from the frame (e.g., new listings
and unlisted numbers)
Duplications: units listed more than once in the frame (e.g., households listed
under multiple names)
Ineligibles: units not in the target population (e.g., households recently
moved out of the area)
Cluster lists: groupings of units listed in the frame (e.g., households, not indi-
viduals, listed)
The most difficult flaw to overcome is the omission of part of the target popula-
tion from the sampling frame. This can lead to a bias that cannot be estimated for the
sample data. An alternative would be to use additional listings that include omitted
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 97
Practical Sampling 97
Precision requirements are used in the calculations of efficient sample sizes. The
objective of the researcher is to produce a specified interval within which the true
value for the study population is likely to fall. Sample size is a principal means by
which the researcher can achieve this objective. But the efficiency of the sampling
design can have considerable impact on the amount of sampling error and the esti-
mate of desired sample size.
For explanatory studies, the sample variability that can be tolerated is based on
the desire to be able to detect effects or relationship if they occur. A power analysis
is conducted to assess the needs for a particular study (see Lipsey & Hurley, Chapter 2,
this volume, for more detail). The power analysis requires that the researchers have
an estimate of the size of the effect that they expect the program or intervention to
produce and the degree of confidence that they would like to be able to have to
detect the effects. Effect sizes are stated in standard deviation units, for example an
effect size of .25 means that the effect is expected to be one quarter of a standard
deviation unit. In practice, it has become common to specify an 80% chance of
detecting the effect. Power analysis software is available from several sources to
determine what sample size would be required to detect an effect of a specified size.
Definition Equal probability of Equal probability of selection Either equal or unequal Clusters that contain First, clusters of study
selection sample sample where a random start probability of selection members of the population members
where n units are that is less than or equal to sample where study population are are sampled, then study
drawn from the sampling interval is population is divided selected by a simple population members
population list chosen, and every unit that into strata (or groups) random sample, and are selected from each
03-Bickman-45636:03-Bickman-45636
falls at the start and at the and a simple random all members of the of the sampled clusters,
interval from the start is sample of each stratum selected clusters are both by random
selected is selected included in the study sampling
7/28/2008
Requirements List of study List of physical representation List of study population List of clusters in List of primary
population of study population divided into strata which all members sampling units
of study population
are contained in one
6:10 PM
sampling units
Count of study Approximate count of study Count of study Count of clusters (C) Count of primary
population (N) population (N) population for sampling units
each stratum
Sample size (n) Sample size (n) Sample size for each Approximate size of Number of primary
stratum clusters (Nc) sampling units to
be selected
(Continued)
99
100
Table 3.3 (Continued)
Requirements Random selection of Sampling interval (I = N/n Number of clusters Number of members
individuals or units rounded down to integer) to be sampled (c) to be selected from
primary sampling units
03-Bickman-45636:03-Bickman-45636
Benefits Easy to administer Easy to administer in field or Reduces standard error List of study Same benefits as for
with physical objects, such as population cluster, plus may
files or invoices, when list unnecessary reduce standard error
unavailable
6:10 PM
No weighting Disproportionate
required stratifications can be
used to increase sample
Page 100
size of subpopulations
Clusters can be
stratified for
efficiency
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 101
clusters, such as schools or clinics, are placed into strata and then sampled, either
proportionately or disproportionately. If separate estimates or explanatory analyses
are needed for certain subpopulations or some strata are known to have much
higher variability for important variables, a disproportionate sampling strategy
should be considered, which would result in unequal probability of selection.
s2
n = ,
(te/t )2
n
n= ,
1+f
where n' is the sample size computed in the first step, s is the estimate of the stan-
dard deviation, te is the tolerable error, t is the t value for the desired probability
level, n is the sample size using the FPC error factor, and f is the sampling fraction.
The most difficult piece of information to obtain for these formulas, consider-
ing it is used prior to conducting the actual data collection, is the estimate of the
standard deviation. A number of options are available, including prior studies,
small pilot studies, and estimates using the range.
Although the sample size is the principal means for influencing the precision of
the estimate once the design has been chosen, an iterative process can be used to
examine the impact on efficient sample size if an alternative design were used.
Stratification or the selection of more primary sampling units in multistage sam-
pling can improve the precision of a sample without increasing the number of units
in the sample. Of course, these adjustments may increase costs also, but perhaps less
than increasing the sample size would.
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 102
Postsampling Choices
How Can the Impact of Nonresponse Be Evaluated?
Nonresponse for sampling purposes means the number of sampled individuals
who did not provide useable responses, calculated by subtracting the response rate
from 1. Nonresponse can occur when a respondent refuses to participate in the sur-
vey or when a respondent cannot be contacted. If the nonresponding portion of the
population is reduced, the nonsampling bias is reduced (Kalton, 1983). Also, non-
response can occur when an individual who is participating in a survey cannot or
will not provide an answer to a specific question. Fowler (1993; see also Chapter 12,
this volume) and Dillman (1999) discuss several ways of reducing nonresponse. It
is often necessary for the researcher to evaluate the impact of nonresponse by con-
ducting special studies of the nonrespondents, comparing the sample characteris-
tics with known population parameters, or examining the sensitivity of the sample
estimates to weighting schemes that may provide greater weight to responses from
individuals who are considered to have characteristics more like the nonrespon-
dents (Henry, 1990; see also Braverman, 1996; Couper & Groves, 1996; Krosnick,
Narayan, & Smith, 1996).
Summary
The challenge of sampling lies in making trade-offs to reduce total error while
keeping study goals and resources in mind. The researcher must act to make choices
throughout the sampling process to reduce error, but reducing the error associated
with one choice can increase errors from other sources. Faced with this complex,
multidimensional challenge, the researcher must concentrate on reducing total
error. Error can arise systematically from bias or can occur due to random fluctua-
tion inherent in sampling. Error cannot be eliminated entirely. Reducing error is
the practical objective, and this can be achieved through careful design.
Discussion Questions
1. What are the main differences in probability and nonprobability samples?
2. For probability samples, what are the main alternatives to simple random
samples? Name one circumstance in which each one might become a preferred
option for the sampling design.
3. What is a confidence interval? What does it measure?
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 104
4. How would you go about determining the variable of greatest interest for an
evaluation of adolescent mental health programs?
5. What sample plan would you develop for describing the uninsured popula-
tion of your state?
6. In what circumstances might you choose a convenience sample over a prob-
ability sample?
7. What are the major factors that contribute to standard error of the mean?
Which of the factors can be most easily controlled by researchers?
Exercises
1. Find an evaluation report for which survey data have been collected from a
sample of the population. Answer the following questions:
a. What is the target population?
b. What is the study population?
c. What target population members are omitted from the study population?
d. Was a listing used as the sampling frame? Other than the omissions, are
their issues with the sampling frame that might bias the findings?
e. What sampling design was used for the evaluation?
2. Find a survey conducted by a federal agency and made available on the
Internet. Look at the technical description of the sample. What was the sampling
design that was used? What was the sample size? What factors affected the sample
size? Did the survey researchers oversample to compensate for nonresponse? Did
the researchers oversample a subpopulation or a strata of the population for other
reasons? If so, what were the reasons?
3. Draw up two approaches for sampling teachers in your home state. The tar-
get population is full-time classroom teachers in public schools in the state. Assume
that you are going to survey the teachers using a mailed survey. One approach
should use a sampling frame. The other approach should use a sample design that
does not require a sampling frame. Compare the nonsampling bias, sampling bias,
and sampling variability of the two approaches. To compare the sampling variabil-
ity, assume that the variable of interest is the percentage of teachers planning to
leave teaching within the next 5 years. Are there differences in costs or in feasibility
that might lead to choosing one of the approaches over the other?
4. Look carefully at the results and description of a national, statewide, or city-
wide poll based on a probability sample (surveys of readers should be excluded)
that you see reported in the media. If reported in print media, you may find more
detail about the survey online. What is the margin of error or confidence interval
around the percentages reported? What other sources of error seem to have
occurred, if any? What was the response rate? What would you like to know about
the poll that is not mentioned in the descriptions?
03-Bickman-45636:03-Bickman-45636 7/28/2008 6:10 PM Page 105
References
American Association of Public Opinion Research. (2006). Standard definitions: Final dispo-
sitions of case codes and outcome rates for surveys (4th ed.). Lenexa, KS: Author.
Braverman, M. T. (1996). Survey use in evaluation. New Directions in Evaluation, 71, 315.
Couper, M. P., & Groves, R. M. (1996). Household-level determinants of survey nonresponse.
In M. T. Braverman & J. K. Slaters (Eds.), Advances in survey research (pp. 6370). San
Francisco: Jossey-Bass.
Dillman, D. A. (1999). Mail and Internet surveys: The tailored design method (2nd ed.).
New York: Wiley.
Fowler, F. J., Jr. (1993). Survey research methods (2nd ed.). Newbury Park, CA: Sage.
Gormley, W. T., & Gayer, T. (2005). Promoting school readiness in Oklahoma. Journal of
Human Resources, 40(3), 533558.
Henry, G. T. (1990). Practical sampling. Newbury Park, CA: Sage.
Kalton, G. (1983). Introduction to survey sampling. Beverly Hills, CA: Sage.
Keeter, S., Miller, C., Kohut, A., Groves, R., & Presser, S. (2000). Consequences of reducing
non-response in a national telephone survey. Public Opinion Quarterly, 64(2), 125148.
Kish, L. (1965). Survey sampling. New York: Wiley.
Krosnick, J. A., Narayan, S., & Smith, W. R. (1996). Satisficing in surveys: Initial evidence. In
M. T. Braverman & J. K. Slaters (Eds.), Advances in survey research (pp. 2944). San
Francisco: Jossey-Bass.
Mashburn, A. J., & Henry, G. T. (2004). Assessing school readiness: Validity and bias in
preschool and kindergarten teachers ratings. Educational Measurement: Issues and
Practice, 23(4), 1630.
McKean, K. (1987, January). The orderly pursuit of pure disorder. Discover, 7281.
Skidmore, F. (1983). Overview of the Seattle-Denver Income Maintenance Experiment: Final
report. Washington, DC: Government Printing Office.
Sudman, S. (1976). Applied sampling. New York: Academic Press.
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 106
CHAPTER 4
Planning Ethically
Responsible Research
Joan E. Sieber
A pplied researchers examine and experiment with issues that directly affect
peoples livesissues such as education, health, family life, work, finances,
and access to government benefits, and must respect the interests of sub-
jects and their communities. There is a practical, as well as a moral, point to this.
Unless all parties concerned are recognized and respected, it is likely that research
questions may be inappropriately framed, participants may be uncooperative, and
findings may have limited usefulness. Consequently, investigators who are thought-
less regarding ethics are likely to harm themselves and their research as well as those
that they study.
This chapter focuses on research planning and ethical problem solving, not on
details of federal or state law governing human research or on preparing research
protocols for institutional review boards (IRBs). Readers may wish to refer to
www.hhs.gov/ohrp for the current federal regulations governing human research.
Details on approaches to compliance with various aspects of federal law, and how
to write a research protocol in compliance with IRB and federal requirements, are
presented on the Web sites of many IRBs and in Planning Ethically Responsible
Research (Sieber, 1992) in the Applied Social Research Methods Series published by
Sage Publications. The readers own IRB can provide information on its specific
requirements.
An Introduction to Planning
The ethics of social and behavioral research is about creating a mutually respectful,
win-win relationship in which important and useful knowledge is sought, participants
106
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 107
are pleased to respond candidly, valid results are obtained, and the community con-
siders the conclusions constructive. This requires more than goodwill or adherence
to laws governing research. It requires investigation into the perspectives and cul-
tures of the participants and their community early in the process of research
design, so that their needs and interests are understood and served.
In contrast, a researcher who does not investigate the perspectives of the partic-
ipants and plan accordingly may leave the research setting in pandemonium. The
ensuing turmoil may harm all the individuals and institutions involved, as illus-
trated by the following example, adapted from an actual study.
A researcher sought to gather information that would help local schools meet
the needs of children of migrant farm workers. He called on families at their homes
to ask them, in his halting Spanish, to sign a consent form and to respond to his
interview questions. Most of the families seemed not to be at home, and none
acknowledged having children. Many farm workers are undocumented, and they
assumed that the researcher was connected with the U.S. Immigration and
Naturalization Service (INS). News of his arrival spread quickly, and families
responded accordinglyby fleeing the scene.
Such enlightened, ethical research practices make for successful science, yet
many researchers have been trained to focus narrowly on their research agendas
and to ignore the perceptions and expectations of their participants and of society
at large. When one is narrowly focused on completing a research project, it is easy
to overlook some of the interests and perspectives of the subjects and of society at
large. The result would likely be a failed research program as well as a community
that learned to disrespect researchers.
Ethical research practice entails skillful planning and effective communication,
reduction of risk, and creation of benefits, as these issues pertain to the stakehold-
ers in the research. Stakeholders include any persons who have interests in the
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 108
The protocol has legal status as a control document. It is the paper trail show-
ing that the research is acceptable to a legally constituted board of reviewers. Should
anyone raise questions about the project, the approved protocol shows that the
project is deemed to be of sufficient value to justify any risks involved. Hence, the pro-
tocol must reflect what is actually done in the research. Once the IRB has approved a
protocol for a particular project, the investigator must follow that procedure, have any
desired changes approved by the IRB, or risk a disaster such as the following:
Beneficence: maximizing good outcomes for science, humanity and the indi-
vidual research participant, while avoiding or minimizing unnecessary risk,
harm or wrong.
Respect for subjects: protecting the autonomy of (autonomous) persons, and
treating the nonautonomous with respect and special protections.
Justice: ensuring reasonable, nonexploitative, and carefully considered proce-
dures and their fair administration.
benefit, selecting the appropriate kind and number of subjects, obtaining voluntary
informed consent, and compensating subjects for injury or at least informing them
whether compensation will be available.
The interpretation of regulations needs to evolve as necessitated by new research
challenges that need to be met. The IRB (a committee) is governed by the HRPP
(the administrative policies and program that specify the role of the IRB and
other elements of the system such as education of investigators, students, and IRB
members). The HRPP should take advantage of the flexibility permitted by the fed-
eral regulations to modify the role of the IRB as circumstances require (Rubin &
Sieber, 2006). For example, the HRPP may mandate that the IRB not review mini-
mal risk research, but that these be reviewed outside the IRB, perhaps by IRB
members who expedite the review of minimal risk or exempt protocols within their
department or area of expertise. Researchers who observe the need for more ethi-
cal interpretations of regulations might work with their IRB to empirically test the
efficacy of alternative procedures, as suggested by Levine (2006), for example. Thus,
empirical research to determine what works to satisfy ethical principles can play an
important role in ensuring that regulations are interpreted in ways that are sensible
and ethical.
We turn now to three major aspects of ethical problem solving: consent (includ-
ing debriefing and deception), privacy/confidentiality, and risk/benefit, and finally
to the special needs of vulnerable populations, including children.
research participation, but it is not necessarily the final communication about the
conditions of the research. Often, questions and concerns occur to the participants
only after the research is well under way. Sometimes, it is only then that meaning-
ful communication and informed consent can occur. The researcher must be open
to continuing two-way communication throughout the study and afterward as
questions occur to the participants.
Voluntary means without threat or undue inducement. When consent state-
ments are presented as a plea for help or when people are rushed into decisions,
they may agree to participate even though they would rather not. They are then
likely to show up late, fail to appear, or fail to give the research their full attention.
To avoid this, the researcher should urge each subject to make the decision that best
serves his or her own interests. Also, the researcher should not tie participation to
benefits that the subjects could not otherwise afford such as health services, espe-
cially if participants are indigent or otherwise vulnerable to coercion. And, partici-
pants need to know that they can quit at any time without repercussion.
Informed means knowing what a reasonable person in the same situation would
want to know before giving consent, including who the researcher is and why the
study is being done. Mostly, people want to know what they are likely to experience,
including the length of time required, and how many sessions are involved. If the
procedure is unusual or complicated, a videotape of the procedure may be more
informative than a verbal description. People need to be informed in language that
they understand. Two methods of learning the terminology that subjects would use
and understand are described by Willis (2006), the think aloud method and the ver-
bal probing method. In the think aloud method (surrogate) subjects are asked to
externalize their thought processes (Tell me what you are thinking.) as they
respond to materials. For example, as the surrogate subject reads each element of
the informed consent, he is to say out loud what it makes him think. In the verbal
probing method, the subject is asked to explain each part and probes such as the fol-
lowing are used: Tell me more about that . . . What does . . . (particular term)
mean to you? When someone tells you that, what would you want to know?
Although the competence to understand and make decisions about research par-
ticipation is conceptually distinct from voluntariness, these qualities become blurred
in the case of some populations. Children, adults with intellectual disabilities, the
poorly educated, and prisoners, for instance, may not understand their right to refuse
to participate in research when asked by someone of apparent authority. They may
also fail to grasp details relevant to their decision. The researcher may resolve this
problem by injecting probes (as in cognitive interviewing) into the informed consent
process for each subject, or by appointing an advocate for the research subject, in
addition to obtaining the subjects assent. For example, children cannot legally con-
sent to participate in research, but they can assent to participate, and must be given
veto power over parents or other adults who give permission for them to participate.
Consent means explicit agreement to participate. Competence to consent or
assent and voluntariness are affected by the way the decision is presented (Melton
& Stanley, 1991). An individuals understanding of the consent statement and
acceptance of his of her status as an autonomous decision maker will be most
powerfully influenced not by what the individual is told, but by how he or she is
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 112
engaged in the communication. There are many aspects of the investigators speech
and behavior that communicate information to subjects. Body language, friendli-
ness, a respectful attitude, and genuine empathy for the role of the subject are
among the factors that may speak louder than words. To illustrate, imagine a poten-
tial subject who is waiting to participate in a study:
Scenario 1: The scientist arrives late, wearing a rumpled lab coat, and props
himself in the doorway. He ascertains that the subject is indeed the person
whose name is on his list. He reads the consent information without looking
at the subject. The subject tries to discuss the information with the researcher,
who seems not to hear. He reads off the possible risk. The nonverbal commu-
nication that has occurred is powerful. The subject feels resentful and sup-
presses an urge to storm out. What has been communicated most clearly is
that the investigator does not care about the subject. The subject is sophisti-
cated and recognizes that the researcher is immature, preoccupied, and lack-
ing in social skills, yet he feels devalued. He silently succumbs to the pressures
of this unequal status relationship to do the right thing; he signs the consent
form amid a rush of unpleasant emotions.
Scenario 2: The subject enters the anteroom and meets a researcher who is well-
groomed, stands straight and relaxed, and invites the subject to sit down with
him. The researchers eye contact,3 easy and relaxed approach, warm but pro-
fessional manner, voice, breathing, and a host of other cues convey that he is
comfortable communicating with the subject. He is friendly and direct as he
describes the study. Through eye contact, he ascertains that the subject under-
stands what he has said. He invites questions and responds thoughtfully to
comments, questions, and concerns. When the subject raises scientific ques-
tions about the study (no matter how naive), the scientist welcomes the
subjects interest in the project and enters into a brief discussion, treating the
subject as a respected peer. Finally, the researcher indicates that there is a for-
mal consent form to be signed and shows the subject that the consent form cov-
ers the issues they have discussed. He mentions that it is important that people
not feel pressured to participate, but rather should participate only if they really
want to. The subject signs the form and receives a copy of the form to keep.
Though the consent forms in these two cases may have been identical, only the
second scenario exemplifies adequate, respectful informed consent. The second
researcher was respectful and responsive; he facilitated adequate decision making.
Congruence, rapport, and trust were essential ingredients of his success.
Congruence of Verbal and Body Language. The researcher in Scenario 1 was incon-
gruent; his words said one thing, but his actions said the opposite. The congruent
researcher in Scenario 2 used vocabulary that the research participant easily under-
stood, spoke in gentle, direct tones, breathed deeply and calmly, and stood or sat
straight and relaxed. To communicate congruently, ones mind must be relatively
clear of distracting thoughts.
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 113
Rapport. The researchers friendly greeting, openness, positive body language, and
willingness to hear what each subject has to say or to ask about the study are
crucial to establishing rapport. When consent must be administered to many par-
ticipants, the process can turn into a routine delivered without a feeling of com-
mitment; this should be avoided.
Trust. If participants believe that the investigator may not understand or care about
them, there will not be the sense of partnership needed to carry out the study sat-
isfactorily. The issue of trust is particularly important when the investigator has
higher status than the subject or is from a different ethnic group. It is useful for the
researcher to ask members of the subject population, perhaps in a focus group, to
examine the research procedures to make sure that they are respectful, acceptable,
and understandable to the target population.
There are many ways to build respect, rapport, and trust, as the following
examples illustrate:
There are many ways to enhance communication, rapport, respect, and trust,
and to increase the benefits to subjects of a research project, depending on the set-
ting and circumstances. When planning research, especially in a field setting, it is
useful for researchers to conduct focus groups drawn from the target population, to
consult with community gatekeepers, or to consult with pilot subjects to learn their
reactions to the research procedures and how to make the research most beneficial
and acceptable to them (see Stewart, Shamdasani, and Rook, Chapter 18, this vol-
ume, for discussion of uses of focus groups). For example, learn what terms to use
when obtaining demographic information such as ethnicity and gender orienta-
tion. In some cases, this consultation should extend to other stakeholders and com-
munity representatives. The rewards to the researcher for this effort include greater
ease of recruiting cooperative participants, a research design that will work, and a
community that evinces goodwill.
In summary, it is important for the researcher to determine what the concerns
of the subject population actually are. Pilot subjects from the research population,
as well as other stakeholders, should have the procedure explained to them and
should be asked to try to imagine what concerns people would have about partici-
pating in the study. Often some of these concerns turn out to be very different from
those that the researcher would imagine, and they are likely to affect the outcome
of the research if they are not resolved, as illustrated by the following case of mis-
informed consent:
Mrs. B: Oh its great! (She constantly complained to her family about the
poor service.)
Researcher: How do you like the food here?
Mrs. Bs anxiety was rising; midway through the questioning she asked, Did I
pass the test?
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 115
Mrs. B spun her chair around and wheeled herself away. (Fisher & Rosendahl,
1990, pp. 4748)
Debriefing
The benefits of research include its educational or therapeutic value for partici-
pants. Debriefing provides an opportunity for the researcher to consolidate
the value of the research to subjects through conversation and handouts. The
researcher can provide rich educational material immediately, based on the litera-
ture that forms the foundation of the research. Debriefing also offers an opportu-
nity for the researcher to learn about subjects perceptions of the research: Why did
they respond as they didespecially those whose responses were unusual? How do
their opinions about the usefulness of the findings comport with those of the
researcher? Typically, the interpretation and application of findings are strength-
ened by researchers thoughtful discussions with participants. Many a perceptive
researcher has learned more from the debriefing process than the data alone could
ever reveal.
If the researcher or IRB have any concerns about whether subjects experience
misgivings about the research, it is useful to know if, in fact, misgivings or upset do
occur, and whether it is an idiosyncratic concern of just one or a few or a concern
of a substantial proportion of the subjects. It is a mistake to confuse the misgivings
of one or a few with the notion that the research is risky. Newman, Risch, and
Kassam-Adams (2006) summarize research on trauma survivors to show that while
most find it quite beneficial to be interviewed by an experienced professional about
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 116
Deception
In deception research, the researcher studies reactions of subjects who are pur-
posely led to have false beliefs or assumptions. This is generally unacceptable in
applied research, but consent to concealment may be defensible when it is the only
viable way (a) to achieve stimulus control or random assignment, (b) to study
responses to low-frequency events (e.g., fights, fainting), (c) to obtain valid data
without serious risk to subjects, or (d) to obtain information that would otherwise
be unobtainable because of subjects defensiveness, embarrassment, or fear of
reprisal. An indefensible rationale for deception is to trick people into research par-
ticipation that they would find unacceptable if they correctly understood it. If it is
to be acceptable at all, deception research should not involve people in ways that
members of the subject population would find unacceptable.
Deception studies that involve people in doing socially acceptable things, and
pose no threat to persons self-esteem are little different from many other everyday
activities. The few deception studies that have been regarded as questionable or
harmful, such as Milgrams (1974) study of obedience in which persons thought
that they were actually delivering high voltage electric shock to others, are ones in
which persons were strongly induced to commit acts that are harmful or wrong,
or were surreptitiously observed engaging in extremely private acts (e.g.,
Humphreys, 1970).
There are three kinds of deception that involve consent and respect subjects
right of self-determination:
2. Consent to deception: Subjects are told that there may be misleading aspects
of the study that will not be explained to them until after they have participated.
A full debriefing is given as promised.
3. Consent to waive the right to be informed: Subjects waive the right to be
informed and are not explicitly forewarned of the possibility of deception. They
receive a full debriefing afterward.
Privacy
What one person considers private, another may not. We certainly know when our
own privacy has been invaded, but the privacy interests of another may differ from
ours. Thus, while researchers should be sensitive to the topics that might be regarded
as private by those they plan to study, to judge what another considers private based
on ones own sense of privacy is to set a capricious and egocentric/ethnocentric stan-
dard for judging privacy. One must let subjects and members of their community
judge for themselves what is appropriate to ask or do in research and how subjects are
to be given an opportunity to control the access of the researcher to themselves.
What is private depends greatly on context and on what we consider to be the
other persons business. The kinds of things we consider appropriate to disclose to
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 118
our physician differ from what we disclose to our banker, accountant, neighbor, and
so on. If a highly professional interviewer establishes that a socially important piece
of research hinges on the candid participation of a random sample of the popula-
tion, many would disclose details that they might never tell others. However, a
researcher who took a less professional approach, or sought to do trivial research,
would receive a different reception.
Respecting Privacy. How can investigators protect subjects from the pain of having
their privacy violated? How can investigators guard the integrity of their research
against the lies and subterfuges that subjects will employ to hide some private
truths or to guard against intrusions? Promises of confidentiality and the gathering
of anonymous data may solve some of these problems, but respecting privacy is
more complex than that. An understanding of the privacy concerns of potential
subjects enables the researcher to communicate an awareness of, and respect for,
those concerns, and to protect subjects from invasion of their privacy. Because pri-
vacy issues are often subtle, and researchers may not understand them, appropriate
awareness may be lacking with unfortunate results, such as the following:
honestly. This researcher should have invested time in better scholarship into
the development of privacy needs in children (see Thompson, 1991).
In each of the above cases, the researcher has been insensitive to privacy issues
idiosyncratic to the research population and has not addressed the problems that
these issues pose for the research. Had the researcher consulted the psychological
literature, community gatekeepers, consumers of the research, or others familiar
with the research population, he or she might have identified these problems and
solved them in the design stage. Most of the topics that interest social scientists con-
cern somewhat private or personal matters. Yet most topics, however private, can be
effectively and responsibly researched if investigators employ appropriate sensitiv-
ity and safeguards.
Is There a Right to Privacy? The right to privacy from research inquiry is protected
by the right to refuse to participate in research. An investigator is free to do research
on consenting subjects or on publicly available information, including unobtrusive
observation of people in public places, although the chat room case above illus-
trates that in some contexts a public venue should be treated as private. Researchers
may videotape or photograph the behavior of people in public without consent.
But if they do so, they should heed rules of common courtesy and should be sensi-
tive to local norms. Intimate acts in public places, such as goodbyes at airports and
intimate discussions in chat rooms, should be regarded as private, though done in
a public venue.
Constitutional and federal laws have little to say directly about privacy and
social/behavioral research. Except for HIPAA (see p. 128) which governs health
data, the only definitive federal privacy laws governing social/behavioral research
pertain to school research.
Many claims to privacy are also claims to autonomy. For example, subjects
privacy and autonomy are violated when their self-report data on marijuana use
become the basis for their arrest, when IQ data are disclosed to schoolteachers who
would use it to track students, or when organizational research data disclosed to
managers become the basis for firing or transferring employees. The most dramatic
cases in which invasion of privacy results in lowered autonomy are those in which
something is done to an individuals thought processesthe most private part of a
personthrough behavior control techniques such as psychopharmacology.
Privacy may be invaded when people are given unwanted information. For
example, a researcher may breach a subjects privacy by showing him pornography
or by requiring him to listen to more about some other persons sex life than he
cares to hear. Privacy is also invaded when people are deprived of their normal flow
of information, as when nonconsenting subjects (who do not realize that they are
participating in a study) are deprived of information that they ordinarily would use
to make important decisions.
Unusual personal boundaries were encountered by Klockars (1974), a criminolo-
gist, when he undertook to write a book about a well-known fence. The fence was
an elderly pawnshop owner who had stolen vast amounts earlier in his life. Klockars
told the fence that he would like to document the details of his career, as the world
has little biographical information about the lives of famous thieves. Klockars offered
to change names and other identifying features of the account to ensure anonymity.
The fence, however, wanted to go down in history and make his grandchildren proud
of him. He offered to tell all, but only if Klockars agreed to publish the fences real
name and address in the book. This was done, and the aging fence proudly decorated
his pawnshop with clippings from the book. (Thus confidentiality does not always
involve a promise not to reveal the identity of research participants; rather, it entails
whatever promise is mutually acceptable to researcher and participant.)
Brokered Data
If it would be too intrusive for an investigator to have direct access to subjects, a
broker may be used. The term broker refers to any person who works in some
trusted capacity with a population to which the researcher does not have access and
who obtains data from that population for the researcher. For example, a broker
may be a psychotherapist or a physician who asks patients if they will provide data
for important research being conducted elsewhere. A broker may serve other func-
tions in addition to gathering data for the researcher, as discussed below.
Additional Roles for Brokers. A broker may (a) examine responses for information
that might permit the researcher to deduce the identity of the respondent and,
therefore, remove that information, (b) add information (e.g., a professional eval-
uation of the respondent), or (c) check responses for accuracy or completeness.
There should be some quid pro quo between researcher and broker. Perhaps the
broker may be paid for his or her time, or the researcher may make a contribution
to the brokers organization.
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 123
Confidentiality
Confidentiality refers to access to data, not access to people directly. The
researcher should employ adequate safeguards of confidentiality, and these should
be described in specific terms in the consent statement. For example, confidential-
ity agreements such as the following might be included in a consent letter from a
researcher seeking to interview families in counseling.
To protect your privacy, the following measures will ensure that others do not
learn your identity or what you tell me: No names will be used in transcrib-
ing from the audiotape, or in writing up the case study. Each person will be
assigned a letter name as follows: M for mother, F for father, MS1 for male
first sibling, and so on.
All identifying characteristics, such as occupation, city, and ethnic back-
ground, will be changed.
The audiotapes will be reviewed only in my home or the office of my thesis
adviser. The tapes and notes will be destroyed after my report of this research
has been accepted for publication.
What is discussed during our session will be kept confidential, with two
exceptions: I am compelled by law to inform an appropriate other person if I
hear and believe that you are in danger of hurting yourself or someone else
or if there is reasonable suspicion that a child, elder, or dependent adult has
been abused.5
Noteworthy characteristics of this agreement are that it (a) recognizes the sensi-
tivity of some of the information likely to be conveyed, (b) states what steps will be
taken to ensure that others are not privy to the identity of subjects or to identifiable
details about individuals, and (c) states any legal limitations to the assurance of
confidentiality.
1. They make it possible for the researcher to recontact subjects if their data
indicate that they need help or information.
2. They make it possible for the researcher to link data sets from the same indi-
viduals. (This might also be achieved with code names.)
3. They allow the researcher to mail results to the subjects. (This might also be
achieved by having subjects address envelopes to themselves, which are then
stored apart from the data. After the results are mailed out, no record of the
names of subjects would remain with the researcher.)
4. They make it possible for the researcher to screen a large sample on some
measures in order to identify a low-base-rate sample (e.g., families in which
there are twins).
Note that for the first two reasons, the issue is whether to have names associated
with subjects data; for the third reason, the issue is whether to have names on file
at all. In the fourth case, identifiers may be expunged from the succeeding study as
soon as those data are gathered. If the data can be gathered anonymously, subjects
will be more forthcoming, and the researcher will be relieved of some responsibili-
ties connected with assuring confidentiality. If the research cannot be done anony-
mously, the researcher must consider procedural, statistical, and legal methods for
assuring confidentiality.
1978). This method enables the researcher to check off those who have responded
and to send another wave of questionnaires to those who have not.
Any of these three methods can be put to corrupt use if the researcher is so
inclined. Because people are sensitive to corrupt practices, the honest researcher
must demonstrate integrity. The researchers good name and that of the research
institution may reduce the suspicion of potential respondents.
Different procedures are needed if individuals data files are to be linked perma-
nently, as in longitudinal research, or linking of other independently stored files:
Longitudinal Research. Here, the researcher must somehow link together the vari-
ous responses of particular persons over time. A common way to accomplish this is
to have each subject use an easily remembered code, such as mothers maiden name
as an alias. The researcher must make sure that there are no duplicate aliases. The
adequacy of this method depends on subjects ability to remember their aliases. In
cases where a subject is mistakenly using the wrong alias might seriously affect the
research or the subject (e.g., the subject gets back the wrong HIV test result), this
method of linking data would be inappropriate.
Other File Linking. Sometimes, a researcher needs to link each persons records with
some other independently stored records on those same persons (exact matching)
or on persons who are similar on some attributes (statistical matching). A researcher
can link files without disclosing the identity of the individuals by constructing
identifications based on the files, such as a combination of letters from the individ-
uals name, his or her date of birth and gender, and the last four digits of the per-
sons social security number.
Another approach to interfile linkage would be through use of a broker, who
would perform the linkage without disclosing the identity of the individuals. An
example would be court-mandated research on the relationship between academic
accomplishment and subsequent arrest records of juveniles who have been sen-
tenced to one of three experimental rehabilitation programs. The court may be
unwilling to grant a researcher access to the records involved but may be willing to
arrange for a clerk at the court to gather all the relevant data on each subject,
remove identifiers, and give the anonymous files to the researcher. The obvious
advantages of exact matching are the ability to obtain data that would be difficult
or impossible to obtain otherwise and the ability to construct a longitudinal file.
Certificates of Confidentiality
Under certain circumstances, priests, physicians, and lawyers may not be
required to reveal to a court of law the identities of their clients or sources of infor-
mation. This privilege does not extend to researchers. Prosecutors, grand juries, leg-
islative bodies, civil litigants, and administrative agencies can use their subpoena
powers to compel disclosure of confidential research information. What is to pro-
tect research from this intrusion? Anonymous data, aliases, colleagues in foreign
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 126
countries to whom sensitive data can be mailed as soon as it is gathered, and statis-
tical strategies are not always satisfactory solutions. The most effective protection
against subpoena is the Certificate of Confidentiality.
In 1988, the U.S. Congress enacted the Public Health Service Act, providing for
an apparently absolute researcher-participant privilege when it is covered by a
Certificate of Confidentiality issued by units of the Department of Health and
Human Services. The Certificate of Confidentiality is designed to protect identifi-
able sensitive data against compelled disclosure in any federal, state, or local civil,
criminal, administrative, legislative, or other proceeding (see http://grants1.nih
.gov/grants/policy/coc/background.htm). Wolf and Zandecki (2006) recently sur-
veyed National Institutes of Health (NIH)funded investigators to learn about their
experience of using Certificates of Confidentiality and found that while most inves-
tigators prefer using them, they cannot gauge how research participants regard
them, and some investigators found them too complex to explain to participants.
Singer (2004) found that mention of a Certificate of Confidentiality increases the
perception of harm, especially among younger respondents.
Example 3: The data will be anonymous. You are asked to write your name on
the cover sheet so that I can make sure your responses are complete. As soon
as you hand in your questionnaire, I will check your responses for complete-
ness and ask you to complete any incomplete items. I will then tear off and
destroy the cover sheet. There will then be no way anyone else can associate
your name with your data.
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 127
Example 7: In this study, I will examine the relationship between your childs
SAT scores and his attitude toward specific areas of study. We respect the pri-
vacy of your child. If you give me permission to do so, I will ask your child to
fill out an attitude survey. I will then give that survey to the school secretary,
who will write your childs SAT subscores on it, and erase your childs name
from it. That way, I will have attitude and SAT data for each child, but will not
know the name of any child. The data will then be statistically analyzed and
reported as group data.
These are merely examples. The researcher needs to give careful consideration to
the content and wording of each consent statement.
Data Sharing
If research is published, the investigator is accountable for the results, and is nor-
mally required to keep the data for 5 to 10 years. The editor of the publication in
which the research is reported may ask to see the raw data to check its veracity.
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 128
Some funders (e.g., NIH, 2003) require that the documented data be archived in
user-friendly form and made available to other scientists. When data are shared via
a public archive, the researcher must ensure that all identifiers are removed and that
there is no way for anyone to deduce subjects identities.
A variety of techniques have been developed by the Federal government (which
has an obligation to provide to other users the data collected at taxpayer expense)
to transform raw data into a form that prevents deductive disclosure (Zarate &
Zayatz, 2006). The objective is always to preserve the analytical value while remov-
ing the characteristics of that data that would enable one to reidentify the ostensi-
bly deidentified data. Variables or cases with easily identifiable characteristics are
removed. Microaggregation can be employed by ordering microdata along a single
variable then aggregating adjacent records in groups of three or more. Within each
grouping, the reported (actual) value on all variables is replaced by the average
value of the group for each variable. For details of microaggregation see ORourke
et al. (2006) who provide detailed descriptions of other techniques as well. If the
analytical value of data would be destroyed by using techniques such as those
described by ORourke et al., one may provide limited access to the raw data to per-
sons who meet stringent requirements such as administration of the sharing
arrangement by their institution, signing of contractual or licensing agreements,
and so on (see Rodgers & Nolte, 2006, for details of these procedures).
When health data are to be shared, the Privacy Rule of the Health Insurance
Portability and Accountability Act of 1996 (HIPAA)which is really about
confidentialitypermits a holder of identified health data to release those data with-
out the individuals authorization if it meets certain conditions. Either it must delete
any of the 18 identifiers specified in HIPAA or one can have a disclosure expert deter-
mine whether data elements, alone or combined with others, might lead to identifica-
tion of a specific person (for details of HIPAA, see www.hhs.gov/ocr/combinedregtext
.pdf; for details on compliance with HIPAA, see DeWolf, Sieber, Steel, & Zarate, 2006).
Kinds of Risk. Risk, or the possibility of some harm, loss, or damage, may involve
mere inconvenience (e.g., boredom, frustration, time wasting), physical risk (e.g.,
injury), psychological risk (e.g., insult, depression, upset), social risk (e.g., embar-
rassment, rejection), economic risk (e.g., loss of job, money, credit), or legal risk
(e.g., arrest, fine, subpoena).
What Aspect of Research Creates Risk? Risk may arise from (a) the theory, which may
become publicized and may blame the victim or create wrong ideas; (b) the research
process; (c) the institutional setting in which the research occurs, which may be coer-
cive in connection with the research; and (d) the uses of the research findings.
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 129
Cognitive vulnerability: Does the person have the capacity to decide whether
to participate?
Juridic vulnerability: Is the person liable to the authority of others who may
have an independent interest in their research participation?
Deferential vulnerability: Does the person have patterns of deferential behav-
ior that may mask an unwillingness to participate?
Medical vulnerabililty: Has the person been selected for having a serious
health-related condition for which there are no satisfactory remedies?
Allocational vulnerability: Does the person lack important social goods that
will be provided in return for research participation?
Research infrastructure: Does the political, organizational, economic, social
context of the research have the integrity and resources needed to manage the
study responsibly?
prostitutes, persons with AIDS, victims of violence, and so on. The preceding dis-
cussion about communication, risk/benefit assessment, and privacy/confidentiality
is doubly important for such populations. Furthermore, members of many stigma-
tized and fearful populations are especially unwilling to be candid with researchers
who are interested primarily in discovering scientific truth, rather than helping the
individuals being studied. Contrary to the usual scientific directive to be objective,
the researcher who investigates the lives of runaways, prostitutes, or victims of
domestic violence or spousal rape often must be an advocate for those subjects to
gain their trust and cooperation and must relate in a personal and caring manner
if candor and participation are to be forthcoming from members of the research
population. However, the devil is in the details. General prescriptions pale along-
side accounts of ethical issues in specific contexts. Each vulnerable research popu-
lation has its own special set of fears, its own reasons for mistrusting scientists, and
its own culture, which outsiders can scarcely imagine. Interested readers are
referred to Renzetti and Lee (1993) for further discussion.
She provided the school faculty with materials on learning disabilities, and
gave bag-lunch workshops and presentations on her project. She worked with
teachers who were interested in trying her approaches in their classrooms, urg-
ing them to adapt and modify her approaches as they deemed appropriate,
and asked that they let her know the outcomes. Together, the researcher and
the teachers pilot tested adaptations of the methods concurrently with the for-
mal experiments. All learning disabled children who participated received
special recognition and learned how to assist other students with similar prob-
lems. Two newspaper articles about the program brought favorable publicity
to the researcher, the school, and the researchers university. This recognition
further increased the already high morale of students, teachers, and the
researcher.
Of the six procedures examined, only two showed significant long-term gains
on standardized tests of learning. However, the teachers who had gotten
involved with pilot testing of variations on the treatments were highly enthu-
siastic about the success of these variations. When renewal of funding was
sought, the funder was dissatisfied with the formal findings, but impressed
that the school district and the university, together, had offered to provide in-
kind matching funds. The school administrators wrote a glowing testimony to
the promise of the new pilot procedures and of the overall approach, and the
funder supported the project for a second year. The results of the second year,
based on modified procedures, were much stronger. Given the structure that
had been created, it was easy for the researcher to document the entire proce-
dure on videotape and to disseminate it widely. The funder provided seed
money to permit the researcher, her graduate students, and the teachers who
had collaborated on pilot testing to start a national-level traveling workshop,
which quickly became self-supporting. This additional support provided sum-
mer salary to the researcher, teachers, and graduate students for several years.
This tale of providing benefits to the many stakeholders in the research process
is not strictly relevant to all research. Not every researcher does field research
designed to benefit a community. In some settings, too much missionary zeal to
include others in helping may expose some subjects to serious risk such as breach
of confidentiality. Not all research is funded or involves student assistants. Many
researchers engage in simple, unfunded, unassisted, one-time laboratory studies to
test theory. Even in such uncomplicated research, however, any benefit to the insti-
tution (e.g., a Science Day research demonstration) may favorably influence the
institution to provide resources for future research, and efforts to benefit subjects
may be repaid with their cooperation and respect.
Significant contributions to science and society are not the results of one-shot
activities. Rather, such contributions typically arise from a series of competently
designed research or intervention efforts, which themselves are possible only
because the researcher has developed appropriate institutional or community rap-
port and infrastructures and has disseminated the findings in a timely and effective
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 132
Note that even if the experiment or intervention yields disappointing results, all
but the last benefit might be available to the community, as well as to individual
subjects. Let us now consider the seven kinds of beneficiaries.
The subjects may enjoy such benefits as the respect of the researcher, an interest-
ing debriefing, money, treatment, or future opportunities for advancement.
The community or institution that provides the setting for the field research may
include the subjects homes, neighborhood, clinic, workplace, or recreation center.
A community includes its members, gatekeepers, leaders, staff, professionals, clien-
tele, and peers or family of the subjects. Benefits to the community are similar to
those for the subjects. Sometimes, community members also serve as research assis-
tants and so would receive benefits associated with those of the next category of
recipients as well.
The researcher, as well as research assistants and others who are associated with
the project, may gain valuable relationships, knowledge, expertise, access to fund-
ing, scientific recognition, and so on, if the research is competently conducted, and
Table 4.1 Benefit Table of a Hypothetical Learning Research Project
Relationships Respect of Ties to university Future access to Improved Ties with a Ideas shared Access to a new
researcher community town-gown successful with other specialist
relationships project scientists
Material Workbook Books Grant support Videotapes of Instructional Refereed Useful popular
resources research materials publications literature
7/28/2008
Training Tutoring skills Trained Greater research Student training Model project Workshop Training for
opportunity practitioners expertise program for future grant at national practitioners
applicants meetings nationally
11:08 AM
Do good/ Esteem of peers Local enthusiasm Professional Esteem of Satisfaction Recognition Greater respect
earn esteem for project respect community of funder of scientific for science
overseers contribution
Page 133
Empowerment Earn leadership Prestige from National Good reputation Congressional Increased Increased
status the program reputation increase in prestige of power to help
with funder funding discipline people
Scientific/ Improved Effective Leadership Headquarters for Proven success Improved Nationally
clinical success learning ability program opportunities in national teacher of funded training via successful
national program program treatment workshops programs
133
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 134
3. Issues of privacy, which are normally salient for adolescents, are likely to be
even more heightened for this population.
4. Maltreated youngsters are likely to experience the research as more stressful than
are normal children. If the researcher effectively establishes rapport, the young-
ster may reach out for help; the researcher must be prepared to respond helpfully.
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 136
Vulnerable Populations
Most high-priority social research is concerned with vulnerable populationsdrug
abusers, runaways, prostitutes, persons with AIDS, victims of violence, the mentally
ill, and so on. The foregoing discussions about communication, risk/benefit assess-
ment, and privacy/confidentiality are doubly applicable to these populations.
Additionally, members of many stigmatized and fearful populations are unwilling
to be candid with researchers who are interested primarily in discovering scientific
truth, rather than helping the individuals being studied. Contrary to the usual sci-
entific directive to be objective, the researcher who investigates the lives of such
people as runaways, prostitutes, or victims of domestic violence or spousal rape
must be an advocate for those studied to gain their trust and cooperation (Renzetti
& Lee, 1993). The investigators must relate in a personal and caring manner if can-
dor and participation are to be forthcoming from members of such research pop-
ulations. Critical to success is understanding the ways in which members of such
populations may be vulnerable. Application of Kipniss categories of vulnerability
discussed above (p. 129) is critically important when analyzing the ways in which
such populations are vulnerable in the research setting, and seeking to minimize
those vulnerabilities.
Discussion Questions
1. Ethics is a win-win matter. Discuss the ways that researchers who are
thoughtful can benefit the many stakeholders in human research (including the
seven categories of stakeholders listed in Table 4.1). Discuss ways that researchers
who are thoughtless of ethics might destroy opportunities to do useful research and
negate possible benefits of research.
2. Discuss ways empirical research can enable investigators and IRBs to estab-
lish truly ethical interpretations of the Belmont principles. (Hint: How can they
create informed consent statements and procedures that are correctly understood
by the target research population; how can they learn what fears subjects have about
breach of confidentiality (whether warranted or not); how can they understand the
privacy interests of some subjects? How can they learn what kinds of benefits sub-
jects would really like to have? How can they learn how subjects respond to the
experience of participating in their research?)
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 137
3. What are some of the things one should consider when preparing the
informed consent procedure? Why might this matter? Arguably, the manner of
delivery of the consent procedure is more important than the verbal content of the
statement; explain.
4. Debriefing should be a two-way communication. What do you think are
some of the things that the researcher should seek to learn about the research and
the subjects in the debriefing process?
5. When is deception justified? When not? What are some approaches that
respect subjects rights of self-determination? Describe a way in which a deception
study can have a learning not to be fooled element added to it.
6. Distinguish between privacy, confidentiality, and anonymity. Why are pri-
vacy interests of others difficult to judge? What is the role of informed consent in
respecting privacy? Describe several ways to explore the likely privacy interests of
your research population.
7. Assume that you have plans to gather survey data. What are some of the
confidentiality issues you might explore? What might be the advantages of
anonymity? The disadvantages?
8. What are the provisions of PPRA and FERPA? What are the implications for
planning educational research?
9. Describe several kinds of research in which you may need to use a broker.
How might you organize the brokering procedure in each situation?
10. What are the kinds of risk possibly inherent in research? What are ways,
according to Kipnis, in which one might be vulnerable?
11. Describe some of the kinds of benefits that might be received directly by
subjects when they participate in research? Why would it matter whether your insti-
tution or funder benefited?
12. Minors, as research subjects, are different from adults. What are some of the
ways they are different? Why are troubled youth a particular challenge to study?
Exercises
For purposes of convenience, the exercises presented here are based on material
available on the Internet. Three of the articles you will draw on appear in the March
issue of the Journal of Empirical Research on Human Research Ethics (JERHRE, pro-
nounced Jerry). Articles in the March issue of JERHRE can be downloaded free of
charge from http://caliber.ucpress.net/loi/jer.
patterned after the focus group research conducted by Raymond DeVries, Melissa
Anderson, and Brian Martinson (2006), Normal Misbehavior: Scientists Talk
About the Ethics of Research (available at http://caliber.ucpress.net/loi/jer). Peruse
this brief article to understand the purpose of the study on which your first prac-
tice exercises will be based.
2. Identify some people who are involved in research, who could serve as
surrogate subjects in your exercise.
3. Review Tips on Informed Consent at www.socialpsychology.org/consent
.htm/. Notice that the U.S. government regulations offered in the first set of tips
appear to be designed primarily for biomedical research and are less focused on
social and behavioral research than the second set of tips by the American
Psychological Association. Note that this site also offers tips on developing a con-
sent form for a Web-based study. At the bottom of this Web page, click on Sample
Consent Form, which is a good example of a consent form that would be clear and
understandable to members of an academic community. Using the ideas presented
at this Web site, draft your consent statement.
4. Describe how you will use cognitive interviewing, both the think aloud and
the verbal probing procedures, to examine whether your surrogate subjects under-
stand the consent statement you have drafted. A detailed discussion of the use of these
procedures may be found in an article by Gordon Willis (2006) titled Cognitive
Interviewing as a Tool for Improving the Informed Consent Process, in JERHRE
(available at http://caliber.ucpress.net/loi/jer). Recognizing that your research topic is
a rather unusual one, consider what aspects of it your subjects are likely to misun-
derstand based on your consent statement. Think especially about how you will focus
on these areas of likely misunderstanding in your cognitive interview.
5. Conduct sequential cognitive interviews with your surrogate subjects until you
feel you have addressed the areas of misunderstanding or ambiguity in your consent
statement, and have arrived at a statement that your subjects correctly understand.
6. Conduct the focus group. After your focus group of surrogate subjects has
generated a list of behaviors that they believe to be most threatening to the integrity
of the research enterprise, use their experience to generate your debriefing mate-
rial. (a) Ask the surrogate subjects to discuss what they thought of their research
experience, and what kind of debriefing discussion they think people would want.
(b) Take careful notes on what they say. (c) Probe and ask what privacy interests
subjects participating in the focus groups might have. (d) Ask what other kinds of
risks participants might be concerned about or be exposed to. (e) Ask what bene-
fits they think participants might enjoy from the experience. (f) Administer the
Reactions to Research Participation QuestionnaireRRPQ (which can be down-
loaded from www.personal.utulsa.edu/~elana-newman) asking that respondents
not identify themselves on the questionnaire. (g) Ask if they have any further reac-
tions that they would like to share with the group. (h) After thanking and dismiss-
ing the participants, examine the RRPQ for further ideas about what to add to the
debriefing procedure. (i) Write out the debriefing procedure.
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 139
7. Revisit your informed consent statement, taking into account what you
have learned. Can you better describe what people will experience and what risks
or benefits they might perceive from the experience? Do you think that there will
be people who are likely to want to opt out of participating if they fully understand
what they will experience? Have you written the statement to give them that oppor-
tunity? There are good scientific and practical reasons not to include such people
in your focus groups; if so, state some of these reasons.
8. Suppose that you are now going to conduct a survey of scientists to discover
what percentage of them have committed any of the 10 scientific misbehaviors
described in Brian Martinson, Melissa Anderson, Lauren Crain, and Raymond
DeVries (2006, table 2, p. 58), Scientists Perceptions of Organizational Justice and
Self-Reported Misbehavior (available at http://caliber.ucpress.net/loi/jer). Since
you would be asking people to disclose such egregious wrongdoing as falsifying
data, and ignoring human subjects requirements, what confidentiality concerns
would you have? What confidentiality concerns do you think your subjects would
have? What procedure did Martinson et al. employ to resolve confidentiality con-
cerns? Can you think of a different procedure that would work as well or better?
9. Furthermore, suppose that you conducted this survey over the Internet and
that to better understand the reasons why anyone would commit any of these 10
misbehaviors, you further asked your subjects whether you might interview them
by phone and if so they should contact you. While there is much you could do to
ensure that the data were kept in an anonymous form, you worry that there could
be risk of subpoena of data. Go to http://grants1.nih.gov/grants/policy/coc/back
ground.htm and learn what would be involved in obtaining a Certificate of
Confidentiality that would protect the data from subpoena. Identify two ways in
which your interview subjects might be vulnerable, from Kipniss vulnerability
factors; see http://www.onlineethics.diamax.com/cms/8087.aspx.
10. Using Table 4.1, identify kinds of benefits you could offer to each of the
seven categories of potential benefit recipients in connection with the hypothetical
study based on Martinson et al. (2006).
11. Do you think your focus group project is a minimal risk project? How
might you be sure whether it is? How would you demonstrate your conclusion to
your IRB? Do you think that the hypothetical second project is a minimal risk
project? Why or why not?
Notes
1. For discussion of Certificates of Confidentiality and how they may be obtained from
a federal agency, see http://grants1.nih.gov/grants/policy/coc/background.htm.
2. Federal regulations governing human research are written largely for biomedical
research and may be difficult to interpret. For an interpretation of the regulations that pro-
vides user-friendly instruction, see excellent online materials created by institutional HRPPs,
such as the Web site from the University of Minnesota, www.research.umn.edu/consent, which
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 140
presents separate guidance for biomedical and social/behavioral research focusing primarily
on informed consent and understanding the IRB, and an orientation to the rest of the HRPP
Web site www.research.umn.edu/irb/guidance, which discusses many other issues in depth.
3. The researcher should be aware that the significance of eye contact varies with culture.
Direct eye contact conveys honesty in some cultures, whereas in others it is construed as a
sign of disrespect.
4. The Internet provides many kinds of opportunities for recruiting subjects, doing
online experiments, and observing behavior online. A full discussion of the ways in which
the Internet has changed human research and the distinctive ethical questions raised by these
innovations are beyond the scope of this chapter. An excellent article summarizing these new
opportunities and challenges may be found in a key article by Kraut, Olson, Banaji,
Bruckman, Cohen, and Couper (2004).
5. This example, adapted from a statement developed by David H. Ruja is discussed in
Gil (1986).
6. See Thompson (1991) for discussion of developmental aspects of vulnerability to
research risk.
References
American Statistical Association. (2004). Committee on Privacy, Confidentiality, and Data
Security Web site. Sponsored by ASAs Committee on Privacy and Confidentiality.
Retrieved March 26, 2008, from www.amstat.org/comm/cmtepc/index.cfm
Cauce, A., & Nobles, R. (2006). With all due respect: Ethical issues in the study of vulnerable
adolescents. In J. Trimble & C. Fisher (Eds.), The handbook of ethical research with
ethnocultural populations and communities (pp. 197215). Thousand Oaks: Sage.
Citro, C., Ilgen, D., & Marrett, C. (Eds.). (2003). Protecting participants and facilitating social
and behavioral sciences research. Washington, DC: National Academies Press.
DeVries, R., Anderson, M., & Martinson, B. (2006). Normal misbehavior: Scientists talk
about the ethics of research. Journal of Empirical Research of Human Research Ethics,
1(1), 4350.
DeWolf, V., Sieber, J. E., Steel, P., & Zarate, A. (2006). Part II: HIPAA and disclosure risk
requirements. IRB: Ethics & Human Research, 28(1), 611.
Dillman, D. (1978). Mail and telephone surveys: The total design method. New York: Wiley.
Elliott, K., & Urquiza, A. (2006). Ethical research with ethnic minorities in the child welfare
system. In J. Trimble & C. Fisher (Eds.), The handbook of ethical research with ethnocul-
tural populations and communities (pp. 181195). Thousand Oaks, CA: Sage.
The Family Educational Rights and Privacy Act, 20 U.S.C. 1232g; 34 C.F.R. Part 99 (1974).
Fisher, C. B., & Rosendahl, S. A. (1990). Psychological risk and remedies of research partici-
pation. In C. G. Fisher & W. W. Tryon (Eds.), Ethics in applied developmental psychology:
Emerging issues in an emerging field (pp. 4359). Norwood, NJ: Ablex.
Fost, N. (1975). A surrogate system for informed consent. Journal of the American Medical
Association, 233(7), 800803.
Gil, E. (1986). The California child abuse reporting law: Issues and answers for professionals
(Publication No. 132). Sacramento: California Department of Social Services, Office of
Child Abuse Prevention.
Grisso, T. (1991). Minors assent to behavioral research without parental consent.
In B. Stanley & J. E. Sieber (Eds.), The ethics of research on children and adolescents
(pp. 109127). Newbury Park, CA: Sage.
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 141
Howard, J. (2006, November 10). Oral history under review. Chronicle of Higher Education,
53(12), A14.
Humphreys, L. (1970). Tearoom trade: A study of homosexual encounters in public places.
London: Duckworth.
Jones, J. (1981). Bad blood. New York: Free Press.
Katz, J. (1972). Experimentation with human beings. New York: Russell Sage.
Kipnis, K. (2001). Vulnerability in research subjects: A bioethical taxonomy. In Ethical and pol-
icy issues in research involving human participants: Vol. 2. Commissioned papers and staff
analysis (pp. G-1G-13). Bethesda, MD: National Bioethics Advisory Commission.
Retrieved March 26, 2008, from http://bioethics.georgetown.edu/nbac/human/over
vol2.pdf
Kipnis, K. (2004). Vulnerability in research subjects: An analytical approach. In D. Thomasma
& D. N. Weisstub (Eds.), Variables of moral capacity (pp. 217231). Dordrecht, The
Netherlands: Kluwer Academic.
Klockars, C. B. (1974). The professional fence. New York: Free Press.
Kraut, R., Olson, J., Banaji, M., Bruckman, A., Cohen, J., & Couper, M. (2004). Psychological
research online: Report of Board of Scientific Affairs Advisory Group on the Conduct
of Research on the Internet. American Psychologist, 59(2), 105117.
Laufer, R. S., & Wolfe, M. (1977). Privacy as a concept and a social issue: A multidimensional
developmental theory. Journal of Social Issues, 33, 4487.
Levine, R. (2006). Empirical research to evaluate Ethics Committees burdensome and per-
haps unproductive policies and practices: A proposal. Journal of Empirical Research on
Human Research Ethics, 1(3), 14.
Martinson, B., Anderson, M., Crain, L., & DeVries, R. (2006). Scientists perceptions of orga-
nizational justice and self-reported misbehavior. Journal of Empirical Research on
Human Research Ethics, 1(1), 5166.
Melton, G., & Stanley, B. (1991). Research involving special populations. In B. Stanley,
J. Sieber, & G. Melton (Eds.), Psychology and research ethics (pp. 177202). Lincoln:
University of Nebraska Press.
Milgram, S. (1974). Obedience to authority. New York: Harper & Row.
National Bioethics Advisory Commission. (2001). Report and recommendations: Vol. 1.
Ethical and policy issues in research involving human participants (pp. 1125). Bethesda,
MD: Author.
National Institutes of Health. (2003). Final NIH statement on sharing of research data.
Retrieved March 26, 2003, from http://grants.nih.gov/grants/guide/notice-files/NOT-
OD-03-032.html
Newman, E., Risch, E., & Kassam-Adams, N. (2006). Ethical issues in trauma-related
research: A review. Journal of Empirical Research on Human Research Ethics, 1(3), 2946.
Newman, E., Willard, T., Sinclair, R., & Kaloupek, D. (2001). The costs and benefits of
research from the participants view: The path to empirically informed research prac-
tice. Accountability in Research, 8, 2747.
ORourke, J. M., Roehrig, S., Heeringa, S. G., Reed, B. G., Birdsall, W. C., Overcashier, M., et al.
(2006). Solving problems of disclosure risk while retaining key analytic uses of publicly
released microdata. Journal of Empirical Research on Human Research Ethics, 1(3), 6384.
Pelto, P. J. (1988, February 1820). [Informal remarks]. In J. E. Sieber (Ed.), Proceedings of a
conference on sharing social research data, National Science Foundation/American
Association for the Advancement of Science, Washington, DC. Unpublished manuscript.
Public Health Service Act, 301[d], 42 U.S.C. 242a (1988).
Renzetti, C. M., & Lee, R. M. (Eds.). (1993). Researching sensitive topics. Newbury Park, CA: Sage.
04-Bickman-45636:04-Bickman-45636 7/28/2008 11:08 AM Page 142
Rodgers, W., & Nolte, M. (2006). Solving problems of disclosure risk in an academic setting:
Using a combination of restricted data and restricted access methods. Journal of
Empirical Research on Human Research Ethics, 1(3), 8597.
Rotheram-Borus, M. J., & Koopman, C. (1991). Protecting childrens rights in AIDS research.
In B. Stanley & J. E. Sieber (Eds.), The ethics of research on children and adolescents
(pp. 143161). Newbury Park, CA: Sage.
Rubin, P., & Sieber, J. (2006). Empirical research on IRBs and methodologies usually associated
with minimal risk. Journal of Empirical Research on Human Research Ethics, 1(4), 14.
Sieber, J. E. (1992). Planning ethically responsible research: A guide for students and internal
review boards. Newbury Park, CA: Sage.
Singer, E. (2003). Exploring the meaning of consent: Participation in research and beliefs
about risks and benefits. Journal of Official Statistics, 19, 333342.
Singer, E. (2004). Confidentiality assurances and survey participation: Are some requests
for information perceived as more harmful than others? [Invited paper]. In S. Cohen &
J. Lepkowski (Eds.), Eighth conference on health survey research methods (pp. 183188).
Hyattsville, MD: National Center for Health Statistics.
Singer, E., Hippler, H., & Schwarz, N. (1992). Confidentiality assurances in surveys:
Reassurance or threat? International Journal of Public Opinion Research, 4, 256268.
Thompson, R. A. (1991). Developmental changes in research risk and benefit: A changing
calculus of concerns. In B. Stanley & J. E. Sieber (Eds.), The ethics of research on children
and adolescents (pp. 3164). Newbury Park, CA: Sage.
Willis, G. (2006). Cognitive interviewing as a tool for improving the informed consent
process. Journal of Empirical Research on Human Research Ethics, 1(1), 924.
Wolf, L., & Zandecki, J. (2006). Sleeping better at night: Investigators experiences with
Certificates of Confidentiality. IRB: Ethics & Human Research, 28(6), 17.
Zarate, A., & Zayatz, L. (2006). Essentials of the disclosure review process: A federal perspec-
tive. Journal of Empirical Research on Human Research Ethics, 1(3), 5162.
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 143
PART II
Applied Research
Designs
I n this section of the handbook we move from the broader design and planning
issues raised in Part I to more specific research designs and approaches. In Part I,
the contributors noted the unique characteristics of applied research and
discussed issues such as sampling, statistical power, and ethics. In Part II, the
focus narrows to particular types of designs, including experimental and quasi-
experimental designs, case studies, needs analysis, cost-effectiveness evaluations,
and research synthesis.
In Chapter 5, Boruch and his co-authors focus on one type of design, the ran-
domized experiment. The randomized study is considered the gold standard for
studying interventions, both in applied settings and more basic research settings.
Boruch et al. provide justifications for this widespread belief, noting the investiga-
tions that have demonstrated the relative strengths of randomized studies over
quasi-experiments. However, implementing a randomized design in field settings is
difficult. Through the use of multiple examples, the chapter describes some of the
best ways to implement this design. The authors note the need to conduct pipeline
studies, as well as the need for careful attention to the ethical concerns raised by
randomized experiments. They also discuss the management requirements of a
randomized design and issues concerning the reporting of results. Through the use
of examples they illustrate how to plan and implement a randomized experiment.
Although randomized experiments represent the gold standard, it is not always
possible to conduct such research. In Chapter 6, Mark and Reichardt move us from
the simpler, but elegant, randomized design to a discussion of quasi-experiments.
They reconceptualize the traditional ways of thinking about the several forms of
validity. Their approach clarifies many of the problems of previous schemes for
describing the variety of quasi-experiments. Chapter 6 can serve as a guide for
143
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 144
these data, using qualitative and/or quantitative methods, will then lead to more
defensible findings and conclusions. Yin provides four examples of analytic strate-
gies, including pattern-matching, explanation building, chronological analysis, and
constructing and testing logic models. The chapter draws upon numerous examples
from several fields to cover these topics and provide concrete and operational advice
for readers.
In Chapter 9, Tashakkori and Teddlie note the increasing frequency of mixed
methods designs in applied social research. The widespread population of mixed
methods is seen in the number of texts written, the growing number of references
on the internet, and even a journal devoted to the field, Journal of Mixed Methods
Research. The authors broadly define mixed methods as research in which the
researcher collects and analyzes data from both qualitative and quantitative
approaches, integrates the findings and draws inferences from the analysis. In this
chapter, the authors begin by offering the assumptions that guide their approach to
mixed methods, with an emphasis on believing that qualitative and quantitative
methods are not dichotomous or discrete, but are on a continuum of approaches.
They then provide an overview of various integrative approaches to sampling, data
collection, data analysis, and inferences, and end with a discussion of the issues
involved in evaluating the inferences made based on the results.
Michael Harrison in Chapter 10 offers an introduction to organizational diagno-
sis, the use of conceptual models and applied research methods to conduct an assess-
ment of an organization that can inform decision-making. Similar to evaluation
research, organizational diagnosis is practically oriented and can involve a focus on
both implementation and effectiveness. What distinguishes organizational diagnosis
is that its focus is typically broader than a program evaluation, with an examination
of organizational features and a wide range of indicators of effectiveness. To provide
both useful and valid information for a client, Harrison highlights three key aspects
of diagnosisprocess, modeling, and methods. Process involves interacting with the
clients and other stakeholders over the course of a study. Modeling refers to using
research-based models to guide the study, including models and frames for identify-
ing what to study, framing the problem, choosing effectiveness criteria, determining
which organizational conditions to examine for their influence on effectiveness, and
organizing and providing feedback to the clients. Methods refers to techniques for
gathering, summarizing, and analyzing data that can provide both rigorous and
valid results. Harrison stresses that there is no step-by-step guide to conducting a
diagnosis, but rather a set of choices that the diagnosis practitioner must make. The
ultimate task is to use methods and models from the behavioral and organization
sciences to help identify what is going on in an organization and to help guide clients
in making decisions based on this information.
As we noted in our introduction, a major theme of this handbook is the impor-
tance of accumulating knowledge in substantive areas so as to make possible more
definitive answers to key questions. Do we have the tools and methods in applied
research to pull together the vast number of studies that have been completed? In
Chapter 11, Cooper, Patall, and Lindsay summarize a number of useful meta-analytic
techniques to produce quantitative summaries of often hundreds of studies. Although
most of these techniques have been developed in the past 20 to 25 years, the authors,
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 146
in a brief history of research synthesis and meta-analysis, note that the first meta-
analysis was actually published in 1904 by Karl Pearson and was followed by more
than a dozen papers on techniques for statistical combination of findings prior to
1960. In recent years, there has been an explosion of meta-analyse published, and
two networksthe Cochrane Collaboration and the Campbell Collaborationare
the leading producers in research syntheses in health care and social policy, respec-
tively, and are considered the gold standard for determining the effectiveness of dif-
ferent interventions in these areas.
In addition to presenting a brief history of the method and an overview of a
number of statistical strategies for combining studies, Cooper et al. review the
stages of research synthesis, including problem formulation, literature search, data
evaluation, analysis and interpretation, and public presentation. With an overrid-
ing purpose of the chapter to help researchers distinguish good from bad synthe-
ses, the authors discuss the difficult decisions that researchers face in conducting a
meta-analysis (e.g., handling missing data), and address the criteria that need to be
considered in evaluating the quality of both knowledge syntheses more generally
and meta-analysis in particular.
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 147
CHAPTER 5
Randomized Controlled
Trials for Evaluation and
Planning
Robert F. Boruch
David Weisburd
Allison Karpyn
Julia Littell
S uppose you were asked to determine the effectiveness of a new police strat-
egy to reduce crime and disorder at crime hot spots. The police had deter-
mined that a limited number of blocks in the city were responsible for a large
proportion of crime and disorder and had decided to crack down on those high
crime areas. The strategy involved concentrating police patrol at the hot spots,
rather than simply having the police spread their resources thinly across the city. A
study of the topic would require comparing the crime rates and disorder at the hot
spots after police intervention, with rates of crime at places that did not receive the
intervention. The studys objective is to establish whether concentrating patrol at
hot spots will reduce crime and disorder at those places.
In an uncontrolled or observational study, particular hot spots would be tar-
geted based on the preferences of police commanders who are often pressured by
citizens to do something about crime on their block. This selection factor, born of
147
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 148
commanders preferences, leads to two groups of hot spots that are likely to differ
systematically. Those hot spots that receive the innovative policing program may,
for example, have higher rates of crime or disorder. The targeted hot spots, for
instance, may be places with wealthier citizens who are perhaps more able to apply
pressure to the police, or places in which citizens are simply better organized and,
thus, with more contacts with the department. They may be in certain areas of the
city where police patrol is ordinarily concentrated, or areas close to businesses,
schools, or community centers that are seen as deserving special police attention.
Each of these factors, of course, may influence the primary outcomes of interest
crime and disorderand may affect how effective or ineffective the police are in
doing something about these problems.
The differences between the groups that evolve from natural processes, rather than
a controlled study, will then be inextricably tangled with the actual effect of police
patrol on crime, if indeed there is an effect. A simple difference in crime between the
two naturally occurring (nonrandomized) groups, one that received the intervention
and one that did not, will not then register the effect of the intervention alone. It will
reflect the effect of police patrol at hot spots and the combined effect of all selection
factors: commanders preferences, political clout, socioeconomic factors, the location
of institutions thought important to the police, and so on. As a consequence, the esti-
mate of the effect of police patrol at hot spots based on a simple difference between
the groups is equivocal at best. It may be misleading at worst.
Crime in the self-selected hot spots policing area, for instance, may be higher
following the intervention, making it appear that hot spots patrol increases crime,
when in fact it had no effect. For example, burglaries may be higher in the hot spots
patrol area because the places targeted included people with higher incomes. Their
relative wealth might have given them preference when the program was initiated,
but it also might suggest higher burglary rates since such places will naturally be
more attractive targetsthey have more goods that can be stolen. The point is that
a simple observational study comparing crime hot spots that received extra patrol
and those that did not will yield a result that cannot be interpreted easily.
Eliminating the selection factors in evaluations that are designed to estimate the
relative effectiveness of alternative approaches to reducing the incidence of violence
is difficult. Hot Spots Policing experiments described by Weisburd (2005) met this
challenge through randomized trials. Related kinds of selection issues affect nonran-
domized studies that are used to assess the impact of initiatives in human resources
training programs, health care, education, and welfare, among others. It also affects
studies that purport to match places or individuals in each group to the extent that
matching is imperfect or incomplete in ways that are unknown or unknowable.
That many applied research and evaluation projects cannot take selection factors
into account does not mean such studies are useless. Some of them are, of course.
It does imply that, where appropriate and feasible, researchers ought to exploit valid
methods for estimating the relative effects of initiatives, methods that are not vul-
nerable to selection problems and do not lead to estimates that are equivocal or
biased in unknown ways. Randomized field trials, the focus of this chapter, are less
vulnerable to such problems.
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 149
This chapter covers basic definitions and aims of randomized trials and the dis-
tinction between this approach and others that purport to estimate effects of inter-
ventions. Illustrations are considered next, partly to show how trials are mounted
in different arenas, partly to provide evidence against naive academic, institutional,
and political claims that such trials are not feasible. We next consider basic ingredi-
ents of a randomized trial; each ingredient is handled briefly. The final section sum-
marizes efforts to develop capacity. This chapter updates one that appeared in the
earlier edition of Bickman and Rog (1998); the update is no easy task given the
remarkable expansion in trials over the past decade in education, crime and justice,
social services, and other areas.
Distinctions
Randomized trials are different from observational studies in which there is an
interest in establishing cause-effect relations, but there is no opportunity to assign
individuals to alternative interventions using a randomization plan (Cochran,
1983; Rosenbaum, 2002). Such studies are often based on survey samples and
depend on specialized methods for constructing comparison groups and estimates
the effects of interventions.
Observational studies can and often do produce high-quality descriptive data on
the state of individuals or groups. They can provide promissory notes on what
works or what does not, conditional on assumptions that one might be willing to
make. They cannot always sustain defensible analyses of the relative effects of dif-
ferent treatments, although they are often employed to this end. Statistical advances
in the theory and practice of designing better observational studies, and in analyz-
ing resultant data and potential biases in estimates of an interventions effects, are
covered by Rosenbaum (2002).
Randomized field tests also differ from quasi-experiments. Quasi-experiments
have the object of estimating the relative effectiveness of different interventions that
have a common aim, just as randomized experiments do. But the quasi-experiments
depend on methods other than randomization to rule out competing explanations
for differences in the outcomes of competing interventions or to recognize bias in
the estimates of a difference. In some respects, quasi-experiments aim to approxi-
mate the results of randomized field tests (Campbell & Stanley, 1966; Cochran,
1983; Shadish, Cook, & Campbell, 2002).
Important statistical approaches have been invented to try to isolate the relative
effects of different interventions based on analyses of data from observational
surveys and quasi-experiments of the interventions. These approaches attempt to
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 151
recognize all the variables that may influence outcomes, including selection factors
to measure them and to separate the intervention effects from other factors.
Advances in this arena fall under the rubrics of structural models, selection models,
and propensity scores. Antecedents and augmentations to these approaches include
ordinary least square regression/covariance analysis and matching methods.
The scientific credibility of some of these techniques is reviewed on empirical
grounds by Glazerman, Levy, and Myers (2003) in the context of employment,
training, and education. See Weisburd, Lum, and Petrosino (2001) for criminolog-
ical research comparing results of randomized trials, including quasi-experiments
with the results of nonrandomized trials; and Chalmers (2003) and Deeks et al.
(2003) for analogous comparisons of studies of effects of health interventions.
Victors (2007) dissertation gives a review of statistical matching methods in quasi-
experiments and reports on simulation studies on how propensity scores and ordi-
nary least squares regression can produce better estimates of effect than competing
models/analyses in such quasi-experimental designs. The general conclusion one
reaches based on such empirical work is that estimates of an interventions effect
based on randomized trials often differ in both magnitude and variability from
those based on nonrandomized studies. The reasons for such differences are an
important target for new methodological research.
In this chapter, the phrases randomized experiment and randomized trial will be
used interchangeably with other terms that have roughly the same meaning and are
common in different research literatures. These terms include randomized test and
randomized social experiments, used frequently during the 1970s and 1980s. They
also include randomized clinical trials, a phrase often used to describe the same
design for evaluating the relative effectiveness of medical or pharmaceutical treat-
ments, for example, Piantadosi (1997) and Donner and Klar (2000). Similarly,
the phrases cluster randomized, place randomized, and group randomized are
used interchangeably when independent entities or independent assemblies of
related individuals or entities are randomly assigned to different regimens.
Experiments in Context
The main benefit of a randomized trial is unbiased estimates of the relative effect
of interventions coupled with a statistical statement of ones confidence in results.
The benefit must be put into the broader context of applied social research, of
course. Addressing questions about the nature of the phenomenon or problem at
hand, and producing evidentiary answers, precedes any good trial. Determining
how interventions may be constructed and deployed, and generating evidence on
such determinations, must also precede such trials. It is only after such questions
are addressed that it makes sense to undertake controlled trials so as to answer
questions about effect.
Understanding which questions to address, in what conditions, and when, is an
ingredient of research policy. The need to arrange ones thinking about this under-
standing has been reiterated and elaborated in recent tracts on applied research on
crime prevention (Lipsey et al., 2005), in education (Shavelson & Towne, 2002), and
in the context of federal policies more generally (Julnes & Rog, 2007). The message
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 152
in these and others is that the question to be addressed drives the methods to be
used to generate dependable evidence.
Further questions depend on having answered questions about problem scope,
program deployment, and program effect. What is the cost effectiveness ratio for
programs that have been tested? How can the evidence on any question be employed
well in systematic reviews, legislation, and generation of practice guidelines? How
can the trialists keep abreast of the state of the art in each question category?
This chapter focuses mainly on randomized trials. Other chapters in this
Handbook carry the weight in addressing other related topics. See also Rossi, Lipsey,
and Freeman (2004) and Stufflebeam and Shrinkfield (2007) for randomized trials
in a broader evaluation context.
Education
In education as in other arenas, researchers may randomly allocate eligible and
willing teachers, individuals, classrooms, schools, and other entities to different
interventions in order to produce good estimates of their relative effectiveness. The
choice of the experiments unit of assignment in education, as in other social sec-
tors depends on the nature of the intervention and on whether the units can be
regarded as statistically independent. For instance, entire schools have been ran-
domly assigned to alternative regimens in dozens of studies designed to determine
whether schoolwide campaigns could delay or prevent youngsters use of tobacco,
alcohol, and drugs (e.g., Flay & Collins, 2005). In a milestone experiment on class
size, students and teachers were randomly assigned to small classes or to regular
classes in Tennessee to learn whether smaller classes would yield higher achieve-
ment levels and for whom (Finn & Achilles, 1990; Mosteller, Light, & Sachs, 1995).
See Stufflebeam and Shinkfield (2007) for a description of this and other remark-
able precedents.
Over the past decade, the role of randomized trials in education has changed
remarkably. Between 1999 and 2006, for instance, the Interagency Education
Research Initiative funded about 20 small-, moderate-, and large-scale trials. This
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 153
joint effort to develop and evaluate programs in science, mathematics, and reading
involved thousands of students in at least a dozen states over 5 years (Brown,
McDonald, & Schneider, 2006). In the United States, the Institute of Education
Sciences (IES) began in 2000 to lead the way toward more dependable evidence on
effects of interventions on randomized trials in the face of notable criticism. The
IES Directors Report to the Congress notes that only one substantial trial was
underway in 2000 (U.S. Department of Education, 2007). Spybrooks (2007) fine
dissertation on statistical power in certain kinds of trials identified nearly 60 trials
supported by IES between 2001 and 2006. This is a lower bound on the number of
recently sponsored trials in that Spybrook focused only on group randomized
trials in her research and could not handle trials undertaken by Regional Education
Laboratories during 20062007.
William T. Grant Foundation (2007) played a leadership role in the private foun-
dation sector through its support of randomized trials and its building the research
communitys capacity to implement such trials. After school programs and summer
programs in math and reading, for instance, have been a special focus. Large-scale
cluster trials have been supported on schoolwide mentoring, socioemotional learn-
ing, literacy, positive youth development, school-based prevention, and reading.
program from the characteristics of eligible individuals who elect (or do not elect)
to enter a new program, unless a controlled randomized trial is done.
Tax Administration
The interests of the U.S. Internal Revenue Service (IRS) and of tax agencies in
other countries lie partly in understanding how citizens can be encouraged to pay
the proper amount of taxes. Randomized trials in this arena have also been under-
taken. For example, delinquent taxpayers identified by the IRS have been randomly
assigned to different strategies to encourage payment, and they are then tracked to
determine which strategies yielded the best returns on investment (Perng, 1985).
Other experiments have been undertaken to determine how tax forms may be sim-
plified and how taxpayer errors might be reduced through various alterations in tax
forms (e.g., Roth, Scholz, & Witte, 1989). Such research extends a remarkable early
experiment by Schwartz and Orleans (1967) to learn how people might be per-
suaded to report certain taxable income more thoroughly. In an ambitious update
of this work, Koper, Poole, and Sherman (2006) focused on 7,000 businesses in
Pennsylvania that had not complied with the states sales tax code. Moral appeals,
personal letters, as well as threats were tested in a randomized trial to understand
whether they have appreciable effects on payment.
Academy of Sciences panel concluded that studies that focused police resources on
crime hot spots provide the strongest collective evidence of police effectiveness that
is now available (Skogan & Frydl, 2004; see also Weisburd & Eck, 2004).
Trialists have undertaken several substantial reviews of randomized field exper-
iments in civil and criminal justice. Dennis (1988), for instance, analyzed the fac-
tors that influenced the quality of 40 such trials undertaken in the United States.
His dissertation updated Farringtons (1983) examination of the rationale, conduct,
and results of randomized experiments in Europe and North America. Farrington
and Welshs (2005) review covers more than 80 trials.
The range of interventions whose effectiveness has been evaluated in these ran-
domized controlled trials is remarkable. They have included efforts to appraise rel-
ative effects of different appeals processes in civil court, telephone-based appeals
hearings, restorative justice programs, victim restitution plans, jail time for offend-
ers, diversion from arrest, arrest versus mediation, juvenile diversion and family
systems intervention, probation rules, bail procedures, work-release programs for
prisoners, and sanctions that involve community service rather than incarceration.
Nutrition
With rates of obesity approaching 20% for children and 60% for adults in the
United States, there is increasing interest in understanding effective prevention and
intervention strategies (University of Virginia Health Systems, 2008). Programs that
demonstrably prevent overweight and obesity are of interest in school and com-
munity settings. As a result, randomized trials have been undertaken to assess
school-based nutrition education and environmental change efforts, programs to
maximize nutrition and health prevention efforts among those receiving federal
assistance program benefits and work site interventions.
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 156
unclear, and in which biases may be chronic. Their paper reviews some efforts to
mount controlled trials in the interest of better evidence for evidence-based man-
agement. Such trials, they report, have been undertaken at times to understand
the effects of different marketing strategies in the hotel and legal gambling busi-
ness, global Web-based services industry, convenience store chains, and elsewhere.
Individual customers, or corporate units, or stores, etc. may be the units of random
allocation. The authors examples are brief but provocative. However, it is difficult
to gauge scope and quality in this arena of applications on account of proprietary
aspects of the research.
In another kind of marketing arena, Gerber (2004) points out that virtually all
of the work on candidate spending effects have been based on non experimental
evidence (p. 544). His article reviews the few and very recent efforts to assess effects
of political campaign spending (and different campaign programs) on vote share,
voter preferences, and other election outcomes. Randomization in some experi-
ments is at the household level; in others it is at the ward level. Gerbers handling
of the topic is distinctive in trying to synthesize and reconcile results of both the tri-
als and related nonexperimental studies and building more nuanced theory (mod-
els) of when and how incumbent spending has positive, negative, or no effects.
The first three topics are considered next. The subsequent topics are considered
in the following section under the rubric of the Experiments Design.
In the Hot Spots Patrol experiments, for example, the primary question was,
Does the focus of police resources such as preventative patrol in specific areas
where crime is high, as opposed to a more even spread of policing activities in a city,
lead to crime prevention benefits? The question was developed from theoretical
debate and empirical evidence that crime is tightly clustered in urban areas and that
such clustering is due to the presence of specific opportunities for crime and the
presence of motivated offenders at crime hot spots.
Cohen and Felsons (1979) theory of routine activities was an important cata-
lyst for the hot spots policing studies (Weisburd, 2005). Prior theorizing in crimi-
nology had focused on individual offenders and the possibilities for decreasing
crime by focusing criminal justice resources either on their incapacitation, reha-
bilitation, or in deterring them from future offending. This offender-based crimi-
nology dominated crime and justice interventions for most of the past century, but
it was criticized extensively beginning in the 1970s for failing to provide the crime
prevention benefits that were often promised (Brantingham & Brantingham, 1975;
Martinson, 1974). Cohen and Felson (1979) observed that for criminal events to
occur, there is need not only of a criminal but also of a suitable target and the
absence of a capable guardian. Their theory suggested that crime rates could be
affected by changing the nature of targets or of guardianship, without a specific
focus on offenders themselves. Drawing on similar themes, British scholars led by
Ronald Clarke began to explore the theoretical and practical possibilities of situa-
tional crime prevention (Clarke, 1983, 1992, 1995; Cornish & Clarke, 1986). Their
focus was on criminal contexts and the possibilities for reducing the opportunities
for crime in very specific situations. Their approach, like that of Cohen and Felson,
placed opportunities for crime at the center of the crime equation. One natural
outgrowth of these perspectives was that the specific places where crime occurs
would become an important focus for crime prevention researchers (Eck &
Weisburd, 1995; Taylor, 1997).
In the mid- to late 1980s, a group of criminologists began to examine the distri-
bution of crime at places such as addresses, street segments and small clusters of
addresses or street segments. Perhaps the most influential of these studies was con-
ducted by Sherman, Gartin, and Buerger (1989). Looking at crime addresses in the
city of Minneapolis, they found a concentration of crime there that was startling.
Only 3% of the addresses in Minneapolis accounted for 50% of the crime calls to
the police. Similar results were reported in a series of other studies in different loca-
tions and using different methodologies, each suggesting a very high concentration
of crime in microplaces (e.g., see Pierce, Spaar, & Briggs, 1986; Weisburd, Bushway,
Lum, & Yang, 2004; Weisburd & Green, 1994; Weisburd, Maher, & Sherman, 1992).
This empirical research reinforced theoretical perspectives that emphasized the
importance of crime places and suggested a focus on small areas, often encompass-
ing only one or a few city blocks that could be defined as crime hot spots.
While the Minneapolis Hot Spots Patrol Experiment (Sherman & Weisburd,
1995) examined whether extra police presence would have crime prevention
impact at hot spots, other studies began to study whether different types of police
strategies such as problem-oriented policing would enhance crime prevention ben-
efits at hot spots (see, e.g., Braga, Weisburd, Waring, & Mazerolle, 1999; Weisburd
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 159
& Green, 1995). Importantly, later studies also examined the theory that crime
would simply be displaced to other areas near the targeted hot spots. If crime
simply moved around the corner, then such hot spots approaches would not be
very useful for decreasing crime and disorder more generally in a city (Weisburd
et al., 2006). In the Jersey City Drug Market Analysis Experiment (Weisburd &
Green, 1995), for example, displacement within two block areas around each hot
spot was measured. No significant displacement of crime or disorder calls was
found. Importantly, however, the investigators found that drug-related and public
morals calls actually declined in the displacement areas. This diffusion of crime
control benefits (Clarke & Weisburd, 1994) was also reported in the New Jersey
Violent Crime Places experiment (Braga et al., 1999) and the Oakland Beat Health
experiment (Mazerolle & Roehl, 1998).
Rossi et al. (2004) and Stufflebeam and Shinkfield (2007) elaborated on the role
of theory in the context of randomized trials and other types of evaluation that
address questions that precede or succeed an impact evaluation. Wittman and
Klumb (2006) provided counsel about how researchers might deceive themselves
about testing theory in the context of randomized experiments considering the
topics of history since the 1950s.
Substantive theory, implicit or explicit, also drives the choice of outcome vari-
ables to be measured in a randomized trial. In the Crime Hot Spots experiments,
researchers relied on emergency calls for police service to measure program out-
comes, because such calls were seen as a direct measure of criminal activity in the
hot spots. The question was not whether individual offenders reduced their moti-
vations to commit crime which would have been best noted in surveys or interviews
with offenders, but whether crime and disorder was reduced. In Tennessees exper-
iments on class size, Finn and Achilles (1990) measured student achievement as an
outcome variable based on theory and earlier research about how class size might
enhance childrens academic performance.
Well-articulated theory can also help to determine whether and which context
(setting) variables need to be measured. For instance, most trials on new employ-
ment and training programs have measured the local job market in which the
program is deployed. This is based on rudimentary theory of demand for and sup-
ply of workers. Knowing that there are no jobs available in an area, for example, is
important for understanding the results of a trial that compares wage rates of par-
ticipants in new training programs against wages of those involved in ordinarily
available community employment and training programs.
Finally, theory may also drive how one interprets a simple comparison of the
outcomes of two programs, deeper analyses based on data from the experiment at
hand and broader analyses of the experiment in view of research in the topical area
generally. Rossi et al.s (2004) discussed different kinds of hypotheses. The implication
is that we ought to have a theory (an enlarged hypothesis or hypothesis system) that
addresses people and programs in the field, a theory about the interventions in the
trial given the field theory, and a theory about what would happen if the results of
the trial were exploited to change things in the field.
A bottom line for trialists is that the theory or logic about how the intervention
is supposed to work ought to be explicit. It is up to the design team for the ran-
domized trial to draw that theory into the open, so as to assure that the trial exploits
all the information that must be exploited in designing the trial.
of the pipeline of cases. For instance, each of the investigators in the SARP studies
(Boruch, 1997; Garner et al., 1995) developed such a study prior to each of six
experiments. In most, the following events and relevant numbers constituted the
evidential base: total police calls received, cases dispatched on call, cases dispatched
as domestic violence cases, domestic cases that were found on site actually to be
domestic violence cases, and domestic cases in which eligibility requirements were
met. In one site over a 2-year period, for example, nearly 550,000 calls were dis-
patched; 48,000 of these were initially dispatched as domestic cases. Of these, only
about 2,400 were actually domestic disputes and met eligibility requirements. That
is, the cases that involved persons in spouselike relationships, in which there were
grounds for believing that misdemeanor assaults had occurred, and so on, were far
fewer than those initially designated as domestic by police dispatchers.
Pipeline studies have been undertaken in other social experiments. See Bickman and
Rog (1998; the earlier edition of this Handbook) for examples from the 1980s and 1990s.
Generally, a pipeline study would describe in quantitative and qualitative terms eli-
gible target populations, obtained samples, and rates of nonparticipation, crossovers,
and attrition. St. Pierre (2004) gives informative examples from education and eco-
nomic trials that would be incorporated into a pipeline study. The pipeline is suffi-
ciently important that CONSORT statement recommends routine reporting on this
matter in health care trials (Mohler, Schultz, & Altman, 2001). Flay et al. (2005) make
a similar recommendation for the behavioral and education sciences.
Population, power, and pipeline are intimately related to one another in ran-
domized field trials. Considering them together in the studys design is essential.
Where this consideration is inadequate or based on wrong assumptions, and espe-
cially when early stages of the trial show that the flow of cases into the trial is sparse,
drastic change in the trials design may be warranted. Such changes might include
terminating the study, of course. Change might include extending the time frame
for the trial so as to accumulate adequate sample sizes in each arm of the trial.
Intensifying outreach efforts so as to identify and better engage target cases is
another common tactic for assuring adequate sample size.
Interventions
Interventions here mean the programs or projects, program components, or
program variations whose relative effectiveness is of primary interest in a random-
ized trial. In the simplest case, this implies verifying and documenting activity
undertaken in both the program being evaluated and the control condition in
which that program is absent.
Interventions are, of course, not always delivered as they are supposed to be.
Math curricula have been deployed in schools but teachers have not always deliv-
ered the curriculum as intended. Fertility control devices designed to reduce
birthrates have not been distributed to potentially willing users. Human resources
training projects have not been put into place, in the sense that appropriate staffs
have not been hired. Drug regimens have been prescribed for tests, but individuals
assigned to a drug do not always comply with the regimen.
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 165
Random Assignment
Technical advice on how to assign individuals or entities randomly to interven-
tions is readily available in statistical textbooks on design of experiments. Technical
advice is necessary but insufficient. Researchers must also recognize the realities of
field conditions. Inept or subverted assignments are, for example, distinct possibili-
ties. See Boruch (1997) for early examples that are becoming admirably less frequent.
Contemporary good practice focuses on who controls the random assignment
procedure, when the procedure is employed, and how it is structured. Practice is
driven by scientific standards that demand that the random assignment cannot
be anticipated by service providers, for instance, and therefore subverted easily.
Contemporary standards require that the assignments cannot be subverted post
facto and cannot be manipulated apart from the control exercised by a blind assign-
ment process. As a practical matter, these standards usually preclude processes that
are easily subverted, such as coin flips and card deck selections.
In studies such as the Hot Spots Policing experiments, cases that are eligible are
often known in advance to trialists and so the trialist can randomize cases before
the experiment even begins. In this scenario and others, contemporary experiments
employ a centralized randomization procedure that assures quality control and
independence of the interventions delivery. Trials undertaken to test mathematics
curriculum packages by the Mid Atlantic Regional Laboratory, for instance, include
centralized assignment of schools based on well-defined eligibility criteria (Turner,
2007). The Mid Atlantic Regional Education Laboratorys various trials on Odyssey
Math involved 32 schools, 24 classrooms, and 2,800 students. In one such trial, eli-
gible classrooms were randomly assigned to interventions within schools using a
random assignment algorithm that was commercially available (Excels random
function), which was tested by the Laboratorys Technical Group and then applied
by an independent organization, Analytica Inc. (Turner, 2007).
The random allocations timing is important in several respects. A long interval
between the assignment and the interventions delivery can engender the problem
that assigned individuals disappear, engage in alternative interventions, and so on.
For example, individuals assigned to one of two different employment programs
may, if engagement in the programs is delayed, seek other options. The experiment
then is undermined. A similar problem can occur in tests of programs in rehabili-
tation, medical services, and civil justice. The implication is that assignment should
take place as close as possible to the point of entry to the intervention.
The random assignment process must be structured so as to meet the demands
of both the experiments design and the field conditions. The individuals or entitys
eligibility for intervention, for instance, must usually be determined prior to assign-
ment. Otherwise, there may be considerable wastage of effort and opportunity for
subversion of the trial. Moreover, individuals or entities such as schools or hospitals
may have to be blocked or stratified on the basis of demographic characteristics
prior to their assignments. This is partly to increase precision in (say) a randomized
block design. It may also be done to reduce volatility of issues that the trial might
otherwise engender. For example, in the Odyssey math trial, each of the 32 schools
were used as a blocking factor, and classrooms within schools were then assigned
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 167
randomly to Odyssey Math and to the control condition. This was done partly to
increase power; half as many schools were needed as compared with school ran-
domization design. The design also alleviated school principals concerns that their
schools might be denied the opportunity to obtain the Odyssey curriculum.
Blocking prior to randomization is also done at lower levels to address volatile
field issues. For example, the trialist involved in an employment experiment may
group four individuals into two blocks consisting of two individuals each, one
block containing two African Americans and the second containing two Hispanics.
The randomization process then involves assigning one African American to one of
the interventions and the second individual to the remaining one. The randomiza-
tion of Hispanics is done separately, within the Hispanic block. This approach
assures that chance-based imbalances will not occur. That is, one will not encounter
a string of Hispanics being assigned to one intervention rather than another. This,
in turn, avoids local quarrels about favoritism. It also enhances the statistical power
of the experiment to the extent that ethnic or racial characteristics influence indi-
viduals responses to the intervention.
Simple random allocation of half the eligible units to intervention A and half to
intervention (control) B is common. This tactic maximizes statistical power also,
but good reasons for departing from this simple 1:1 allocation scheme often appear
in the field. The demand for one intervention may be especially strong, and the
supply of eligible candidates for intervention may be ample. This scenario justifies
consideration of allocating in a (say) 2:1 ratio in a two-arm experiment. Allocation
ratios different from 1:1 are of course legitimate and, more important, may resolve
local constraints. They can do so without appreciably affecting the statistical power
of the experiment, if the basic sample sizes are adequate and the allocation ratio
does not depart much from 60:40. Larger differences in ratio require increased
sample size.
A final aspect of the structuring of the random assignment, and the experiments
design more generally, involves a small sample size. For example, experiments
that involve organizations, communities, or crime hot spots (e.g., see Weisburd &
Green, 1995) as the primary unit of random assignment and analysis can often
engage far fewer than 100 entities. Some experiments that focus on individuals as
the unit of random assignment must also contend with small sample size, for
example, local tests of interventions for those who attempt suicide, people who
sexually abuse children, abusers of some controlled substances.
Regardless of what the unit of allocation is, a small sample presents special prob-
lems. A simple randomization scheme may, by chance, result in imbalanced assign-
ment; for example, eight impoverished schools may be assigned to one health
program and eight affluent schools assigned to a second. The approaches recom-
mended by Cox (1958) are sensible. First, if it is possible to match or block prior to
randomization, this ought to be done. This approach was used both in the Minneapolis
Hot Spots Patrol Experiment and the Jersey City Drug Market Analysis Experiment.
Second, one can catalog all random allocations that are possible, eliminate before-
hand those that arguably would produce peculiarly uninterpretable results, and then
choose randomly from the remaining set of arrangements. This approach is more
complex and, on this account, seems not in favor.
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 168
Third, one can incorporate into the experiments design strategies that can
enhance analytic precision despite small sample size. See, for instance, Raudenbush
and Bryk (2002) on matching prior to randomization and on the value of covari-
ates. And see Bloom et al. (2007) and Schochet (2008) on using covariates when
schools are the units of random allocation. The bottom line is that covariates can
be valuable and often inexpensive in place randomized trials.
to this answer (Gibson-Davis & Duncan, 2005). Boys seem to benefit more than
girls in the sense of statistically reduced problem behavior, apparently on account
of mothers investing more resources (day care) in them so as to avert all the higher
risks that mothers perceive.
The frequency and periodicity of observing outcomes on intervention and con-
trol groups is important. For instance, theory and prior research may suggest that
an interventions effects decay or appear late, or that particular responses to one
intervention appear at different rates than responses to another. We already noted
the importance of social observations of hot spots in the Minneapolis Hot Spots
Experiment in understanding the decline of the programs effects during the sum-
mer months. No consolidated handling of this matter is available yet in the context
of social experiments. Nonetheless, if the trialist thinks about the arms of a ran-
domized trial as two or more parallel surveys, then one can exploit contemporary
advances in survival analysis, event history analysis, and in longitudinal data analy-
sis. See Singer and Willett (2003) and references therein, generally, and Raudenbush
and Bryk (2002) on multilevel models in which one level involves measures on the
same entities over time.
It is obvious that the interventions that were assigned randomly to people or enti-
ties ought to be recorded, and the interventions that were actually delivered also
ought to be recorded. The simplest recording is a count. In the Minneapolis Hot
Spots trial, for instance, researchers measured the level of police presence each
month through observations and used these data as a method of monitoring the
dosage of police patrol. But measures on at least two deeper levels are commonly
made to inform policy and science on the character of the interventions that are
under scrutiny in the trial. At the study level, the counts on departures from ran-
domization are, as a matter of good practice, augmented by qualitative information.
In the SARP, for instance, departures were monitored and counted at each site to
assure proper execution of the basic experiments design and to learn about how
departures occurred through qualitative interviews with police officers. At the inter-
vention provider level, measures may be simplefor example, establishing how
many police officers in the SARP contributed how many eligible cases and with what
rate of compliance with assigned treatments. In large-scale education and employ-
ment experiments, measures are often more elaborate. They attend to duration,
character, and intensity of training and support services, and to staff responsible for
them (see, e.g., Gueron & Pauly, 1991; St. Pierre, 2004; and references therein).
Baseline or pretest measures in a randomized field experiment function to pro-
vide evidence that interventions are delivered to the right target individuals or enti-
ties, to reassure the trialist about the integrity of the random assignment process,
to enhance the interpretability of the experiments, and to increase precision in
analysis. Each function is critical and requires a different use of the baseline data.
In the Hot Spots Patrol experiment, for instance, data were generally collected
for more than a year before eligibility was defined to make sure that police efforts
were focused on places that had consistently high levels of crime and disorder. In
the Minneapolis Experiment, researchers required a high level of stability in crime
rates across time, since variability in prior measurement of crime is likely to be
reflected in future measurement.
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 170
Consider next what trialists must observe on the trials context. In experiments
on training and employment programs that attempt to enhance participants wage
rates, it is sensible to obtain data on the local job market. This is done to understand
whether programs being evaluated have an opportunity to exercise any effect. The
measurement of job markets, of course, may also be integrated with employment
program operations. Studies of programs designed to prevent school dropout or to
reduce recidivism of former offenders might also, on theoretical grounds, attend to
job markets, though it is not yet common practice to do so.
In some social experiments, measurement of costs is customary. Historically,
trials on employment and training programs, for example, have addressed cost
seriously, as in the Rockefeller Foundations experiments on programs for single
parents (Gordon & Burghardt, 1990) and work-welfare projects (e.g., Gueron &
Pauly, 1991; Hollister, Kemper, & Maynard, 1984). Producing good estimates of
costs requires resources, including expertise, that are not always available in other
sectors. None of the Hot Spots Policing experiments, for example, focused mea-
surement attention on cost; the focus was on the treatments effectiveness. This is
despite the fact that the interventions being tested involved substantial and expen-
sive investments of police resources and might have negative as well as positive
impacts on the communities living in the hot spots (Rosenbaum, 2006; Weisburd &
Braga, 2006). Trials sponsored by the IES in education since 2002 seem also not to
include much attention to costs.
Guidelines on measuring different kinds of costs are available in textbooks on
evaluation (see, e.g., Rossi et al., 2004). Illustrations and good advice are contained
in such texts, in reports of the kind cited earlier, and in monographs on cost-
effectiveness analysis (e.g., Gramlich, 1990). Part of the future lies in trialists doing
better at reporting on costs and in journal editors assuring that costs get reported
uniformly.
Missingness here refers to failures to obtain data on who was assigned to and
received what interventions, on what the outcome measurement was for each
individual or unit, and on baseline characteristics of each participant. A missing
data registry, a compilation of what data are missing from whom at what level of
measurement, is not yet a formal part of a measurement system in many ran-
domized controlled trials. The need for such registries is evident. The rate of fol-
low-up on victims in ambitious police experiments such as SARP, for example,
does not exceed 80%. On the other hand, follow-up in studies such as the Hot
Spots Policing experiments based on police records is nearly perfect; missingness
is negligible.
Understanding the missingness rate and especially how the rate may differ
among interventions (and can be affected by interventions) is valuable for the study
at hand and for designing better trials. The potential biases in estimates of effect are
a fundamental reason why the What Works Clearinghouse [WWC] (2007) takes
differential attrition into account in its standards of evidence. Understanding why
data are missed is no less important. But the state of the art in reporting on miss-
ingness in experiments is not well developed. This presents an opportunity for
young colleagues to get beyond precedent.
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 171
Management
Analysis
Contemporary randomized trials in the social sector usually involve at least four
classes of analyses. The first class focuses on quality assurance. It entails developing
information on which interventions were randomly assigned to which individuals
or entities, which interventions were actually received by each, and analyses of
departures from the random assignment. Each experiment in the SARP, for
instance, engaged these tasks to assure that the experiments were executed as
designed and to assess the frequency and severity of departures from design during
the study and at its conclusion. Quality assurance also usually entails examination
of baseline (pretreatment) data to establish that, indeed, the randomized groups
do not differ appreciably from one another prior to the intervention. Presenting
numerical tables on the matter in final reports is typical in peer-reviewed reports to
government (good) in peer-reviewed journals (poor).
Core analysis here refers to the basic comparisons among interventions that were
planned prior to the experiment. The fundamental theme underlying the core
analysis is to analyze them as you have randomized them. In statistical jargon, this
is an intent to treat analysis. That is, the groups that are randomly assigned to each
intervention are compared regardless of which intervention was actually received.
At this level of analysis, departures from assignment are ignored.
ITT is justified by the statistical theory underlying a formal test of hypothesis
and by the logic of comparing groups that are composed through randomization
so as to undergird fair comparison. It also has a policy justification. Under real field
conditions, one can often expect departures from an assigned treatment. In the
SARP, for instance, some individuals who were assigned to a mediation treatment
then became obstreperous and were then arrested; arrest was a second randomized
treatment. Such departures occur normally in field settings. Comparing randomly
assigned groups regardless of actual treatment delivered recognizes that a reality of
core analysis is basic in medical and clinical trials (e.g., Friedman, Furberg, &
DeMets, 1985) as in the social and behavioral sciences (Riecken et al., 1974).
The product of the ITT analysis is an estimate of the relative effect of interven-
tion. This product addresses the question, What works? and a statistical statement
of confidence in the result, based on randomized groups. Where departures from
random assignment are substantial, the researcher has to decide whether any ITT
analysis is warranted and indeed whether the experiment has been executed at all.
The experiment or core analysis, or both may have to be aborted. If information on
the origins or process of departures from random assignment has been generated,
the researcher may design and execute a better experiment. This sequence of failure
and trying again is a part of science. See, for instance, Silvermans (1980) descrip-
tions of research on retrolental fibroplasia that covers blindness of premature
infants as a function of enriched oxygen environments.
Deeper levels of analysis than ITT are often warranted on account of the com-
plexity of the phenomenon under study or on account of unanticipated problems
in the studys execution. For example, finding no differences among interventions
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 173
may be a consequence of using interventions that were far less different from one
another than the researcher anticipated or inadequate statistical power. A no dif-
ference finding may also be on account of unreliable or invalid measures of the out-
comes on each randomized group. Interactions between intervention type and
subgroup, of course, can lead to a naive declaration of no difference. The topic is
understudied, but good counsel has been developed by Yeaton and Sechrest (1986,
1987), and Julnes and Mohr (1989).
A final class of analysis directs attention to how the results of the trial at hand
relates to the results of similar studies. Exploring how a given study fits into the larger
scientific literature on related studies is demanding. One disciplined approach to the
task lies in exploiting the practice underlying the idea of systematic reviews and meta-
analyses. That is, the researcher does a conscientious accounting for each study of
who or what was the target (eligibility for treatments, target samples, and popula-
tion), what variables were measured and how, the character of the treatments and
control conditions, how the specific experiment was designed, and so on. The U.S.
General Accounting Office (1994), now called the Government Accountability Office,
formalized such an approach to understand the relative effectiveness of mastectomy
and lumpectomy on 5-year survival rates of breast cancer victims. See Pettigrew and
Roberts (2006), and the U.S. General Accounting Office (1992, 1994) more generally
on the topic of synthesizing the results of studies. Each contains implications for
understanding how to view the experiments at hand against earlier work.
Reporting
The medical and health sciences led the way in developing standards for
reporting on randomized trials (e.g., Chalmers et al., 1981). Later, Boruch (1997)
provided a checklist that depended partly on one prepared for reports on medical
clinical trials. The Consolidated Statement on Reporting of Trials (CONSORT)
Statement is one of the best articulated statements of its kind (Mohler et al.,
2001). One of CONSORTs innovations is the requirement that authors provide
a flowchart that details case flow into and out of the trial. The flowchart is a
numerical and graphical portrayal of the pipeline discussed earlier in this
Chapter. The CONSORT guidelines have been updated and revised to foster stan-
dardized and thorough reporting on cluster randomized trials (Campbell, Elbourne,
& Altman, 2004).
CONSORTs ingredients have informed the WWCs (2007) guidance on how to
report and what to report on controlled trials in education (http://ies.ed.gov/
ncee/wwc). The WWC, a unit of the IES in the United States, has also built on stan-
dards of evidence work by the Society for Prevention Research, the Campbell
Collaboration, and others to develop its standards of evidence. The production and
revision of nongovernmental standards of reporting have begun, in turn, to depend on
the WWC. The sheer volume of research publications (20,000 year in education alone)
has provoked a move toward standardized abstracts that contain brief statements
about the experiments design elements and results (Mosteller, Nave, & Miech, 2004).
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 174
Capacity Building
Developing better capacity to design randomized trials and to analyze results is not
new in one sense. Excellent texts on statistical aspects of randomized trials, and new
ones that cover remarkable advances in the field, such as Raudenbush and Bryk
(2002), Piantadosi (1997), and Donner and Klar (2000) are readily available and are
used in many graduate courses.
Capacity building in the sense of educating ourselves and others about manag-
ing and executing such trials, and handling the political and institutional problems
that they engender, has only recently received serious attention. The World Banks
International Program for Development Evaluation Training (IPDET) included
such matters in 2004 and 2005 after years of neglect. NIMHs summer institutes
on trials and the workshops on trials at professional society meetings run by the
American Institutes for Research, Manpower Demonstration Research Corporation,
and others are illustrative. William T. Grant Foundation invested substantially in
special seminars on the topic for senior and midlevel researchers and civil servants.
Beginning in 2007, the IES invested substantially in training institutes and confer-
ences, in predoctoral and postdoctoral fellowship programs that focused heavily
(not entirely) on randomized trials (U.S. Department of Education, 2007). Partici-
pants have typically included researchers, people from local, state, and federal agen-
cies, and service providers.
Of course, capacity building includes providing resources to different entities to
run trials. The entities include schools, police departments, etc., whose cooperation
is essential in generating better evidence. See the examples given earlier. The chal-
lenges for the future include learning how to institutionalize and cumulate the
learning by professionals in these organizations and to assure that the learning leads
to decisions that will inform. This particular challenge is also not new, but the
refreshed interest over the last decade in randomized trials will help to drive more
sophisticated uses of evidence and ways to think about use.
Conclusion
During the 1960s, when Donald T. Campbell developed his prescient essays on the
experimenting society, fewer than 100 randomized field experiments in the social
sector had been mounted to test the effects of domestic programs. The large
number of randomized trials undertaken since then is countable, but not without
substantial effort. Registers of such trials, generated with voluntary resources, such
as the Campbell Collaboration (http://campbellcollaboration.org), yield more than
14,000 entries and the actual number is arguably far larger. Executing randomized
controlled trials help us to transcend debates about the quality of evidence and,
instead, inform social choices based on good evidence. In the absence of random-
ized controlled experiments on policy and programs, we will, in Walter Lippmans
(1963) words, Leave matters to the unwise . . . those who bring nothing construc-
tive to the process and who greatly imperil the future. . . . by leaving great questions
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 175
to be fought out by ignorant change on the one hand, and ignorant opposition to
change on the other (p. 497).
References
Alexander, L. B., & Solomon, P. (Eds.). (2006). The research process in human services.
Belmont, CA: Thomson/Brooks/Cole.
Aos, S. (2007). Testimony of Mr. Steve Aos to the Healthy Families and Communities
Subcommittee of the Committee on Education and Labor United States House of
Representatives. Olympia, WA: Washington State Institute for Public Policy.
Bayley, D. (1994). Police for the future. New York: Oxford University Press.
Bickman, L., & Rog, D. (Eds.). (1998). Handbook of applied social research methods. Thousand
Oaks, CA: Sage.
Birnbaum, A. S., Lytle, L. A., Story, M., Perry, C. L., Murray, D. M. (2002). Are differences
in exposure to a multicomponent school-based intervention associated with varying
dietary outcomes in adolescents? Health Education and Behavior, 29(4), 427443.
Bloom, H. S. (Ed.). (2005). Learning more from experiments: Evolving analytic approaches.
New York: Russell Sage Foundation.
Bloom, H. S., Richburg-Hayes, L., & Black, A. R. (2007). Using covariates to improve preci-
sion for studies that randomize schools to evaluate educational interventions.
Educational Evaluation and Policy Analysis, 29(1), 3059.
Boruch, R. F. (1997). Randomized controlled experiments for planning and evaluation: A
practical guide. Thousand Oaks, CA: Sage.
Boruch, R. F. (Ed.). (2005, May). Place randomized trials: Experimental tests of public pol-
icy [Special issue]. Annals of the American Academy of Political and Social Science, 599.
Boruch, R. F. (2007). The null hypothesis is not called that for nothing: Statistical tests in
randomized trials. Journal of Experimental Criminology, 3, 120.
Braga, A. (2005). Hot spots policing and crime prevention: A systematic review of random-
ized controlled trials. Journal of Experimental Criminology, 1, 317342.
Braga, A. A., Weisburd, D., Waring, E., & Mazerolle, L. G. (1999). Problem solving in violent
crime places: A randomized controlled experiment. Criminology, 37(3), 541580.
Brantingham, P. J., & Brantingham, P. L. (1975). Residential burglary and urban form. Urban
Studies, 12(3), 273284.
Brown, K. L., McDonald, S.-K., & Schneider, B. (2006). Just the facts: Results from IERI scale-
up research. Chicago: Data Research and Development Center, NORC, University of
Chicago. Retrieved May 6, 2008, from http://drdc.uchicago.edu/extra/just-the-facts.pdf
Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental designs for
research. Chicago: Rand McNally.
Campbell, M. K., Elbourne, D. R., & Altman, D. G. (2004). CONSORT statement extension
to cluster randomized trials. British Medical Journal, 328, 702708.
Chalmers, I. (2003). Trying to do more good than harm in policy and practice: The role
of rigorous, transparent, up-to-date evaluations. Annals of the American Academy of
Political and Social Sciences, 589, 2240.
Chalmers, T., Smith, H., Blackburn, B., Silverman, B., Schroeder, B., Reitman, D., et al. (1981).
A method for assessing the quality of a randomized controlled trial. Controlled Clinical
Trials, 2(1), 3150.
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 177
Clarke, R. V. (1983). Situational crime prevention: Its theoretical basis and practical scope.
In M. Tonry & N. Morris (Eds.), Crime and justice: An annual review of research (Vol. 4,
pp. 225256). Chicago: University of Chicago Press.
Clarke, R. V. (1992). Situational crime prevention: Successful case studies. Albany, NY: Harrow
& Heston.
Clarke, R. V. (1995). Situational crime prevention: Achievements and challenges. In M. Tonry &
D. Farrington (Eds.), Building a safer society: Strategic approaches to crime prevention, crime
and justice: A review of research (Vol. 19, pp. 91150). Chicago: Chicago University Press.
Clarke, R. V., & Weisburd, D. (1994). Diffusion of crime control benefits: Observations
on the reverse of displacement. In R. V. Clarke (Ed.), Crime prevention studies (Vol. 2,
pp. 165183). Monsey, NY: Criminal Justice Press.
Cochran, W. G. (1983). Planning and analysis of observational studies (L. E. Moses &
F. Mosteller, Eds.). New York: Wiley.
Cohen, L. E., & Felson, M. (1979). Social change and crime rate trends: A routine activity
approach. American Sociological Review, 44, 558605.
Cordray, D. S. (2000). Enhancing the scope of experimental inquiry in intervention studies.
Crime & Delinquency, 46(3), 401424.
Cornish, D. B., & Clarke, R. V. (1986). The reasoning criminal: Rational choice perspectives in
offending. New York: Springer-Verlag.
Cox, D. (1958). Planning of experiments. New York: Wiley.
Datta, L. (2007). Looking at the evidence: What variations in practice might indicate. New
Directions for Program Evaluation, 113, 3554.
Deeks, J. J., Dinnes, J., DAmico, R., Sowden, A. J., Sakarovitch, C., Song, F., et al. (2003).
Evaluating non-randomized intervention studies. Health Technology Assessment,
7(27), 1173.
Dennis, M. (1988). Factors influencing quality of controlled randomized trials in criminologi-
cal research. Unpublished doctoral dissertation, Northwestern University, Evanston, IL.
Donner, A., & Klar, N. (2000). Design and analysis of cluster randomization trials in health
care. New York: Oxford University Press.
Doolittle, F., & Traeger, L. (1990). Implementing the National JTPA Study. New York: MDRC.
Eck, J. E., & Weisburd, D. (Eds.). (1995). Crime and place: Crime prevention studies (Vol. 4).
Monsey, NY: Criminal Justice Press.
Farrington, D. P. (1983). Randomized experiments on crime and justice. Crime and Justice:
Annual Review of Research, 4, 257308.
Farrington, D. P., & Welsh, B. (2005). Randomized experiments in criminology. What have
we learned in the last two decades? Journal of Experimental Criminology, 1, 938.
Federal Judicial Center. (1983). Social experimentation and the law. Washington, DC: Author.
Finn, J. D., & Achilles, C. M. (1990). Answers and questions about class size: A statewide
experiment. American Education Research Journal, 27, 557576.
Flay, B., Biglan, A., Boruch, R., Castro, F., Gottfredson, D., Kellam, S., et al. (2005). Standards
of evidence: Criteria for efficacy, effectiveness, and dissemination. Prevention Science,
6(3), 151175.
Flay, B. R., & Collins, L. M. (2005). Historical review of school based randomized trials for
evaluating problem behavior. Annals of the American Academy of Political and Social
Science. 599, 115146.
Foster, G., Sherman, S., Borradaile, K., Grundy, K., Vander Veur, S., Nachmani, J., et al.
(2006). A policy-based school intervention to prevent childhood obesity. Unpublished
manuscript.
Friedman, L. M., Furberg, C. D., & DeMets, D. L. (1985). Fundamentals of clinical trials.
Boston: John Wright.
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 178
Garner, J., Fagen, J., & Maxwell, C. (1995). Published findings from the Spouse Assault
Replication Program: A critical review. Journal of Quantitative Criminology, 11(1), 328.
Gerber, A. S. (2004). Does campaign spending work? Field experiments provide evidence and
suggest new theory. American Behavioral Scientist, 47(5), 541574.
Gibson-Davis, L. M., & Duncan, G. J. (2005). Qualitative/quantitative synergies in a random-
assignment program evaluation. In T. Weisner (Ed.), Discovering successful pathways in
childrens development (pp. 283303). Chicago: University of Chicago Press.
Glazerman, S., Levy, D., & Myers, D. (2003). Nonexperimental versus experimental estimates
of earnings impacts. Annals of the American Academy of Political and Social Science,
589, 6394.
Gordon, A., & Burghardt, J. (1990). The minority female teenage single parent demonstration:
Short-term economic impacts. New York: Rockefeller Foundation.
Gortmaker, S. L., Peterson, K., Wiecha, J., Sobol, A. M., Dixit, S., Fox, M. K., et al. (1999).
Reducing obesity via a school-based interdisciplinary intervention among youth: Planet
Health. Archives of Pediatrics and Adolescent Medicine, 153, 409418.
Gottfredson, M. R., & Hirschi, T. (1990). A general theory of crime. Stanford, CA: Stanford
University Press.
Gramlich, E. M. (1990). Guide to benefit cost analysis. Englewood Cliffs, NJ: Prentice Hall.
Gueron, J. M., & Pauly, E. (1991). From welfare to work. New York: Russell Sage Foundation.
Havas, S., Anliker, J., Greenberg, D., Block, G., Block, T., Blik, C., et al. (2003). Final results of
the Maryland WIC food for life program. Preventive Medicine, 37, 406416.
Hedges, L., & Hedberg, E. C. (2007). Intraclass correlation values for planning group ran-
domized trials in education. Educational Evaluation and Policy Analysis, 29(1), 6087.
Hollister, R., Kemper, P., & Maynard, R. (1984). The national supported work demonstration.
Madison: University of Wisconsin Press.
Julnes, G., & Mohr, L. B. (1989). Analysis of no-difference findings in evaluation research.
Evaluation Review, 13, 628655.
Julnes, G., & Rog, D. J. (Eds.). (2007, Spring). Informing federal policies on evaluation
methodology: Building the evidence base for method choice in government sponsored
evaluation [Special issue]. New Directions for Evaluation, 2007(113).
Koper, C., Poole, E., & Sherman, L. W. (2006). A randomized experiment to reduce sales
tax delinquency among Pennsylvania businesses: Are threats best? Unpublished report.
Philadelphia: Fels Institute of Government.
Lippman, W. (1963). The Savannah speech. In C. Rossiter & J. Lare (Eds.), The essential
Lippman. New York: Random House. (Original work published 1933)
Lipsey, M. W., Adams, J. L., Gottfredson, D. C., Pepper, J. V., Weisburd, D., Petrie, C., et al.
(2005). Improving evaluation of anticrime programs. Washington, DC: National Research
Council/National Academies Press.
Littell, J. H., & Schuerman, J. R. (1995). A synthesis of research on family preservation and
family reunification programs. Washington, DC: Office of the Assistance Secretary for
Planning and Evaluation, U.S. Department of Health and Human Services. Retrieved
May 6, 2008, from http://aspe.os.dhhs.gov/hsp/cyp/fplitrev.htm
Martinson, R. (1974). What works? Questions and answers about prison reform. The Public
Interest, 35, 2254.
Mazerolle, L. G., & Roehl, J. (Eds.). (1998). Civil remedies and crime prevention (Vol. 9).
Monsey, NY: Criminal Justice Press.
Moffitt, R. A. (2004). The role of randomized field trials in social science research: A per-
spective from evaluations of reforms of social welfare programs. American Behavioral
Scientist, 47, 506540.
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 179
Mohler, D., Schultz, K. F., Altman, D. G., for the CONSORT Group. (2001). The CONSORT
statement: Revised recommendations for improving the quality of reports of parallel-
group randomized trials. Lancet, 357, 11911194.
Mosteller, F. (1986). Errors: Nonsampling errors. In W. H. Kruskal & J. M. Tanur (Eds.),
International encyclopedia of statistics (Vol. 1, pp. 208229). New York: Free Press.
Mosteller, F., & Boruch, R. F. (Eds.). (2005). Evidence matters: Randomized tests in education
research. Washington, DC: Brookings Institution.
Mosteller, F., Light, R. M., & Sachs, J. (1995). Sustained inquiry in education: Lessons from
ability grouping and class size. Cambridge, MA: Harvard University Press, Center for
Evaluation of the Program on Initiatives for Children.
Mosteller, F., Nave, B., & Miech, E. (2004, January/February). Why we need a structured
abstract in education research. Educational Researcher, 33, 2934.
Murray, P. A. (1998). Design and analysis of group randomized trials. New York: Oxford
University Press.
Nicklas, T. A., Johnson, C. C., Myers, L., Farris, R. P., & Cunningham, A. (1998). Outcomes
of a high school program to increase fruit and vegetable consumption: Gimme 5a
fresh nutrition concept for students. Journal of School Health, 68, 248253.
Perng, S. S. (1985). The accounts receivable treatments study. In R. F. Boruch & W. Wothke
(Eds.), Randomization and field experimentation (pp. 5562). San Francisco: Jossey-Bass.
Pettigrew, M., & Roberts, H. (2006). Systematic reviews in the social sciences: A practical guide.
Oxford, UK: Blackwell.
Pfeffer, J., & Sutton, R. I. (2006). Evidence based management. Harvard Business Review,
84(1), 6274.
Piantadosi, S. (1997). Clinical trials: A methodologic perspective. New York: Wiley Interscience.
Pierce, G. L., Spar, S., & Briggs, L. R. (1986). The character of police work: Strategic and
tactical implications. Boston: Center for Applied Social Research, Northwestern
University.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models. Thousand Oaks, CA: Sage.
Reiss, A. J., & Boruch, R. F. (1991). The program review team approach to multi-site experi-
ments: The Spouse Assault Replication Program. In R. S. Turpin & J. N. Sinacore (Eds.),
Multi-site evaluation (pp. 3344). San Francisco: Jossey-Bass.
Riecken, H. W., Boruch, R. F., Campbell, D. T., Caplan, N., Glennan, T. K., Pratt, J. W., et al.
(1974). Social experimentation: A method for planning and evaluating social programs.
New York: Academic Press.
Rosenbaum, P. R. (2002). Observational studies. New York: Springer-Verlag.
Rosenbaum, P. R. (2006). The limits of hot spots policing. In D. Weisburd & A. Braga (Eds.),
Police innovation: Contrasting perspectives (pp. 245266). Cambridge, UK: Cambridge
University Press.
Rossi, P. H., Lipsey, M., & Freeman, H. F. (2004). Evaluation: A systematic approach (7th ed.).
Thousand Oaks, CA: Sage.
Roth, J. A., Scholz, J. T., & Witte, A. D. (Eds.). (1989). Paying taxes: An agenda for compliance
research (Report of the Panel on Research on Tax Compliance Behavior National
Academy of Sciences). Philadelphia: University of Pennsylvania Press.
Schochet, P. (2008). Statistical power for random assignment evaluations of education
programs. Journal of Educational and Behavioral Statistics, 33(1), 6287.
Schuerman, J. R., Rzepnicki, T. L., & Littell, J. (1994). Putting families first: An experiment in
family preservation. New York: Aldine de Gruyter.
Schwartz, R. D., & Orleans, S. (1967). On legal sanctions. University of Chicago Law Review,
34(274), 282300.
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 180
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental
designs for generalized causal inference. Boston: Houghton Mifflin.
Shavelson, R. J., & Towne, L. (Eds.). (2002). Scientific research in education. Washington,
DC: National Research Council/National Academies Press.
Shepherd, J. P. (2003). Explaining feast or famine in randomized field trials: Medical science
and criminology compared. Evaluation Review, 27(3), 290315.
Sherman, L. W., Gartin, P. R., & Buerger, M. E. (1989). Repeat call address policing: The
Minneapolis RECAP experiment. Final report to the National Institute of Justice.
Washington, DC: Crime Control Institute.
Sherman, L. W., Schmidt, J. D., & Rogan, D. P. (1992). Policing domestic violence: Experiments
and dilemmas. New York: Free Press.
Sherman, L. W., & Weisburd, D. (1995). General deterrent effects of police patrol in crime
hotspots: A randomized controlled trial. Justice Quarterly 12, 625648.
Sieber, J. E. (1992). Planning ethically responsible research: A guide for students and internal
review boards. Newbury Park, CA: Sage.
Silverman, W. (1980). Retrolental fibroplasia: A modern parable. New York: Grune & Stratton.
Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and
event occurrence. New York: Oxford University Press.
Skogan, W., & Frydl, K. (2004). Fairness and effectiveness in policing: The evidence.
Washington, DC: National Academies Press.
Sorensen, G., Thompson, B., Glanz, K., Feng, Z., Kinne, S., DiClemente, C., et al. (1996).
Work site-based cancer prevention: Primary results from the Working Well Trial.
American Journal of Public Health, 86, 939947.
Spybrook, J. (2007). Examining the experimental designs and statistical power of group ran-
domized trials. Funded by the Institute of Education Sciences. Unpublished doctoral
dissertation, University of Michigan, Ann Arbor.
St. Pierre, R. G. (2004). Using randomized experiments. In J. S. Wholey, K. P. Hatry, &
E. Newcomer (Eds.), Handbook of practical program evaluation (2nd ed., pp. 150175).
New York: Jossey-Bass.
Stanley, B., & Sieber, J. F. (Eds.). (1992). Social research on children and adolescents: Ethical
issues. Newbury Park, CA: Sage.
Stufflebeam, D. L., & Shinkfield, A. J. (2007). Evaluation theory, models, and applications.
New York: Jossey-Bass.
Taylor, R. (1997). Social order and disorder of street blocks and neighborhoods: Ecology,
microecology, and the synthetic model of social disorganization. Journal of Research in
Crime and Delinquency, 34(1), 113155.
Tilley, B., Glanz, K., Kristal, A. R., Hirst, K., Li, S. Vernon, S. W., et al. (1999). Nutrition inter-
vention for high-risk auto workers: Results of the Next Step trial. Preventive Medicine,
28, 284292.
Turner, H. (2007). Random assignment in the Odessy math trial. Philadelphia, PA: Analytica.
U.S. Department of Education. (2007). Toward a Learning Society: Directors Biennial Report
to Congress. Washington, DC: Author (IES 20076004).
U.S. General Accounting Office. (1992). Cross-design synthesis: A new strategy for medical
effectiveness research (Publication No. GAO IPEMD-9218). Washington, DC: Government
Printing Office.
U.S. General Accounting Office. (1994). Breast conservation versus mastectomy: Patient
survival in day to day medical practice and in randomized studies (Publication No.
PEMD-95.9). Washington, DC: Government Printing Office.
05-Bickman-45636:05-Bickman-45636 7/28/2008 6:11 PM Page 181
University of Virginia Health System. (2008, February). For Your child: Childhood obesity
addressed with new program. Retrieved March 4, 2008, from www.healthsystem.virginia
.edu/UVAHealth/news_foryourchild/0802ch.cfm
Victor, T. (2007). Estimating effects based on quasi-experiments: A Monte Carlo simulation
study. Unpublished doctoral dissertation, University of Pennsylvania, Philadelphia.
Weisburd, D. (2000). Randomized experiments in criminal justice policy: Prospects and
problems. Crime & Delinquency, 46(2), 181193.
Weisburd, D. (2005). Hot spots policing experiments and criminal justice research. Annals of
the American Academy of Political and Social Science, 599, 220245.
Weisburd, D., & Braga, A. (2006). Hot spots policing as a model for police innovation. In
D. Weisburd & A. Braga (Eds.), Police innovation: Contrasting perspectives (pp. 225244).
Cambridge, UK: Cambridge University Press.
Weisburd, D., Bushway, S., Lum, C., and Yang, S. M. (2004). Trajectories of crime at places: A
longitudinal study of street segments in the city of Seattle. Criminology, 42(2), 283321.
Weisburd, D., & Eck, J. (2004). What can police do to reduce crime, disorder, and fear? Annals
of the American Academy of Political and Social Science, 593, 4265.
Weisburd, D., & Green, L. (1995). Policing drug hot spots: The Jersey City DMA experiment.
Justice Quarterly, 12, 711736.
Weisburd, D., Lum, C., & Petrosino, A. (2001). Does research design affect study outcomes in
criminal justice? Annals of the American Academy of Political and Social Science, 578, 5070.
Weisburd, D., Maher, L., & Sherman, L. W. (1992). Contrasting crime general and crime
specific theory: The case of hot-spots crime. Advances in criminological theory (Vol. 4,
pp. 4570). New Brunswick, NJ: Transaction Press.
Weisburd, D., Wyckoff, L., Ready, J., Eck, J., Hinkle, J., & Gajewski, F. (2006). Does crime just
move around the corner? A controlled study of spatial displacement and diffusion of
crime control benefits. Criminology, 44, 549591.
Weisner, T. (Ed.). (2005). Discovering successful pathways in childrens development: Mixed
methods in the study of childhood and family Life. Chicago: University of Chicago Press.
Westat, Inc. (2002). Evaluation of family preservation and reunification programs: Final report.
Washington, DC: U.S. Department of Health and Human Services Assistant Secretary
for Planning and Evaluation. Retrieved May 6, 2008, from http://aspe.os.dhhs.gov/hsp/
fampres94/index.htm
What Works Clearinghouse. (2007). Retrieved May 6, 2008, from http://ies.ed.gov/ncee/
wwc/overview/review.asp
William T. Grant Foundation. (2007). Portfolio of education related grants awarded before
January 1, 2007. New York: Author. Retrieved May 6, 2008, from www.wtgrantfdn.org
Wittman, W. W., & Klumb, P. L. (2006). How to fool yourself with experiments in testing
theories in psychological research. In R. R. Bootzin & P. E. McKnight (Eds.),
Strengthening research methodology: Psychological measurement and evaluation (pp.
185212). Washington, DC: American Psychological Association.
Yeaton, W. H., & Sechrest, L. (1986). Use and misuse of no difference findings in eliminating
threats to validity. Evaluation Review, 10, 836852.
Yeaton, W. H., & Sechrest, L. (1987). No difference research. New Directions for Program
Evaluation, 34, 6782.
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 182
CHAPTER 6
Quasi-Experimentation
Melvin M. Mark
Charles S. Reichardt
A pplied social science researchers often try to assess the effects of an inter-
vention of interest, also known as a treatment. To take just a few examples,
educational researchers have estimated the effects of preschool programs,
economists have examined the consequences of an increase in the minimum wage,
psychologists have assessed the psychological effects of living through a natural
disaster, and legal scholars have studied the results of legal changes such as laws
mandating helmets for motorcycle riders. When an applied social researcher is
interested in estimating the effects of a treatment, a range of research options exists.
One option is to employ a randomized experiment. In a randomized experiment, a
random process, such as a flip of a fair coin, decides which participants receive one
treatment condition (e.g., a new state-supported preschool program) and which
receive no treatment or an alternative treatment condition (e.g., traditional child
care). The randomized experiment is the preferred option for many applied
researchers, and sometimes is held out as the gold standard for studies that
estimate the effect of a treatment. In applied social research, however, practical or
ethical constraints often preclude random assignment to conditions. For instance,
it will usually not be feasible to randomly assign people or states to a law that man-
dates helmets for motorcyclists. When random assignment to conditions is not
feasibleas will often, but hardly inevitably, be the case in applied researcha
quasi-experiment may be the method of choice.
Quasi is a Latin term meaning as if. Donald Campbell, the original architect
of the logic of quasi-experimentation (e.g., Campbell & Stanley, 1966; Cook &
Campbell, 1979; Shadish, Cook, & Campbell, 2002), coined the term quasi-
experiment. It means an approximation of an experiment, a near experiment. Like
182
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 183
Quasi-Experimentation 183
A Review of Alternative
Quasi-Experimental Designs
In this, the longest section of the chapter, we review four quasi-experimental
designs: the one-group pretest-posttest design, the nonequivalent group design,
the interrupted time-series design, and the regression-discontinuity design. In the
context of these designs, we introduce several potential threats to the validity of
inferences from quasi-experiments. We begin with relatively queasy designs that
generally do not provide sufficiently confident causal inferences in applied social
research. Even here, however, the adequacy of a design is not preordained, but
depends. We then move to more compelling quasi-experimental designs and to
additional comparisons that can facilitate causal inference.
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 184
O X O.
Quasi-Experimentation 185
decline in cancer cases between 2002 and 2003 would not imply an effect of the
reduced use of HRT.
Instrumentation can lead to inaccurate inferences about a treatments effects when
an apparent effect is instead the result of a change in a measuring instrument. One
reason that instrumentation can occur is because of changes in the definition of an
outcome variable. Paulos (1988) gave an example, noting that Government employ-
ment figures jumped significantly in 1983, reflecting nothing more than a decision to
count the military among the employed (p. 124). Instrumentation would be a prob-
lem in the HRT-cancer study if, for example, the official definition of breast cancer
changed, say, with some of cases that in 2002 would have been classified as breast can-
cer instead defined in 2003 as lymph node cancer. Instrumentation can also be a
problem when there is not a formal change in definition, if the procedures or stan-
dards of those who record the observations shift over time.
The threat of testing arises when the very act of measuring the pretest alters the
results of the posttest. For example, individuals unfamiliar with tests such as the
SAT may score higher on a second taking of the test than they did the first time,
simply because they have become more familiar with the test format. In the HRT-
cancer investigation, testing appears to be an implausible threat, but it would be
a problem if many women had mammograms in 2002 and by some biological
process this screening itself offered protection against cancer.
Regression toward the mean is an inferential threat that occurs most strongly
when the pretest observation is substantially different than usual, either higher or
lower. When things are unusual at the pretest, the posttest observation often will
return to a more average or normal level even in the absence of a treatment effect.
This kind of pattern is called spontaneous remission in medical treatments or psy-
chotherapy. That is, people often seek out treatment when their physical or emo-
tional conditions are at their worst and, because many conditions get better on their
own, patients often improve without any intervention. In theory, an unusual form
of regression toward the mean could have occurred in the HRT-cancer study.
Publicity about the WHI study results could have created a stampede of women to
get mammograms, including women who otherwise would have not have had a
mammogram until 2003 or after. The 2002 tally of breast cancer cases thus might
have been unusually high, with a decline in 2003 to be expected even without any
real effect of the reduction in HRT.
Attrition, alternatively labeled experimental mortality, refers to the loss of partic-
ipants in a study. Such a loss can create a spurious difference in a pretest-posttest
comparison. For example, the average test scores of college seniors tend to be higher
than the average test scores of college freshmen, simply because poor-performing
students are more likely than high-performing students to drop out of school. A
form of attrition could have threatened internal validity in the HRT-cancer study if
fewer women, especially those high at risk, were screened for cancer in 2003 than in
2002. Hypothetically, publicity about the WHI might have made some women too
anxious to be screened or given a false sense of security to women not on HRT. The
WHI study and the associated reduction in HRT therapy would not have caused a
real drop in breast cancer, but would have only reduced detection via attrition from
screening (and thus from the study data).
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 186
Quasi-Experimentation 187
plausible for organizational performance as an outcome than they are for immedi-
ate outcomes such as knowledge.
Eckerts argument highlights several take home messages about quasi-
experimentation. First, to reiterate, threats do not automatically cripple a quasi-
experiment based solely on its design. The specifics of a study, including its context
and content, including what the outcome measure is, determine whether a threat
applies in a particular case. Second, quasi-experimentation should not be seenor
practicedas a mindless or automatic process of selecting from a preexisting menu
of quasi-experimental designs. One consideration in thoughtfully selecting a quasi-
experimental design is the plausibility of internal validity threats in the specific
circumstances of the study. For instance, if Eckert is right that the one-group
pretest-posttest design suffices for evaluating the immediate learning effects of cer-
tain training programs, then it could be a waste of resources to implement a more
complex design. Of course, this argument rests on the assumption that the risk of
the various internal validity threats can be assessed reasonably well in advance.
Moreover, the higher costs of a more rigorous design may sometimes be practically
inconsequential, in which the stronger design would of course be preferred. Or the
more rigorous design may be more costly (in terms of dollars, time, or other
resources), but this cost could be outweighed by the importance of having a strong
evidentiary base to convince skeptics. Again, the selection of a particular quasi-
experimental design, or the selection of a quasi-experiment versus a randomized
experiment, involves judgment and consideration of trade-offs. A third implication
is that the quasi-experimental researcher often has a larger burden than the
researcher conducting a randomized experiment. Rather than simply reporting the
results of a pretest-posttest evaluation of the effects of a training program on
knowledge, for example, Eckert would need to offer evidence and argument to rule
out the validity threats to which the design is generically susceptible.
Sometimes the evidence that a quasi-experimentalist might add to his or her
argument is relatively direct evidence about the plausibility of a particular validity
threat. For instance, Ross (1973), in a study of a British intervention directed at
road safety, used a variety of sources to see if there were actual history threats such
as other legislation or shifts in gasoline prices. In the HRT-cancer example, the
threat of attrition could be directly assessed by examining whether there was a
decline from 2002 to 2003 in the number of women screened for breast cancer by
mammograms.
Alternatively, the quasi-experimentalist might seek to rule out threats less
directly, by creatively identifying additional comparisons that help render relevant
validity threats implausible. For instance, consider a one-group pretest-posttest eval-
uation of a training program. The researcher could create two knowledge scales, one
closely reflecting the training programs content and the other measuring related
knowledge that the program did not teachbut that would be expected to change
if maturation occurred. If the posttest showed improvement on the first but not the
second measure, this would further support the conclusion that the training worked
(vs. the alternative explanation that maturation occurred). In the HRT-cancer study,
a similar strategy was employed. Investigators found that the decline in cancer cases
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 188
occurred primarily among women in the age group previously targeted for HRT
therapy and in the types of cancer sensitive to estrogen, a component of HRT. The
logic of adding such comparisons is addressed further in a later section.
Despite the preceding discussion, in most circumstances the one-group pretest-
posttest design will not be adequate for applied social research. This is because one
or more of the previously described threats to internal validity are likely to be suf-
ficiently plausible and sufficiently large in size as to render results from the design
ambiguous. Thus, we turn to other quasi-experimental designs.
XO ,
O
where the broken line denotes that the groups are nonequivalent, which simply
means that group assignment was not random.
The posttest difference between the groups on the outcome variable is used to
estimate the size of the treatment effect. However, the internal validity threat of
selection usually makes the results of the posttest-only nonequivalent group design
uninterpretable in applied social research. Selection refers to the possibility that ini-
tial differences between groups, rather than an actual treatment effect, are respon-
sible for any observed difference between groups on the outcome measure. When
nonequivalent groups are compared, the selection threat is usually sufficiently plau-
sible that the posttest-only nonequivalent group design is not recommended for
applied social research. That is, differences on the outcome variable seem likely to
result from self-selection or whatever the nonrandom process is that created the
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 189
Quasi-Experimentation 189
groups, which would of course obscure the effects of the intervention in the
posttest-only design.
In a more prototypical nonequivalent group design, the groups are observed on
both a pretest and a posttest. Diagrammatically, this pretest-posttest nonequivalent
group design is represented as
OXO ,
O O
where the dashed line again denotes nonequivalent groups. With this design, the
researcher can use the pretest to try to take account of initial selection differences.
The basic logic of the pretest-posttest nonequivalent group design can perhaps
most easily be seen from the vantage of one potential data analysis technique, gain
score analysis. Gain (or change) score analysis focuses on the average pretest-to-
posttest gain in each group. The difference between the two groups in terms of
change (i.e., the difference between groups in the average pretest-posttest gain)
serves as the estimate of the treatment effect. That is, the treatment effect is esti-
mated by how much more (or less) the treatment group gained on average than the
control group. Unlike the posttest-only design, the pretest-posttest nonequivalent
group design at least offers the possibility of controlling for the threat of selection
using the pretest to represent the initial difference that is due to selection.
Gain score analysis, however, controls only for a simple main effect of initial
selection differences. For example, imagine that (a) the treatment group begins 15 points
higher than the control group at the pretest and (b) it would remain 15 points ahead
at the posttest unless there is an effect of the treatment. In this case, gain score
analysis would perfectly adjust for the effect of the initial selection difference.
However, the analysis does not control for interactions between selection and other
threats. In particular, gain score analysis of data from the pretest-posttest non-
equivalent group design does not control for a selection-by-maturation interaction,
whereby one of the groups improves faster than the other group (i.e., matures at a
different rate) even in the absence of a treatment effect.
Functionally, there are two ways to think about why a selection-by-maturation
interaction would occur. One is captured in the old expression, The rich get richer.
Certain maturational processes are characterized by increasingly larger gaps over
time between the best and the rest. For example, skill differences are usually less pro-
nounced among younger children and more pronounced among older children.
When such a pattern holds, a gain score analysis will not remove the differential mat-
uration across groups. That is, the initially higher-scoring group would be further
ahead of the other group at the posttest (the rich get richer), even in the absence
of a treatment effect. A second (and conceptually related) reason for the selection-
by-maturation pattern is that the pretest might not capture all the relevant initial
differences between groups in the face of certain maturational processes. Consider
the case of a quasi-experimental evaluation of a program intended to prevent drug
use in early adolescents. If the two groups had similar levels of drug use at the
pretest, while at the posttest the comparison group used drugs more than treatment
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 190
group youths, a gain score analysis would suggest that the program was effective.
However, the groups might have appeared similar at the pretest because that mea-
surement took place at an age before many youths have begun to use drugs. But if
the two groups differed on risk factors such as community levels of drug use, then
divergence between the two groups over time may be expected even if no treatment
effect occurred. More generally, a single pretest (measured in the same way as the
posttest) may not represent all the factors that should be controlled for.
The task of controlling for initial selection differences can be approached in
several different ways through alternative statistical analyses (Reichardt, 1979;
Shadish et al., 2002). Another common analytic procedure is the analysis of covari-
ance (ANCOVA). In controlling for initial selection differences, in essence ANCOVA
statistically matches individuals in the two treatment groups on their pretest scores
and uses the average difference between the matched groups on the posttest to esti-
mate the treatment effect. Unlike gain score analysis, ANCOVA allows the use of
covariates that are not operationally identical to the posttests, as well as the use of
multiple covariates. However, measurement error in the pretest scores will introduce
bias into the ANCOVAs estimate of the treatment effect, because the statistical
adjustment would not control for the true initial differences. Bias will also arise if the
statistical model does not include all the variables that both affect the outcome vari-
able and account for initial selection differences. There is seldom any way to be con-
fident that all such variables have been appropriately included in the analysis. So the
possibility of bias due to initial selection differences usually remains.
Because measurement error in the pretest will introduce bias in ANCOVA
(Reichardt, 1979), latent variable structural equation models are sometimes used
instead (Magidson & Sorbom, 1982; Ullman & Bentler, 2003). These models use
multiple measures of the construct thought to affect the outcome variable and
account for initial selection differences, and these measures are essentially factor
analyzed in an effort to obtain an estimate of the latent variable that effectively is
without measurement error. (Latent variable structural equation models also nicely
support the testing of mediational models, discussed below.) However, the validity
of the estimates that result from these models depends on the accuracy and thor-
oughness of the model, and applied social researchers often cannot be confident
that they have specified a model accurately.
An alternative approach, propensity score analyses, is gaining in popularity of
late. In this approach, the predicted probability of being in the treatment (rather
than the control) group is generated by a logistic regression (Little & Rubin, 2000;
Rosenbaum, 1995; Rosenbaum & Rubin, 1983). An advantage, relative to the sim-
pler ANCOVA, is that the influence of numerous covariates can be captured in a
single propensity score. Cases are then usually stratified into subgroups (commonly
five subgroups) based on their propensity scores, and the treatment effect com-
puted as a weighted average based on the treatment and control group means
within each subgroup. Alternatively, the propensity score can be treated as a covari-
ate in ANCOVA. Winship and Morgan (1999) provide a useful review of several of
these techniques (also see Little & Rubin, 2000; Shadish et al., 2002).
Much uncertainty remains about how to tailor an adequate statistical analysis for
the pretest-posttest nonequivalent group design under different research conditions.
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 191
Quasi-Experimentation 191
20
16
12
Figure 6.1 Number of Days of Alcohol Use Both Before and After Two Groups of
Homeless Individuals Received Different Amounts of Substance Abuse
Treatment
SOURCE: Adapted from Braucht et al. (1995, p. 103) by permission. Copyright by Haworth Press, Inc.
O O O O O OXO O O O O O.
Quasi-Experimentation 193
200
180
160
Breast Cancer Rate per
140
100,000 Females
120
100
80
60
40
20
0
19 4
76
19 8
80
19 2
19 4
86
19 8
90
19 2
94
19 6
20 8
00
20 2
20 4
06
20 8
20 0
20 2
20 4
16
20 8
20
7
8
8
9
9
0
0
0
1
1
1
1
19
19
19
19
19
19
20
20
20
Years
Figure 6.2 Time Series From Hypothetical Study of Reduced HRT and Breast
Cancer
outcome variable, as in the hypothetical findings in Figure 6.2, where the interven-
tion appears to have reduced breast cancer cases by a relatively constant amount
over the posttreatment period. Change can also occur in slope, either alone or in
association with a change in level. For instance, a future HRT-cancer time-series
study might show both a reduced level and a declining slope (imagine Figure 6.2
with a downward slope after the intervention). Moreover, a treatment effect could
be either immediate or delayed, and could also be either permanent or temporary.
However, validity threats (history and maturation, respectively, as will be discussed
later) are often more plausible for both delayed and gradual effects than for an
immediate, abrupt effect. The temporal pattern of the effect also can have serious
implications for judgments about the importance of the effect. For example, if the
effects of reduced HRT lasted only 1 year, most observers would judge this as less
important than if the effects were permanent.
How does the simple ITS design fare with respect to internal validity threats?
Like the one-group pretest-posttest design, the simple ITS design estimates the
treatment effect by comparing the same individuals (or the same aggregate group)
at different points in time, before and after the treatment. However, the ITS design
does far better in terms of ruling out several validity threats. Consider the six valid-
ity threats introduced in the earlier discussion of the one-group pretest-posttest
design. While maturation is a plausible threat in the one-group pretest-posttest
design, the pretreatment observations in a time series can allow the researcher to
estimate the pattern of maturation. For example, if maturation follows a simple lin-
ear trend, the researcher can see (often literally) the pattern of maturation and
model it in the statistical analysis. The pretreatment observations in a simple ITS
also can reveal the likely degree of regression toward the mean. That is, with a series
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 194
O O O OXO O O O
O O O O O O O O
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 195
Quasi-Experimentation 195
The top line of Os represents data from the experimental subjects who receive the
treatment, whereas the bottom line of Os represents data from the control subjects
who do not receive the treatment. The broken line indicates that the two time series
of observations did not come from randomly assigned groups. Ideally, the control
time-series of observations would be affected by everything that affects the experi-
mental time series, except for the treatment. To the extent this is the case, the con-
trol series increases ones knowledge of how the experimental series would have
behaved in the absence of a treatment, and thereby increases ones confidence in the
estimate of the treatment effect. For example, if the two groups have similar matu-
rational patterns, then the control time series can be used in modeling the pretreat-
ment trend and projecting it into the future. Furthermore, a control time series can
take account of the validity threat of history, to the extent the control time series is
affected by the same history effects. In this case, the treatment effect is estimated as
the size of the change in the experimental series after the treatment is introduced,
minus the size of the change in the control series at the same point in time.
For example, Wagenaar (1981, 1986) was interested in the effect that an increase
in the drinking age had on traffic accidents. In 1979, the drinking age in Michigan
was raised from 18 to 21 years. To assess the effect of this change, Wagenaar (1981)
plotted one experimental time series (for the number of drivers aged between 18 and
20 years who were involved in a crash) and two control series (for the number of
drivers aged between 21 and 24 years or between 25 and 45 years who were involved
in crashes). These time series are reproduced in Figure 6.3. A drop in fatalities
25,000
15,000
Ages 1820
10,000
Ages 2124
5,000
Year
Figure 6.3 The Number of Drivers Involved in Crashes While Drinking, Plotted
Yearly Both Before and After the Legal Drinking Age Was Raised in
1979 from 18 to 21
SOURCE: Adapted from Wagenaar (1981) by permission. Copyright by The University of Michigan
Transportation Research Institute.
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 196
occurred in 1979 only for the experimental time seriesthat is, only for the data
from the 18- to 20-year-old drivers, which is the only time series of observations that
should have been affected by the treatment intervention. The two control series add
to our confidence that the dip in the experimental series is an effect of the treatment
and not due to other factors that would also affect the control series, such as changes
in the severity of weather patterns or changes in the price of gasoline. As noted ear-
lier, in the case of the HRT-breast cancer relationship, it will be useful to compare
the time series of breast cancer cases for women of the age typical for HRT with the
time series for women of other ages. It would also be useful to compare time series
for estrogen-sensitive cancers (which should be affected by HRT) and nonestrogen-
sensitive cancers (which should not be affected by HRT). This can be labeled a non-
equivalent dependent variables ITS design (Cook & Campbell, 1979), because a
comparison times series of observations exists that consists of a different dependent
variable than the primary dependent, time-series variable.
Other design elaborations can also be useful. When the treatments effects are
transitory (i.e., they disappear when the treatment is removed), one potentially
useful option is the ITS with removed and repeated treatment. Such a design is
diagrammatically depicted as
O O O X O O O X O O O X O O O X O O O,
where X indicates that the treatment was introduced and X indicates that the
treatment was removed. For example, Schnelle et al. (1978) estimated the effects of
police helicopter surveillance, as an adjunct to patrol car surveillance, on the fre-
quency of home burglaries. After a baseline of observations was collected with
patrol car surveillance alone, helicopter surveillance was added for a while, then
removed, and so on. In general, the frequency of burglaries decreased whenever
helicopter surveillance was introduced, while burglaries increased when helicopter
surveillance was removed. The repeated introduction and removal of the treat-
ment can greatly lessen the plausibility of the threat of history. In the Schnelle et
al. study of helicopter surveillance, for example, it is unlikely that historical events
that decrease burglaries would happen to coincide repeatedly with the
multiple introductions of the treatment, while the multiple removals of the treat-
ment would happen repeatedly to coincide with historical events that increased
burglaries.
The statistical analysis of time-series data generally raises complexities. In a time
series, data points that are adjacent in time are likely to be more similar than data
points that are far apart in time. This pattern of similarity, called autocorrelation,
violates the assumptions of typical parametric analyses such as multiple regression
analysis. In short, autocorrelation can bias significance tests and confidence inter-
vals. In ITS studies that examine aggregate data, such as annual number of breast
cancer cases in the United States, autoregressive integrated moving average (ARIMA)
models are frequently suggested (e.g., Box, Jenkins, & Reinsel, 1994; Box & Tiao,
1975). However, the number of time points must be relatively large, perhaps as large
as 50 to 100 observations. When there is a control ITS, ARIMA models could be fit
separately to each of the different time series of observations.
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 197
Quasi-Experimentation 197
Alternatively, when data are collected over time from numerous cases (e.g.,
annual test scores collected from many students), a variety of techniques can be used
to analyze the data. Importantly, the analysis of such N much greater than 1 (N >> 1)
designs can require far fewer than the 50 to 100 time points of observations that are
necessary for ITS designs that have only a single case (i.e., N = 1 designs), the latter
having to meet the demands of the ARIMA analysis strategy. In other words, having
a large number of observations (i.e., cases) at any one point in time can reduce the
number of different time points of observation that are required. For numerous
cases (N >> 1) designs, the most frequently recommended analysis strategy in the
past was derived from multivariate analysis of variance (MANOVA; Algina &
Olejnik, 1982; Algina & Swaminathan, 1979; Simonton, 1977, 1979). The MANOVA
approach allowed the autocorrelation structure among observations to have any
form over time but fit the same model to the data for each individual.
More recently, two additional statistical approaches have been developed. These
newer approaches model the trajectory of growth for each case (e.g., student) indi-
vidually, which means these two statistical approaches allow trajectories to differ
across the individual cases and allow these differences in trajectories to be explained
using other variables in the model. In addition, different models of the treatment
effect can be fit to each case and differences across cases in the effects of the treat-
ment can be assessed. The first of the two newer approaches has been given a vari-
ety of names, including multilevel modeling and hierarchical linear modeling
(HLM; Raudenbush & Bryk, 2001). An example using HLM with a short time series
is provided by Roderick, Engel, Nagaoka, and Jacob (2003), who evaluated the
effects of a summer school program in the Chicago school district. They provide an
accessible explanation of the benefits of the HLM approach for accounting for sta-
tistical regression in the context of a short time series. The second approach is
called latent growth curve modeling (LGCM; Duncan & Duncan, 2004; Muthn &
Curran, 1997) and is implemented using software for structural equation model-
ing. Under a range of conditions, the HLM and LGCM analyses are equivalent and
produce the same estimates of effects (Raudenbush & Bryk, 2001).
To sum up regarding ITS designs, in these quasi-experiments a series of observa-
tions is collected over time both before and after a treatment is implemented.
Essentially, the trend in the pretreatment observations is projected forward in time
and compared with the trend in the posttreatment observations, and differences
between these two trends are used to estimate the treatment effect. The ITS design
often has the greatest credibility when the effect of the treatment is relatively imme-
diate and abrupt. Some of the advantages of the ITS design are that it (a) can be used
to assess the effects of the treatment on a single individual (or a single aggregated
unit, such as a city), (b) can estimate the pattern of the treatment effect over time,
and (c) can be implemented without the treatments being withheld from anyone.
The researcher can often strengthen the design by removing and then repeating the
treatment at different points in time, adding a control time series, or both. The ITS
design, especially with a control group or other elaborations, is generally recognized
as among the strongest quasi-experimental designs. With more recent advances in
analysis (e.g., the use of HLM for growth curve modeling), the use of shorter time
series with multiple cases appears to have become more commonplace.
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 198
Quasi-Experimentation 199
Control Treatment
Outcome
0 2 4 6 8 10 12 14 16
Eligibility Dimension
Figure 6.4 Hypothetical Data From an R-D Design (depicting no treatment effect)
Control Treatment
Outcome
0 2 4 6 8 10 12 14 16
Eligibility Dimension
Figure 6.5 Hypothetical Data From an R-D Design (depicting positive treatment
effect)
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 200
treatment effect is positive, with the regression line in the experimental group
displaced above the regression line in the control groupthe treatment group
scores are higher than you would expect relative to the regression line in the con-
trol group. The estimate of size of the treatment effect is equal to the vertical dis-
placement between the two regression lines.
The graphical representation of an R-D studys findings, as illustrated in Figure
6.5, highlights the source of the designs inferential strength. In general, it is implau-
sible that any threat to validity, whether selection, statistical regression, or any other
threat, would produce a discontinuity precisely at the cutoff between the treatment
conditions. Put informally, the question is: How likely is it that there would be a
jump in scores on the outcome variable that coincides precisely with the cutoff on
the eligibility criterion, unless there really is a treatment effect? Unless the treatment
really makes a difference, why would individuals who score just below the eligibility
criterion look so different on the outcome than those who score just above it, and
why would this difference between individuals just above and below the cutoff be so
much greater than the difference, say, between those who score right below the cut-
off as compared with those who score just below that? Because there are usually few
plausible answers to these questions, the R-D design has relatively strong internal
validity, approaching that of a randomized experiment (Shadish et al., 2002).
The conventional statistical analysis of the R-D design involves predicting the
outcome variable using regression analyses, where the predictors are (a) the QAV
(transformed by subtracting the cutoff value, so that the treatment effect is esti-
mated at the cutoff point), (b) a dummy variable representing condition (e.g., 1 =
treatment vs. 0 = comparison), and (c) a term representing the interaction of con-
dition and the QAV. The regression coefficient for the dummy variable estimates
the treatment effects (seen visually as the vertical displacement of the regression
lines in Figure 6.5). The interaction term assesses whether the size of the treatment
effect varies across the QAV. For example, imagine that the treatment in Figure 6.5
is more effective for those who initially scored the highest. If so, the two regression
lines would no longer be parallel, and the experimental groups regression line
would be higher on the right side than it is in Figure 6.5.
Curvilinearity in the relationship between the QAV and the outcome variable is
one potential source of bias in an R-D designs estimate of the treatment effect. If
the underlying relationship is curvilinear, but a linear relationship is fit to the data,
a spurious effect may be observed (Exercise 2). To address this problem, curvilin-
earity in the data should be modeled in the analysis. Typically, in practice, this
would be done after visual inspection for curvilinearity in the original and
smoothed data. In the regression analysis, polynomials terms of the (transformed)
QAV and interaction are added. Inclusion of the polynomials serves to test for the
possibility that a nonlinear relationship exists that could otherwise masquerade as
a treatment effect. Trochim (1984) and Reichardt, Trochim, and Cappelleri (1995)
discuss procedures for modeling interactions and curvilinearity, and for perform-
ing the regression analysis. The R-D design has substantially less power than a ran-
domized experiment (Cappelleri, Darlington, & Trochim, 1994). For example, to
have the same precision and power as randomized experiment (assuming that a
measure analogous to the QAV is used as a covariate), the R-D design must have
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 201
Quasi-Experimentation 201
helpful to compare findings for women of the age typically treated with HRT
versus findings for women of other ages. Alternatively, comparisons can be drawn
across measures, as Cook and Campbell (1979) demonstrated with the so-called
nonequivalent dependent variable (a comparison across outcome variables, in
Reichardts language). In the HRT-breast cancer example, a treatment effect would
predict a decline in estrogen-sensitive cancers only, while most alternative explana-
tions would predict a decline in both estrogen-sensitive and nonestrogen-sensitive
cancers. As Reichardt (2006) has noted, competitive elaboration can also take place
with respect to comparisons across variations in settings and times. See Reichardt
(2006) for examples and further discussion.
Implementation Assessment
In early applied social research, researchers often failed to assess systematically
what the treatment and the comparison (or control) actually consisted of in
practice. For example, an early evaluator of the effects of bilingual education prob-
ably would not have observed the education of second-language learners in the so-
called bilingual education schools, nor what transpired in the so-called comparison
group schools. But without attention to the specifics of treatment implementation,
sensible conclusions are hard to reach. For example, if no treatment effect is
observed, the implications would be quite different (a) if bilingual education was
not implemented than if (b) bilingual education was well implemented but
nonetheless ineffective.
Systematic assessment of a treatments implementation is more commonplace
nowadays than in early applied social research. Several approaches to implementa-
tion assessment have been employed (Mark & Mills, 2007). For example, interven-
tions sometimes have a relatively detailed implementation plan, as is the case for
many school-based prevention programs and psychological therapies. In such cases,
implementation assessment may consist of checks, preferably by observation but
perhaps by self-report from program implementers or recipients, on the extent to
which the intervention was implemented with fidelity to the plan. Checks should
also be made about whether the same or similar activities are carried out in the com-
parison or control group. For example, a study of bilingual education should assess
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 203
Quasi-Experimentation 203
not only the fidelity to the program plan in treatment group schools but also the
extent to which similar activities did not occur in the comparison group. (See Mark
& Mills, 2007, for discussion of alternative models of implementation assessment.)
Information from an implementation assessment is valuable, as already noted,
in terms of facilitating more sensible interpretation of no-effect findings. Implemen-
tation analyses, by allowing better description of the actual intervention, are also
valuable in facilitating dissemination of effective treatments. In some instances,
implementation assessment results can also strengthen causal inference in a quasi-
experiment. For example, there is often variation within the treatment group in
terms of the degree or nature of the exposure to the treatment. Based on a simple
dose-response logic, researchers may seek to test the hypothesis that there are larger
effects for clients who received higher doses of the treatment. However, potential
selection effects may bias this comparison. That is, clients may have self-selected
into different amounts of treatment exposure, and these self-selected subgroups
may differ initially in important ways. Propensity scores or other forms of statisti-
cal adjustment can be used to try to alleviate this bias. See Yoshikawa, Rosman, and
Hsueh (2001) for a related example.
Mediational Tests
A mediator is a variable that falls between two other variables in a causal chain,
such as between a program and its outcome. Substantively and statistically, the
mediator accounts for or is responsible for the relationship between an intervention
and its outcome. To take an example, for many years the drug abuse prevention
program DARE (Drug Abuse Resistance Education) was based on a mediational
model assuming that the program activities, its lessons and exercises, would cause
an increase in students refusal skills, the mediator, and these enhanced refusal skills
would in turn translate into reduced drug use by the students, the intended out-
come. In many areas of social research, whether basic or applied, it has become
commonplace to test mediational models. For example, theory-driven evaluation, a
popular approach to program and policy evaluation, includes mediational analyses
as a routine practice (Donaldson, 2003). Mediational tests are often conducted via
structural equation modeling (SEM; e.g., Ullman & Bentler, 2003) or simpler sta-
tistical procedures (e.g., Baron & Kenny, 1986), although more qualitative methods
are sometimes used (e.g., Weiss, 1995). Although these techniques have limits, they
can be useful at least in probing mediation.
A mediational model may contain only one mediator, as in the model held by
the original advocates of DARE. Or there may be multiple mediators. Indeed,
research on programs such as DARE have demonstrated that their program activi-
ties influence more than one mediator. In particular, although DARE and similar
programs increase refusal skills, they also make drug use seem more common, and
unfortunately, making drug use seem more common or normative is associated
with a higher level of drug use (e.g., Donaldson, Graham, Piccinin, & Hansen,
1995). This example illustrates some of the benefits of mediational analyses. Like
implementation assessment, mediational tests can facilitate interpretation of the
treatment effect results. For instance, if a study found DARE to be ineffective, the
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 204
implications would differ if (a) the program failed to increase refusal skills versus
(b) refusal skills were increased but the program nevertheless failed to achieve
reduced drug use. In addition, the finding that DARE and similar programs affected
perceived norms provides guidance about how to revise DARE.
Mediational analyses can also strengthen confidence that the treatment, rather
than a validity threat, accounts for the observed differences between groups in a
quasi-experiment. This follows from the idea of competitive elaboration discussed
in the previous section. When a theory of the treatment predicts a particular medi-
ational pattern and findings are consistent with that pattern, causal inference is
strengthened to the extent plausible validity threats would not account for the same
pattern. Mediational evidence can also make quasi-experimental (or experimental)
findings easier to communicate and more persuasive. For instance, being able to
explain why DARE is ineffective is likely to be more compelling than simply stating
it is ineffective. Testing mediation also can erase the distinction between applied
and basic research, as when the evaluation of a real-world program includes a test
of a theoretical hypothesis about social norms.
Quasi-Experimentation 205
Summary
Implementation assessment, mediational tests, and the study of moderation
have each become more commonplace in applied social research. These procedures
have specific benefits as ancillaries to both experiments and quasi-experiments. For
the quasi-experimentalist, it is important to note that these procedures, in at least
some cases, can also strengthen causal inference. This will especially occur if the
researcher implements these procedures thoughtfully from the perspective of com-
petitive elaboration.
designs might be beset by a selection bias that cause the quasi-experiments on aver-
age to overestimate the real treatment effects, while in another research area the
typical selection bias might lead to an underestimate. And in yet other areas there
may not be a consistent direction of bias. For instance, a particular research area
might not be plagued by consistent selection effects, but history effects might apply.
Given the vagaries of history, this threat would sometime lead to an overstatement
of the true treatment effect and at other times to an underestimate. A related find-
ing from Lipsey and Wilson is that quasi-experiments were associated with more
variability in effect size estimates. That is, in a given research area, there was less
consistency across studies in the treatment effect estimates from quasi-experiments
than from randomized experiments. This does not seem surprising, in that the
validity threats that generally apply to the quasi-experiments in a given research
area are not likely to operate to the same degree in every study. For example, if
history is an applicable threat, the vagaries of history are in essence adding random
error to the treatment effect estimates across quasi-experimental studies.
Altogether, then, the findings of Lipsey and Wilson (1993) do not inspire confi-
dence that the results of a quasi-experiment will match the results that would have
arisen if a randomized experiment were done insteadalthough they may do well
in some research areas. Aiken, West, Schwalm, Carroll, and Hsiung (1998) and
Cook and Wong (2008) have summarized other research that compares results
from a set of quasi-experiments and a set of randomized experiments investigating
a particular treatment. In short, their conclusions seem compatible with the find-
ings of Lipsey and Wilson. As both Aiken et al. and Cook and Wong (2008) point
out, however, comparisons of this kind are themselves subject to bias. That is, many
differences on average may exist between the quasi-experimental and the experi-
mental studies in a given research area, including differences in the way the treat-
ments are implemented, differences in the type of individuals receiving the treatment,
differences in the way outcomes are measured, differences in the settings in which
the two types of studies are implemented, and so on.
Other comparisons of study types have taken a more local or within-study
approach (Cook & Wong, 2008). In some cases, the researchers have constructed
both a randomized experimental test and one or more quasi-experimental tests in
the same context (e.g., Aiken et al., 1998; Lipsey, Cordray, & Berger, 1981). In other
studies, the researcher has conducted a randomized experiment; for the quasi-
experiment, data from the randomized experiments treatment group are compared
with data from another source, typically a large national data set. One problem
with this approach is that, as has been emphasized throughout this chapter, quasi-
experiments are not all alike. Some are queasier than others. And, as Cook and
Wong suggest, an argument can be made that in many of the local comparisons
across study types, a well-designed randomized experiment has been compared
with a mediocre quasi-experiment.
Cook and Wong (2008) indicate that, in those few instances in which random-
ized experiments are compared with the strongest of the quasi-experiments, the
results are similar. In the case of R-D designs, for instance, Aiken et al. (1998) found
similar results for an R-D quasi-experiment and a randomized experiment study-
ing the effects of a remedial writing course. Lipsey et al. (1981) similarly found
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 207
Quasi-Experimentation 207
design, rather than a queasier one, appears to be highly desirable. Second, not all
comparison groups are alike, and procedures such as using an internal control
group or a cohort controlby creating a comparison group more initially similar
to the treatment groupmay result in more accurate findings. Third, statistical
controls for selection bias will be enhanced to the extent the researcher has a good
understanding of the selection process and measures the variables that are involved.
Fourth, rather than relying only on statistical adjustments, the quasi-experimentalist
should rely on the logic of competitive elaboration, considering the full range of
comparisons that can be used to try to deal with selection and other validity threats
(e.g., nonequivalent dependent variables and theory-driven subgroup analyses).
Fifth, although the argument for replication is important in research generally, it
may be stronger for research using quasi-experiments given the possibility not only
of bias despite the researchers best efforts, but also of more variability in treatment
effect estimates.
Conclusion
A variety of designs are available for estimating the effects of a treatment. No sin-
gle design type is always best. The choice among designs depends on the circum-
stances of a study, particularly on how well potential threats to validity and other
criticisms can be avoided under the given circumstance. For this reason, researchers
would be well-advised to consider a variety of designs before making their final
choices. Researchers should evaluate each design relative to the potential validity
threats that are likely to be most plausible in their specific research contexts.
Researchers should also be mindful that they can rule out threats to validity by
adding comparisons that put the treatment and potential threats into direct com-
petition. Sometimes, researchers can add such a comparison simply by disaggre-
gating data that have already been collected. For example, in studying the
HRT-breast cancer relationship, researchers could render threats implausible by
disaggregating the available data into a subgroup of women of the age typically
treated by HRT and of women of different ages. In other cases, researchers must
plan ahead of time to collect data that allows the additional comparisons needed to
evaluate threats to validity.
At its best, quasi-experimentation is not simply a matter of picking a prototyp-
ical design out of a book. Rather, considerable intellectual challenge is encountered
in recognizing potential threats to validity and in elaborating design comparisons
so as to minimize uncertainty about the size of the treatment effect. Indeed, the
fact that it can be challenging to get the right answer with quasi-experiments, espe-
cially the queasier ones, is an argument for the use of randomized experiments. In
this regard, researchers have recently attempted to integrate quasi-experiments with
randomized experiments, such as using ITS designs in conjunction with small
N experiments (Bloom et al., 2005; Riccio & Bloom, 2002). However, when
random assignment is not feasible, implementing a strong quasi-experimental
design and creatively employing the strategy of competitive elaboration is highly
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 209
Quasi-Experimentation 209
Discussion Questions
1. Quasi-experiments are appropriate for certain research questions but not
others. Generate four or five examples of research questions for which a quasi-
experiment would make sense and also four or five research questions for which a
quasi-experiment would not make sense.
2. Look at the two sets of research questions you generated in response to the
previous question. What differentiates the two sets?
5. Think about what makes one quasi-experiment queasy and another one
relatively rigorous. Explain.
6. The chapter discussed a possible future study of the effects of the recent rapid
decline in hormone replacement therapy for menopausal women. Discuss the way
that a more elaborate set of evidence could enhance causal inference in that study.
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 210
Exercises
1. Identify a real or hypothetical applied social research question that can be
examined quasi-experimentally. Then, in Step 1, describe a relatively weak quasi-
experiment (e.g., a one-group pretest-posttest design or a posttest-only nonequivalent
group design) to examine the research question. In Step 2, apply a pretest-posttest
nonequivalent group design to the same research question. In Step 3, try to apply a
relatively rigorous quasi-experiment (some form of ITS design or a regression-
discontinuity design). At each step, explain what key internal validity threats are
plausible. For the second step (the pretest-posttest nonequivalent group design)
and the third step (the ITS or R-D design), indicate how that design rules out
threats that the weaker design did not.
3. Pretend you were one of the first researchers to try to study the hypothesis
that smoking tobacco causes lung cancer. Using the logic of ruling out threats to
validity, identify an elaborate set of comparisons you could make to assess the
causal hypothesis.
References
Aiken, L. S., West, S. G., Schwalm, D. E., Carroll, J., & Hsiung, S. (1998). Comparison of a
randomized and two quasi-experiments in a single outcome evaluation: Efficacy of a
university-level remedial writing program. Evaluation Review, 22, 207244.
Algina, J., & Olejnik, S. F. (1982). Multiple group time-series design: An analysis of data.
Evaluation Review, 6, 203232.
Algina, J., & Swaminathan, H. (1979). Alternatives to Simontons analyses of the interrupted
and multiple-group time series designs. Psychological Bulletin, 86, 919926.
Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social
psychological research: Conceptual, strategic and statistical considerations. Journal of
Personality and Social Psychology, 51, 11731182.
Bloom, H. S., Michalopoulos, C., & Hill, C. J. (2005). Using experiments to assess nonexperi-
mental comparison-group methods for measuring program effects. In H. S. Bloom (Ed.),
Learning more from social experiments (pp. 173235). New York: Russell Sage Foundation.
Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (1994). Time-series analysis: Forecasting and con-
trol (3rd ed.). Englewood Cliffs, NJ: Prentice Hall.
Box, G. E. P., & Tiao, G. C. (1975). Intervention analysis with applications to economic and
environmental problems. Journal of the American Statistical Association, 70, 7092.
Braucht, G. N., Reichardt, C. S., Geissler, L. J., Bormann, C. A., Kwiatkowski, C. F., & Kirby,
M. W., Jr. (1995). Effective services for homeless substance abusers. Journal of Addictive
Diseases, 14, 87109.
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 211
Quasi-Experimentation 211
Lipsey, M. W., Cordray, D. S., & Berger, D. E. (1981). Evaluation of a juvenile diversion
program: Using multiple lines of evidence. Evaluation Review, 5, 283306.
Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of psychological, educational, and behav-
ioral treatment: Confirmation from meta-analysis. American Psychologist, 48, 11811209.
Little, R. J., & Rubin, D. B. (2000). Causal effects in clinical and epidemiological studies
via potential outcomes: Concepts and analytical approaches. Annual Review of Public
Health, 21, 121145.
Magidson, J., & Sorbom, D. (1982). Adjusting for confounding factors in quasi-experiments:
Another reanalysis of the Westinghouse Head Start evaluation. Educational Evaluation
and Policy Analysis, 4, 321329.
Mark, M. M., & Mellor, S. (1991). The effect of the self-relevance of an event on hindsight
bias: The foreseeability of a layoff. Journal of Applied Psychology, 76, 569577.
Mark, M. M., & Mills, J. (2007). The use of experiments and quasi-experiments in decision
making. In G. Morcl (Ed.), Handbook of decision making (pp. 459482). New York:
Marcel Dekker.
Marsh, J. C. (1985). Obstacles and opportunities in the use of research on rape legislation. In
R. L. Shotland & M. M. Mark (Eds.), Social science and social policy (pp. 295310).
Beverly Hills, CA: Sage.
MSNBC News Services. (2006, December 14). Breast cancer drop tied to less hormone therapy:
Sharp decline in 2003 when older women stopped drugs, research shows. Retrieved
February 10, 2007, from www.msnbc.msn.com/id/16206352
Muthn, B., & Curran, P. (1997). General longitudinal modeling of individual differences in
experimental designs: A latent variable framework for analysis and power estimation.
Psychological Methods, 2, 371402.
Paulos, J. A. (1988). Innumeracy: Mathematical illiteracy and its consequences. New York: Hill
& Wang.
Raudenbush, S. W., & Bryk, A. S. (2001). Hierarchical linear models: Applications and data
analysis methods (2nd ed.). Thousand Oaks, CA: Sage.
Reichardt, C. S. (1979). The statistical analysis of data from nonequivalent group designs. In
T. D. Cook & D. T. Campbell (Eds.), Quasi-experimentation: Design and analysis issues
for field settings (pp. 147205). Chicago: Rand McNally.
Reichardt, C. S. (2000). A typology of strategies for ruling out threats to validity. In
L. Bickman (Ed.), Research design: Donald Campbells legacy (Vol. 2, pp. 89115).
Thousand Oaks, CA: Sage.
Reichardt, C. S. (2006). The principle of parallelism in the design of studies to estimate treat-
ment effects. Psychological Methods, 11, 118.
Reichardt, C. S., Trochim, W. M. K., & Cappelleri, J. C. (1995). Reports of the death of regression-
discontinuity analysis are greatly exaggerated. Evaluation Review, 19, 3963.
Reynolds, A. J., & Temple, J. A. (1995). Quasi-experimental estimates of the effects of a
preschool intervention: Psychometric and econometric comparisons. Evaluation Review,
19, 347373.
Riccio, J. A., & Bloom, H. S. (2002). Extending the reach of randomized social experiments:
New directions in evaluations of American welfare-to-work and employment initia-
tives. Journal of the Royal Statistical Society: Series A, 165, 1330.
Roderick, M., Engel, M., Nagaoka, J., & Jacob, B. (2003). Ending social promotion in Chicago:
Results from Summer Bridge. Chicago: Consortium on Chicago School Research.
Rosenbaum, P. R. (1984). From association to causation in observational studies: The role
of tests of strongly ignorable treatment assignment. Journal of the American Statistical
Association, 79, 4048.
06-Bickman-45636:06-Bickman-45636 7/28/2008 7:37 PM Page 213
Quasi-Experimentation 213
CHAPTER 7
Designing a
Qualitative Study
Joseph A. Maxwell
214
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 215
the activities of collecting and analyzing data, developing and modifying theory,
elaborating or refocusing the research questions, and identifying and dealing with
validity threats are usually going on more or less simultaneously, each influencing
all of the others. In addition, the researcher may need to reconsider or modify any
design decision during the study in response to new developments or to changes in
some other aspect of the design. Grady and Wallston (1988) argue that applied
research in general requires a flexible, nonsequential approach and an entirely dif-
ferent model of the research process than the traditional one offered in most text-
books (p. 10).
This does not mean that qualitative research lacks design; as Yin (1994) says,
Every type of empirical research has an implicit, if not explicit, research design
(p. 19). Qualitative research simply requires a broader and less restrictive concept
of design than the traditional ones described above. Thus, Becker, Geer, Hughes,
and Strauss (1961), authors of a classic qualitative study of medical students, begin
their chapter titled Design of the Study by stating,
In one sense, our study had no design. That is, we had no well-worked-out set
of hypotheses to be tested, no data-gathering instruments purposely designed
to secure information relevant to these hypotheses, no set of analytic proce-
dures specified in advance. Insofar as the term design implies these features
of elaborate prior planning, our study had none.
If we take the idea of design in a larger and looser sense, using it to identify those
elements of order, system, and consistency our procedures did exhibit, our study
had a design. We can say what this was by describing our original view of the
problem, our theoretical and methodological commitments, and the way these
affected our research and were affected by it as we proceeded. (p. 17)
For these reasons, the model of design that I present here, which I call an inter-
active model, consists of the components of a research study and the ways in which
these components may affect and be affected by one another. It does not presup-
pose any particular order for these components, or any necessary directionality of
influence.
The model thus resembles the more general definition of design employed out-
side research: An underlying scheme that governs functioning, developing, or
unfolding and the arrangement of elements or details in a product or work of art
(Frederick et al., 1993). A good design, one in which the components work harmo-
niously together, promotes efficient and successful functioning; a flawed design
leads to poor operation or failure.
Traditional (typological or linear) approaches to design provide a model for con-
ducting the researcha prescriptive guide that arranges the components or tasks
involved in planning or conducting a study in what is seen as an optimal order. In
contrast, the model presented in this chapter is a model of as well as for research. It
is intended to help you understand the actual structure of your study as well as to
plan this study and carry it out. An essential feature of this model is that it treats
research design as a real entity, not simply an abstraction or plan. Borrowing
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 216
1. Goals: Why is your study worth doing? What issues do you want it to clarify,
and what practices and policies do you want it to influence? Why do you want to
conduct this study, and why should we care about the results?
2. Conceptual framework: What do you think is going on with the issues, set-
tings, or people you plan to study? What theories, beliefs, and prior research find-
ings will guide or inform your research, and what literature, preliminary studies,
and personal experiences will you draw on for understanding the people or issues
you are studying?
3. Research questions: What, specifically, do you want to learn or understand by
doing this study? What do you not know about the things you are studying that you
want to learn? What questions will your research attempt to answer, and how are
these questions related to one another?
4. Methods: What will you actually do in conducting this study? What
approaches and techniques will you use to collect and analyze your data, and how
do these constitute an integrated strategy?
5. Validity: How might your results and conclusions be wrong? What are the
plausible alternative interpretations and validity threats to these, and how will you
deal with these? How can the data that you have, or that you could potentially col-
lect, support or challenge your ideas about whats going on? Why should we believe
your results?
I have not identified ethics as a separate component of research design. This isnt
because I dont think ethics is important for qualitative design; on the contrary, atten-
tion to ethical issues in qualitative research is being increasingly recognized as essen-
tial (Christians, 2000; Denzin & Lincoln, 2000; Fine, Weis, Weseen, & Wong, 2000).
Instead, it is because I believe that ethical concerns should be involved in every aspect
of design. I have particularly tried to address these concerns in relation to methods,
but they are also relevant to your goals, the selection of your research questions, valid-
ity concerns, and the critical assessment of your conceptual framework.
These components are not substantially different from the ones presented in
many other discussions of qualitative or applied research design (e.g., LeCompte &
Preissle, 1993; Lincoln & Guba, 1985; Miles & Huberman, 1994; Robson, 2002).
What is innovative is the way the relationships among the components are concep-
tualized. In this model, the different parts of a design form an integrated and inter-
acting whole, with each component closely tied to several others, rather than being
linked in a linear or cyclic sequence. The most important relationships among these
five components are displayed in Figure 7.1.
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 217
Conceptual
Goals
framework
Research
questions
Methods Validity
There are also connections other than those emphasized here, some of which I
have indicated by dashed lines. For example, if a goal of your study is to empower
participants to conduct their own research on issues that matter to them, this will
shape the methods you use, and conversely the methods that are feasible in your
study will constrain your goals. Similarly, the theories and intellectual traditions
you are drawing on in your research will have implications for what validity threats
you see as most important and vice versa.
The upper triangle of this model should be a closely integrated unit. Your
research questions should have a clear relationship to the goals of your study and
should be informed by what is already known about the phenomena you are study-
ing and the theoretical concepts and models that can be applied to these phenom-
ena. In addition, the goals of your study should be informed by current theory and
knowledge, while your decisions about what theory and knowledge are relevant
depend on your goals and questions.
Similarly, the bottom triangle of the model should also be closely integrated. The
methods you use must enable you to answer your research questions, and also to
deal with plausible validity threats to these answers. The questions, in turn, need to
be framed so as to take the feasibility of the methods and the seriousness of partic-
ular validity threats into account, while the plausibility and relevance of particular
validity threats, and the ways these can be dealt with, depend on the questions and
methods chosen. The research questions are the heart, or hub, of the model; they
connect all the other components of the design, and should inform, and be sensi-
tive to, these components.
There are many other factors besides these five components that should influence
the design of your study; these include your research skills, the available resources,
perceived problems, ethical standards, the research setting, and the data and
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 218
Perceived Personal
problems experience Existing theory
and prior
Personal research
goals
Preliminary
Ethical data and
Methods Validity
standards conclusions
Research
Researcher skills
setting Research
and preferred
paradigm
style of research
preliminary conclusions of the study. In my view, these are not part of the design of
a study; rather, they either belong to the environment within which the research and
its design exist or are products of the research. Figure 7.2 presents some of the envi-
ronmental factors that can influence the design and conduct of a study.
I do not believe that there is one right model for qualitative or applied research
design. However, I think that the model I present here is a useful one, for three main
reasons:
SOURCE: From Qualitative Research Design: An Interactive Approach, by J. A. Maxwell, 2005. Copyright
by SAGE.
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 219
Because a design for your study always exists, explicitly or implicitly, it is impor-
tant to make this design explicit, to get it out in the open, where its strengths, limi-
tations, and implications can be clearly understood. In the remainder of this chapter,
I present the main design issues involved in each of the five components of my
model, and the implications of each component for the others. I do not discuss in
detail how to actually do qualitative research, or deal in depth with the theoretical
and philosophical views that have informed this approach. For additional guidance
on these topics, see the contributions of Fetterman (Chapter 17, this volume) and
Stewart, Shamdasani, and Rook (Chapter 18, this volume) to this Handbook; the
more extensive treatments by Patton (2000), Eisner and Peshkin (1990), LeCompte
and Preissle (1993), Glesne (2005), Weiss (1994), Miles and Huberman (1994), and
Wolcott (1995); and the encyclopedic handbooks edited by Denzin and Lincoln
(2005) and Given (in press). My focus here is on how to design a qualitative study
that arrives at valid conclusions and successfully and efficiently achieve its goals.
To the extent that you have not made a careful assessment of ways in which your
design decisions and data analyses are based on personal desires, you are in danger of
arriving at invalid conclusions.
However, your personal reasons for wanting to conduct a study, and the experi-
ences and perspectives in which these are grounded, are not simply a source of
bias (see the later discussion of this issue in the section on validity); they can also
provide you with a valuable source of insight, theory, and data about the phenom-
ena you are studying (Marshall & Rossman, 1999, pp. 2530; Strauss & Corbin,
1990, pp. 4243). This source is discussed in the next section, in the subsection on
experiential knowledge.
Two major decisions are often profoundly influenced by the researchers per-
sonal goals. One is the topic, issue, or question selected for study. Traditionally,
students have been told to base this decision on either faculty advice or the litera-
ture on their topic. However, personal goals and experiences play an important role
in many research studies. Strauss and Corbin (1990) argue that
1. Understanding the meaning, for participants in the study, of the events, situ-
ations, and actions they are involved with, and of the accounts that they give of
their lives and experiences. In a qualitative study, you are interested not only in the
physical events and behavior taking place, but also in how the participants in your
study make sense of these and how their understandings influence their behavior.
The perspectives on events and actions held by the people involved in them are not
simply their accounts of these events and actions, to be assessed in terms of truth
or falsity; they are part of the reality that you are trying to understand, and a major
influence on their behavior (Maxwell, 1992, 2004a). This focus on meaning is cen-
tral to what is known as the interpretive approach to social science (Bredo &
Feinberg, 1982; Geertz, 1973; Rabinow & Sullivan, 1979).
2. Understanding the particular context within which the participants act and
the influence this context has on their actions. Qualitative researchers typically
study a relatively small number of individuals or situations and preserve the indi-
viduality of each of these in their analyses, rather than collecting data from large
samples and aggregating the data across individuals or situations. Thus, they are
able to understand how events, actions, and meanings are shaped by the unique
circumstances in which these occur.
3. Identifying unanticipated phenomena and influences and generating new,
grounded theories about the latter. Qualitative research has long been used for
this goal by survey and experimental researchers, who often conduct exploratory
qualitative studies to help them design their questionnaires and identify variables
for experimental investigation. Although qualitative research is not restricted to this
exploratory role, it is still an important strength of qualitative methods.
4. Understanding the processes by which events and actions take place. Although
qualitative research is not unconcerned with outcomes, a major strength of qualita-
tive studies is their ability to get at the processes that lead to these outcomes, processes
that experimental and survey research are often poor at identifying (Maxwell, 2004a).
5. Developing causal explanations. The traditional view that qualitative
research cannot identify causal relationships is based on a restrictive and philo-
sophically outdated concept of causality (Maxwell, 2004b), and both qualitative
and quantitative researchers are increasingly accepting the legitimacy of using qual-
itative methods for causal inference (e.g., Shadish, Cook, & Campbell, 2002). Such
an approach requires thinking of causality in terms of processes and mechanisms,
rather than simply demonstrating regularities in the relationships between vari-
ables (Maxwell, 2004a); I discuss this in more detail in the section on research ques-
tions. Deriving causal explanations from a qualitative study is not an easy or
straightforward task, but qualitative research is not different from quantitative
research in this respect. Both approaches need to identify and deal with the plausi-
ble validity threats to any proposed causal explanation, as discussed below.
These intellectual goals, and the inductive, open-ended strategy that they
require, give qualitative research an advantage in addressing numerous practical
goals, including the following.
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 222
Generating results and theories that are understandable and experientially credible,
both to the people being studied and to others (Bolster, 1983). Although quantitative
data may have greater credibility for some goals and audiences, the specific detail
and personal immediacy of qualitative data can lead to the greater influence of the
latter in other situations. For example, I was involved in one evaluation, of how
teaching rounds in one hospital department could be improved, that relied pri-
marily on participant observation of rounds and open-ended interviews with staff
physicians and residents (Maxwell, Cohen, & Reinhard, 1983). The evaluation led
to decisive department action, in part because department members felt that the
report, which contained detailed descriptions of activities during rounds and
numerous quotes from interviews to support the analysis of the problems with
rounds, told it like it really was rather than simply presenting numbers and gen-
eralizations to back up its recommendations.
Conducting formative studies, ones that are intended to help improve existing prac-
tice rather than simply to determine the outcomes of the program or practice being
studied (Scriven, 1991). In such studies, which are particularly useful for applied
research, it is more important to understand the process by which things happen in
a particular situation than to measure outcomes rigorously or to compare a given
situation with others.
Engaging in collaborative, action, or empowerment research with practitioners or
research participants (e.g., Cousins & Earl, 1995; Fetterman, Kaftarian, &
Wandersman, 1996; Tolman & Brydon-Miller, 2001; Whyte, 1991). The focus of
qualitative research on particular contexts and their meaning for the participants in
these contexts, and on the processes occurring in these contexts, makes it especially
suitable for collaborations with practitioners or with members of the community
being studied (Patton, 1990, pp. 129130; Reason, 1994).
A useful way of sorting out and formulating the goals of your study is to write
memos in which you reflect on your goals and motives, as well as the implications
of these for your design decisions (for more information on such memos, see
Maxwell, 2005, pp. 1113; Mills, 1959, pp. 197198; Strauss & Corbin, 1990,
chap. 12). See Exercise 1.
Conceptual Framework:
What Do You Think Is Going On?
The conceptual framework of your study is the system of concepts, assumptions,
expectations, beliefs, and theories that supports and informs your research. Miles
and Huberman (1994) state that a conceptual framework explains, either graphi-
cally or in narrative form, the main things to be studiedthe key factors, concepts,
or variablesand the presumed relationships among them (p. 18). Here, I use the
term in a broader sense that also includes the actual ideas and beliefs that you hold
about the phenomena studied, whether these are written down or not.
Thus, your conceptual framework is a formulation of what you think is going on
with the phenomena you are studyinga tentative theory of what is happening and
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 223
why. Theory provides a model or map of why the world is the way it is (Strauss,
1995). It is a simplification of the world, but a simplification aimed at clarifying and
explaining some aspect of how it works. It is not simply a framework, although it
can provide that, but a story about what you think is happening and why. A useful
theory is one that tells an enlightening story about some phenomenon, one that
gives you new insights and broadens your understanding of that phenomenon. The
function of theory in your design is to inform the rest of the designto help you
assess your goals, develop and select realistic and relevant research questions and
methods, and identify potential validity threats to your conclusions.
What is often called the research problem is a part of your conceptual frame-
work, and formulating the research problem is often seen as a key task in designing
your study. It is part of your conceptual framework (although it is often treated as
a separate component of a research design) because it identifies something that is
going on in the world, something that is itself problematic or that has consequences
that are problematic.
The conceptual framework of a study is often labeled the literature review. This
can be a dangerously misleading term, for three reasons. First, it can lead you to
focus narrowly on literature, ignoring other conceptual resources that may be of
equal or greater importance for your study, including unpublished work, commu-
nication with other researchers, and your own experience and pilot studies. Second,
it tends to generate a strategy of covering the field rather than focusing specifi-
cally on those studies and theories that are particularly relevant to your research
(Maxwell, 2006). Third, it can make you think that your task is simply descriptive
to tell what previous researchers have found or what theories have been proposed.
In developing a conceptual framework, your purpose is not only descriptive, but
also critical; you need to treat the literature not as an authority to be deferred to,
but as a useful but fallible source of ideas about whats going on, and to attempt to
see alternative ways of framing the issues (Locke, Silverman, & Spirduso, 2004).
Another way of putting this is that the conceptual framework for your research
study is something that is constructed, not found. It incorporates pieces that are
borrowed from elsewhere, but the structure, the overall coherence, is something
that you build, not something that exists ready-made. Becker (1986, 141ff.) system-
atically develops the idea that prior work provides modules that you can use in
building your conceptual framework, modules that you need to examine critically
to make sure they work effectively with the rest of your design. There are four main
sources for these modules: your own experiential knowledge, existing theory and
research, pilot and exploratory studies, and thought experiments. Before address-
ing the sources of these modules, however, I want to discuss a particularly impor-
tant part of your conceptual frameworkthe research paradigm(s) within which
you situate your work.
of the term paradigm, which derives from the work of the historian of science
Thomas Kuhn, refers to a set of very general philosophical assumptions about the
nature of the world (ontology) and how we can understand it (epistemology),
assumptions that tend to be shared by researchers working in a specific field or tra-
dition. Paradigms also typically include specific methodological strategies linked to
these assumptions, and identify particular studies that are seen as exemplifying
these assumptions and methods. At the most abstract and general level, examples
of such paradigms are philosophical positions such as positivism, constructivism,
realism, and pragmatism, each embodying very different ideas about reality and
how we can gain knowledge of it. At a somewhat more specific level, paradigms that
are relevant to qualitative research include interpretivism, critical theory, feminism,
postmodernism, and phenomenology, and there are even more specific traditions
within these (for more detailed guidance, see Creswell, 1997; Schram, 2005). I want
to make several points about using paradigms in your research design:
1. Although some people refer to the qualitative paradigm, there are many dif-
ferent paradigms within qualitative research, some of which differ radically in their
assumptions and implications (see also Denzin & Lincoln, 2000; Pitman & Maxwell,
1992). You need to make explicit which paradigm(s) your work will draw on, since
a clear paradigmatic stance helps guide your design decisions and to justify these
decisions. Using an established paradigm (such as grounded theory, critical realism,
phenomenology, or narrative research) allows you to build on a coherent and well-
developed approach to research, rather than having to construct all of this yourself.
2. You dont have to adopt in total a single paradigm or tradition. It is possible
to combine aspects of different paradigms and traditions, although if you do this
you will need to carefully assess the compatibility of the modules that you borrow
from each. Schram (2005) gives a valuable account of how he combined the ethno-
graphic and life history traditions in his dissertation research on an experienced
teachers adjustment to a new school and community.
3. Your selection of a paradigm (or paradigms) is not a matter of free choice. You
have already made many assumptions about the world, your topic, and how we can
understand these, even if you have never consciously examined these. Choosing a par-
adigm or tradition primarily involves assessing which paradigms best fit with your
own assumptions and methodological preferences; Becker (1986, pp. 1617) makes
the same point about using theory in general. Trying to work within a paradigm (or
theory) that doesnt fit your assumptions is like trying to do a physically demanding
job in clothes that dont fitat best youll be uncomfortable, at worst it will keep you
from doing the job well. Such a lack of fit may not be obvious at the outset; it may
only emerge as you develop your conceptual framework, research questions, and
methods, since these should also be compatible with your paradigmatic stance.
Experiential Knowledge
Traditionally, what you bring to the research from your background and iden-
tity has been treated as bias, something whose influence needs to be eliminated
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 225
from the design, rather than a valuable component of it. However, the explicit
incorporation of your identity and experience (what Strauss, 1987, calls experien-
tial data) in your research has recently gained much wider theoretical and philo-
sophical support (e.g., Berg & Smith, 1988; Denzin & Lincoln, 2000; Jansen &
Peshkin, 1992; Strauss, 1987). Using this experience in your research can provide
you with a major source of insights, hypotheses, and validity checks. For example,
Grady and Wallston (1988, p. 41) describe how one health care researcher used
insights from her own experience to design a study of why many women dont do
breast self-examination.
This is not a license to impose your assumptions and values uncritically on the
research. Reason (1988) uses the term critical subjectivity to refer to
However, there are few well-developed and explicit strategies for doing this. The
researcher identity memo is one technique; this involves reflecting on, and writ-
ing down, the different aspects of your experience that are potentially relevant to
your study. Example 7.1 is part of one of my own researcher identity memos, writ-
ten when I was working on a paper of diversity and community; Exercise 1 involves
writing your own researcher identity memo. (For more on this technique, see
Maxwell, 2005.) Doing this can generate unexpected insights and connections, as
well as create a valuable record of these.
I cant recall when I first became interested in diversity; its been a major
concern for at least the past 20 years . . . I do remember the moment that I
consciously realized that my mission in life was to make the world safe for
diversity; I was in Regenstein Library at the University of Chicago one night
in the mid-1970s talking to another student about why we had gone into
anthropology, and the phrase suddenly popped into my head.
However, I never gave much thought to tracing this position any further
back. I remember, as an undergraduate, attending a talk on some political
topic, and being struck by two students bringing up issues of the rights of
particular groups to retain their cultural heritages; it was an issue that had
never consciously occurred to me. And Im sure that my misspent youth
reading science fiction rather than studying had a powerful influence on my
sense of the importance of tolerance and understanding of diversity; I wrote
my essay for my application to college on tolerance in high school society.
But I didnt think much about where all this came from.
(Continued)
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 226
(Continued)
It was talking to the philosopher Amelie Rorty in the summer of 1991 that
really triggered my awareness of these roots. She had given a talk on the
concept of moral diversity in Plato, and I gave her a copy of my draft paper
on diversity and solidarity. We met for lunch several weeks later to discuss
these issues, and at one point she asked me how my concern with diversity
connected with my background and experiences. I was surprised by the
question, and found I really couldnt answer it. She, on the other hand, had
thought about this a lot, and talked about her parents emigrating from
Belgium to the United States, deciding they were going to be farmers like
real Americans, and with no background in farming, buying land in rural
West Virginia and learning how to survive and fit into a community
composed of people very different from themselves.
This made me start thinking, and I realized that as far back as I can
remember Ive felt different from other people, and had a lot of difficulties
as a result of this difference and my inability to fit in with peers, relatives,
or other people generally. This was all compounded by my own shyness and
tendency to isolate myself, and by the frequent moves that my family made
while I was growing up.
The way in which this connects with my work on diversity is that my main
strategy for dealing with my difference from others, as far back as I can
remember, was not to try to be more like them (similarity-based), but to try
to be helpful to them (contiguity-based). This is a bit oversimplified, because
I also saw myself as somewhat of a social chameleon, adapting to whatever
situation I was in, but this adaptation was much more an interactional
adaptation than one of becoming fundamentally similar to other people.
It now seems incomprehensible to me that I never saw the connections
between this background and my academic work.
[The remainder of the memo discusses the specific connections between
my experience and the theory of diversity and community that I had been
developing, which sees both similarity (shared characteristics) and contiguity
(interaction) as possible sources of solidarity and community.]
it is for most people the more problematic and confusing of the two, and then deal
with using prior research for other purposes than as a source of theory.
Using existing theory in qualitative research has both advantages and dangers. A
useful theory helps you organize your data. Particular pieces of information that
otherwise might seem unconnected or irrelevant to one another or to your research
questions can be related if you can fit them into the theory. A useful theory also illu-
minates what you are seeing in your research. It draws your attention to particular
events or phenomena and sheds light on relationships that might otherwise go
unnoticed or misunderstood.
However, Becker (1986) warns that the existing literature, and the assumptions
embedded in it, can deform the way you frame your research, causing you to over-
look important ways of conceptualizing your study or key implications of your
results. The literature has the advantage of what he calls ideological hegemony,
making it difficult for you to see any phenomenon in ways that are different from
those that are prevalent in the literature. Trying to fit your insights into this estab-
lished framework can deform your argument, weakening its logic and making it
harder for you to see what this new way of framing the phenomenon might con-
tribute. Becker describes how existing theory and perspectives deformed his early
research on marijuana use, leading him to focus on the dominant question in the lit-
erature and to ignore the most interesting implications and possibilities of his study.
Becker (1986) argues that there is no way to be sure when the established approach
is wrong or misleading or when your alternative is superior. All you can do is try to
identify the ideological component of the established approach, and see what hap-
pens when you abandon these assumptions. He asserts that a serious scholar ought
routinely to inspect competing ways of taking about the same subject matter, and
warns, Use the literature, dont let it use you (p. 149; see also Mills, 1959).
A review of relevant prior research can serve several other purposes in your design
besides providing you with existing theory (see Locke et al., 2004; Strauss, 1987, pp.
4856). First, you can use it to develop a justification for your studyto show how
your work will address an important need or unanswered question. Second, it can
inform your decisions about methods, suggesting alternative approaches or revealing
potential problems with your plans. Third, it can be a source of data that you can use
to test or modify your theories. You can see if existing theory, the results of your pilot
research, or your experiential understanding is supported or challenged by previous
studies. Finally, you can use ideas in the literature to help you generate theory, rather
than simply borrowing such theory from the literature.
This is not simply a source of additional concepts for your theory; instead, it provides
you with an understanding of the meaning that these phenomena and events have for
the actors who are involved in them, and the perspectives that inform their actions.
In a qualitative study, these meanings and perspectives should constitute an impor-
tant focus of your theory; as discussed earlier, they are one of the things your theory
is about, not simply a source of theoretical insights and building blocks for the latter.
Thought Experiments
Thought experiments have a long and respected tradition in the physical
sciences (much of Einsteins work was based on thought experiments) but have
received little attention in discussions of research design, particularly qualitative
research design. Thought experiments draw on both theory and experience to
answer what if questions, to seek out the logical implications of various proper-
ties of the phenomena you want to study. They can be used both to test your cur-
rent theory for logical problems and to generate new theoretical insights. They
encourage creativity and a sense of exploration and can help you make explicit the
experiential knowledge that you already possess. Finally, they are easy to do, once
you develop the skill. Valuable discussions of thought experiments in the social
sciences are presented by Mills (1959) and Lave and March (1975).
Experience, prior theory and research, pilot studies, and thought experiments
are the four major sources of the conceptual framework for your study. The ways in
which you can put together a useful and valid conceptual framework from these
sources are particular to each study, and not something for which any cookbook
exists. The main thing to keep in mind is the need for integration of these compo-
nents with one another and with your goals and research questions.
Concept Mapping
A particularly valuable tool for generating and understanding these connections in
your research is a technique known as concept mapping (Miles & Huberman, 1994;
Novak & Gowin, 1984). Kane and Trochim (Chapter 14, this volume) provide an
overview of concept mapping but focus on using concept mapping with groups of
stakeholders for organizational improvement or evaluation, employing mainly quan-
titative techniques. However, concept mapping has many other uses, including clarifi-
cation and development of your own ideas about whats going on with the phenomena
you want to study. Exercise 2 is designed to help you develop an initial concept map
for your study (for additional guidance, see the sources above and Maxwell, 2005).
Research Questions:
What Do You Want to Understand?
Your research questionswhat you specifically want to learn or understand
by doing your studyare at the heart of your research design. They are the one
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 229
component that directly connects to all the other components of the design. More
than any other aspect of your design, your research questions will have an influence
on, and should be responsive to, every other part of your study.
This is different from seeing research questions as the starting point or primary
determinant of the design. Models of design that place the formulation of research
questions at the beginning of the design process, and that see these questions as deter-
mining the other aspects of the design, dont do justice to the interactive and induc-
tive nature of qualitative research. The research questions in a qualitative study
should not be formulated in detail until the goals and conceptual framework (and
sometimes general aspects of the sampling and data collection) of the design are clar-
ified, and should remain sensitive and adaptable to the implications of other parts of
the design. Often, you will need to do a significant part of the research before it is clear
to you what specific research questions it makes sense to try to answer.
This does not mean that qualitative researchers should, or usually do, begin
studies with no questions, simply going into the field with open minds and seeing
what is there to be investigated. Every researcher begins with a substantial base of
experience and theoretical knowledge, and these inevitably generate certain ques-
tions about the phenomena studied. These initial questions frame the study in
important ways, influence decisions about methods, and are one basis for further
focusing and development of more specific questions. However, these specific ques-
tions are generally the result of an interactive design process, rather than the start-
ing point for that process. For example, Suman Bhattacharjea (1994; see also
Maxwell, 2005, p. 66) spent a year doing field research on womens roles in a
Pakistani educational district office before she was able to focus on two specific
research questions and submit her dissertation proposal; at that point, she had also
developed several hypotheses as tentative answers to these questions.
highly inductive, loosely designed studies make good sense when experienced
researchers have plenty of time and are exploring exotic cultures, understud-
ied phenomena, or very complex social phenomena. But if youre new to qual-
itative studies and are looking at a better understood phenomenon within a
familiar culture or subculture, a loose, inductive design is a waste of time.
Months of fieldwork and voluminous case studies may yield only a few banal-
ities. (p. 17)
They also point out that prestructuring reduces the amount of data that you
have to deal with, functioning as a form of preanalysis that simplifies the analytic
work required.
Unfortunately, most discussions of this issue treat prestructuring as a single
dimension, and view it in terms of metaphors such as hard versus soft and tight ver-
sus loose. Such metaphors have powerful connotations (although they are different
for different people) that can lead you to overlook or ignore the numerous ways in
which studies can vary, not just in the amount of prestructuring, but in how pre-
structuring is used. For example, you could employ an extremely open approach to
data collection, but use these data for a confirmatory test of explicit hypotheses
based on a prior theory (e.g., Festinger, Riecker, & Schachter, 1956). In contrast, the
approach often known as ethnoscience or cognitive anthropology (Werner &
Schoepfle, 1987a, 1987b) employs highly structured data collection techniques, but
interprets these data in a largely inductive manner with very few preestablished
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 234
categories. Thus, the decision you face is not primarily whether or to what extent
you prestructure your study, but in what ways you do this, and why.
Finally, it is worth keeping in mind that you can lay out a tentative plan for some
aspects of your study in considerable detail, but leave open the possibility of sub-
stantially revising this if necessary. Emergent insights may require new sampling
plans, different kinds of data, and different analytic strategies.
I distinguish four main components of qualitative methods:
1. The research relationship that you establish with those you study
2. Sampling: what times, settings, or individuals you select to observe or inter-
view, and what other sources of information you decide to use
3. Data collection: how you gather the information you will use
4. Data analysis: what you do with this information to make sense of it
with the primary goal being to generate frequency counts of the items in each cate-
gory. In qualitative research, in contrast, the goal of coding is not to produce counts
of things but to fracture (Strauss, 1987, p. 29) the data and rearrange it into cate-
gories that facilitate comparison between things in the same category and between
categories. These categories may be derived from existing theory, inductively gener-
ated during the research (the basis for what Glaser & Strauss, 1967, term grounded
theory), or drawn from the categories of the people studied (what anthropologists
call emic categories). Such categorizing makes it much easier for you to develop a
general understanding of what is going on, to generate themes and theoretical con-
cepts, and to organize and retrieve your data to test and support these general ideas.
(An excellent practical source on coding is Bogdan & Biklen, 2006.)
However, fracturing and categorizing your data can lead to the neglect of con-
textual relationships among these data, relationships based on contiguity rather
than similarity (Maxwell & Miller, 2008), and can create analytic blinders, prevent-
ing you from seeing alternative ways of understanding your data. Atkinson (1992)
describes how his initial categorizing analysis of data on the teaching of general
medicine affected his subsequent analysis of the teaching of surgery:
inherently imply a more abstract theory. In the study of grade retention mentioned
above, examples of substantive categories would be retention as failure, retention
as a last resort, self-confidence as a goal, parents willingness to try alternatives,
and not being in control (of the decision) (drawn from McMillan & Schumacher,
2001, p. 472). Substantive categories are often inductively developed through a close
open coding of the data (Corbin & Strauss, 2007). They can be used in developing a
more general theory of whats going on, but they dont depend on this theory.
Theoretical categories, in contrast, place the coded data into a more general or
abstract framework. These categories may be derived either from prior theory or
from an inductively developed theory (in which case the concepts and the theory
are usually developed concurrently). They usually represent the researchers con-
cepts (what are called etic categories), rather than denoting participants own
concepts (emic concepts). For example, the categories nativist, remediationist,
or interactionist, used to classify teachers beliefs about grade retention in terms
of prior analytic distinctions (Smith & Shepard, 1988), would be theoretical.
The distinction between organizational categories and substantive or theoretical
categories is important because some qualitative researchers use mostly organiza-
tional categories to formally analyze their data, and dont systematically develop and
apply substantive or theoretical categories in developing their conclusions. The more
data you have, the more important it is to create the latter types of categories; with
any significant amount of data, you cant hold all the data relevant to particular sub-
stantive or theoretical points in your mind, and need a formal organization and
retrieval system. In addition, creating substantive categories is particularly important
for ideas (including participants ideas) that dont fit into existing organizational or
theoretical categories; such substantive ideas may get lost, or never developed, unless
they can be captured in explicit categories. Consequently, you need to include strate-
gies for developing substantive and theoretical categories in your design.
Connecting strategies, instead of fracturing the initial text into discrete elements
and re-sorting it into categories, attempt to understand the data (usually, but not nec-
essarily, an interview transcript or other textual material) in context, using various
methods to identify the relationships among the different elements of the text. Such
strategies include some forms of case studies (Patton, 1990), profiles (Seidman, 1991),
some types of narrative analysis (Coffey & Atkinson, 1996), and ethnographic micro-
analysis (Erickson, 1992). What all these strategies have in common is that they look
for relationships that connect statements and events within a particular context into a
coherent whole. Atkinson (1992) states,
I am now much less inclined to fragment the notes into relatively small seg-
ments. Instead, I am just as interested in reading episodes and passages at
greater length, with a correspondingly different attitude toward the act of
reading and hence of analysis. Rather than constructing my account like a
patchwork quilt, I feel more like working with the whole cloth . . . To be more
precise, what now concerns me is the nature of these products as texts. (p. 460)
(Continued)
had to do with how individual historians thought about their worktheir
theories about how the different topics were connected, and the relation-
ships that they saw between their thinking, actions, and results.
Answering the latter question would have required an analysis that
elucidated these connections in each historians interview. However, the
categorizing analysis on which the report was based fragmented these
connections, destroying the contextual unity of each historians views and
allowing only a collective presentation of shared concerns. Agar argues that
the fault was not with The Ethnograph, which is extremely useful for
answering questions that require categorization, but with its misapplication.
He comments that The Ethnograph represents a part of an ethnographic
research process. When the part is taken for the whole, you get a pathological
metonym that can lead you straight to the right answer to the wrong
question (p. 181).
SOURCE: From The Right Brain Strikes Back , by M. Agar in Using Computers in Qualitative
Research edited by N. G. Fielding and R. M. Lee, 1991. Copyright by SAGE.
What are the To assess the impact Computerized student Attendance Mr. Joe Smith, high August: Establish student
truancy rates of attendance on attendance records offices; assistant school assistant database
for American American Indian principals offices principal; Dr. Amanda October: Update
Indian students? students persistence for all schools Jones, middle school June: Final tally
07-Bickman-45636:07-Bickman-45636
in school principal
What is the To assess the impact Norm- and criterion- Counseling offices High school and middle Compilation #1:
academic of academic referenced test school counselors; End of semester
achievement of performance on scores; grades on classroom teachers Compilation #2:
7/28/2008
What is the To assess the Language-assessment Counseling offices; Counselors test Collect test scores Sept. 15
English-language relationship between test scores; ESL teachers records; classroom Teacher survey, Oct. 1015
proficiency of the language proficiency, classroom teacher offices teachers ESL class grades, end of fall
students? academic attitude surveys; semester and end of
Page 241
What do American To discover what Formal and informal Homeroom Principals of high school Obtain student and parent
Indian students factors lead to student interviews; classes; meetings and middle schools; consent forms, Aug.Sept.
dislike about antischool attitudes student survey with individual parents of students; Student interviews,
school? among American students homeroom teachers Oct.May 30
Indian students Student survey, first
week in May
Figure 7.3 Adaptation of the Data Planning Matrix for a Study of American Indian At-Risk High School Students (Continued)
241
242
What kind of
What do Why do I need data will answer Where can I Whom do I Timelines
I need to know? to know this? the questions? find the data? contact for access? for acquisition
What do students To assess the degree Student survey; Counseling offices; Homeroom teachers; Student survey, first
plan to do after to which coherent follow-up survey of Tribal Social school personnel; week in May
07-Bickman-45636:07-Bickman-45636
high school? posthigh school students attending Services office; parents; former Follow-up survey, summer
career planning college and getting Dept. of students; community and fall
affects high school jobs Probation; Alumni social service workers
completion Association
What do teachers To assess teacher Teacher survey; Building principals; Teacher interviews,
7/28/2008
What do teachers To assess teachers Teacher interviews; Individual teachers Building principals; Teacher interviews,
know about the cultural awareness teacher survey; logs classrooms and individual classroom November (subgroup)
home culture of of participation in records teachers; assistant Teacher survey, April
their students? staff development superintendent for (all teachers)
activities staff development
Page 242
What do teachers To assess the degree Teachers lesson plans; Individual teachers Building principals; Lesson plans,
do to integrate of discontinuity classroom classrooms and individual classroom Dec.June
knowledge of between school observations; logs of records teachers; assistant Observations,
the students culture and home participation in staff superintendent for Sept. 1May 30
home culture culture development staff development Staff development,
community into activities June logs
their teaching?
Figure 7.3 Adaptation of the Data Planning Matrix for a Study of American Indian At-Risk High School Students
SOURCE: This figure was published in Ethnography and Qualitative Design in Educational Research, 2nd ed. by M. D. LeCompte & J. Preissle, with R. Tesch. Copyright 1993
by Academic Press.
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 243
studies: researcher bias, and the effect of the researcher on the setting or individu-
als studied, generally known as reactivity.
Bias refers to ways in which data collection or analysis are distorted by the
researchers theory, values, or preconceptions. It is clearly impossible to deal with
these problems by eliminating these theories, preconceptions, or values, as dis-
cussed earlier. Nor is it usually appropriate to try to standardize the researcher to
achieve reliability; in qualitative research, the main concern is not with eliminating
variance between researchers in the values and expectations that they bring to the
study but with understanding how a particular researchers values influence the
conduct and conclusions of the study. As one qualitative researcher, Fred Hess, has
phrased it, validity in qualitative research is the result not of indifference, but of
integrity (personal communication).
Reactivity is another problem that is often raised about qualitative studies. The
approach to reactivity of most quantitative research, of trying to control for the
effect of the researcher, is appropriate to a variance theory perspective, in which
the goal is to prevent researcher variability from being an unwanted cause of vari-
ability in the outcome variables. However, eliminating the actual influence of the
researcher is impossible (Hammersley & Atkinson, 1995), and the goal in a qual-
itative study is not to eliminate this influence but to understand it and to use it
productively.
For participant observation studies, reactivity is generally not as serious a valid-
ity threat as many people believe. Becker (1970, 45ff.) points out that in natural set-
tings, an observer is generally much less of an influence on participants behavior
than is the setting itself (though there are clearly exceptions to this, such as settings
in which illegal behavior occurs). For all types of interviews, in contrast, the inter-
viewer has a powerful and inescapable influence on the data collected; what the
interviewee says is always a function of the interviewer and the interview situation
(Briggs, 1986; Mishler, 1986). Although there are some things that you can do to
prevent the more undesirable consequences of this (such as avoiding leading ques-
tions), trying to minimize your effect on the interviewee is an impossible goal. As
discussed above for bias, what is important is to understand how you are influ-
encing what the interviewee says, and how to most productively (and ethically) use
this influence to answer your research questions.
1. Intensive, long-term involvement: Becker and Geer (1957) claim that long-
term participant observation provides more complete data about specific situations
and events than any other method. Not only does it provide more, and more dif-
ferent kinds, of data, but the data are more direct and less dependent on inference.
Repeated observations and interviews, as well as the sustained presence of the
researcher in the setting studied, can help rule out spurious associations and pre-
mature theories. They also allow a much greater opportunity to develop and test
alternative hypotheses during the course of the research. For example, Becker
(1970, pp. 4951) argues that his lengthy participant observation research with
medical students not only allowed him to get beyond their public expressions of
cynicism about a medical career and uncover an idealistic perspective, but also
enabled him to understand the processes by which these different views were
expressed in different social situations and how students dealt with the conflicts
between these perspectives.
2. Rich data: Both long-term involvement and intensive interviews enable you
to collect rich data, data that are detailed and varied enough that they provide a
full and revealing picture of what is going on (Becker, 1970, 51ff.). In interview
studies, such data generally require verbatim transcripts of the interviews, not just
notes on what you felt was significant. For observation, rich data are the product of
detailed, descriptive note-taking (or videotaping and transcribing) of the specific,
concrete events that you observe. Becker (1970) argued that such data
counter the twin dangers of respondent duplicity and observer bias by
making it difficult for respondents to produce data that uniformly support
a mistaken conclusion, just as they make it difficult for the observer to
restrict his observations so that he sees only what supports his prejudices
and expectations. (p. 53)
3. Respondent validation: Respondent validation (Bryman, 1988, pp. 7880;
Lincoln & Guba, 1985, refer to this as member checks) is systematically soliciting
feedback about ones data and conclusions from the people you are studying. This is
the single most important way of ruling out the possibility of misinterpreting the
meaning of what participants say and do and the perspective they have on what is
going on, as well as being an important way of identifying your own biases and mis-
understandings of what you observed. However, participants feedback is no more
inherently valid than their interview responses; both should be taken simply as evi-
dence regarding the validity of your account (see also Hammersley & Atkinson, 1995).
4. Searching for discrepant evidence and negative cases: Identifying and analyzing
discrepant data and negative cases is a key part of the logic of validity testing in
qualitative research. Instances that cannot be accounted for by a particular inter-
pretation or explanation can point up important defects in that account. However,
there are times when an apparently discrepant instance is not persuasive, as when
the interpretation of the discrepant data is itself in doubt. The basic principle here
is that you need to rigorously examine both the supporting and discrepant data to
assess whether it is more plausible to retain or modify the conclusion, being aware
of all of the pressures to ignore data that do not fit your conclusions. In particularly
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 245
difficult cases, the best you may be able to do is to report the discrepant evidence
and allow readers to evaluate this and draw their own conclusions (Wolcott, 1990).
5. Triangulation: Triangulationcollecting information from a diverse range of
individuals and settings, using a variety of methodswas discussed earlier. This
strategy reduces the risk of chance associations and of systematic biases due to a
specific method and allows a better assessment of the generality of the explanations
that one develops. The most extensive discussion of triangulation as a validity-
testing strategy in qualitative research is by Fielding and Fielding (1986).
6. Quasi-Statistics: Many of the conclusions of qualitative studies have an
implicit quantitative component. Any claim that a particular phenomenon is typi-
cal, rare, or prevalent in the setting or population studied is an inherently quanti-
tative claim and requires some quantitative support. Becker (1970) coined the term
quasi-statistics to refer to the use of simple numerical results that can be readily
derived from the data. He argues that one of the greatest faults in most observa-
tional case studies has been their failure to make explicit the quasi-statistical basis
of their conclusions (pp. 8182).
Quasi-statistics not only allows you to test and support claims that are inher-
ently quantitative, but also enable you to assess the amount of evidence in your data
that bears on a particular conclusion or threat, such as how many discrepant
instances exist and from how many different sources they were obtained.
7. Comparison: Although explicit comparisons (such as control groups) for the
purpose of assessing validity threats are mainly associated with quantitative research,
there are valid uses for comparison in qualitative studies, particularly multisite stud-
ies (e.g., Miles & Huberman, 1994, p. 237). In addition, single case studies often
incorporate implicit comparisons that contribute to the interpretability of the case.
For examples, Martha Regan-Smith (1992), in her uncontrolled study of how
exemplary medical school teachers helped students learn, used both the existing
literature on typical medical school teaching and her own extensive knowledge
of this topic to determine what was distinctive about the teachers she studied.
Furthermore, the students that she interviewed explicitly contrasted these teachers
with others whom they felt were not as helpful to them, explaining not only what the
exemplary teachers did that increased their learning, but why this was helpful.
Exercise 4 is designed to help you identify, and develop strategies to deal with,
the most important validity threats to your conclusions.
practice, deliberately selected an atypical practice, one in which the physicians were
better trained and more progressive than usual and that was structured precisely
to deal with the problems that he was studying. He argues that the documented fail-
ure of social controls in this case provides a far stronger argument for the general-
izability of his conclusions than would the study of a typical practice.
The generalizability of qualitative studies is usually based not on explicit sam-
pling of some defined population to which the results can be extended, but on the
development of a theory that can be extended to other cases (Becker, 1991; Ragin,
1987); Yin (1994) refers to this as analytic, as opposed to statistical, generalization.
For this reason, Guba and Lincoln (1989) prefer to talk of transferability rather
than generalizability in qualitative research. Hammersley (1992, pp. 189191) and
Weiss (1994, pp. 2629) list a number of features that lend credibility to generaliza-
tions made from case studies or nonrandom samples, including respondents own
assessments of generalizability, the similarity of dynamics and constraints to other
situations, the presumed depth or universality of the phenomenon studied, and cor-
roboration from other studies. However, none of these permits the kind of precise
extrapolation of results to defined populations that probability sampling allows.
Conclusion
Harry Wolcott (1990) provided a useful metaphor for research design: Some of the
best advice Ive ever seen for writers happened to be included with the directions I
found for assembling a new wheelbarrow: Make sure all parts are properly in place
before tightening (p. 47). Like a wheelbarrow, your research design not only needs
to have all the required parts, it has to workto function smoothly and accomplish
its tasks. This requires attention to the connections among the different parts of the
designwhat I call coherence. There isnt one right way to create a coherent quali-
tative design; in this chapter I have tried to give you the tools that will enable you
to put together a way that works for you and your research.
Discussion Questions
The following questions are ones that are valuable to review before beginning (or
continuing) with the design of a qualitative study.
1. Why are you thinking of doing a qualitative study of the topic youve chosen?
How would your study use the strengths of qualitative research? How would it deal
with the limitations of qualitative research?
2. What do you already know or believe about your topic or problem? Where
do these beliefs come from? How do the different beliefs fit together into a coher-
ent picture of this topic or problem?
3. What do you not know about your topic or problem that a qualitative study
could help you understand?
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 247
Exercises
These exercises give you an opportunity to work through several of the most
important issues in designing a qualitative study. Other important issues are
addressed in the discussion questions.
1. What prior experiences have you had that are relevant to your topic or set-
ting? What assumptions about your topic or setting have resulted from these expe-
riences? What goals have emerged from these? How have these experiences,
assumptions, and goals shaped your decision to choose this topic, and the way you
are approaching this project?
2. What potential advantages do you think these goals, beliefs, and experiences
have for your study? What potential disadvantages do you think these may create
for you, and how might you deal with these?
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 248
1. Begin by thinking about your goals for this study. What could you learn in a
research study that would help accomplish these goals? What research questions
does this suggest? Conversely, how do any research questions you may already have
formulated connect to your goals in conducting the study? How will answering
these specific questions help you achieve your goals? Which questions are most
interesting to you, personally, practically, or intellectually?
2. Next, connect these research questions to your conceptual framework. What
would answering these questions tell you that you dont already know? Where are the
places in this framework that you dont understand adequately or where you need to
test your ideas? What could you learn in a research study that would help you better
understand whats going on with these phenomena? What changes or additions to
your questions does your framework suggest? Conversely, are there places where
your questions imply things that should be in your framework, but arent?
3. Now focus. What questions are most central for your study? How do these
questions form a coherent set that will guide your study? You cant study everything
interesting about your topic; start making choices. Three or four main questions are
usually a reasonable maximum for a qualitative study, although you can have addi-
tional subquestions for each of the main questions.
4. In addition, you need to consider how you could actually answer the ques-
tions you pose. What methods would you need to use to collect data that would
answer these questions? Conversely, what questions can a qualitative study of the
kind you are planning productively address? At this point in your planning, this
may primarily involve thought experiments about the way you will conduct the
study, the kinds of data you will collect, and the analyses you will perform on these
data. This part of the exercise is one you can usefully repeat when you have devel-
oped your methods and validity concerns in more detail.
5. Assess the potential answers to your questions in terms of validity. What are
the plausible validity threats and alternative explanations that you would have to
rule out? How might you be wrong, and what implications does this have for the
way you frame your questions?
Dont get stuck on trying to precisely frame your research questions or in spec-
ifying in detail how to measure things or gain access to data that would answer your
questions. Try to develop some meaningful and important questions that would be
worth answering. Feasibility is obviously an important issue in doing research, but
focusing on it at the beginning can abort a potentially valuable study.
A valuable additional step is to share your questions and your reflections on these
with a small group of fellow students or colleagues. Ask them if they understand the
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 249
questions and why these would be worth answering, what other questions or
changes in the questions they would suggest, and what problems they see in trying
to answer them. If possible, tape record the discussion; afterward, listen to the tape
and take notes.
Remember that some validity threats are unavoidable; you will need to acknowl-
edge these in your proposal or in the conclusions to your study, but no one expects
you to have airtight answers to every possible threat. The key issue is how plausible
and how serious these unavoidable threats are.
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 250
References
Agar, M. (1991). The right brain strikes back. In N. G. Fielding & R. M. Lee (Eds.), Using
computers in qualitative research (pp. 181194). Newbury Park, CA: Sage.
Atkinson, P. (1992). The ethnography of a medical setting: Reading, writing, and rhetoric.
Qualitative Health Research, 2, 451474.
Becker, H. S. (1970). Sociology work: Method and substance. New Brunswick, NJ: Transaction
Books.
Becker, H. S. (1986). Writing for social scientists: How to start and finish your thesis, book, or
article. Chicago: University of Chicago Press.
Becker, H. S. (1991). Generalizing from case studies. In E. W. Eisner & A. Peshkin (Eds.),
Qualitative inquiry in education: The continuing debate (pp. 233242). New York:
Teachers College Press.
Becker, H. S., & Geer, B. (1957). Participant observation and interviewing: A comparison.
Human Organization, 16, 2832.
Becker, H. S., Geer, B., Hughes, E. C., & Strauss, A. L. (1961). Boys in white: Student culture in
medical school. Chicago: University of Chicago Press.
Berg, D. N., & Smith, K. K. (Eds.). (1988). The self in social inquiry: Research methods.
Newbury Park, CA: Sage.
Bhattacharjea, S. (1994). Reconciling public and private: Women in the educational
bureaucracy in Sinjabistan Providence, Pakistan. Unpublished doctoral dissertation,
Harvard Graduate school of Education.
Bogdan, R. C., & Biklen, S. K. (2006). Qualitative research for education: An introduction to
theory and methods (5th ed.). Boston: Allyn & Bacon.
Bolster, A. S. (1983). Toward a more effective model of research on teaching. Harvard
Educational Review, 53, 294308.
Bredo, E., & Feinberg, W. (1982). Knowledge and values in social and educational research.
Philadelphia: Temple University Press.
Briggs, C. L. (1986). Learning how to ask: A sociolinguistic appraisal of the role of the interview
in social science research. Cambridge, UK: Cambridge University Press.
Bryman, A. (1988). Quantity and quality in social research. London: Unwin Hyman.
Campbell, D. T. (1988). Methodology and epistemology for social science: Selected papers.
Chicago: University of Chicago Press.
Campbell, D. T., & Stanley, J. C. (1967). Experimental and quasi-experimental designs for
research. Chicago: Rand McNally.
Christians, C. G. (2000). Ethics and politics in qualitative research. In N. K. Denzin & Y. S. Lincoln
(Eds.), Handbook of qualitative research (2nd ed., pp. 133155). Thousand Oaks, CA: Sage.
Coffey, A., & Atkinson, P. (1996). Making sense of qualitative data: Complementary research
strategies. Thousand Oaks, CA: Sage.
Corbin, J. M., & Strauss, A. C. (2007). Basics of qualitative research: Techniques and procedures
for developing grounded theory (3rd ed.). Thousand Oaks, CA: Sage.
Cousins, J. B., & Earl, L. M. (Eds.). (1995). Participatory evaluation in education: Studies in
evaluation use and organizational learning. London: Falmer Press.
Creswell, J. W. (1997). Qualitative inquiry and research design: Choosing among five traditions.
Thousand Oaks, CA: Sage
Denzin, N. K. (Ed.). (1970). Sociological methods: A sourcebook. Chicago: Aldine.
Denzin, N. K., & Lincoln, Y. S. (2000). The SAGE handbook of qualitative research (2nd ed.).
Thousand Oaks, CA: Sage.
Denzin, N. K., & Lincoln, Y. S. (2005). The SAGE handbook of qualitative research (3rd ed.).
Thousand Oaks, CA: Sage.
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 251
Dey, I. (1993). Qualitative data analysis: A user-friendly guide for social scientists. London:
Routledge.
Eisner, E. W., & Peshkin, A. (Eds.). (1990). Qualitative inquiry in education: The continuing
debate. New York: Teachers College Press.
Emerson, R. M., Fretz, R. I., & Shaw, L. L. (1995). Writing Ethnographic Fieldnotes. Chicago:
University of Chicago Press.
Erickson, F. (1992). Ethnographic microanalysis of interaction. In M. D. LeCompte,
W. L. Millroy, & J. Preissle (Eds.), The handbook of qualitative research in education
(pp. 201225). San Diego, CA: Academic Press.
Festinger, L., Riecker, H. W., & Schachter, S. (1956). When prophecy fails. Minneapolis:
University of Minnesota Press.
Fetterman, D. M., Kaftarian, S. J., & Wandersman, A. (Eds.). (1996). Empowerment evaluation:
Knowledge and tools for self-assessment and accountability. Thousand Oaks, CA: Sage.
Fielding, N. G., & Fielding, J. L. (1986). Linking data. Beverly Hills, CA: Sage.
Fine, M., Weis, L., Weseen, S., & Wong, L. (2000). For whom? Qualitative research, represen-
tations, and social responsibilities. In N. Denzin & Y. Lincoln (Eds.), Handbook of qual-
itative research (2nd ed., pp. 107131). Thousand Oaks, CA: Sage.
Frederick, C. M., et al. (Eds.). (1993). Merriam-Websters collegiate dictionary (10th ed.).
Springfield, MA: Merriam-Webster.
Freidson, E. (1975). Doctoring together: A study of professional social control. Chicago:
University of Chicago Press.
Geertz, C. (1973). The interpretation of cultures: Selected essays. New York: Basic Books.
Given, L. M. (in press). The SAGE encyclopedia of qualitative research methods. Thousand
Oaks, CA: Sage.
Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies for qualita-
tive research. Chicago: Aldine.
Glesne, C. (2005). Becoming qualitative researchers: An introduction (3rd ed.). Boston: Allyn
& Bacon.
Grady, K. E., & Wallston, B. S. (1988). Research in health care settings. Newbury Park, CA:
Sage.
Guba, E. G., & Lincoln, Y. S. (1989). Fourth generation evaluation. Newbury Park, CA: Sage.
Hammersley, M. (1992). Whats wrong with ethnography? Methodological explorations.
London: Routledge.
Hammersley, M., & Atkinson, P. (1995). Ethnography: Principles in practice (2nd ed.).
London: Routledge.
Huberman, A. M., & Miles, M. B. (1988). Assessing local causality in qualitative research.
In D. N. Berg & K. K. Smith (Eds.), The self in social inquiry: Researching methods
(pp. 351381). Newbury Park, CA: Sage.
Jansen, G., & Peshkin, A. (1992). Subjectivity in qualitative research. In M. D. LeCompte,
W. L. Millroy, & J. Preissle (Eds.), The handbook of qualitative research in education
(pp. 681725). San Diego, CA: Academic Press.
Kaplan, A. (1964). The conduct of inquiry. San Francisco: Chandler.
Kidder, L. H. (1981). Qualitative research and quasi-experimental frameworks. In M. B. Brewer
& B. E. Collins (Eds.), Scientific inquiry and the social sciences (pp. 226256). San
Francisco: Jossey-Bass.
Lave, C. A., & March, J. G. (1975). An introduction to models in the social sciences. New York:
Harper & Row.
LeCompte, M. D., & Preissle, J. (with Tesch, R.). (1993). Ethnography and qualitative design
in educational research (2nd ed.). San Diego, CA: Academic Press.
Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic inquiry. Beverly Hills, CA: Sage.
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 252
Locke, L., Silverman, S. J., & Spirduso, W. W. (2004). Reading and understanding research
(2nd ed.). Thousand Oaks, CA: Sage.
Locke, L., Spirduso, W. W., & Silverman, S. J. (1993). Proposals that work (3rd ed.). Newbury
Park, CA: Sage.
Locke, L., Spirduso, W. W., & Silverman, S. J. (2000). Proposals that work (4th ed.). Thousand
Oaks, CA: Sage.
Marshall, C., & Rossman, G. (1999). Designing qualitative research (3rd ed.). Thousand Oaks,
CA: Sage.
Maxwell, J. A. (1986). The conceptualization of kinship in an Inuit community. Unpublished
doctoral dissertation, University of Chicago.
Maxwell, J. A. (1992). Understanding and validity in qualitative research. Harvard
Educational Review, 62, 279300.
Maxwell, J. A. (2004a). Causal explanation, qualitative research, and scientific inquiry in
education. Educational Researcher, 33(2), 311.
Maxwell, J. A. (2004b). Using qualitative methods for causal explanation. Field Methods,
16(3), 243264.
Maxwell, J. A. (2005). Qualitative research design: An interactive approach (2nd ed.).
Thousand Oaks, CA: Sage.
Maxwell, J. A. (2006). Literature reviews of, and for, educational research: A response to
Boote and Beile. Educational Researcher, 35(9), 2831.
Maxwell, J. A., Cohen, R. M., & Reinhard, J. D. (1983). A qualitative study of teaching rounds in a
department of medicine. In Proceedings of the twenty-second annual conference on Research
in Medical Education. Washington, DC: Association of American Medical Colleges.
Maxwell, J. A., & Loomis, D. (2002). Mixed method design: An alternative approach. In
A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral
research (pp. 241271). Thousand Oaks, CA: Sage.
Maxwell, J. A., & Miller, B. A. (2008). Categorizing and connecting strategies in qualitative
data analysis. In P. Leavy & S. Hesse-Biber (Eds.), Handbook of emergent methods
(pp. 461477). New York: Guilford Press.
McMillan, J. H., & Schumacher, S. (2001). Research in education: A conceptual introduction.
New York: Longman.
Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded source-book
(2nd ed.). Thousand Oaks, CA: Sage.
Mills, C. W. (1959). The sociological imagination. New York: Oxford University Press.
Mishler, E. G. (1986). Research interviewing: Context and narrative. Cambridge, MA: Harvard
University Press.
Mohr, L. (1982). Explaining organizational behavior. San Francisco: Jossey-Bass.
Mohr, L. (1995). Impact analysis for program evaluation (2nd ed.). Thousand Oaks, CA: Sage.
Mohr, L. (1996). The causes of human behavior: Implications for theory and method in the
social sciences. Ann Arbor: University of Michigan Press.
Norris, S. P. (1983). The inconsistencies at the foundation of construct validation theory. In
E. R. House (Ed.), Philosophy of evaluation (pp. 5374). San Francisco: Jossey-Bass.
Novak, J. D., & Gowin, D. B. (1984). Learning how to learn. Cambridge, UK: Cambridge
University Press.
Oja, S. N., & Smulyan, L. (1989). Collaborative action research: A developmental approach.
London: Falmer Press.
Patton, M. Q. (1990). Qualitative evaluation and research methods (2nd ed.). Newbury Park,
CA: Sage.
07-Bickman-45636:07-Bickman-45636 7/28/2008 6:13 PM Page 253
Patton, M. Q. (2000). Qualitative evaluation and research methods (3rd ed.). Thousand Oaks,
CA: Sage.
Pitman, M. A., & Maxwell, J. A. (1992). Qualitative approaches to evaluation. In M. D. LeCompte,
W. L. Millroy, & J. Preissle (Eds.), The handbook of qualitative research in education
(pp. 729770). San Diego, CA: Academic Press.
Rabinow, P., & Sullivan, W. M. (1979). Interpretive social science: A reader. Berkeley:
University of California Press.
Ragin, C. C. (1987). The comparative method: Moving beyond qualitative and quantitative
strategies. Berkeley: University of California Press.
Reason, P. (1988). Introduction. In P. Reason (Ed.), Human inquiry in action: Developments
in new paradigm research (pp. 117). Newbury Park, CA: Sage.
Reason, P. (1994). Three approaches to participative inquiry. In N. K. Denzin & Y. S. Lincoln
(Eds.), Handbook of qualitative research (pp. 324339). Thousand Oaks, CA: Sage.
Regan-Smith, M. G. (1992). The teaching of basic science in medical school: The students
perspective. Unpublished dissertation, Harvard Graduate School of Education.
Robson, C. (2002). Real world research: A resource for social scientists and practitioner-
researchers (2nd ed.). Oxford, UK: Blackwell.
Sayer, A. (1992). Method in social science: A realist approach (2nd ed.). London: Routledge.
Schram, T. H. (2005). Conceptualizing and proposing qualitative research. Upper Saddle River,
NJ: Merrill Prentice Hall.
Scriven, M. (1991). Beyond formative and summative evaluation. In M. W. McLaughlin &
D. C. Phillips (Eds.), Evaluation and education at quarter century (pp. 1964). Chicago:
National Society for the Study of Education.
Seidman, I. E. (1991). Interviewing as qualitative research. New York: Teachers College Press.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental
designs for generalized causal inference. Boston: Houghton Mifflin.
Smith, M. L., & Shepard, L. A. (1988). Kindergarten readiness and retention: A qualitative study
of teachers beliefs and practices. American Educational Research Journal, 25(3), 307333.
Strauss, A. L. (1987). Qualitative analysis for social scientists. New York: Cambridge University
Press.
Strauss, A. L. (1995). Notes on the nature and development of general theories. Qualitative
Inquiry 1, 718.
Tolman, D. L., & Brydon-Miller, M. (2001). From subjects to subjectivities: A handbook of
interpretive and participatory methods. New York: New York University Press.
Tukey, J. (1962). The future of data analysis. Annals of Mathematical Statistics 33, 167.
Weiss, R. S. (1994). Learning from strangers: The art and method of qualitative interviewing.
New York: Free Press.
Weitzman, E. A. (2000). Software and qualitative research. In Denzin & Lincoln (Eds.),
Handbook of qualitative research (2nd ed., pp. 803820). Thousand Oaks, CA: Sage.
Werner, O., & Schoepfle, G. M. (1987a). Systematic fieldwork: Vol. 1. Foundations of ethnogra-
phy and interviewing. Newbury Park, CA: Sage.
Werner, O., & Schoepfle, G. M. (1987b). Systematic fieldwork: Vol. 2. Ethnographic analysis
and data management. Newbury Park, CA: Sage.
Whyte, W. F. (Ed.). (1991). Participatory action research. Newbury Park, CA: Sage.
Wolcott, H. F. (1990). Writing up qualitative research. Newbury Park, CA: Sage.
Wolcott, H. F. (1995). The art of fieldwork. Walnut Creek, CA: AltaMira Press.
Yin, R. K. (1994). Case study research: Design and methods (2nd ed.). Thousand Oaks, CA: Sage.
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 254
CHAPTER 8
How to Do Better
Case Studies
(With Illustrations From
20 Exemplary Case Studies)
Robert K. Yin
Although other steps also are important in doing case study research, somehow
these four have posed the most formidable demands. If you can meet them, you will
be able to conduct high-quality case studiesones that may be better and more
distinctive than those of your peers. Because of the importance of the four steps,
254
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 255
this advantage will prevail whether you are doing a dissertation, case study evalua-
tions (e.g., U.S. Government Accountability Office, 1990), case studies of natural
settings (e.g., Feagin, Orum, & Sjoberg, 1991), or more theory-based (e.g., George
& Bennett, 2004; Sutton & Staw, 1995) or norm-based (e.g., Thacher, 2006) case
study research.3
Second, the chapter goes beyond merely describing the relevant research proce-
dures. It also refers to many exemplary examples from the existing case study litera-
ture.4 The examples include some of the best case studies ever done, including a case
study that is more than 75 years old but that is still in print. The richness of the
examples permits the discussion of the four stepsand especially the fourth and most
difficult step of doing case study analysisto be deeper than commonly found in
other texts. In this sense, this chapter should help you do more advanced case studies.
The exemplary examples come from different fields, such as community sociol-
ogy, public health services, national and international politics, urban planning,
business management, criminal justice, and education. The hope is that among
these examples you will find case studies that cover not only methodologically
important issues but also topics relevant to your interests.
Practical Considerations
From a practical standpoint, you will be devoting significant time to your case
study. You therefore would like to reduce any likelihood of finding that, midstream,
your case will not work out.
The most frequent surprise involves some disappointment regarding the actual
availability, quality, or relevance of the case study data. For instance, you might have
planned to interview several key persons as part of your case study but later found
only limited or no access to these persons. Similarly, you might have planned to use
what you had originally considered to be a rich source of documentary evidence,
only later to find their contents to be unhelpful and irrelevant to your case study.
Last, you might have counted on an organization or agency updating an annual
data set, to provide a needed comparison to earlier years, only later to learn that the
update will be significantly delayed. Any of these three situations could then cause
you to search for another case to study, making you start all over again.
These and other practical situations need, as much as possible, to be investigated
prior to starting your case study. A commonplace practice in other types of
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 256
Substantive Considerations
The selection process, however, should not dwell on practical considerations
only. You should be ambitious enough to try to select a significant or special case
for your case study, as a more mundane case may not produce an acceptable study
(or even dissertation). Think of the possibility that your case study may be one of
the few that you ever might complete and that you, therefore, would like to put your
efforts into as important, interesting, or significant a case study as possible.
What makes a case special? One possibility arises if your case covers some distinc-
tive event or condition, such as the revival or renewal of a major organization, the cre-
ation and confirmed efficacy of a new medical procedure, the discovery of a new way
of reducing youth gang violence; a critical political election; some dramatic neigh-
borhood change; or even the occurrence and aftermath of a natural disaster. By defi-
nition, these are likely to be remarkable circumstances. To do a good case study of any
of them may produce an exemplary piece of research (see Case Studies 1 and 2).
Two historically distinctive, if not unique, events were the Swine Flu Scare
and the Cuban Missile Crisis. Both events became the subjects of now
well-known case studies in the field of political science.
In the first case (Neustadt & Fineberg, 1983), the United States faced a
threat of epidemic proportions from a new, and potentially lethal, influenza
strain. As a result, the U.S. government planned and then tried to immunize
the whole U.S. population. Over a 10-week period, the immunization effort
reached 40 million people before the campaign was ended amidst contro-
versy, delay, administrative troubles, and legal complications.
In the second case (Allison, 1971), a nuclear holocaust between the United
States and the former Soviet Union threatened the survival of the entire world.
The case study investigates how and why military and diplomatic maneuvers
successfully eliminated the confrontation. With the later availability of
new documentation after the fall of the Soviet Union, an entirely updated
and revised version of the case study was written, corroborating but also
refining the understanding of the key decisions (Allison & Zelikow, 1999).
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 257
But what if no such distinctive circumstances are available for you to study? Or
what if you deliberately want to do a case study about a common and even every-
day phenomenon? In these situations, you need to define some compelling theo-
retical framework for selecting your case. The more compelling the framework, the
more your case study can contribute to the research literature, and in this sense, you
will have conducted a special case study.
A compelling framework could be based on some historical context or some
sociological insight. Around the context or insight, you would still need to amass
the relevant existing literature, to show how your compelling framework would fit
(or depart from) the literature, and how your case study would eventually extend
that literature. These ingredients would lay the groundwork for your case study
making a significant contribution to the literature (see Case Studies 3 and 4).
C A S E S TU D Y 5 : A P R OC E S S CA S E ST U DY
(Continued)
Rather, the case studys lasting value derives from its focus on the
decisions made by officials trying to put a federal initiative (the economic
development program) into place in a local community. The authors show
how the decisions were numerous, complex, and interdependent. They use
these decisions to define, operationally, a broader implementation process
that, until that time, had not been fully appreciated in the field of public
policy. Instead of being about the program or the city, the case study
therefore is about a process. The lessons learned have been helpful for
understanding other implementation experiences.
C A S E S T U D Y 6 : R E P L I C AT I O N C A S E S
The replication logic is analogous to that used in multiple experiments (see Yin,
2003b, p. 4752). For example, on uncovering a significant finding from a single
experiment, the immediate research goal would be to replicate this finding by con-
ducting a second, third, and even more experiments. For two-case case studies,
you may have selected both cases at the outset of your case study, anticipating that
they will either produce similar findings (a literal replication) or produce contrast-
ing results, but for predictable reasons (a theoretical replication). With more cases,
the possibilities for more subtle and varied replications increase. Most important,
the replication logic differs completely from the sampling logic used in survey
research.
C A S E S T U D I E S 7 A N D 8 : T W O M U LT I P L E - C A S E S T U D I E S
Multiple-case studies provide more convincing data and also can permit the
investigation of broader topics than single-case studies.
Case Study 7 (Magaziner & Patinkin, 1989) was one of nine cases amassed
to describe various facets of a global but silent war, involving world
economic competition at all levels. These include the United Statess
competition with low-wage countries, with developed countries, and in
relation to future technologies.
Case Study 8 (Derthick, 1972) uses seven cases to illuminate the
weakness of the federal government in addressing local affairs and attempting
to respond to local needs. The federal objective was to implement new
housing programs in seven different cities. The cross-case analysis, based on
the experiences in all seven cities, readily pointed to common reasons for
the problems that arose.
As the ability to expand the number of cases increases, you can start seeing
the advantages of doing multiple-case studies. As part of the same case study,
you might have two or three literal replications and two or three deliberately
contrasting cases. Alternatively, multiple cases covering different contextual con-
ditions might substantially expand the generalizability of your findings to a
broader array of contexts than can a single-case study. Overall, the evidence from
multiple-case studies should produce a more compelling and robust case study.
In principle, you will need more time and resources to conduct a multiple-
rather than single-case study. However, you should note that the classic, single-case
studies nevertheless consumed much time and effort. For instance, Case Study 3
involved a four-person research team living in the city under study for 18 months
just to carry out the data collection. Analysis and writing then took another couple
of years. Other classic single-case studies have involved extensive time commit-
ments made by single investigators. Doing a good single-case study should not
automatically lead to reduced time commitments on your part.
related questions, the more you will be on your way to thinking about the advan-
tages and disadvantages of doing a multiple-case study.
C A S E S T U D Y 9 : O B S E R V AT I O N A L E V I D E N C E
A S PA R T OF A C A S E ST U DY
Part of a case study about the firms and working life in Silicon Valley called
for the case study investigators to observe the clean room operations
where silicon chips are made (Rogers & Larsen, 1984).The clean rooms are a
(Continued)
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 262
(Continued)
key part of the manufacturing process for producing semiconductor chips.
Among other features, employees wear bunny suits of lint-free cloth and
handle extremely small components in these rooms. The case study
observations showed how the employees adapted to the working conditions
in these clean rooms, adding that, at the time, most of the employees were
female while most of the supervisors were male.
Coroners reports, with their dry and factually operational tone, may serve as
a good model for the desired narrative. Note that such narrativewhose main
function is to present observational evidenceis not the same as the interpre-
tive narrative that will appear elsewhere in the case study. That narrative dis-
cusses evidence and interpretation together, and the case still may be told in a
compelling manner. This latter narrative, in combination with the drier, opera-
tional narrative covering the observational evidence, parallels other types of
research where numeric tables (the evidentiary portion) are accompanied by the
investigators interpretation of the findings (the interpretive portion). Again, the
main point is that many case studies confuse the two presentations, and yours
should not.
The separate presentation of narrative evidence can assume several forms. One,
the use of vignettes, is illustrated in this very chapter by the material in the boxes
about the individual case studies. Another, the use of word tables, is a table,
arranged with rows and cells like any other table, but whose cells are filled with
words (i.e., categorical or qualitative evidence) rather than the numbers found in
numeric tables.
Going beyond this traditional, narrative form of reporting observational data,
you can quantify observations by using a formal observational instrument and then
report the evidence in numeric form (e.g., tables showing the frequency of certain
observations). The instrument typically requires you to enumerate an observed
activity or to provide one or more numeric ratings about the activity (see Case
Study 10). Thus, observational evidence can be reported both as narrative and in
the form of numeric tables.
C A S E S T U D Y 1 0 : Q U A N T I F Y I N G O B S E R V AT I O N A L
E V ID E NC E I N A C AS E ST U DY
An elementary school was the site for a case study of a new instructional
practice, or innovation (Gross, Bernstein, & Giacquinta, 1971). To judge how
well teachers were implementing the new practice, members of the research
team made classroom observations and quantified their observations.
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 263
Archival Records
In contrast to direct observations in the field, case studies also can rely on
archival datainformation stored through existing channels, such as electronic
records, libraries, and old-fashioned (paper) files. Newspapers, television, and the
mass media are but one type of channel. Records maintained by public agencies,
such as public health or police records, serve as another. The resulting archival data
can be quantitative or qualitative (or both).
From a research perspective, the archival data can be subject to their own biases
or shortcomings. For instance, researchers have long known that police records of
reported crime do not reflect the actual amount of crime that might have occurred.
Similarly, school systems reports of their enrollment, attendance, and dropout rates
may be subject to systematic under- or overcounting. Even the U.S. Census strug-
gles with the completeness of its population counts and the potential problems
posed because people residing in certain kinds of locales (rural and urban) may be
undercounted.
Likewise, the editorial leanings of different mass media are suspected to affect
their choice of stories to be covered (and not covered), questions to be asked (and
not asked), and writing detail (and not detailed). All these editorial choices can
collectively produce a systematic bias in what would otherwise appear to be a full
and factual account of some important events.
Case studies relying heavily on archival data need to be sensitive to these possi-
ble biases and to take steps to counteract them. With mass media, a helpful proce-
dure is to select two different media that are believed, if not known, to have
opposing orientations. A more factually balanced picture may then emerge (see
Case Study 11). Finding and using additional sources bearing on the same topic
would help even more.
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 264
C A S E S TU D Y 1 1 : A C A S E S TU DY U S IN G T W O A RC HI V AL
S O U R C E S TO C O V E R T H E S A M E C O M M U N I T Y E V E N T S
One of the most inf lammatory community events in the 1990s came
to be known as the Rodney King crisis. White police officers were
serendipitously videotaped in the act of beating an African American male,
but a year later they all were acquitted. The acquittal sparked a major civil
disturbance in which 58 people were killed, 2,000 injured, and 11,000
arrested.
A case study of this crisis deliberately drew from two different
newspapersthe major daily for the metropolitan area and the most
significant newspaper for the areas African American community ( Jacobs,
1996). For the pertinent period surrounding the crisis, the first newspaper
produced 357 articles and the second (a weekly, not daily publication) 137
articles. The case study not only traces the course of events but also shows
how the two papers constructed different but overlapping understandings
of the crisis.
Open-Ended Interviews
A third common type of evidence for case studies comes from open-ended inter-
views. These interviews offer richer and more extensive material than data from
surveys and especially the closed-ended portions of survey instruments. On the
surface, the open-ended portions of surveys may resemble open-ended interviews,
but the latter are generally less structured and even may assume a conversational
manner.
The diminished structure permits open-ended interviews, if properly done,
to reveal how case study interviewees construct reality and think about situa-
tions, not just giving answers to specific questions. For some case studies, the
construction of reality provides important insights into the case. The insights
gain even further value if the interviewees are key persons in the organizations,
communities, or small groups being studied, not just the average member of
such groups. For a case study of a public agency or private firm, for instance, a
key person would be the head of the agency or firm. For schools, the principal
or a department head would carry the same status. Because by definition such
roles are not frequently found within an organization, the open-ended inter-
views also have been called elite interviews. A further requirement is that case
study investigators need to be able to gain access to these elites. Such access is
not always available and may hamper the conduct of the case study in the first
place (see Case Study 12).
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 265
CA SE S TU DY 1 2 : O PE N -E N DED I NT E RVIE WS
AS A SO URC E O F CA SE STUDY E VIDE NC E
Integrating Evidence
The preceding paragraphs have covered three types of case study evidence.
Other chapters in this Handbook actually cover some of the other types, such as the
use of focus groups, surveys, and ethnographies. Together, you should now have a
good idea of the different kinds of evidence that you can use in case studies.
More important than reviewing the remaining types at this juncture is the need
to show how various sources of evidence might come together as part of the same
case study. Recall that the preferred integration would position the evidence from
each source in a way that converged with, or at least complemented, the evidence
from other sources.
Such integration readily takes place in many existing case studies. The presenta-
tion of a case study can integrate (a) information from interviews (e.g., quotations
or insights from the interviews appearing in the text, but citations pointing the
reader to the larger interview database) with (b) documentary evidence (e.g., quo-
tations or citations to specific written texts, accompanied by the necessary cita-
tions) and with (c) information drawn from direct observations. The resulting case
study tries to see whether the evidence from these sources presents a consistent pic-
ture. The procedure involves juxtaposing the different pieces of evidence, to see
whether they corroborate each other or provide complementary (or conflicting)
details. If the case study is well documented, all the evidence contains appropriate
footnotes and citations to data collection sources (e.g., the name and date of a
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 266
document that was used), and the case study also includes a full description of the
data collection methods, often appearing as an appendix to the case study.
Integrating and presenting the evidence in this manner can be a major challenge
(see Case Studies 13 and 14). Although the final case study still may be criticized for
having undesirable biases, the richness of the evidence should nevertheless shift any
debate into a more empirical modethat is, critics need to produce contrary evi-
dence rather than simply make alternative arguments. The shift is highly desired,
because case studies should promote sound social science inquiry rather than raw
polemic argument.
C A S E S TU D IE S 1 3 AN D 1 4 : T WO C A SE ST U D IE S
T H AT B R I N G T H E E V I D E N C E T O G E T H E R
As an alternative strategy, you can bring the evidence together, from multiple
sources, on an even grander scale than just described. Understanding this grander
scale requires an appreciation of the concept of embedded units of analysis (see Yin,
2003b, pp. 4245).
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 267
The concept applies when the data for a case study come from more than a single
layer. For instance, a case study about an organization will certainly include data
about an organizational layer (the organizations overall performance, policies, part-
nerships, etc.). However, depending on the research questions being studied, addi-
tional data may come from a second layerthe organizations employees. Data might
come from an employee survey, which, if used alone, might have served to support a
study of the employees. However, within the context of the case study of the organi-
zation, the employee layer would be an embedded unit of analysis, falling within the
main unit of analysis for the case study, which is the organization as a whole.
You can imagine many situations where case studies will have embedded units
of analysis: a neighborhood case study, where the services or the residents in the
neighborhood might represent embedded units of analysis; a case study of a public
or foundation program that consists of multiple, separately funded projects; a study
of a new technology, with an assessment of the technologys multiple applications
also being part of the case study; or a study of a health services marketplace, with
different health service providers and clients being the embedded units.
In all these examples, the embedded units are embedded within the larger, main
unit of the case study. The main unit is the single entity, covering a single-case. The
embedded units are more numerous and can produce a large amount of quantitative
data. Nevertheless, the data are still part of the same single case. The most complex
case study design then arises when your case study may contain multiple cases (e.g.,
multiple organizations), each of which has an embedded unit of analysis.
In these situations, the multiple sources of evidence help cover the different
units of analysisthe main and embedded units. In the example of an organization
and its employees, the case study might be about the development of an organiza-
tional culture. At the main unit of analysis, only a single entitythe organization
exists, and the relevant data could include the kind of observations, key interviews,
and documents review previously highlighted in Case Studies 13 and 14. At the
embedded unit of analysisa sample or universe of employeesthe relevant data
would include an employee survey or some analysis of employee records. In con-
trast to Case Studies 13 and 14, which did not have an embedded unit of analysis,
Case Study 15 is an older but classic case study of a single organization (a labor
union), with multiple layers and in fact, several levels of embedded units.
This case study is about a single trade union, the International Typo-
graphical Union, whose membership came from across the country (Lipset,
Trow, & Coleman, 1956). Because of its national coverage, the union, like
many other unions, was organized into a series of locals, each local
(Continued)
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 268
(Continued)
representing the members in a local area. Similarly, each local consisted of
a number of shops. Finally, each shop contained individual union members.
From top to bottom, the organization therefore had four layers. As a case
study, the case had one main unit (the union) and three embedded units. In
this sense, the case study was complex.
The research questions called for information at every level. The three
investigators, who ultimately became recognized as prominent scholars in
their fields, designed a variety of data collection activities, ranging from key
interviews with the top officials to observations of informal group behavior
among the locals and shops to a survey of the individual members. For each
of the three embedded levels, the investigators also had to define and
defend their sample selection. The study took 4 years to complete, in
addition to two earlier years when the senior author had begun preliminary
queries.
The absence of any cookbook for analyzing case study evidence has only
partially been offset by the development of prepackaged software to conduct
computer-assisted tallies of large amounts of narrative text. The software helps code
and categorize the words found in a text, as might have been collected from open-
ended interviews or extracted from documents. However, the coding can only attend
to the verbatim or surface language in the texts, potentially serving as a microlevel
starting point for doing case study analysis. Yet the case study of interest is likely to
be concerned with broader themes and events than represented by the surface lan-
guage of texts. To this extent, you still need to have a broader analytic strategy, even
if you have found the computer software to be a useful preliminary tool.
Discussed next are four examples of the broader analytic strategies (see also Yin,
2003b, pp. 116133). The associated case study examples suggest that all the strate-
gies can use either qualitative or quantitative data, or both. This duality reinforces
the positioning of the case study method as a method not limited to either type of
data. An important correlate is that case study investigators, including yourself,
should not only be acquainted with collecting data from the variety of sources of
evidence discussed in the preceding section but also with the analytic techniques
now discussed in the present section.
variances from two groups, you could perform statistical tests of significance. For
instance, a study of math-science education reform might predict a pattern
whereby students test scores in math and science at different grade levels will
improve compared to some baseline period, but that their reading scores at differ-
ent grade levels will remain on the same trend lines compared to the same baseline
period. In this example, you could conduct all the needed matching (comparisons)
through statistical tests.
More commonly, the variables of interest are likely to be categorical or nominal
variables. In this situation, you would have to judge the presence or absence of the
predicted pattern by setting your own criteria (ahead of time) for what might con-
stitute a match or a mismatch. For instance, a case study investigating the pre-
sumed economic impact of a military base closing argues that the closing was not
associated with the pattern of dire consequences that pundits commonly predicted
would occur as a result of such closings (see Case Study 16).
C A S E S T U D Y 1 6 : PAT T E R N M ATC H I N G T O S H O W W H Y A
M I L I T A R Y B A S E C L O S U R E W A S N O T C AT A S T R O P H I C
Many military bases in the United States have been the presumed economic
and residential driving forces of the local community. When such bases
close, the strong belief is that the community will suffer in some catastrophic
mannerleaving behind both economic and social disarray.
A case study of such a closure in California (Bradshaw, 1999), assembled
a broad array of data to suggest that such an outcome did not, in fact,
occur. The analytic strategy was to identify a series of sectors (e.g., retail
markets, housing sales, hospital and health services, civilian employment,
unemployment, and population turnover and stability) where catastrophic
outcomes might have been feared, and then to collect data about each
sector before and after the base closure. In every sector, and also in
comparison to other communities and statewide trends, a pattern-
matching procedure showed that the outcomes were much less severe than
anticipated. The case study also presented potential explanations for these
outcomes, thereby producing a compelling argument for its conclusions.
As but one example presented in Case Study 16, among the predicted conse-
quences was a rise in unemployment. The case study tracked the seasonal pat-
tern of unemployment for several years before and after the base closing and
showed how, after observing seasonal variations, the overall rate did not appear
to decline at all, much less in any precipitous manner. The case study especially
called attention to the employment levels between January and April 1997, well
after the base closing. The levels at these later times exceeded those of the
January and April periods in the previous 5 years, when the base was still in
operation (see Figure 8.1).
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 271
65,000 25
Castle AFB closes
Employment All Industries
20
15
55,000
10
50,000
5
45,000 0
Jan 1992
Apr 1992
Jul 1992
Oct 1992
Jan 1993
Apr 1993
Jul 1993
Oct 1993
Jan 1994
Apr 1994
Jul 1994
Oct 1994
Jan 1995
Apr 1995
Jul 1995
Oct 1995
Jan 1996
Apr 1996
Jul 1996
Oct 1996
Jan 1997
Apr 1997
Employment all industries Unemployment rate
Important, too, was the breadth of possible consequences covered by the case
study. Thus, the full case study did not rely on the unemployment outcome alone
but showed that similar patterns existed in nearly every other important sector
related to the communitys economy. In this same manner, you would want to show
that you had considered a broad array of relevant variables related to your research
questions and also had defined and tested a variety of rival conditionsthe more
conditions, the better.
The explanation building in the first case study follows many situations in which
an explanation is built post-hoc, or after the fact. Such a label means that you try
retrospectively to explain an event whose outcome already is known. In this first
case study (see Case Study 17), the known outcome was that a Fortune 50 firm had
gone out of business. The case study tried to explain why this outcome might have
occurred. To do this, the case study posited the downside effects of several of the
firms cultural tendencies. The case study then offered evidence in support of
these tendencies and explained how they collectively left the firm without a critical
survival motive.
C A S E S T U D Y 1 7 : E X P L A N AT I O N B U I L D I N G :
W HY A F O RT U N E 50 F IR M W EN T OU T O F B U S IN ESS
Business failure has been a common part of the American scene. Less
common is when a failure occurs with a firm that, having successfully grown
for 30 years, had risen to be the number two computer maker in the United
States and, across all industries, among the top 50 corporations in size.
A professor at MIT served as a consultant to the senior management of
the firm during nearly all its history. His case study (Schein, 2003) tries to
explain how and why the company had a missing gene, critical to the
survival of the business.
As an important part of the explanation, the author argues that the
gene needed to be strong enough to overcome the firms other cultural
tendencies, which included its inability to address layoffs that might have
pruned deadwood in a more timely manner; set priorities among competing
development projects (the firm developed three different PCs, not just
one); and give more prestige to marketing and business as opposed to
technological functions within the firm.
The case study cites much documentation and interviews but also
includes supplementary chapters permitting key former officials of the firm
to offer their own rival explanations.
The second case study took place in an entirely different setting. In New York
City, a long-time rise in crime from 1970 finally peaked in the early 1990s, starting
a new, declining trend from that time thereafter (see Figure 8.2). The case study (see
Case Study 18) attempts to explain how actions taken by the New York City Police
Department might have contributed to the turnaround. The case study builds a
twofold explanation. First, it devotes several chapters to the nature of the police
departments specific protective actions, showing how they could plausibly reduce
crime. Second, it presents time-series data and suggests that the timing of the
actions fit well the timing of the turnaround. In particular, the case study argues
that, although a declining trend already had started in 1991, an even sharper decline
in murder rates in 1994 coincided with the first full year of new police protection
practices (see Figure 8.2).
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 273
2,500
2,000
Number of Murders
1,500
1,000
500
0
1988 1989 1990 1991 1992 1993 1994
In New York City, following a parallel campaign to make the citys subways
safer, the citys police department took many actions to reduce crime in the
city more broadly. The actions included enforcing minor violations (order
restoration and maintenance), installing computer-based crime-control
techniques, and reorganizing the department to hold police officers
accountable for controlling crime.
Case Study 18 (Kelling & Coles, 1996) first describes all these actions in
sufficient detail to make their potential effect on crime reduction under-
standable and plausible. The case study then presents time series of the
annual rates of specific types of crime over a 7-year period. During this period,
crime initially rose for a couple of years and then declined for the remainder
of the period. The case study explains how the timing of the relevant actions
by the police department matches the changes in the crime trends. The
authors cite the plausibility of the actions effects, combined with the timing
of the actions in relation to the changes in crime trends, as part of their
explanation for the reduction in crime rates in the New York City of that era.
Both of these examples show how to build explanations for a rather complex set
of events. Each case study is book length. Neither follows any routine formula or
procedure in the explanation-building process. However, the work in both case
studies suggests the following characteristics that might mark the explanations in
your own case study analyses:
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 274
C A S E S T U D Y 1 9 : A C H R O N O L O G Y S H O W I N G T H E D E L AY E D
ST A RT-U P OF A C O N TR OV E RS I A L CO M M U NI T Y P R OG R A M
Chronologies offer the additional advantage that chronological data are usually
easy to obtain. One value of using documentary evidence is that the documents fre-
quently cite specific dates. But even in the absence of specific dates, having an esti-
mated month or even season of occurrence may be sufficient to serve your case
studys needs. If so, you need not depend solely on having relevant documentary
evidence. You also can ask your interviewees to estimate when something might
have happened. Such an inquiry does not require them to have been a chronicler.
Rather, you can ask whether something happened before or after a well-known
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 275
election, a holiday season, or some other benchmark such as the annual Super Bowl
in professional football. Citing such a benchmark usually can help most people
recall more readily the chronological occurrence of an event or even the chronol-
ogy of a sequence of events.
Chronological data are sufficiently valuable that collecting such information should
be a routine part of all the data collected for your case study. Tracking such chronolo-
gies requires you to take note of the dates that appear in documents and to ask inter-
viewees when something might have occurred, not just whether it had transpired. Even
if you had not identified the need for this information at the outset of your case study,
in later analyzing your data you may find that the chronologies lead to surprising
insights. Evidence about the timing of events also may help you reject some rival expla-
nations, because they may not fit the chronological facts that you have amassed.
276
3 8
Technical Precursory Changed
2
Assistance Outcomes Manufacturing
Brokerage
Services Performance
Services
a) predecessor 5 6 resulting physical
help, if any, 7
conditions Capacity for Changes in performance outcomes
in engaging Changed Firm
b) services: Changes in Practice
the assistance Capability
the assistance Practice Intermediate Outcomes
changes in
that occurred new skills or resulting ability
firms operations
capacity created to meet new
resulting from 9
by the assistance requirements
assistance Changed
08-Bickman-45636:08-Bickman-45636
Benefits to Firm
resulting business
outcomes
4
1
Other Directly
Conditions Later Outcomes
Contributing
7/28/2008
Leading to service
Initiatives
Conditions
other initiatives 10
giving rise
related to the Changed Business
to assistance
assistance Performance
7:47 PM
12 13
Other New External Market
Page 276
Firms Characteristics
features related to practices and outcomes
The logic model framework has quantitative counterparts that take the form of
structural equation models (SEMs) and path analyses. For example, schools
progress in implementing education reform was a major subject of a case study of
a reforming school system. Although the single system was the subject of a single-
case study (see Case Study 20), the size of the system meant that it contained hun-
dreds of schools. The school-level data then became the subject of a path analysis.
Figure 8.4 shows the results of the path analysis, enumerating all the original vari-
ables but then only showing arrows where the standardized regression coefficients
were statistically significant.
C A S E S TU D Y 2 0 : T ES T I N G T H E LO G I C O F A S C H O O L RE F O R M AC T
(Continued)
The case study includes qualitative data about the system as a whole and
about the individual schools in the system. At the same time, the study also
includes a major quantitative analysis that takes the form of structural
equation modeling. The resulting path analysis tests a complex logic model
whereby prereform restructuring is claimed to produce strong democracy, in
turn producing systemic restructuring, and finally producing innovative
instruction, all taking into account a context of basic school characteristics.
The analysis is made possible because the single case (the school system)
contains an embedded unit of analysis (individual schools), and the path
model is based on data from 269 of the elementary schools in the system. The
results of the path model do not pertain to any single school but represent a
commentary about the collective reform experience across all the schools
in other words, the overall reform of the system (single case) as a whole.
In this example, the schools represented an embedded unit of analysis within the
overall single-case study, and the collective experiences of the schools provided
important commentary about the advances made by the system as a whole. Note
the similarity between the variables used in the path analysis and those that might
have been used in a logic model studying the same situation. Other investigators of
school reform have used the same path analysis method to test the logic of reform
in multiple school systems, not just single systems (see Borman & Associates, 2005).
Summary
This chapter has suggested ways of dealing with four steps that have been the most
challenging in doing case study research. In the first step, investigators like yourself
commonly struggle with how to choose a significant, not mundane, case or cases
for their case studies.
In the second step, having multiple cases within your case study may require
greater effort. However, the benefit will be a more strongly designed case study,
where the cases may replicate or otherwise complement each others experiences.
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 279
In the third and fourth steps, creating a strong evidentiary base will provide
greater credibility for your case study, and methodically analyzing these data, using
qualitative or quantitative methods, will then lead to more defensible findings and
conclusions.
By covering these four steps, the chapter follows the spirit of handbooks that try
to provide concrete and operational advice to readers. The chapters descriptions of
numerous, specific case studies add to the concreteness. If you can emulate some of
these case studies, or if you can successfully implement the four steps more gener-
ally, you may markedly improve your own case studies.
In contrast, the chapter has not attempted another conventional use of hand-
booksto provide a theoretical and historical perspective on the evolution of a
topic such as case study research. Such a perspective already has been provided else-
where by Jennifer Platt (1992), and readers interested in learning more about it
would be well-advised to consult her work.6
Exercises
Different exercises may be relevant, depending on whether a class is at the prelimi-
nary or advanced end of the spectrum of doing social science research.
Exercise 1. Finding and Analyzing an Existing Case Study: Have each student
retrieve an example of case study research from the literature.
Prelim. Class: The case study can be on any topic, but it must have used some
empirical method and presented some empirical data. Questions for discussion:
1. Why is this a case study?
2. What, if anything, is distinctive about the findings that could not be learned
by using some other social science method focusing on the same topic?
Advanced Class: The case study must have presented some numeric (quanti-
tative) as well as narrative (qualitative) data. Questions for discussion:
1. How were these data derived (e.g., from what kind of instrument, if any)
and were they presented clearly and fairly?
2. How were these data analyzed? What were the specific analytic procedures
or methods?
3. Are there any lessons regarding the potential usefulness of having both
qualitative and quantitative evidence within the same case study?
Exercise 2. Designing Case Study Data Collection: Have each student design a
case study on a topic with which he or she is familiar (my family, my school, my
friends, my neighborhood, etc.).
Prelim. Class: What are the case studys questions? Among the various sources
of evidence for the case study, will interviews, documents, observations, and
archival data all be relevant? If so, how?
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 280
Exercise 3. Testing for Case Study Skills: Have each student present the following
claims, either in the form of a classroom presentation or written assignment.
Prelim. Class: Why and with what distinctive skills, if any, does a student
believe that he or she is adequately equipped (or not equipped) to do a case
study? Where not well-equipped, what remedies does the student recom-
mend for himself or herself?
Advanced Class: Carry out the same exercise as that of the prelim class. In addi-
tion, however, ask two other students to prepare critiques of the first students
claims and permit the first student time for a brief response or rebuttal.
Notes
1. The chapter is based on and draws heavily from a case study anthology compiled by
the author (see Yin, 2004). See also Yin (2005) for an anthology of case studies devoted solely
to the field of education.
2. Aspiring case study investigators may, therefore, need to consult (and use) the earlier
chapter and the full textbook, as well as several other directly related works by the present
author: Yin (2003a) for in-depth applications of the case study method; Yin (2006a) for guid-
ance in doing case studies in the field of education; and Yin (2006b) if case studies are to be
part of a mixed methods research study. These other works can help investigators address
such questions as when and why to use the case study method in the first place, compared
to other methods.
3. These forms all fall within the domain of case study research. In turn, many special-
ists consider case study research to fall within a yet broader domain of qualitative research
(Creswell, 2007). However, the present approach to case study research resists any catego-
rization under the broader domain, because case study research, as discussed throughout the
present chapter, can include quantitative and not just qualitative methods.
4. The case study anthology (Yin, 2004) referenced in Footnote 1 contains lengthy
excerpts of all the case studies described in the boxes throughout this chapter.
5. Case study evaluations are not necessarily the same as doing your own case stud-
ies. Clients and sponsoring organizations (e.g., private foundations) usually prespec-
ify the research questions as well as the cases to be studied. In this sense, case study
evaluators may not need to decide how to define and select their case studies as cov-
ered in the text.
6. Platt traces the evolution of case study research, starting with the work of the Chicago
School (of sociology) in the 1920s. Despite this auspicious beginning, Platt explains why
case study research became moribund during the postWorld War II perioda period so
barren that the term case study was literally absent from the methodological texts of the
1950s and 1960s. Platt then argues that the resurgence of case study research occurred in the
early 1980s, crediting the resurgence to a fresh understanding of the benefits that may accrue
when case study research is properly designed.
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 281
References
Allison, G. T. (1971). Essence of decision: Explaining the Cuban missile crisis. Boston: Little,
Brown.
Allison, G. T., & Zelikow, P. (1999). Essence of decision: Explaining the Cuban missile crisis
(2nd ed.). New York: Addison-Wesley.
Borman, K. M., & Associates (2005). Meaningful urban education reform: Confronting the
learning crisis in mathematics and science. Albany: State University of New York Press.
Bradshaw, T. K. (1999). Communities not fazed: Why military base closures may not be
catastrophic. Journal of American Planning Association, 65, 193206.
Bryk, A. S., Bebring, P. B., Kerbow, D., Rollow, S., & Easton, J. Q. (1998). Charting Chicago
school reform: Democratic localism as a lever for change. Boulder, CO: Westview Press.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for
field settings. Chicago: Rand McNally.
Creswell, J. W. (2007). Qualitative inquiry & research design: Choosing among five approaches
(2nd ed.). Thousand Oaks, CA: Sage.
Derthick, M. (1972). New towns-in-town. Washington, DC: The Urban Institute.
Feagin, J. R., Orum, A. M., & Sjoberg, G. (1991). A case for the case study. Chapel Hill:
University of North Carolina Press.
George, A. L., & Bennett, A. (2004). Case studies and theory development in the social sciences.
Cambridge: MIT Press.
Gross, N. C., Bernstein, M., & Giacquinta, J. B. (1971). Implementing organizational innova-
tions: A sociological analysis of planned educational change. New York: Basic Books.
Hooks, G. (1990). The rise of the Pentagon and U.S. state building. American Journal of
Sociology, 96, 358404.
Jacobs, R. N. (1996). Civil society and crisis: Culture, discourse, and the Rodney King beat-
ing. American Journal of Sociology, 101, 12381272.
Kelling, G. L., & Coles, C. M. (1996). Fixing broken windows: Restoring order and reducing
crime in our communities. New York: Free Press.
Lipset, S. M., Trow, M. A., & Coleman, J. S. (1956). Union democracy. New York: Free Press.
(Copyright renewed in 1984 by S. M. Lipset and J. S. Coleman)
Lynd, R. S., & Lynd, H. M. (1957). Middletown: A study of modern American culture. Orlando,
FL: Harcourt Brace. (Original work published 1929)
Magaziner, I., & Patinkin, M. (1989). Winning with microwaves. The silent war: Inside the
global business battles shaping Americas future. New York: Random House.
McAdams, D. R. (2000). Fighting to save our urban schoolsand winning! Lessons from
Houston. New York: Teachers College Press.
National Institute of Standards and Technology. (1999, April). MEP Successes (Case Study Series):
Transformed Firms Case Studies. Gaithersburg, MD: U.S. Department of Commerce.
National Institute of Standards and Technology. (2000, May). MEP Successes (Case Study
Series): More Transformed Firms Case Studies. Gaithersburg, MD: U.S. Department of
Commerce.
Nelkin, D. (1973). Methadone maintenance: A technological fix. New York: George Braziller.
Neustadt, R. E., & Fineberg, H. V. (1983). The epidemic that never was: Policy-making and the
swine flu scare. New York: Vintage Books.
Platt, J. (1992). Case study in American methodological thought. Current Sociology, 40, 1748.
Pressman, J. L., & Wildavsky, A. (1973). Implementation: How great expectations in Washington
are dashed in Oakland (3rd ed.). Berkeley: University of California Press.
08-Bickman-45636:08-Bickman-45636 7/28/2008 7:47 PM Page 282
Rogers, E. M., & Larsen, J. (1984). Silicon Valley fever: Growth of high-technology culture. New
York: Basic Books.
Schein, E. (2003). DEC is dead, long live DEC: Lessons on innovation, technology, and the busi-
ness gene. San Francisco: Berrett-Koehler.
Shavelson, R., & Townes, L. (Eds.). (2002). Scientific research in education. Washington,
DC: National Academy Press.
Sutton, R. I., & Staw, B. M. (1995). What theory is not. Administrative Science Quarterly, 40,
371384.
Thacher, D. (2006). The normative case study. American Journal of Sociology, 111, 16311676.
U.S. Government Accountability Office. (1990). Case study evaluations. Washington, DC:
Government Printing Office.
Warner, W. L., & Lunt, P. S. (1941). The social life of a modern community. New Haven, CT:
Yale University Press.
Wholey, J. (1979). Evaluation: Performance and promise. Washington, DC: The Urban
Institute.
Yin, R. K. (1998). The abbreviated version of case study research. In L. Bickman & D. Rog (Eds.),
Handbook of applied social research (1st ed., pp. 229259). Thousand Oaks, CA: Sage.
Yin, R. K. (2000). Rival explanations as an alternative to reforms as experiments. In
L. Bickman (Ed.), Validity & social experimentation: Donald Campbells legacy
(pp. 239266). Thousand Oaks, CA: Sage.
Yin, R. K. (2003a). Applications of case study research (2nd ed.). Thousand Oaks, CA: Sage.
Yin, R. K. (2003b). Case study research: Design and methods (3rd ed.). Thousand Oaks, CA:
Sage.
Yin, R. K. (Ed.). (2004). The case study anthology. Thousand Oaks, CA: Sage.
Yin, R. K. (Ed.). (2005). Introducing the world of education: A case study reader. Thousand
Oaks, CA: Sage.
Yin, R. K. (2006a). Case study methods. In J. L. Green, G. Camilli, & P. B. Elmore (Eds.),
Complementary methods in education research (pp. 111122). Mahwah, NJ: Lawrence
Erlbaum. (Published for the American Educational Research Association)
Yin, R. K. (2006b). Mixed methods research: Parallel or truly integrated? Journal of Education
Research, 13, 4147.
Zigler, E., & Muenchow, S. (1992). Head start: The inside story of Americas most successful
educational experiment. New York: Basic Books.
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 283
CHAPTER 9
Integrating Qualitative
and Quantitative
Approaches to Research
Abbas Tashakkori
Charles Teddlie
283
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 284
Onwuegbuzie, 2004; Rao & Woolcock, 2004; Teddlie & Tashakkori, 2006). The term
mixed methodology has been broadly used to denote the academic field or discipline
of studying and presenting the philosophical, theoretical, technical, and practical
issues and strategies for such integration (Teddlie & Tashakkori, in press). In the
following sections, we provide an overview of mixed methodology.
The sections that follow will first examine our guiding assumptions for the
chapter. We then introduce an overview of qualitative, quantitative, and integrated
approaches to sampling, data collection, data analysis, and inference. The chapter
will end with a discussion of issues in evaluating/auditing the inferences that are
made on the basis of the results.
in all stages of the study. Strands of a study might have research questions that are
qualitative or quantitative in approach. However, an overarching question, involv-
ing the integration of subquestions must drive every mixed methods study.
Throughout the chapter, we make every effort to differentiate between pur-
pose (agenda or reason motivating you to conduct a study), question (the profes-
sional or theoretical issue troubling you that needs an answer or solution), data (the
information you need to answer your research question), data collection methods
(how you collect the information you need for answering your research question),
results (the outcome of summarizing and analyzing your collected data), inferences
(the credible conclusions you make on the basis of the results), and policy/practice
recommendations (credible suggestions you can make for policy and professional
practice on the basis of your inferences).
In one kind of mixed methods study, qualitative and quantitative entities are
in mixed company with each other, while in the other kind, they are actually
blended. In the first kind of mixed methods study, entities are associated with
or linked to each other but retain their essential characters; metaphorically,
apple juice and orange juice both are used, but they are never mixed together
to produce a new kind of fruit juice. (p. 326)
Purpose Description
Complementarity Mixed methods are used to gain complementary views about the same
phenomenon or relationship. Research questions for the two strands of the
mixed study address related aspects of the same phenomenon.
Completeness Mixed methods designs are used to make sure a complete picture of the
phenomenon is obtained. The full picture is more meaningful than each
of the components.
Developmental Questions of one strand emerge from the inferences of a previous one (sequential
mixed methods), or one strand provides hypotheses to be tested in the next one.
Expansion Mixed methods are used to expand or explain the understanding obtained in a
previous strand of a study.
Corroboration/ Mixed methods are used to assess the credibility of inferences obtained from one
Confirmation approach (strand). There usually are exploratory and explanatory/confirmatory
questions.
Compensation Mixed methods enable the researcher to compensate for the weaknesses of one
approach by using the other. For example, errors in one type of data would be
reduced by the other (Johnson & Turner, 2003).
Diversity Mixed methods are used with the hope of obtaining divergent pictures of the
same phenomenon. These divergent findings would ideally be compared and
contrasted (pitted against each other, Greene & Caracelli, 2003).
SOURCES: This table is constructed on the basis of Greene, Caracelli, and Graham (1989), Patton (2002), Tashakkori
and Teddlie (2003a), Creswell (2005), and Rossman and Wilson (1985).
The utilization quality of mixed methods also depends on the design of the mixed
methods study. For parallel mixed methods, the purpose of mixing must be known
from the start. For sequential mixed methods, the purpose might be known from the
start, or it might emerge from the inferences of the first strand. For example, unex-
pected or ambiguous results from a quantitative study might necessitate the collec-
tion and analysis of in-depth qualitative data in a new strand of the study.
Recently, we (Teddlie & Tashakkori, 2006) have categorized mixed designs into
five families: sequential, parallel, conversion, multilevel, and fully integrated. This
classification is based on three key dimensions: (1) number of strands in the
research design, (2) type of implementation process, and (3) stage of integration
(i.e., collecting and analyzing two types of data to answer predominantly qualita-
tive or quantitative questions vs. integration in all stages of research to answer
mixed questions). We do not use the other three criteria noted above in our typol-
ogy, which focuses on the methodological components of research designs.
The first dimension in our typology is the number of strands or phases in the
design. A strand of a research design is a phase of a study that includes three stages:
the conceptualization stage, the experiential stage (methodological/analytical), and
the inferential stage. A monostrand design employs only a single phase and it
encompasses all the stages from conceptualization through inference, while a mul-
tistrand design employs more than one phase, each encompassing all the stages
from conceptualization through inference.
The second dimension of our typology is the type of implementation process:
parallel, sequential, and conversion. Parallel and sequential designs have been
employed by numerous authors writing in the mixed methods tradition. In parallel
mixed designs, the strands of a study occur in a synchronous manner (even though
the data for one strand might be collected with some time lag), while in sequential
designs they occur in chronological order with one strand emerging from the other.
Conversion designs are a unique feature of mixed methods research and include the
transformation of one type of data to another, to be reanalyzed accordingly.
Conversion may be in the form of quantitizing1 (converting qualitative data into
numerical codes that can be reanalyzed statistically) or qualitizing (in which quanti-
tative data are transformed into data that can be reanalyzed qualitatively).
The third dimension of our typology is the stage of integration of the qualitative
and quantitative approaches. The most dynamic and innovative of the mixed
methods designs are mixed across stages. However, various scholars have identified
mixed studies in which two types of data are collected and analyzed to answer a pre-
dominantly qualitative or quantitative type of research question. We call these stud-
ies quasi-mixed designs, because there is no serious integration across the qualitative
and quantitative approaches.
Monostrand conversion designs (also known as the simple conversion design)
are used in single-strand studies in which research questions are answered through
an analysis of transformed data (i.e., quantitized or qualitized data). These studies
are mixed because they switch approach in the methods phase of the study, when
the data that were originally collected are converted into the other form. Monostrand
conversion designs may be planned before the study actually occurs, but many
applications of this design occur serendipitously as a study unfolds. For instance, a
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 289
researcher may determine that there are emerging patterns in the information
gleaned from narrative interview data that can be converted into numerical form and
then analyzed statistically, thereby allowing for a more thorough analysis of the data.
The monostrand conversion design has been used extensively in both the quan-
titative and qualitative traditions, without being recognized as mixed (see, e.g.,
Hunter & Brewer, 2003; Maxwell & Loomis, 2003; Waszak & Sines, 2003). An
explicit example of quantitizing data in the mixed methods research literature is
Sandelowski, Harris, and Holditch-Davis (1991) transformation of interview data
into a frequency distribution that compared the numbers of couples having and
not having an amniocentesis with the number of physicians encouraging or not
encouraging them to have the procedure, which was then analyzed statistically to
determine the relationship between physician encouragement and couple decision
to have an amniocentesis (Sandelowski, 2003, p. 327).
Multistrand mixed methods designs are more complex, containing at least two
research strands. Mixing of the qualitative and quantitative approaches may occur
both within and across all stages of the study. Five types of these designs, which we
consider to be the most valuable are parallel mixed designs, sequential mixed
designs, conversion mixed designs, multilevel, and fully integrated mixed designs.
These five types of designs are families, since there may be several permutations of
members of these families based on other design criteria.
Parallel mixed designs are designs in which there are at least two interconnected
strands: one with qualitative questions and data collection and analysis techniques
and the other with quantitative questions and data collection and analysis tech-
niques. Data may be collected simultaneously or with some time lag (for this reason,
we prefer the term parallel, as compared with concurrent). Analysis is performed
independently in each strand, although one might also influence the other.
Inferences made on the basis of the results from each strand are integrated to form
meta-inferences at the end of the study. Using parallel mixed designs enables the
researchers to answer exploratory (frequently, but not always, qualitative) and con-
firmatory (frequently, but not always, quantitative) questions.
Lopez and Tashakkori (2006) provide an example of a parallel mixed study of
the effects of two types of bilingual education programs on attitudes and academic
achievement of fifth-grade students. The quantitative strand of the study included
standardized achievement tests in various academic subjects, as well as measured
linguistic competence in English and Spanish. Also, a Likert-type scale was used to
measure self-perceptions and self-beliefs in relation to bilingualism. The qualitative
strand consisted of interviews with a random sample of 32 students in the two
programs. Each set of data was analyzed independently, and conclusions were
drawn. The findings of the two studies were integrated by (a) comparing and con-
trasting the conclusions and (b) by trying to construct a more comprehensive
understanding of how the two programs affected the children.
Sequential mixed designs are designs in which there are at least two strands that
occur chronologically (QUAN QUAL or QUAL QUAN). The conclusions that
are made on the basis of the results of the first strand lead to formulation of ques-
tions, data collection, and data analysis for the next strand. The final inferences are
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 290
based on the results of both strands of the study. The second strand of the study is
conducted either to confirm/disconfirm the inferences of the first strand or to pro-
vide further explanation for findings from the first strand. Although the second
strand of the study might emerge as a response to the unexpected and/or inexplic-
able results of the first strand, it is also possible to plan the two strands in advance.
An example of a sequential QUAL QUAN mixed design comes from the con-
sumer marketing literature (Hausman, 2000). The first part of the study was
exploratory in nature using semistructured interviews to examine several questions
related to impulse buying. Interview results were then used to generate a series of
hypotheses. Trained interviewers conducted 60 interviews with consumers, and the
resultant data were analyzed using grounded theory techniques. Based on these
analyses, a series of five hypotheses were developed and tested using a 75-item ques-
tionnaire generated for the purposes of this study. A final sample of 272 consumers
completed the questionnaire. Hypothesis testing involved both correlational and
analysis of variance techniques.
The conversion mixed design is a multistrand parallel design in which mixing of
qualitative and quantitative approaches occurs in all components/stages, with data
transformed (qualitized or quantitized) and analyzed both qualitatively and quanti-
tatively (Teddlie & Tashakkori, 2006). In these designs, one type of data (e.g., quali-
tative) is gathered and is analyzed accordingly (qualitatively) and then transformed
and analyzed using the other methodological approach. The Witcher, Onwuegbuzie,
Collins, Filer, and Wiedmaier (2003) study is an example of such a design. In this
study, the researchers gathered qualitative data from 912 undergraduate/graduate
students regarding their perceptions of the characteristics of effective college teach-
ers. A qualitative thematic analysis revealed nine characteristics of effective college
teachers, including student centeredness and enthusiasm about teaching. A series of
binary codes (1, 0) were assigned to each student for each effective teaching charac-
teristic. These quantitized data were subjected to a series of analyses that enabled the
researchers to statistically associate each of the nine themes of effective college teach-
ing with four demographic variables (gender, race, undergraduate/graduate status,
preservice status). The researchers were able to connect students with certain demo-
graphic characteristics with preferences for certain effective teaching characteristics.
In a multilevel mixed design, mixing occurs as QUAN and QUAL data from differ-
ent levels of analysis are analyzed and integrated to answer aspects of the same or related
questions. These designs are described in more detail in the sampling section below.
The fully integrated mixed design, takes advantage of both a parallel and a sequen-
tial process in which mixing of qualitative and quantitative approaches occurs in an
interactive (i.e., dynamic, reciprocal, interdependent, iterative) manner at all stages of
the study. At each stage, information from one approach (e.g., qualitative) affects the
formulation of the other approach (e.g., quantitative) (Teddlie & Tashakkori, 2006).
It should be evident to the reader that in the multistrand designs, one approach/
strand might only be a small part of the overall study (what Creswell & Plano Clark,
2007, call embedded designs). For example, parallel with (or immediately follow-
ing) an extended qualitative study, limited quantitative survey data might be col-
lected and analyzed, to provide insights about a larger respondent group than the
qualitative study included. Despite the larger sample size, such a survey study does
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 291
not provide much more insight on the phenomenon than the original qualitative
study. However, it would provide information regarding the degree of transferabil-
ity of the results to the large group/population.
Simple random sampling occurs when each sampling unit in a clearly defined
population has an equal chance of being included in the sample.
Stratified sampling occurs when the researcher divides the population into
subgroups (or strata) such that each unit belongs to a single stratum and then
selects units from those strata.
Cluster sampling occurs when the sampling unit is not an individual but a
group (cluster) that occurs naturally in the population such as neighbor-
hoods or classrooms.
themes) or (2) the sample evolves of its own accord as data are being collected.
Gradual selection is the sequential selection of units or cases based on their rele-
vance to the research questions, not their representativeness (e.g., Flick, 1998).
Sampling techniques Both probability and purposive, within and across strands
Rationale for selecting Simultaneous attention across the strands, to representativeness and
cases/units potential for finding answers to research questions
Sample size Multiple sample within and across strands, with equal or different
sample sizes
Depth/breadth of information Focus on both depth and breadth of information, both within and
per case/unit across the strands
When the sample is selected Preplanned sampling design while allowing for the emergence of other
samples during the study
is determined by the dictates of the research questions. There are four types of
mixed methods sampling: basic mixed sampling strategies, sequential mixed sampling,
parallel mixed sampling, and multilevel mixed sampling (Teddlie & Yu, 2007).
The basic mixed methods sampling strategies include stratified purposive sam-
pling and purposive random sampling. These strategies are also identified as pur-
posive sampling techniques (e.g., Patton, 2002), yet by definition they include a
component of probability sampling (stratified, random). We will not discuss these
techniques here since they are widely described elsewhere.
Sequential and parallel mixed methods sampling follow from the design types
described above. Sequential mixed methods sampling involves the selection of units of
analysis for a study through the sequential use of probability and purposive sampling
strategies (QUAN QUAL) or vice versa (QUAL QUAN). Parallel mixed methods
sampling involves the selection of units of analysis for a study through the parallel,
or simultaneous, use of both probability and purposive sampling strategies. One
type of sampling procedure does not set the stage for the other in parallel mixed
methods sampling studies; instead, both probability and purposive sampling proce-
dures are used simultaneously.
Multilevel mixed methods sampling is a general sampling strategy in which
probability and purposive sampling techniques are used at different levels (e.g.,
student, class, school, district) (Tashakkori & Teddlie, 2003b, p. 712). This sampling
strategy is common in contexts or settings in which different units of analysis are
nested within one another, such as schools, hospitals, and various bureaucracies
(Collins et al., 2007).
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 294
In sequential mixed methods sampling, the results from the first strand typically
inform the methods (e.g., sample, instrumentation) employed in the second strand.
In many QUAN QUAL studies, the qualitative strand uses a subsample of the
quantitative sample. One example of this comes from the work of Hancock,
Calnan, and Manley (1999), in a study of perceptions and experiences of residents
concerning private/public dental service in the United Kingdom. In the quantita-
tive portion of the study, the researchers conducted a postal survey that involved
both cluster and random sampling: (1) the researchers selected 13 wards out of 365
in a county in southern England using cluster sampling, and (2) they randomly
selected one out of every 28 residents in those wards resulting in an accessible pop-
ulation of 2,747 individuals, from which they received 1,506 responses (55%).
The questionnaires included five items measuring satisfaction with dental care,
which they labeled the DentSat scores. The researchers next selected their sample for
the qualitative strand of the study using intensity and homogeneous sampling:
(1) 20 individuals were selected who had high DentSat scores (upper 10% of scores)
through intensity sampling; (2) 20 individuals were selected who had low DentSat
scores (lower 10% of scores) through intensity sampling; and (3) 10 individuals
were selected who had not received dental care in the past 5 years, but also did not
have full dentures, using homogeneous sampling. This type of sampling is often
used in mixed methods designs that involve extreme groups analysis. A good
example of this sampling and data analysis (called Group-Case Method or GCM)
may be found in Teddlie, Tashakkori, and Johnson (2008).
Parasnis, Samar, and Fischers (2005) study provides an example of parallel
mixed methods sampling. Their study was conducted on a college campus where
there were a large number of deaf students (around 1,200). Selected students were
sent surveys that included closed-ended and open-ended items; therefore, data for
the quantitative and qualitative strands were gathered simultaneously. Data analy-
sis from each strand informed the analysis of the other.
The mixed methods sampling procedure included both purposive and proba-
bility sampling techniques. First, all the individuals in the sample were deaf college
students (homogeneous sampling). The research team had separate sampling pro-
cedures for selecting racial/ethnic minority deaf students and for selecting
Caucasian deaf students. There were a relatively large number of Caucasian deaf
students on campus, and a randomly selected number of them were sent surveys
through regular mail and e-mail. Since there were a much smaller number of
racial/ethnic minority deaf students, the purposive sampling technique known as
complete collection was used (Teddlie & Yu, 2007). In this technique, all members
of a population of interest are selected that meet some special criterion. Altogether,
the research team distributed 500 surveys and received a total of 189 responses,
32 of which were eliminated because they were foreign students. Of the remaining
157 respondents, 81 were from racial/ethnic minority groups (African Americans,
Asians, Hispanics), and 76 were Caucasians. The combination of purposive and
probability sampling techniques in this parallel mixed methods study yielded a
sample that allowed interesting comparisons between the two racial subgroups
on a variety of issues, such as their perception of the social psychological climate
on campus.
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 295
The resulting data generated by the open-ended and closed-ended items repre-
sented the experiences of the mothers in all their complexity and ambiguity
(Brannen, 2005, p. 180).
Focus group interviews are another source of data for mixed methods studies
(see Stewart, Shamdasani, & Rook, Chapter 18, this volume). While primarily con-
sidered a group interviewing technique, observations of shifts of opinion among
group members are considered a major part of focus group data collection and
analysis. Krueger and Casey (2000) defined a focus group study as a carefully
planned series of discussions designed to obtain perceptions on a defined area of
interest in a permissive, non-threatening environment (p. 5).
Most researchers writing about focus groups consider them to be a qualitative
technique, since (1) they are considered to be a combination of interviewing and
observation, both of which are presented as qualitative data collection techniques
in many texts and (2) focus group questions are (typically) open-ended, thereby
generating narrative data. However, focus group studies often yield mixed data.
This outcome from focus groups is more common than described in the traditional
focus group literature and is gaining popularity among researchers.
An example of a study employing focus groups to collect mixed methods data
was reported by Henwood and Pidgeon (2001) in the environmental psychology
literature. In this study, researchers conducted community focus groups in Wales
in which the topic of conversation was the importance, significance, and value of
trees to people. The focus group had a seven-step protocol, which involved open
discussions, exercises, and individual rankings of eight issues both for the partici-
pants individually and for the country of Wales. While the data were primarily
QUAL, the rankings provided interesting information on the importance that par-
ticipants placed on issues related to the value of trees in Wales from wildlife habi-
tat to commercial-economic.
Questionnaires also may yield both qualitative and quantitative data. When
questionnaires are used in a study, the researcher is employing a research strategy
in which participants self-report their attitudes, beliefs, and feelings toward some
topic. Questionnaire studies have traditionally involved paper-and-pencil methods
for data collection, but personal computers have led to the Internet becoming a
popular venue for data collection. The items in a questionnaire may be closed-
ended, open-ended, or both (also see Fowler & Cosenza, Chapter 12, this volume).
A good example of the use of questionnaires in mixed methods research comes
from the Parasnis et al. (2005) study of deaf students described earlier in the sam-
pling section of this chapter. Selected students were sent questionnaires that
included 32 closed-ended (5-point Likert-type scales) and three open-ended items.
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 298
The two types of data were gathered and analyzed simultaneously, and the analysis
of data from each strand informed the analysis of the other. The closed-ended items
addressed a variety of issues, including comparisons between the two campuses
where the information was gathered, the advantages of diversity, the institutional
commitment to diversity, the inclusion of diversity in the curriculum, and so forth.
The open-ended items asked the following questions:
Researchers then interview the same teachers whom they observed, asking ques-
tions about the topic of interest, which may evolve somewhat on the basis of the
quantitative results. For instance, if the average scores for the teachers at a school
were low on measures of classroom management, then researchers might ask open-
ended questions regarding the teachers perceptions of orderliness in their class-
rooms, why the disorder was occurring, and what could be done to improve
classroom management. The combination of quantitative and qualitative data
resulting from this research strategy is very informative, especially for educators
wanting to improve classroom teaching practices.
Another mixed methods data collection strategy is to use focus groups together
with structured or unstructured interviews. The Nieto, Mendez, and Carrasquilla
(1999) study of attitudes and practices toward malaria control in Colombia is an
example of this combination:
The study included five focus groups that were formed to discuss a wide
range of issues related to generic health problems and malaria in particular.
The focus group results were subsequently employed by the investigators to
construct a questionnaire with closed-ended items.
Interviews were conducted to determine a baseline regarding the knowledge
and practices of the general population based on a probability sample of
1,380 households.
The findings from the qualitative and quantitative components were congruent, as
noted by Nieto et al. (1999): The information obtained by the two methods was com-
parable on knowledge of symptoms, causes and ways of malaria transmission, and
prevention practices like the use of bednets or provision of health services (p. 608).
Using quantitative unobtrusive measures together with qualitative interviews is
another commonly occurring mixed methods combination, especially in the eval-
uation literature. In these studies, researchers mix quantitative information that
they have gathered from unobtrusive data sources (e.g., archival records, physical
trace data) together with qualitative interview data from participants. In sequential
studies, the qualitative interview questions may be aimed at trying to understand
the results from the quantitative data generated by the unobtrusive measures.
An example of this combination of strategies comes from Detlor (2003) writing
in the information systems literature. His research questions concerned how indi-
viduals working in organizations search and use information from Internet-based
information systems. There were two primary sources of information in this study:
Web tracking of participants Internet use, followed by one-on-one interviews with
the participants. Web tracking consisted of the use of history files and custom-
developed software installed on participants computers that ran transparently
whenever a participants web browser was used during a two-week monitoring
period (Detlor, 2003, p. 123).
The tracking software recorded a large amount of unobtrusive data on the partic-
ipants Web actions, including the sites visited and the frequency of Web page visits
made by the participants. Log tables indicating extended or frequent visits to partic-
ular Web sites were used to pinpoint significant episodes of information seeking.
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 301
Latent Content Analysis. The distinction between the manifest and latent content of
a document refers to the difference between the surface meaning of a text and the
underlying meaning of that narrative. For example, one could count the number of
violent acts (defined a priori) that occur during a television program and make
conclusions concerning the degree of manifest violence that was demonstrated in
the program. To truly understand the underlying latent content of the violence
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 302
within a specific program, however, the context (e.g., Manning & Cullum-Swan,
1994) within which the program occurred would have to be analyzed. In this case,
that context would be the narrative line or plot of the program. A television
program with several violent scenes, yet with an underlying theme of trust or con-
cern among the characters, might generate a latent content analysis very different
from its manifest content analysis.
Spradley (1979, p. 157) explicitly defined two of the major principles used in
qualitative data analysis: the similarity principle and the contrast principle. The sim-
ilarity principle states that the meaning of a symbol can be discovered by finding out
how it is similar to other symbols. The contrast principle states that the meaning of
a symbol can be discovered by finding out how it is different from other symbols.
Descriptive statistics are not sufficient for estimation and testing hypotheses.
Data analysis methods for testing hypotheses are based on estimations of how
much error is involved in obtaining a difference between groups, or a relationship
between variables. Inferential statistical analysis, involving significance tests, pro-
vides information regarding the possibility that the results happened just by
chance and random error versus their occurrence due to some fundamentally true
relationship that exists between variables. If the results (e.g., differences between
means) are statistically significant, then the researcher concludes that they did not
occur solely by chance. The basic assumption in such hypothesis testing is that any
apparent relationship between variables (or difference between groups) might, in
fact, be due to random fluctuations in measurement of the variables or in the indi-
viduals who are observed. Inferential statistics are methods of estimating the degree
of such chance variation. In addition, these methods of data analysis provide infor-
mation regarding the magnitude of the effect or the relationship.
(e.g., Kochan, Tashakkori, & Teddlie, 1996). These two types of schools were then
observed and compared with each other to explore possible differences between
them on other dimensions such as school climate.
7. Forming categories of attributes/themes through quantitative analysis, and
then confirming these categories with the qualitative analysis of other data, is sim-
ilar to the construct identification and construct validation procedures described pre-
viously. In this strategy, the objective is to first identify the components of a
construct (subconstructs) through factor analysis of quantitative data and then to
collect qualitative data to validate the categories, or to expand on the information
that is available regarding these subconstructs. An example of such a type of mixed
data analysis might involve the initial classification of dimensions of teachers per-
ceptions of school climate through factor analysis of survey data completed by a
sample of faculties. Observational and/or other types of data (e.g., focus group
interviews) might then be used to confirm the existence of such dimensions and/or
to explore the degree to which these different dimensions are present in everyday
interactions.
Caracelli and Greene (1993) discuss another application of this type of analy-
sis. Unlike the above examples, in this application the objective is not to confirm
or expand the results of construct validation efforts. Instead, the objective is to
develop an initial framework for the qualitative/categorical analysis that follows
as the next step. For example, factor analytic results might be used as a starting
point for the constant comparative analysis defined earlier in this chapter. The cat-
egories of events/observations that are obtained through factor analysis might
then be used for coding the initial qualitative data in the subsequent constant
comparative analysis.
8. Using inherently mixed data analysis techniques. Inherently mixed data analy-
sis techniques are those that provide two types of outputs: qualitative and quanti-
tative. Social network analysis is an example of one such technique. In social
network analysis, the investigator obtains both graphic (qualitative) snapshots of
communication networks and numeric indicators of various aspects of communi-
cation patterns. Another example is the output from computerized data analysis
packages for qualitative research, such as Atlas-ti and others. These programs usu-
ally provide two types of results, one consisting of qualitative themes and the other,
numeric indicators that may be analyzed statistically.
The term inference has been used to denote both a process and an outcome (see
Miller, 2003, for a full discussion). As a process, making inferences consists of a set
of steps that a researcher follows to create meaning out of a relatively large amount
of collected information. As an outcome, inference is a conclusion made on the
basis of obtained data. Such a conclusion may or may not be acceptable to other
scholars and is subject to evaluation by the community of scholars and/or con-
sumers of research. For example, an inference may be evaluated in terms of the
degree to which it is consistent with the theories and the state of knowledge. Or, on
the other hand, one might ask how good the conclusion is in terms of its relevance
and usefulness to policymakers.
Making inferences in mixed methods involves integrating (comparing, contrast-
ing, incorporating, etc.) the findings of the qualitative and quantitative strands of a
study. Such integration is not the same in parallel and in sequential or conversion
designs. In parallel mixed methods designs, two separate but related answers to the
research questions are obtained, one from each strand of the study. The investiga-
tor must make meta-inferences by integrating the two sets of inferences that are
gleaned from the two strands of the study. As we will discuss below, integration and
its adequacy is directly related to the goal of the study and the purpose of using a
mixed methods design.
In sequential and conversion designs, one strand emerges either as a response to
the inferences of the previous one or provides an opportunity to conduct the next
strand. For example, the conclusions gleaned from one strand might be controversial,
incomplete, or highly unexpected. This leads to the need to conduct a second strand,
in order to obtain more in-depth understanding of such findings. Alternatively, one
strand might provide an opportunity for the next one by providing a framework for
sampling (see examples of typology formation discussed above) or lead to procedures
for data collection (e.g., instrument development in one strand, to be used in data
collection for the next). Although there is a temporal sequence of making inferences,
and the two sets of inferences might seem independent, in a mixed methods design
(as compared with quasi-mixed designs), the inferences of each of the two (or more)
strands must be incorporated into a meta-inference.
relationships and events). Credibility is a qualitative term used for both reputational
and accuracy quality.
The terms and examples used in this section are associated with the quality of
data, while the next section concerns quality of design and inference.
Within Design Consistency: Did the components of the design fit together in
a seamless and cohesive manner? Inconsistencies might happen if the data collec-
tion procedures (e.g., interview, focus group questions) are not compatible with the
sampling process (do not match respondents level of education, or language abil-
ity, etc.).
Analytic Adequacy: Are the data analysis techniques appropriate and adequate
for answering the research questions?
In mixed methods studies, integration does not necessarily mean creating a single
understanding on the basis of the results. We are using the term integration as a mixed
methods term that denotes making meaningful conclusions on the basis of consistent
or inconsistent results. The term incorporates elaboration, complementarity, com-
pleteness, contrast, comparison and so forth. For mixed methods research, the con-
sistency between two sets of inferences derived from qualitative and quantitative
strands have been widely considered as an indicator of quality. However, some schol-
ars have also cautioned against a simple interpretation of inconsistency (see Erzberger
& Prein, 1997; Perlesz & Lindsay, 2003). Obtaining two alternative or complementary
meanings is often considered one of the major advantages of mixed methods (see
Tashakkori & Teddlie, 2008).
Inconsistency might be a diagnostic tool for detecting possible problems in data
collection and analysis, or the inferences derived from the results of one strand or
the other. If refocusing does not reveal any problems in the two sets of inferences,
then the next step would be to evaluate the degree to which lack of consistency
might indicate that the two sets are revealing two different aspects of the same phe-
nomenon (complementarity). Not reaching a plausible explanation for the incon-
sistency, the next step would be to explore the possibility that one set of inferences
provides the conditions for the applicability of the other (for detailed examples, see
Perlesz & Lindsay, 2003). If none of these steps provide a meaningful justification
for the apparent inconsistency, the inconsistency might be an indicator of the fact
that there are two plausible but different answers to the question (i.e., two different
but equally plausible realities exist).
use the term transferability to also include the concept of external validity from the
quantitative research literature. Transferability is relative in that any high-quality
inference is applicable to some condition, context, cultural group, organization, or
individuals other than the one studied.
The degree of transferability depends on the similarity between those studied
(sending conditions, contexts, entities, individuals) and the ones that the findings
are being transferred to (receiving conditions, contexts, groups, etc.). Determining
the degree of similarity is often beyond the scope of the investigators knowledge
and resources. Although it is up to the consumer of research to assess such a degree
of similarity, it is necessary for the researcher to facilitate such a decision by pro-
viding full description of the study and its context, and to employ a research design
that maximizes transferability to other settings.
Although authors often regard sampling adequacy as the main determinant of
the degree of transferability, in truth it also highly depends on design quality and
interpretive vigor. Inadequate implementation of the design components or inade-
quate interpretation of the findings would limit the transferability of the inferences
(i.e., noncredible inferences do not hold in any context or group).
If a finding is not transferable to any other context, phenomenon, or group, it is
of little value to scholars and professionals other than the researcher. Therefore, you
are strongly encouraged to think of maximizing the possible transferability of your
findings by maximizing the representativeness of your (purposive or probability)
sample (of people, observations, entities, etc.), and providing rich descriptions of
your study (procedures, data collection, etc.), and its context.
Summary
Mixed methods designs are used with increasing frequency across disciplines.
Among the reasons for such utilization, researchers and program evaluators
point to the necessity of using all possible approaches/methods (qualitative and
quantitative) for answering their questions. We presented a brief overview of
some of the issues in such utilization and also presented summaries of possible
ways for conducting integrated research. Obviously, the main starting point for
conducting such research is the purpose and research question, which in turn
shapes your ideas about the type of design you might need to reach your objec-
tives. The design you identify as the most appropriate for answering your
research questions (e.g., sequential, parallel, conversion, multilevel, and fully
integrated) would also shape your sampling and data collection procedures,
steps for data analysis, and ultimately your inferences and policy/practice rec-
ommendations/decisions. We believe that the most important part of any study
is when you make final inferences and make policy/practice recommendations
on the basis of your findings. Therefore, we introduced the concept of inference
quality and inference transferability as two categories of audits/assessments about
your overall research.
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 313
Discussion Questions
1. Briefly summarize three sampling procedures in integrated research.
2. What are the similarities and differences between a sequential and a parallel
mixed methods design? Provide an example for each.
3. Explain the reasons why Teddlie and Tashakkori (2006) have found it neces-
sary to distinguish between mixed methods and quasi-mixed-methods research
designs.
4. A concern among some researchers is that if mixed methods are used, they
might find inconsistency between the findings of the qualitative and quantitative
strands. Explain why mixed methods researchers consider inconsistency potentially
valuable for understanding the phenomenon under investigation.
5. Explain the reason(s) why the authors of this chapter do not consider classi-
fication of integrated research design on the basis of priority (of qualitative and
qualitative approaches) useful.
6. Define/explain inference quality and inference transferability. Why have the
authors of this chapter proposed these terms?
Exercises
1. Mixed methods are appropriate for certain research questions but not others
(see, e.g., Creswell & Tashakkori, 2007). Generate four or five examples of research
questions for which a mixed methods design/approach would make sense. For each,
also write at least one question for each strand (qualitative/quantitative).
3. Think about the mixed methods questions that you generated above. What
mixed methods design is necessary/appropriate for answering each? Write a short
description for a possible study that can potentially answer each research question.
In your description, include brief sections for sampling design, data collection pro-
cedures, and possible data analysis steps.
5. Describe the steps you will take if you find variation (difference, inconsis-
tency) between the inferences drawn from qualitative and quantitative strands of a
mixed methods study.
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 314
Note
1. Quantitizing (e.g., Miles & Huberman, 1994) and qualitizing (e.g., Tashakkori &
Teddlie, 1998) are terms that are part of the mixed methodologists lexicon. They are
employed by almost everyone working in the field (e.g., Sandelowski, 2003).
References
Babbie, E. (2003). The practice of social research (10th ed.). Belmont, CA: Wadsworth.
Brannen, J. (2005). Mixed methods: The entry of qualitative and quantitative approaches into
the research process. International Journal of Social Research Methodology, 8(3), 173184.
Brannen, J., & Moss, P. (1991). Managing mothers and earner households after maternity leave.
London: Unwin Hymen.
Brewer, J., & Hunter, A. (1989). Multimethod research: A synthesis of style. Newbury Park,
CA: Sage.
Brewer, J., & Hunter, A. (2006). Foundations of multimethod research: Synthesizing styles
(2nd ed.). Thousand Oaks, CA: Sage.
Brophy, J. E., & Good, T. L. (1986). Teacher behavior and student achievement. In M. Wittrock
(Ed.), Third handbook of research on teaching (pp. 328375). New York: Macmillan.
Caracelli, V. W., & Greene, J. C. (1993). Data analysis strategies for mixed-method evaluation
designs. Educational Evaluation and Policy Analysis, 15(2), 195207.
Collins, K. M. T., Onwuegbuzie, A., & Jiao, Q. C. (2007). A mixed methods investigation of
mixed methods sampling designs in social and health science research. Journal of Mixed
Methods Research, 1(3), 267294.
Creswell, J. W. (2003). Research design: Qualitative, quantitative, and mixed methods
approaches. Thousand Oaks, CA: Sage.
Creswell, J. W. (2005). Educational research: Planning, conducting, and evaluating quantitative
and qualitative research. Upper Saddle River, NJ: Merrill Prentice Hall.
Creswell, J. W., & Plano Clark, V. (2007). Designing and conducting mixed methods research.
Thousand Oaks, CA: Sage.
Denzin, N. K. (1989). The research act: A theoretical introduction to sociological method
(3rd ed.). New York: McGraw-Hill.
Detlor, B. (2003). Internet-based information systems: An information studies perspective.
Information Systems Journal, 13, 113132.
Erzberger, C., & Prein, G. (1997). Triangulation: Validity and empirically based hypothesis
construction. Quality & Quantity, 2, 141154.
Flick, U. (1998). An introduction to qualitative research. Thousand Oaks, CA: Sage.
Gall, M. D., Gall, J. P., & Borg, W. R. (2006). Educational research: An introduction (8th ed.).
Boston: Pearson Allyn & Bacon.
Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies for qualita-
tive research. Chicago: Aldine.
Greene, J. (2007). Mixing methods in social inquiry. San Francisco: Jossey-Bass.
Greene, J. C., & Caracelli, V. J. (1997). Defining and describing the paradigm issue in mixed-
method evaluation. In J. C. Greene & V. J. Caracelli (Eds.), Advances in mixed-method
evaluation: The challenges and benefits of integrating diverse paradigms (pp. 517). San
Francisco: Jossey-Bass.
Greene, J. C., & Caracelli, V. J. (2003). Making paradigmatic sense of mixed-method prac-
tice. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and
behavioral research (pp. 91110). Thousand Oaks, CA: Sage.
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 315
Greene, J. C., Caracelli, V. J., & Graham, W. F. (1989). Toward a conceptual framework for mixed-
method evaluation designs. Educational Evaluation and Policy Analysis, 11, 255274.
Hancock, M., Calnan, M., & Manley, G. (1999). Private or NHS dental service care in the
United Kingdom? A study of public perceptions and experiences. Journal of Public
Health Medicine, 21(4), 415420.
Hausman, A. (2000). A multi-method investigation of consumer motivations in impulse
buying behavior. Journal of Consumer Marketing, 17(5), 403419.
Henwood, K., & Pidgeon, N. (2001). Talk about woods and trees: Threat of urbanization, sta-
bility, and biodiversity. Journal of Environmental Psychology, 21, 125147.
Hooper, M. L. (1994). The effects of high and low level cognitive and literacy language arts
tasks on motivation and learning in multiability, multicultural classrooms. Developmental
Studies: Learning-and-Instruction, 4(3), 233251.
Huberman, A. M., & Miles, M. B. (1994). Data management and analysis methods. In
N. K. Denzin & Y. S. Lincoln (Eds.), Handbook of qualitative research (pp. 428444).
Thousand Oaks, CA: Sage.
Hunter, A., & Brewer, J. (2003). Multimethod research in sociology. In A. Tashakkori &
C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research
(pp. 577594). Thousand Oaks, CA: Sage.
Johnson, B., & Onwuegbuzie, A. (2004). Mixed methods research: A research paradigm
whose time has come. Educational Researcher, 33(7), 1426.
Johnson, B., & Turner, L. A. (2003). Data collection strategies in mixed methods research. In
A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral
research (pp. 297319). Thousand Oaks, CA: Sage.
Kemper, E., Stringfield. S., & Teddlie, C. (2003). Mixed methods sampling strategies in social
science research. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in
social and behavioral research (pp. 273296). Thousand Oaks, CA: Sage.
King, G., Keohane, R. O., & Verba, S. (1994). Designing social inquiry: Scientific inference in
qualitative research. Princeton, NJ: Princeton University Press.
Kochan, S., Tashakkori, A., & Teddlie, C. (1996, April). You cant judge a high school by
achievement alone: Preliminary findings from the construction of behavioral indicators
of school effectiveness. Presented at the annual meeting of the American Educational
Research Association, New York.
Krathwohl, D. R. (2004). Methods of educational and social science research: An integrated
approach (2nd ed.). Long Grove, IL: Waveland Press.
Krueger, R. A., & Casey, M. A. (2000). Focus groups: A practical guide for applied research (3rd
ed.). Thousand Oaks, CA: Sage.
Lee, R. M. (2000). Unobtrusive methods in social research. Buckingham, UK: Open University
Press.
Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic inquiry. Beverly Hills: Sage.
Lincoln, Y. S., & Guba, E. G. (2000). Paradigmatic controversies, contradictions, and emerg-
ing confluences. In N. K. Denzin & Y. S. Lincoln (Eds.), Handbook of qualitative research
(2nd ed., pp. 163188). Thousand Oaks, CA: Sage.
Logan, J. (2006). The impact of Katrina: Race and class in storm-damaged neighborhoods.
Providence, RI: Brown University. Retrieved February 18, 2006, from www.s4.brown
.edu/katrina/report.pdf
Lopez, M., & Tashakkori, A. (2006). Differential outcomes of TWBE and TBE on ELLs at
different entry levels. Bilingual Research Journal, 30(1), 81103.
Manning, P. K., & Cullum-Swan, B. (1994). Narrative, content, and semiotic analysis. In
N. K. Denzin & Y. S. Lincoln (Eds.), Handbook of qualitative research (pp. 463477).
Thousand Oaks, CA: Sage.
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 316
Maxwell, J. A., & Loomis, D. (2003). Mixed methods design: An alternative approach. In
A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral
research (pp. 241272). Thousand Oaks, CA: Sage.
Miles, M., & Huberman, M. (1994). Qualitative data analysis: An expanded sourcebook. (2nd
ed.). Thousand Oaks, CA: Sage.
Miller, S. (2003). Impact of mixed methods and design on inference quality. In A. Tashakkori
& C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research
(pp. 423456). Thousand Oaks, CA: Sage.
Morgan, D. (1998). Practical strategies for combining qualitative and quantitative methods:
Applications to health research. Qualitative Health Research, 8(3), 362376.
Morse, J. (1991). Approaches to qualitative-quantitative methodological triangulation.
Nursing Research, 40(2), 120123.
Morse, J. (2003). Principles of mixed methods and multimethod research design. In
A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral
research (pp. 189208). Thousand Oaks, CA: Sage.
Newman, I., & Benz, C. R. (1998). Qualitative-quantitative research methodology: Exploring
the interactive continuum. Carbondale: University of Illinois Press.
Newman, I., Ridenour, C., Newman, C., & DeMarco, G. M. P., Jr. (2003). A typology of research
purposes and its relationship to mixed methods research. In A. Tashakkori & C. Teddlie
(Eds.), Handbook of mixed methods in social and behavioral research (pp. 167188).
Thousand Oaks, CA: Sage.
Nieto, T., Mendez, F., & Carrasquilla, G. (1999). Knowledge, beliefs and practices relevant for
malaria control in an endemic urban area of the Colombian Pacific. Social Science and
Medicine, 49, 601609.
Onwuegbuzie, A. J., & Teddlie, C. (2003). A framework for analyzing data in mixed methods
research. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and
behavioral research (pp. 351384). Thousand Oaks, CA: Sage.
Parasnis, I., Samar, V. J., & Fischer, S. D. (2005). Deaf college students attitudes toward racial/
ethnic diversity, campus climate, and role models. American Annals of the Deaf, 150(1), 4758.
Patton, M. Q. (2002). Qualitative research and evaluation methods (3rd ed.). Thousand Oaks,
CA: Sage.
Perlesz, A., & Lindsay, J. (2003). Methodological triangulation in researching families: Making
sense of dissonant data. International Journal of Social Research Methodology, 6(1), 2540.
Puma, M., Karweit, N., Price, C., Ricciuti, A., Thompson, W., & Vaden-Kiernan, M. (1997).
Prospects: Final report on student outcomes. Washington, DC: U.S. Department of
Education, Planning and Evaluation Services.
Rao, V., & Woolcock, M. (2004). Integrating qualitative and quantitative approaches in
program evaluation. In F. Bourguignon & L. Pereira da Silva (Eds.), The impact of eco-
nomic policies on poverty and income distribution: Evaluation techniques and tools
(pp. 165190). Oxford, UK: Oxford University Press (for World Bank).
Regehr, C., Chau, S., Leslie, B., & Howe, P. (2001). An exploration of supervisors and man-
agers responses to child welfare reform. Administration in Social Work, 26(3), 1736.
Rossman, G., & Wilson, B. (1985). Numbers and words: Combining quantitative and quali-
tative methods in a single large scale evaluation study. Evaluation Review, 9, 627643.
Sandelowski, M. (2003). Tables or tableaux? The challenges of writing and reading mixed
methods studies. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in
social and behavioral research (pp. 321350). Thousand Oaks, CA: Sage.
Sandelowski, M., Harris, B. G., & Holditch-Davis, D. (1991). Amniocentesis in the con-
text of infertility. Health Care for Women International, 12, 167178.
09-Bickman-45636:09-Bickman-45636 7/28/2008 7:48 PM Page 317
Spradley, J. P. (1979). The ethnographic interview. New York: Holt, Rinehart & Winston.
Spradley, J. P. (1980). Participant observation. New York: Holt, Rinehart & Winston.
Strauss, A., & Corbin, J. (1998). Basics of qualitative research: Techniques and procedures for
developing grounded theory (2nd ed.). Thousand Oaks, CA: Sage.
Tashakkori, A., & Creswell, J. (2007). Editorial: The new era of mixed methods. Journal
of Mixed Methods Research, 1(1), 37.
Tashakkori, A., & Teddlie, C. (1998). Mixed methodology: Combining qualitative and quanti-
tative approaches. Thousand, CA: Sage.
Tashakkori, A., & Teddlie, C. (Eds.). (2003a). Handbook of mixed methods in social and behav-
ioral research. Thousand Oaks, CA: Sage.
Tashakkori, A., & Teddlie, C. (2003b). The past and future of mixed methods research: From
data triangulation to mixed model designs. In A. Tashakkori & C. Teddlie (Eds.),
Handbook of mixed methods in social and behavioral research (pp. 671702). Thousand
Oaks, CA: Sage.
Tashakkori, A., & Teddlie, C. (2008). Quality of inference in mixed methods research. In
M. M. Bergman (Ed.), Advances in mixed methods research: Theories and applications
(pp. 101119). London: Sage.
Teddlie, C., & Meza, J. (1999). Using informal and formal measures to create classroom pro-
files. In J. Freiberg (Ed.), School climate: Measuring, improving and sustaining healthy
learning environments (pp. 4864). London: Falmer Press.
Teddlie, C., & Tashakkori, A. (2006). A general typology of research designs featuring mixed
methods. Research in Schools, 13(1), 1228.
Teddlie, C., & Tashakkori, A. (in press). Foundations of mixed methods research: Integrating
quantitative and qualitative techniques in the social and behavioral sciences. Thousand
Oaks, CA: Sage.
Teddlie, C., Tashakkori, A., & Johnson, B. (2008). Emergent techniques in the gathering and
analysis of mixed methods data. In S. Hesse-Biber & P. Leavy (Eds.), Handbook of emer-
gent methods in social research (pp. 389413). New York: Guilford Press.
Teddlie, C., Virgilio, I., & Oescher, J. (1990). Development and validation of the Virgilio
Teacher Behavior Inventory. Educational and Psychological Measurement, 50, 421430.
Teddlie, C., & Yu, F. (2007). Mixed methods sampling: A typology with examples. Journal of
Mixed Methods Research, 1(1), 77100.
Waszak, C., & Sines, M. (2003). Mixed methods in psychological research. In A. Tashakkori
& C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research
(pp. 557576). Thousand Oaks, CA: Sage.
Webb, E. J., Campbell, D. T., Schwartz, R. D., & Sechrest, L. (1966). Unobtrusive measures:
Nonreactive research in the social sciences. Chicago: Rand McNally.
Webb, E. J., Campbell, D. T., Schwartz, R. D., & Sechrest, L. (2000). Unobtrusive measures
(Rev. ed.). Thousand Oaks, CA: Sage.
Witcher, A. E., Onwuegbuzie, A. J., Collins, K. M. T., Filer, J., & Wiedmaier, C. (2003,
November). Students perceptions of characteristics of effective college teachers. Paper pre-
sented at the annual meeting of the Mid-South Educational Research Association,
Biloxi, MS.
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 318
CHAPTER 10
Organizational Diagnosis
Michael I. Harrison
What Is Organizational
Diagnosis and How Is It Used?1
Organizational diagnosis is the use of conceptual models and applied research
methods to assess an organizations current state and discover ways to solve prob-
lems, meet challenges, or enhance performance. When in-house or external con-
sultants, applied researchers, or managers engage in diagnosis, they draw on ideas
and techniques from a diverse range of disciplines within behavioral science and
related fieldsincluding psychology, sociology, management, and organization
studies. Diagnosis helps decision makers and their advisers develop workable
proposals for organizational change and improvement. Without careful diagnosis,
decision makers may waste effort by failing to attack the root causes of problems
(Senge, 1994). Hence, diagnosis can contribute to managerial decision making, just
as it can provide a solid foundation for recommendations by organizational and
management consultants.
Here is an example of a diagnostic project that I conducted:
318
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 319
was justified, the head of training in the HMO reached an agreement with the
CHF director to ask an independent consultant to assess the situation. After dis-
cussions between the consultant, the head of training, and the top managers at
CHF, all parties agreed to broaden the study goals to include assessment of the fea-
sibility of the proposed transformation and the staffs readiness for the change.
As the CHF case suggests, diagnosis involves more than just gathering valid data.
A successful diagnostic study must provide its clients with data, analyses, and rec-
ommendations that are useful as well as valid. To meet these dual standards, the
diagnostic practitioner must fill the requirements of three key facets of diagnosis
process, modeling, and methodsand assure good alignments among all three. After
a brief introduction to types of diagnostic studies and a comparison to other forms
of applied research, this chapter introduces each of these three facets. Space limits
prevent exploration of the many delicate interactions among them. These can best
be learned by examplefor instance, case studies and descriptions of actual con-
sulting projectsand through mentored experience in conducting a diagnosis.
budgets, within shorter time frames, and rely on less extensive forms of data gath-
ering and analysis.
Despite these differences, many of the models used in diagnosis can contribute to
strategy assessments and program evaluations (Harrison & Shirom, 1999), and diag-
nostic practitioners can benefit from the extensive literature on evaluation techniques
and processes (e.g., Patton, 1999; Rossi et al. 1999; Wholey, Harty, & Newcomer,
2004). Practitioners of diagnosis can also incorporate concepts and methods from
strategic assessments of intraorganizational factors shaping performance and strate-
gic advantage (Duncan, Ginter, & Swayne, 1998; Kaplan & Norton, 1996).
Process
Phases in Diagnosis
To provide genuinely useful findings and recommendations, consultants need to
create and maintain cooperative and constructive relations with clients. Moreover,
to ensure that diagnosis yields valid and useful results, practitioners of diagnosis
must successfully negotiate their relations with other members of the focal organi-
zation as their study moves through a set of analytically distinct phases (Nadler,
1977). These phases can overlap in practice, and their sequence may vary. As the fol-
lowing description shows, diagnostic tasks, models, and methods shift within and
between phases, as do relations between consultants, clients, and other members of
the client organization:
Entry: Clients and consultants explore expectations for the study; client presents
problems and challenges; consultant assesses likelihood of cooperation with var-
ious types of research and probable receptiveness to feedback; consultant makes
a preliminary reconnaissance of organizational problems and strengths.
Contracting: Consultants and clients negotiate and agree on the nature of the
diagnosis and client-consultant relations.
Study design: Methods, measurement procedures, sampling, analysis, and
administrative procedures are planned.
Data gathering: Data are gathered through interviews, observations, ques-
tionnaires, analysis of secondary data, group discussions, and workshops.
Analysis: Consultants analyze the data and summarize findings; consultants
(and sometimes clients) interpret them and prepare for feedback.
Feedback: Consultants present findings to clients and other members of the
client organization. Feedback may include explicit recommendations or more
general findings to stimulate discussion, decision making, and action planning.
occurred in the CHF case, they will often need to redefine their relations and objec-
tives during the course of the diagnosis to deal with issues that were neglected dur-
ing initial contracting or arose subsequently. To manage the consulting relation
successfully, practitioners need to handle the following key process issues (Nadler,
1977; Van de Ven & Ferry, 1980, pp. 2251) in ways that promote cooperation
between themselves and members of the client organization:
Purpose: What are the goals of the study, how are they defined, and how can
the outcomes of the study be evaluated? What issues, challenges, and prob-
lems are to be studied?
Design: How will members of the organization be affected by the study design
and methods (e.g., organizational features to be studied, units and individuals
included in data gathering, and types of data collection techniques)?
Support and cooperation: Who sponsors and supports the study and what
resources will the client organization contribute? What are the attitudes of other
members of the organization and of external stakeholders toward the study?
Participation: What role will members of the organization play in planning
the study, collecting data, interpreting them, and reacting to them?
Feedback: When, how, and in what format will feedback be given? Who will
receive feedback on the study, and what uses will they make of the data?
Modeling
The success of a diagnosis depends greatly on the ways that practitioners handle
the analytic tasks of deciding what to study, framing and defining diagnostic prob-
lems, choosing criteria for assessing organizational effectiveness, analyzing data to
identify conditions that promote or block effectiveness, organizing findings for
feedback, and providing feedback. Behavioral science models and broader-orienting
metaphors (Morgan, 1996) and frames (Bolman & Deal, 2003) can help practi-
tioners handle these tasks.
Many practitioners use models developed by experienced consultants and
applied researchers to guide their investigations (see Harrison, 2005, appendix B;
Harrison & Shirom, 1999). These models specify organizational features that have
proved critical in the past. Standardized models also help large consulting practices
maintain consistency across projects. Unfortunately, work with available models
runs the risks of generating a lot of hard-to-interpret data that fail to address chal-
lenges and problems that are critical to clients and do not reflect distinctive features
of the client organization. To avoid these drawbacks, consultants often tailor stan-
dardized models to fit the client organization and its circumstances.
(Harrison & Shirom, 1999), the practitioner uses one or more theoretical frames as
orienting devices and then develops a model that specifies the forces affecting the
problems or challenges presented by clients. This model also guides feedback.
Figure 10.1 shows the main steps in applying the sharp-image approach to
developing a diagnostic model. In the CHF case, the diagnosis drew on two theo-
retical frames. The first applied open systems concepts to the analysis of strategic
organizational change (Tichy, 1983). This frame guided analysis of the core chal-
lenge facing CHFdeveloping an appropriate strategy for revitalizing the organi-
zation and helping it cope with external challenges. Second, a political frame
(Harrison, 2005, pp. 95104; Harrison & Shirom, 1999, chap. 5; Tichy, 1983) guided
analysis of the ability of CHFs director to mobilize support for the proposed trans-
formation and overcome opposition among staff members. For the feedback stage,
elements from both frames were combined into a single model that directed atten-
tion to findings and issues of greatest importance for action planning.
As they examine diagnostic issues and data, practitioners often frame issues dif-
ferently than clients. For example, in the CHF case, the director of CHF originally
defined the problem as one of resistance to change, whereas the director of training
at the HMO phrased the original diagnostic problem in terms of assessing the need
for the proposed training program. The consultant reframed the study task by
dividing it in two: (1) assessing feasibility of accomplishing the proposed organiza-
tional transformation and (2) discovering steps that CHF management and the
HMO could take to facilitate the transformation. This redefinition of the diagnos-
tic task thus included an image of the organizations desired state that fit both
client expectations and social science knowledge about organizational effectiveness.
Moreover, this reformulation helped specify the issues that should be studied in
depth and suggested ways in which the clients could deal with the problem that ini-
tially concerned them. The consultants recommendations took into account which
possible solutions to problems were more likely to be accepted and could be suc-
cessfully implemented by the clients.
As is the case in any research project, the research design, the measures, and the
findings in a diagnostic study will depend greatly on the choices made about each
of these five facets of effectiveness (for further discussion and illustrations see
Harrison, 2005; Harrison & Shirom, 1999). Let us now turn to examples of broad
and focused models which are useful in diagnosis.
level (e.g., total organization, divisions, departments, units, and work groups) affect
one another. In like manner, open systems studies examine exchanges between a
focal organization or unit and its organizational environments and interdependen-
cies among system subcomponentsincluding the focal organizations culture and
subcultures, inputs (resources), behavior and processes (both intended and emer-
gent), technologies, structures, and outputs.
There are many specifications of the open systems model that can contribute to
diagnosis (Harrison & Shirom, 1999). One useful approach examines fits among
system features. This approach is based on research showing that good fit among
system parts, levels, or subcomponents contributes to several dimensions of orga-
nizational effectiveness.2 Good fit (or alignment) occurs when elements within a
system reinforce one another, rather than disrupting one anothers operations.
Organizational units, system components, or functions fit poorly if their activities
erode or cancel each other; or if exchanges between units or components harm
performance (e.g., by leading to avoidable losses of time, money, or energy).
Common signs of ineffectivenesssuch as rapid turnover of personnel, high
levels of conflict, low efficiency, and poor qualityare often symptoms of poor sys-
tem fit. The following case (adapted from Beckhard & Harris, 1975) illustrates how
poor fit between managerial processes (goal formation and leadership) and reward
systems (structures and processes) at the divisional level can harm motivation and
lead to unintended consequences:
Figure 10.2 provides a schematic summary of the steps required to diagnose fits.
When starting from presented problems and challenges, practitioners hunt for
related, underlying conditionssuch as the reward contingencies in the Advance
Inc. firmthat may be causing ineffectiveness. By reporting these underlying con-
ditions, the practitioner may help clients solve the original problems, reduce other
signs of ineffectiveness, and enhance overall organizational effectiveness.
For example, a practitioner who encounters complaints about tasks being
neglected or handled poorly can examine links between structure and two critical
processesdecision making and communication. Responsibility chartinga proce-
dure used in many large organizations (JISC Infonet, n.d.)provides one way to
clarify these links. First, during interviews or workshops, the practitioner asks group
members to list key tasks or decision areas. In a project group, these might include
budgeting, scheduling, allocating personnel, and changing design specifications of a
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 326
Design Study,
Gather Data
Research design
Methods
Data collection
Assess Degree
of Fit
Needs of units,
system parts
Conflicts, tensions
Actual vs. official
practices
Organization
design models
Continue Diagnosis
Assess Impacts
Choose
Effectiveness Negative
Criteria Positive
Loose coupling
Summarize Data,
Prepare Feedback
product. Second, each member is asked to list the positions that will be involved in
these areas (e.g., project director, general manager, laboratory manager); indicate
who is assigned responsibility for performing tasks; and note who is supposed to
approve the work, be consulted, and be informed. The data usually reveal ambigu-
ities relating to one or more task areas. Consultants can use these data as feedback
to stimulate efforts to redefine responsibilities and clarify relations. Feedback can
also lead clients and consultants to evaluate fundamental organizational features,
such as delegation of authority, coordination mechanisms, and the division of
labor. For instance, discussion of approval procedures for work scheduling might
reveal that many minor scheduling changes are needed and that scheduling would
operate more smoothly if middle-level managers received authority to make such
minor changes and inform the project head afterward.
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 327
members with the necessary skills and knowledge in advance of task activity and in
response to members needs.
Second, group design and culture can facilitate or hinder group processes and
performance. The most critical task conditions for groups include defining clear
tasks, setting challenging objectives, assigning shared responsibility, and speci-
fying accountability for task performance. In addition, it is important that
groups be as small as possible, since larger groups encounter more coordination
problems. Compositional features that contribute to performance focus include
clear boundaries, inclusion of members possessing the needed skills and knowl-
edge, including interpersonal skills, and creation of a good mix of members in
terms of training and experience. This mix ensures cross-fertilization and cre-
ativity, while avoiding insurmountable divergences of opinion and working
styles. Finally, groups are more successful when they possess clear and strong
norms that regulate behavior and insure coordinated action. It is also important
that these norms encourage members to act proactively and learn from their
experiences.
Organizational
Context
Goals
Rewards
Information Critical Group
Training Processes
Effort Performance
Application of skills (outputs)
and knowledge
Task performance
Group Design
strategies
and Culture
Tasks
Composition
Norms
Outside Help
Coaching and
consulting
Help from other
groups
The third set of facilitating conditions refer to access to outside help, such as
coaching and consulting received by members. Like team leaders, external coaches
and consultants can help members anticipate or resolve critical coordination prob-
lems and learn to collaborate effectively. Coaches can also help build commitment
to the group and its task. Leaders and coaches facilitate performance when they
help members decide how best to use participants skills and knowledge, learn from
one another, and learn from other groups. Leaders or coaches also help groups
avoid performance strategies that are likely to fail and can help group members
think creatively about new ways to handle their tasks.
Fourth, groups need access to appropriate material and technical resources.
Without the needed equipment, funds, or raw material, group outputs will be infe-
rior, even if the group members perform well on all the process criteria. Serious
resource constraints and acute shortages can lead to frustration and even turnover
among potential high performers and can erode a groups long-term performance
capacity. Resource availability is particularly critical in groups that are undergoing
structural change or learning new techniques for handling their tasks. Managers
responsible for introducing change sometimes expect performance to improve
immediately without investing in the necessary processes of learning, training, and
experimentation that occur during change. By singling out material and technical
resources as critical variables that intervene between group processes and perfor-
mance, the Action Model reminds managers and consultants to pay attention to
seemingly mundane issues, as well as examining the subtler questions of the avail-
ability of needed human resources, knowledge, and information.
Drawing on the Action Model, diagnostic studies can examine whether current
conditions in each of these four areas lead to ineffective or effective performance.
For example, based substantially on Hackmans model, Denison, Hart, and Kahn
(1996) developed and validated a set of diagnostic questionnaire items for members
of cross-functional teams. These items ask respondents to report the degree to
which their team enjoys supportive facilitating conditions, handles team processes
effectively, and obtains desired outcomes. Hackman and his colleagues also devel-
oped an instrument that measures concepts in the Action Model, along with those
developed subsequently (Hackman, 2002). This instrument (see www.leading
teams.org/ToolsOnWeb/TDS-Guide.pdf) also assesses how well team members
work together and their levels of motivation and satisfaction.
Another way to use the Action Model in diagnosis is to follow the problem-oriented,
sharp-image logic shown in Figure 10.1. The diagnosis would start with troubling
performance problems and then trace these signs of ineffectiveness back to diffi-
culties in handling one or more critical group processes. Then these difficulties
could be followed back to the other elements in the model, such as group design
and organizational context, which can hinder or facilitate group processes. For
instance, a consultant or manager might trace problems of low quality in an indus-
trial work group back to a critical process such as pursuit of an inappropriate qual-
ity enhancement strategy. If the quality enhancement strategy is inappropriate, then
the solution lies in redesigning the groups task (a facilitating condition) so as to
include appropriate quality assurance techniques. Suppose, on the other hand, that
the group had chosen an appropriate strategy for quality enhancement, but team
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 330
members lacked the skills and knowledge needed to implement the strategy. In that
case, the solutions lie in changing other conditions, such as coaching for skill use
and development, training programs, or procedures for selecting team members.
Although the Action Model provides useful starting points for diagnosis, it may
not adequately reflect a groups distinctive challenges and conditions. The distinctive
challenge for air traffic controllers, for example, is reliability, whereas a repertoire the-
ater group faces problems of maintaining spontaneity and artistic vigor night after
night. In a similar fashion, groups and entire organizations face divergent challenges
at different periods in their life cycles (Harrison & Shirom, 1999, pp. 299324). Nor
does the Action Model pay much attention to important soft aspects of group inter-
action, such as mutual expectations and understandings. A further limitation is the
models heavy stress on measurable outputs, which could lead analysts and clients to
pay less attention than needed to other dimensions of effectiveness and ineffective-
ness. Finally, the Action Model builds in strong assumptions about the likely indica-
tors and causes of ineffectiveness and the best ways to intervene to enhance group
performance. Hence, the model may discourage users from attending directly to
client concerns and from identifying causes and possible solutions that reflect the
organizations distinctive features and the contingencies affecting it.
Methods
Besides assuring valid findings, diagnosis requires identifying readily changeable
factors affecting clients problems. The data-gathering methods should help practi-
tioners uncover these actionable solutions. The methods should also contribute to
constructive relations between consultants and members of the client organization
and enhance the chances that members of the client organization will regard the
findings as valid and useful.
Choosing Methods
To provide valid results, practitioners should employ the most rigorous methods
possible within the practical constraints imposed by the assignment. Rigorous
methodswhich need not be quantitativefollow accepted standards of scientific
inquiry (King, Keohane, & Verba, 1994). They have a high probability of producing
results that are valid and reliable (i.e., replicable by other trained investigators;
Trochim, 2001). Nonrigorous approaches can yield valid results, but these cannot
be externally evaluated or replicated. In assessing the validity of their diagnoses,
practitioners need to be aware of the risk of false-positive results that might lead
them to recommend steps that are unjustified, and even harmful, to the client orga-
nization (Rossi & Whyte, 1983).
To achieve replicability, practitioners can use structured data-gathering and
measurement techniques, such as fixed-choice questionnaires (Faletta & Combs,
2002; Harrison, 2005, chap. 3, appendix B) or structured observations (Harrison,
2005, appendix C; Weick, 1985). Unfortunately, it is very hard to structure tech-
niques for assessing many complex but important phenomena, such as the degree
to which managers accurately interpret environmental developments.
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 331
To produce valid and reliable results, investigators must often sort out conflict-
ing opinions and perspectives about the organization and construct an indepen-
dent assessment. The quest for an independent viewpoint and scientific rigor
should not, however, prevent investigators from treating the plurality of interests
and perspectives within a focal organization as a significant organizational feature
in its own right (Ramirez & Bartunek, 1989).
Whatever techniques practitioners use in diagnosis, it is best to avoid method-
ological overkill when only a rough estimate of the extent of a particular phenom-
enon is needed. In the Advance Inc. case, for example, the investigators needed to
determine whether division heads were frustrated and dissatisfied and needed to
find the sources of the managers feelings. The practitioners did not need to specify
the precise degree of managerial dissatisfaction, as they might have done in an
academic research study.
Consultants need to consider the implications of their methods for the consult-
ing process and the analytic issues at hand, as well as weighing strictly practical and
methodological considerations. Thus, consultants might prefer to use less rigorous
methods, such as discussions of organizational conditions in workshop settings
(Biech, 2004; Harrison, 2005, chap. 5), because these methods can enhance the
commitment of participants to the diagnostic study and its findings. Or they might
prefer observations to interviews, so as not to encourage people to expect that the
consultation would address the many concerns raised during interviews.
The methods chosen and the ways in which data are presented to clients also
need to fit the culture of the client organization. In a high-technology firm, for
example, people may regard qualitative research as impressionistic and unscientific.
On the other hand, volunteers at a hospice might view standardized questionnaires
and quantitative analysis as insensitive to their feelings and experiences.
Research Design
Three types of nonexperimental designs seem most appropriate for diagnosis.
The first involves gathering data on important criteria that allow for comparisons
between units, between entire organizations, or over a period of time (Glanz &
Dailey, 1992; Harrison & Shirom, 1999, pp. 217221). Comparisons may focus on
criteria such as client satisfaction, organizational climate (e.g., perceptions of peer
and subordinate-supervisor relations, identification with unit and organizational
goals), personnel turnover, costs, and sales. Sometimes, practitioners can analyze
available records or make repeated measurements to trace changes in key variables
across time for each unit or for an entire set of related units.
The second design uses multivariate analysis of data to isolate the causes or pre-
dictors of variables linked to a particular organizational problem, such as work
quality or employee turnover, or to some desirable outcome, such as product inno-
vation or customer satisfaction. This design is less common in diagnosis than in
academic research because of practical constraints during diagnosis on extensive
and lengthy data collection and analysis. The third design uses qualitative field tech-
niques to construct a portrait of the operations of a small organization or subunit
(e.g., the executive team) and obtain in-depth data on subtle, hard-to-measure
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 332
features that may be lost or distorted in closed-ended inquiries. Among such fea-
tures are emergent practices, members perceptions and assumptions, behind-the-
scenes interactions, and work styles. In such qualitative studies, investigators use
data-gathering techniques and inductive forms of inference such as those used in
nonapplied qualitative research (Denzin & Lincoln, 2000; Miles & Huberman,
1994; Yin, 2002). However, to assure quick feedback and reduce costs, diagnostic
studies usually seek less ethnographic detail than nonapplied qualitative research
and use less rigorous forms of recording and analyzing field data. These less rigor-
ous qualitative methods can yield helpful insights, but they are also more likely to
yield biased or superficial interpretations of complex phenomena.
Data Collection
Table 10.1 surveys and assesses data collection techniques frequently used in
diagnosis.4 No single method for gathering and analyzing data can suit every diag-
nostic problem and situation, just as there is no universal model for guiding diag-
nostic analysis or one ideal procedure for managing the diagnostic process. By
using several methods to gather and analyze their data, practitioners can compen-
sate for many of the drawbacks associated with relying on a single method. They
also need to choose methods that fit the diagnostic problems and contribute to
cooperative, productive consulting relations. Let us consider two of the most pop-
ular data collection techniques in greater depth.
Structured Instruments
Self-administered questionnaires provide the least expensive way of eliciting
attitudes, perceptions, beliefs, and reports of behavior from many people.
Questionnaires can be administered in person, by mail, by telephone, or over the
Internet (Miller & Salkind, 2002; Stanton & Rogelberg, 2001). Aggregations of indi-
vidual responses can also provide a substitute for behavioral measures of group and
organizational phenomena. Although questionnaires typically use fixed-choice
answers, a few open-ended questions can be included to give respondents an
opportunity to express themselves. Responses to such open-ended questions are
often informative, but difficult to code. Questionnaires composed of items drawn
from previous research studies and standardized organizational surveys can be pre-
pared and administered rapidly, since there is less need to develop and pretest the
instrument. By including standard measures, consultants may also be able to com-
pare the responses obtained in the client organization with results from other orga-
nizations in which the same instrument was used.
Over the past few years, many standardized organizational survey instruments
have been developed that can be used in diagnostic studies (see Harrison, 2005,
appendix B). Focused instruments cover particular areas that are often of concern,
including team functioning (e.g., the instruments discussed above measuring aspects
of the Action Model for Group Task Performance), human resources practices, orga-
nizational climate and culture, leadership, and communication patterns. Broad
instruments include scales or entire subsections that cover these topics and others of
recurring interest. Classic examples include the well-documented, Michigan
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 333
Questionnaires
Self-administered Easy to quantify and summarize; quickest Hard to obtain data on structure,
schedules, fixed and cheapest way to gather new data behavior; little information on
choices rigorously, neutral and objective; useful how contexts shape behavior;
for large samples, repeat measures, not suited for subtle or sensitive
comparisons among units or to norms; issues; impersonal; risks:
standardized instruments contain nonresponse, biased or invalid
pretested items, reflect diagnostic answers, over reliance on standard
models, good for studying attitudes measures and models
Interviews
Open-ended Can cover many topics; modifiable before Expensive, hard to administer to
questions based on or during interview; can convey empathy, large samples; respondent bias
fixed schedule or build trust; rich data, allows and socially desirable responses;
interview guide understanding of respondents noncomparable responses; hard to
viewpoints and perceptions analyze responses to open-ended
questions; modification of
interviews to fit respondents
reduces rigor
Observations
Structured or open- Data are independent of peoples Constraints on access to data;
ended observation of self-presentation and biases; data costly, time-consuming; observer
people, work settings on situational, contextual effects; bias, low reliability; may affect
rich data on hard-to-measure topics behavior of those observed;
(e.g., emergent behavior, culture); hard to analyze and report; less
data yield new insights, hypotheses rigorous, may seem unscientific
Available records
and data
Use of documents, Nonreactive; often quantifiable; repeated Access, retrieval, analysis problems
reports, files, measures show change; organizations can raise costs; validity, credibility
statistical records, members can help analyze data; credibility of some sources and derived
unobtrusive measures of familiar measures (e.g., customer measures can be low; need to
complaints, staff turnover); often cheaper analyze data in context; limited
and faster than gathering new data; information on many topics
independent sources; data on total (e.g., emergent behavior)
organization, environments, industries
Workshops, group
discussions
Discussions on group Useful data on complex, subtle process; Biases due to group processes,
processes, culture, interaction stimulates creativity, history, leaders influence (e.g., boss
environment, teamwork, planning; data available stifles dissent); requires high levels
challenges strategy; for immediate analysis and feedback; of trust and cooperation in group;
directed by consultant members share in diagnosis; self- impressionistic, nonrigorous; may
of manager; diagnosis possible; consultant can build yield superficial, biased results,
simulations, exercises trust, empathy unsubstantiated decisions
SOURCE: From Diagnosing Organizations by M. Harrison, 2004, table 1.1. Reprinted with pemission of SAGE.
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 334
specific types of organizations (e.g., Lester & Bishop, 2001; Scott, Mannion, Davies,
& Marshall, 2003).
To obtain data on group-level phenomena from questionnaires such as MOAQ,
OAS, and OAI, the responses from members of a particular work group or admin-
istrative unit are averaged to create group scores. For these averages to be meaning-
ful and useful in analysis and feedback, the questionnaires must specify clearly
which work groups and supervisors are referred to.
Instruments such as MOAQ, OAS, and OAI contain ready-to-use scales that usu-
ally produce valid and reliable measures for many organizational settings. In keep-
ing with current research and organizational theory, these instruments reflect the
assumption that there is no one best way to organize groups or organizations.
Instead, the optimal combination of system traits is assumed to depend on many
variables, including environmental conditions, tasks, technology, personnel, history,
and size of the organization.
Despite their appeal, standardized diagnostic instruments also have weaknesses
and drawbacks. First, they may give practitioners a false sense of confidence that all
the factors relevant to a particular client organization have been covered adequately.
Second, standard questions are necessarily abstract; hence, they may not be fully
applicable to a particular organization or situation. For example, a typical ques-
tionnaire item in MOAQ asks respondents to indicate their degree of agreement
with the statement, My supervisor encourages subordinates to participate in mak-
ing important decisions (Cammann et al., 1983, p. 108). But the responses to this
general statement may mask the fact that the supervisor encourages participation
in decisions in one area, such as work scheduling, while making decisions alone in
other areas, such as budgeting. To obtain data on such situational variations, inves-
tigators must determine the situations across which there may be broad variations
and write questions about these situations (e.g., Moch, Cammann, & Cooke, 1983,
pp. 199200).
Third, as in any questionnaire, even apparently simple questions may contain
concepts or phrases that may be understood in different ways. For instance, when
reacting to the statement, I get to do a number of different things on my job
(Cammann et al., 1983, p. 94), one person might see diversity in physical actions
(e.g., snipping vs. scraping) or in minor changes in the tools needed for the job,
whereas another would consider all those operations as doing the same thing.
Fourth, questionnaires are especially vulnerable to biases stemming from the
respondents desire to give socially acceptable answers or to avoid sensitive issues.
There may also be tendencies to give artificially consistent responses (Salancik &
Pfeffer, 1977; but compare Stone, 1992). Some instruments include questions
designed to detect or minimize biases, whereas others may invite response bias by
phrasing all questions in a single direction.
In designing samples, practitioners consider the attitudes of group members
toward the study and the uses to which the data will be put, as well as strictly
methodological considerations. Standardized diagnostic instruments are often
administered to all members of a unit undergoing diagnosis so as to make the study
findings more relevant and believable to all people who will receive feedback.
Interviews can also be conducted with small units or organizations. Alternatively,
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 336
Semistructured Interviews
Semistructured interviews provide practitioners with opportunities to develop
rapport with members of the organization and learn about critical areas that
are not readily assessed through standardized questionnaires. These include organi-
zational processes, basic assumptions and beliefs, and critical organization-level
phenomenasuch as management control processes, relations to clients, and busi-
ness strategies. In the exploratory stage of a diagnosis, practitioners often conduct
orientation interviews (Harrison, 2005, appendix A) with people who occupy
leadership positions and perform crucial functions within the focal unit or organi-
zation. These interviews provide data on how the focal unit is organized and operates,
as well as the respondents view of the major challenges or problems facing it.
Topics often include background on the interviewee, the units main products and
services, controls and coordinating mechanisms, relations to other unitsincluding
broader organizational units such as divisions or corporate headquarters, relations
to the external environment (markets, stakeholders, suppliers, and regulators),
management structures, processes, and culture.
In seeking information about groups, divisions, or entire organizations, investi-
gators need to pose questions that fit the positions and organizational level of
respondents. For example, department heads may provide basic information on
department regulations, history, and working relations with other departments;
their subordinates may have little knowledge in such areas. In contrast, subordi-
nates sometimes know better than their boss how work is actually done.
Interviews can also be structured around a particular topical area. When practi-
tioners lack detailed knowledge of operations in a particular area or want to allow
their interviews to be responsive to issues that arise during the interview, they may
construct an interview guide, rather than prepare detailed questions in advance.
The guide lists topics to be investigated and then allows the interviewer to frame
questions about each topic that reflect the distinctive circumstances of the client
organization; the guide also provides opportunities to take into account previous
answers. Interview guides thus ensure coverage of major topics while allowing flex-
ibility. But interview guides have lower reliability than standardized questionnaires,
because they allow for more variation between interviews and among interviewers.
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 337
Using interview guides also requires more interviewer skill than does the use of
more structured schedules.
Here is an illustration of the major headings for an interview guide that aims
to assess relations between an organization and its external environments (see
Harrison, 2005, chap. 5 on Environmental Relations Assessment):
Each major heading in the guide is broken down into subheadings to cover spe-
cific topics. For example, Items 4 and 5 could be specified as follows (with phrases
in parentheses serving as interviewer guidelines):
Naturally, when practitioners use an interview guide, they prepare for the possi-
bility that the answers will range across the topics listed in the guide. During the
interview, they record the responses in the order given. Afterward, they can reorga-
nize them according to the topics in the guide.
Interview and questionnaire studies are often subject to bias because respon-
dents seek to present themselves in a favorable light or withhold information, such
as negative descriptions of supervisors that might be used against them. By con-
ducting interviews with members from different backgrounds and locations within
a unit and by listening carefully to their accounts of important issues, investigators
can become aware of members distinct perspectives and viewpoints. For example,
department heads might characterize their organization as dealing honestly and
directly with employee grievances, while subordinates complain that their griev-
ances are ignored or minimized by management. The people interviewed may be
unaware of such a diversity of viewpoints or intolerant of the feelings and percep-
tions of others. In such cases, consultants can summarize the various viewpoints
during feedback to stimulate communication and encourage people to respect
diverse perspectives and opinions. In other instances, consultants can simply take
note of divergent viewpoints and avoid giving undue weight to one particular inter-
pretation when formulating their own descriptions and analyses.
By building relations of trust with group members, consultants can sometimes
overcome peoples reluctance to reveal sensitive information during interviews.
Practitioners may also gain the trust of one or more members of an organization
who know a lot about organizational affairs but are somewhat detached from
them.5 Assistants to high-level managers, for example, often have a broad view of
their organization and may be more comfortable providing such information than
are the managers. When such well-placed individuals trust consultants, they may
provide useful information about sensitive subjects, such as the degree of influence
of managers who officially have the same level of authority, or staff members past
reactions to risk-taking behavior.
The processes of gathering and reporting diagnostic information can pose tricky
ethical and professional issues. These and other ethical issues facing diagnostic
practitioners and other types of consultants deserve advance consideration (see
American Psychological Association, 1992; Harrison, 2005, chap. 6).
Conclusion
Successful diagnosis requires practitioners to deal with three distinct challenges and
to strike a good balance in their tactics for handling each. The process challenge
requires constructive management of interactions with clients and other organiza-
tional stakeholders. The methodological challenge calls for using rigorous and valid
techniques for gathering, summarizing, and analyzing data within the constraints
imposed by the consulting assignment. The analytic challenge involves using
research-based models to identify sources of effectiveness and ineffectiveness,
discover routes toward organizational improvement, and frame feedback.
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 339
Despite their usefulness, the models, techniques, and methods reviewed here
and those presented in the literature on diagnosis, applied research, and consulta-
tion cannot serve as step-by-step guides to diagnosis. Nor can they be used like
equations into which bits of data are inserted to produce a completed assessment.
No such recipes for diagnosis or action planning exist, and none is likely to be dis-
covered. Instead, most models and methodological techniques work best as frames
and guides that help both experienced and novice practitioners sort out what is
going on within an organization. Because models and methods focus attention on
particular system levels or types of phenomena, they may distract attention from
other important organizational features. Only by combining frames and methods
can practitioners deal with the multifaceted nature of organizational problems and
challenges (Harrison & Shirom, 1999).
Anyone who undertakes a diagnosis, thus, faces many choices about which mod-
els and methods to use and how to manage the consulting process. In most cases,
each alternative has some advantages and some drawbacks. Emerging relations
between clients and practitioners and practical considerations, such as the accessi-
bility of data, shape choices among alternatives.
To engage in diagnosis is to undertake a difficult but exciting and rewarding
taskto use methods and models from the behavioral and organization sciences to
help people find out what is going on in their organization and why, while engaged
in a complex, changing web of relations; to find a way of serving clients who may
be ambivalent about receiving help and deal with people who may be dead set
against the project; to sort among project constraints and a tangle of compelling
obligations, values, and professional standards (see Harrison, 2005, chap. 6). Readers
who want to develop their ability to handle these challenges should seek firsthand,
supervised experiences in diagnosis and consulting processes, along with advanced
training in organizational analysis and research methods.
Exercises
1. Describe a planned change project with which you are familiar. Report how
the consultants and main clients dealt with the Critical Process Issues discussed on
page 321the purpose of the diagnosis, its design, sources of support and cooper-
ation, participation, and feedback. Explain how the diagnosis and the change pro-
ject as a whole were affected by the consultants handling of these Critical Process
Issues. If you are not familiar with an actual change project, propose one for an
organization you know well, explain how the consultant should address each of the
Critical Process Issues, and justify your choices.
2. Describe a team or work group that you know well. Explain how you could
gather diagnostic data about this team that would cover each of the factors high-
lighted in the discussion of the Action Model for Group Task Performance.
3. Use one of the standardized diagnostic instruments discussed in this article
or another standardized instrument (questionnaire) to survey at least seven
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 340
Notes
1. Portions of this chapter are drawn from Harrison (2005) and Harrison and Shirom
(1999). See those sources for more detailed discussions and further references on the methods,
techniques, and models reviewed here; those sources and the references cited in them provide
many additional tools and diagnostic approaches besides those presented in this chapter.
2. Some effective organizations develop structures and practices that appear to be poorly
aligned with one another. For example, managers in large organizations can use new infor-
mation technologies to closely oversee the practices and performance of subordinate units
or people, while also granting subordinates substantial decision authority and operating
autonomy. See Harrison (2005, p. 91) for further discussion of combinations of opposing
design principles.
3. This presentation of the model reflects both the work of Hackman and his colleagues
and a modification and critique in Harrison and Shirom (1999, pp. 166173).
4. See Harrison (2005) for additional discussion and references on data collection
techniques.
5. In anthropological studies, such individuals are called informants, a term that cannot
be used in diagnosis because of its negative connotations.
References
American Psychological Association. (1992). Ethical principles of psychologists and code of
conduct. American Psychologist, 47, 15971611.
Ashkenasy, N., Wilderom, C., & Peterson, M. (Eds.). (2000). Handbook of organizational
culture and climate. Thousand Oaks, CA: Sage.
Beckhard, R., & Harris, R. (1975). Strategies for large system change. Sloan Management
Review, 16, 4355.
Beer, M., & Nobria, N. (2000). Resolving the tension between theories E and O of change.
In M. Beer & N. Nobria (Eds.), Breaking the code of change (pp. 134). Boston: Harvard
Business School Press.
Biech, A. (2004). The 2004 Pfeiffer annual: Consulting. San Francisco: Jossey-Bass.
Block, P. (2000). Flawless consulting: A guide to getting your expertise used (2nd ed.). San
Francisco: Jossey-Bass/Pfeiffer.
Bolman, L., & Deal, T. (2003). Reframing organizations: Artistry, choice, and leadership (3rd
ed.). New York: John Wiley.
Cammann, C., Fichman, M., Jenkins, G., & Kelsh, J. (1983). Assessing the attitudes and
perceptions of members. In S. Seashore, E. Lawler III, P. Mirvis, C. Cammann (Eds.),
Assessing organizational change (pp. 71138). New York: John Wiley.
Cummings, T., & Worley, C. (2001). Organization development and change (7th ed.).
Cincinnati, OH: South-Western.
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 341
Denison, D., Hart, S., & Kahn, J. (1996). From chimneys to cross-functional teams:
Developing and validating a diagnostic model. Academy of Management Journal, 39,
10051023.
Denzin, N., & Lincoln, Y. (Eds.). (2000). Handbook of qualitative research (2nd ed.).
Thousand Oaks, CA: Sage.
Duncan, J., Ginter, P., & Swayne, L. (1998). Competitive advantage and international organi-
zational assessment. Academy of Management Executive, 12, 616.
Faletta, S., & Combs, W. (2002). Surveys as a tool for organization development. In
J. Waclawski & A. Church (Eds.), Organization development: A data-driven approach to
organizational change (pp. 78102). San Francisco: Jossey-Bass.
Freeman, H., Dynes, R., Rossi, P., & Whyte, W. (Eds.). (1983). Applied sociology. San
Francisco: Jossey-Bass.
Glanz, E. F., & Dailey, L. K. (1992). Benchmarking. Human Resource Management, 31, 920.
Gormley, W., & Weimer, D. (1999). Organizational report cards. Cambridge, MA: Harvard
University Press.
Gresov, C. (1989). Exploring fit and misfit with multiple contingencies. Administrative
Science Quarterly, 34, 431453.
Hackman, J. R. (1987). The design of work teams. In J. Lorsch (Ed.), Handbook of organiza-
tional behavior (pp. 315342.). Englewood Cliffs, NJ: Prentice Hall.
Hackman, J. R. (Ed.). (1991). Groups that work (and those that dont). San Francisco: Jossey-
Bass.
Hackman, J. R. (2002). Leading teams: Setting the stage for great performances. Boston:
Harvard Business School Press.
Harrison, M. (2004). Implementing change in health systems: Market reforms in the United
Kingdom, Sweden, and The Netherlands. London: Sage.
Harrison, M. (2005). Diagnosing organizations: Methods, models, and processes (3rd ed.).
Thousand Oaks, CA: Sage.
Harrison, M., & Shirom, A. (1999). Organizational diagnosis and assessment: Bridging theory
and practice. Thousand Oaks, CA: Sage.
Institute of Medicine. (2001). Crossing the quality chasm: A new health system for the 21st cen-
tury. Washington, DC: National Academy Press.
JISC Infonet. (n.d.). Responsibility charting. Retrieved April 7, 2008, from www.jiscinfonet
.ac.uk/infokits/change-management/responsibility-charting
Kaplan, R. N., & Norton, D. (1996). The balanced scorecard: Translating strategy into action.
Boston: Harvard Business School Press.
King, G., Keohane, R., & Verba, S. (1994). Designing social inquiry: Scientific inquiry in qual-
itative research. Princeton, NJ: Princeton University Press.
Kolb, D., & Frohman, A. (1970). An organization development approach to consulting. Sloan
Management Review, 12, 5165.
Kraut, A. (1996). Organizational surveys: Tools for assessment and change. San Francisco:
Jossey-Bass.
Lester, P., & Bishop, K. (2001). Handbook of tests and measurement in education and the social
sciences (2nd ed.). Lancaster, PA: Technomic.
Lusthaus, C., Adrien, M. H., Anderson, G., Carden, F., Montvalvan, G., Lusthaus, C. A., et al.
(2002). Organizational assessment: A framework for improving performance. Washington,
DC: Inter-American Development Bank.
Majchrzak, A. (1984). Methods for policy research. Beverly Hills, CA: Sage.
Miles, M., & Huberman, A. (1994). Qualitative data analysis: An expanded sourcebook of
new methods (2nd ed.). Thousand Oaks, CA: Sage.
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 342
Miller, D., & Salkind, N. (Eds.). (2002). Handbook of research design and social measurement.
Thousand Oaks, CA: Sage.
Moch, M., Cammann, C., & Cooke, R. (1983). Organizational structure: Measuring the
degree of influence. In S. Seashore, E. Lawler, P. Mirvis, & C. Cammann (Eds.), Assessing
organizational change (pp. 177202). New York: John Wiley.
Morgan, G. (1996). Images of organization (2nd ed.). Thousand Oaks, CA: Sage.
Muldrow, T., Schay, B., & Buckley, T. (2002). Creating high-performing organizations in the
public sector. Human Resource Management, 41(3), 341354.
Nadler, D. (1977). Feedback and organization development: Using data-based methods.
Reading, MA: Addison-Wesley.
Nadler, D., & Tushman, M. (1980). A congruence model for diagnosing organizational
behavior. In E. Lawler, D. Nadler, & C. Cammann (Eds.), Organizational assessment
(pp. 261278). New York: John Wiley.
Patton, M. (1999). Utilization-focused evaluation (3rd ed.). Thousand Oaks, CA: Sage.
Ramirez, I. L., & Bartunek, J. (1989). The multiple realities and experiences of organization
development consultation in health care. Journal of Organizational Change Manage-
ment, 2(1), 4057.
Rossi, P., Lipsey, M., & Freeman, H. (1999). Evaluation: A systematic approach (6th ed.).
Thousand Oaks, CA: Sage.
Rossi, P., & Whyte, W. F. (1983). The applied side of sociology. In H. Freeman, R. Dynes,
P. Rossi, & W. F. Whyte (Eds.), Applied sociology (pp. 531). San Francisco: Jossey-Bass.
Rousseau, D. (1990). Assessing organizational culture: The case for multiple methods. In
B. Schneider (Ed.), Climate and culture (pp. 153192). San Francisco: Jossey-Bass.
Salancik, G., & Pfeffer, J. (1977). An examination of need satisfaction models of job attitudes.
Administrative Science Quarterly, 22, 427456.
Scott, T., Mannion, R., Davies, H., & Marshall, M. (2003). The quantitative measurement
of organizational culture in health care: A review of the available instruments. Health
Services Research, 38(3), 923945.
Seashore, S., Lawler, E., Mirvis, P., & Cammann, C. (Eds.). (1983). Assessing organizational
change. New York: John Wiley.
Senge, O. (1994). The fifth discipline: The art and practice of the learning organization. New
York: Doubleday.
Stanton, J., & Rogelberg, S. (2001). Using internet/intranet web pages to collect organiza-
tional research data. Organizational Research Methods, 4, 200217.
Stone, E. (1992). A critical analysis of social information processing models of job percep-
tions and job attitudes. In C. J. Cranny, P. Smith, & E. Stone (Eds.), Job satisfaction: How
people feel about their jobs and how it affects their performance (pp. 2144). New York:
Lexington Books.
Tichy, N. (1983). Managing strategic change: Technical, political, and cultural dynamics. New
York: John Wiley.
Trochim, W. (2001). The research methods knowledge base (2nd ed.). Cincinnati, OH: Atomic
Dog.
Turner, A. (1982). Consulting is more than giving advice. Harvard Business Review, 60, 120129.
Urgent Matters. (2006). Emergency department crowding. Retrieved September 5, 2006, from
www.urgentmatters.org/edCrowding
Van de Ven, A., & Chu, Y. (1989). A psychometric assessment of the Minnesota Innovation
Survey. In A. Van de Ven, H. L. Angle, & M. S. Poole (Eds.), Research on the management
of innovation (pp. 55103). New York: Harper & Row.
10-Bickman-45636:10-Bickman-45636 7/28/2008 6:16 PM Page 343
Van de Ven, A., & Ferry, D. (1980). Measuring and assessing organizations. New York: John
Wiley.
Van de Ven, A., & Walker, G. (1984). The dynamics of inter-organizational coordination.
Administrative Science Quarterly, 29(4), 598621.
Waclawski, J. & A. Church (Eds.). (2002). Organization development: A data-driven approach
to organizational change. San Francisco: Jossey-Bass.
Weick, K. (1985). Systematic observation methods. In G. Lindzey & A. Aronson (Eds.),
Handbook of social psychology (3rd ed., Vol. 2, pp. 567634). Reading, MA: Addison-
Wesley.
Wholey, J., Harty, H., & Newcomer, K. E. (Eds.). (2004). Handbook of practical program eval-
uation. San Francisco: Jossey-Bass
Yin, R. (2002). Case study research: Design and methods (3rd ed.). Thousand Oaks, CA: Sage.
11-Bickman-45636:11-Bickman-45636 7/28/2008 7:49 PM Page 344
CHAPTER 11
Research Synthesis
and Meta-Analysis
Harris M. Cooper
Erika A. Patall
James J. Lindsay
A
s the volume of primary research across all fields of social science contin-
ues to grow at rapid rates, research synthesis has become more important
today than at any other time in history. With the development of meta-
analysis, a set of procedures for summarizing the quantitative results from multiple
studies, the rigor, systematicity, and transparency of research syntheses was greatly
improved. However, a number of developments, including the creation of the
Cochrane Collaboration and Campbell Collaboration, have heightened the profile
of meta-analysis in recent years. Furthermore, recent advancements in analytic
strategies, including the use of a random effects model of error, the development
of meta-regression, and improved methods for dealing with missing data and
data censoring, have enhanced the popularity, efficiency, and trustworthiness of
meta-analyses.
344
11-Bickman-45636:11-Bickman-45636 7/28/2008 7:49 PM Page 345
We begin this chapter with a brief history of meta-analysis and research synthe-
sis. We then describe the different stages of a rigorous research synthesis. Next, we
outline a set of generally useful meta-analytic techniques and follow this with a dis-
cussion of some of the difficult decisions that research synthesists face in carrying
out a meta-analysis. We conclude by addressing some broader issues concerning
criteria for evaluating the quality of knowledge syntheses in general and meta-
analyses in particular.
A general theme of the chapter is that social scientists who are conducting
research syntheses need to think about what distinguishes a good synthesis from
a bad synthesis. This kind of effort is crucial for assessing the value of existing
research syntheses and for promoting high-quality research synthesis in the future.
Each of these research teams realized that for some topic areas, prodigious
amounts of empirical evidence had been amassed on why people act and feel the
way they do and on the effectiveness of psychological, social, educational, and med-
ical interventions. These researchers concluded that the traditional research syn-
thesis simply would not suffice. Largely independently, the three research teams
rediscovered and reinvented Pearsons and Fishers solutions to their problem.
In discussing his solution, Glass (1976) coined the term meta-analysis to stand
for the statistical analysis of a large collection of analysis results from individual
studies for purposes of integrating the findings (p. 3). Shortly thereafter, other
proponents of meta-analysis demonstrated that traditional synthesis procedures
led to inaccurate or imprecise characterizations of the literature, even when the size
of the literature was relatively small (Cooper, 1979; Cooper & Rosenthal, 1980).
The first half of the 1980s witnessed the appearance of five books devoted
primarily to meta-analytic methods. The first, by Glass, McGaw, and Smith (1981)
presented meta-analysis as a new application of analysis of variance and multiple
regression procedures, with effect sizes treated as the dependent variable. In 1982,
Hunter, Schmidt, and Jackson introduced meta-analytic procedures that focused on
(a) comparing the observed variation in study outcomes to that expected by chance
and (b) correcting observed correlations and their variance for known sources of
bias (e.g., sampling errors, range restrictions, unreliability of measurements).
Rosenthal (1984) presented a compendium of meta-analytic methods covering,
among other topics, the combining of significance levels, effect size estimation, and
the analysis of variation in effect sizes. Rosenthals procedures for testing modera-
tors of effect size estimates were not based on traditional inferential statistics, but
on a new set of techniques involving assumptions tailored specifically for the analy-
sis of study outcomes.
Another text that appeared in 1984 also helped elevate research synthesis to a
more rigorous level. Light and Pillemer (1984) focused on the use of research syn-
thesis to help decision making in the social policy domain. Their approach placed
special emphasis on the importance of meshing both numbers and narrative for the
effective interpretation and communication of synthesis results.
Finally, in 1985, with the publication of Statistical Methods for Meta-Analysis,
Hedges and Olkin helped to elevate the quantitative synthesis of research to an
independent specialty within the statistical sciences. This book, summarizing and
expanding nearly a decade of programmatic developments by the authors, not only
covered the widest array of meta-analytic procedures but also established their
legitimacy by presenting rigorous statistical proofs.
Meta-analysis did not go uncriticized. Some critics opposed quantitative synthe-
sis, using arguments similar to those used to oppose primary data analysis (Barber,
1978; Mansfield & Bussey, 1977). Others linked meta-analysis with more general
synthesis procedures that are inappropriate, but not necessarily related to the use of
statistics in synthesis. We address several of these issues later in this chapter.
Since the mid-1980s, several other books have appeared on meta-analysis. Some
of these treat the topic generally (e.g., Cooper, 1998; Hunter & Schmidt, 2004;
Lipsey & Wilson, 2001), some treat it from the perspective of particular research
design conceptualizations (e.g., Eddy, Hassleblad, & Schachter, 1992; Mullen, 1989),
11-Bickman-45636:11-Bickman-45636 7/28/2008 7:49 PM Page 347
some are tied to particular software packages (e.g., Johnson, 1993; Wang &
Bushman, 1999), and some look to the future of research synthesis as a scientific
endeavor (e.g., Cook et al., 1992; Wachter & Straf, 1990).
During and after the years that the works mentioned above were appearing,
literally thousands of meta-analyses were published. In 1994, the first edition of
Handbook of Research Synthesis was published (Cooper & Hedges, 1994). Through
the 1990s, the use of meta-analysis spread from psychology and education (see
Hunt, 1997, for a history of these efforts) through many disciplines, especially social
policy analysis and the medical sciences (see Chalmers, Hedges, & Cooper, 2002, for
a history of meta-analysis in medicine). One of the most notable events in medicine
was the establishment of the U.K. Cochrane Center in 1992. The Center was meant
to facilitate the creation of an international network to prepare and maintain sys-
tematic synthesis of the effects of interventions across the spectrum of health care
practices. At the end of 1993, an international network of individuals, called the
Cochrane Collaboration (www.cochrane.org/index.htm), emerged from this initia-
tive (Bero & Rennie, 1995; Chalmers, 1993). By 2006, the Cochrane Collaboration
was an internationally renowned initiative with 11,000 people contributing to its
work, in more than 90 countries. The Cochrane Collaboration is now the leading
producer of research syntheses in health care and is considered by many to be the
gold standard for determining the effectiveness of different health care interven-
tions. Its library of systematic synthesis numbers in the thousands. In 2000, an ini-
tiative called the Campbell Collaboration (www.campbellcollaboration.org) was
begun with similar objectives for the domain of social policy analysis, focusing
initially on policies concerning education, social welfare, and crime and justice.
Stage of Research
Stage Characteristics Problem Formulation Data Collection Data Evaluation Analysis and Interpretation Public Presentation
Research question What evidence should What procedures What retrieved What procedures should What information
asked be included in the should be used evidence should be used to make should be included in
review? to find relevant be included in inferences about the the review report?
evidence? the review? literature as a whole?
11-Bickman-45636:11-Bickman-45636
Primary function Constructing definitions Determining which Applying criteria to Synthesizing valid Applying editorial
in review that distinguish relevant sources of potentially separate valid from retrieved studies criteria to separate
from irrelevant studies relevant studies to invalid studies important from
7/28/2008
Procedural differences 1. Differences in Differences in the 1. Differences in Differences in rules Differences in guidelines
that create variation included operational research contained in quality criteria of inference for editorial judgment
7:49 PM
Sources of potential 1. Narrow concepts 1. Accessed studies 1. Nonquality factors 1. Rules for 1. Omission of review
invalidity in review might make review might be might cause distinguishing procedures might
conclusions conclusions less qualitatively improper weighting patterns from make conclusions
definitive and different from the of study. noise might be irreproducible.
robust. target population 2. Omissions in study inappropriate. 2. Omissions of review
2. Superficial of studies. reports might 2. Review-based evidence findings and study
operational detail 2. People sampled in make conclusions might procedures might
might obscure accessible studies unreliable. be used to infer make conclusions
interacting variables. might be different causality. obsolete.
from target
population.
SOURCE: From Synthesizing Research: A Guide for Literature Synthesis, 3rd ed., by H. M. Cooper, 1998. Reprinted with pemission of SAGE.
349
11-Bickman-45636:11-Bickman-45636 7/28/2008 7:49 PM Page 350
abstraction. That is, conceptual definitions can differ in breadth, or in the number
of events to which they refer. Lets take as an example the concept of homework.
One synthesist may consider as homework only assignments meant to have students
practice what they have learned in class, whereas another may include assignments
to visit museums or to watch certain television programs. In such a case, the second
synthesist employs a broader conception of homework, and this synthesis will likely
contain more research than will the first.
As in primary research, in order to relate concepts to concrete events, the vari-
ables of interest in a research synthesis also must be operationally defined. An oper-
ational definition provides a description of the characteristics of observable events
that are used to determine whether the event represents an occurrence of the con-
ceptual variable. Synthesists can also vary in the way operations are treated after the
relevant research has been retrieved. Thus, synthesists who employ identical con-
ceptual definitions of homework and who include the same set of studies can still
reach decidedly different conclusions if one synthesist retrieved more information
about the features of studies and recognized a relation between a study feature and
outcome that the other synthesist did not test. One synthesist might discover that
the outcomes of homework studies depended on whether textbook or teacher-
developed tests were used to assess impact, whereas another synthesist never even
coded studies based on this feature of the outcome measure.
Each difference in how a problem is formulated introduces a potential threat to
the trustworthiness of a synthesis conclusions. First, synthesists who focus on very
narrow conceptualizations provide little information about how many different
contexts a finding applies to. Therefore, synthesists who employ broad conceptual
definitions can potentially produce more valid conclusions than ones using narrow
definitions. However, broad definitions can lead to the erroneous conclusion that
research results are insensitive to variations in a studys context. We can assume,
therefore, that synthesists who examine more operational details within their
broader constructs will produce more trustworthy conclusions. These synthesists
present more information about contextual variations that do and do not influence
the synthesis outcome.
studies), and nonsignificant findings that do not permit rejection of the hypothesis
that the fear-arousing ads had no effect. The synthesist then would declare the
largest pile the winner. In our example, the null hypothesis wins.
This vote count of significant findings has much intuitive appeal and has been
used quite often. However, the strategy is unacceptably conservative. The problem
is that chance alone should produce only about 5% of all reports falsely indicating
that viewing the ads created more negative attitudes toward smoking. Therefore,
depending on the number of studies, 10% or less of positive and statistically signif-
icant findings might indicate a real difference due to the ads. However, the vote-
counting strategy requires that a minimum 34% of findings be positive and
statistically significant before the hypothesis is ruled a winner. Thus, the vote count-
ing of significant findings could, and often does, lead to the suggested abandonment
of hypotheses (and effective treatment programs) when, in fact, no such conclusion
is warranted.
Hedges and Olkin (1980) describe a different way to perform vote counts in
research synthesis. This procedure involves (a) counting the number of positive and
negative results, regardless of significance, and (b) applying the sign test to deter-
mine if one direction appears in the literature more often than would be expected
by chance. This vote-count method has the advantage of using all studies but suf-
fers because it does not weight a studys contribution by its sample size. Thus, a
study with 100 participants is given weight equal to a study with 1,000 participants.
This is a potential problem because large samples are likely to provide more precise
answers to questions. Therefore, results from larger samples should be given more
weight. Furthermore, the revealed magnitude of the hypothesized relation (or
impact of the treatment under evaluation) in each study is not considereda study
showing a small positive attitude change is given equal weight to a study showing a
large negative attitude change. Still, the vote count of directional findings can be an
informative complement to other meta-analytic procedures and can even be used
to generate an effect size estimate (see Bushman, 1994; Hedges & Olkin, 1985).
discover if variations in study outcomes exist and, if so, what features of studies
might account for them.
Although numerous estimates of effect size are available, three dominate the
literature. The first, called the d-index by Cohen (1988; also see Hedges & Olkin,
1985; Rosenthal, 1994), is a scale-free measure of the separation between two group
means. Calculating the d-index for any study involves dividing the difference
between the two group means by either their average standard deviation or the
standard deviation of the control group. For example, Cooper, Robinson, and Patall
(2006) examined the difference in academic achievement of students who did and
did not do homework. Across five studies that manipulated the presence of home-
work, the average d-index was 0.60 favoring the homework doers. Thus, the average
academic achievement of students who did homework was 0.60 standard deviations
above the average score of students who did not.
Figure 11.1 presents the d-indices associated with three hypothetical studies. In
Figure 11.1a, the fear-arousing ad has no effect on adolescents reported attitudes
toward smoking, thus d = 0. In Figure 11.1b, the average adolescent viewing the ad
has an attitude score that is four tenths of a standard deviation more negative than
the average adolescent viewing control ads. Here, d = 0.40. In Figure 11.1c, d = 0.85,
indicating an even greater separation between the two group means.
In many instances, synthesists will find that primary researchers do not report
the means and standard deviations of the separate groups. For such cases, meta-
analysts can use one of a number of computational formulas that do not require
means and standard deviations. The interested reader may refer to Rosenthal
(1994) or Lipsey and Wilsons (2001) for listings of algebraically equivalent formu-
las that can be used to compute an effect size from various statistical information.
Another effect size metric is the r-index, or the Pearson product-moment correla-
tion coefficient. Typically, it is used to measure the degree of linear relation between
two variables. The correlation coefficient is familiar to most researchers and is most
appropriate when describing the relationship between two continuous variables. For
example, Cooper and colleagues (2006) found 32 studies that described the correla-
tions between the time a student spent on homework and a measure of academic
achievement. The average correlation for the 32 studies was r = 0.24, suggesting that
more time spent on homework is related to greater academic achievement.
The third effect size metric is the odds ratio. The odds ratio is applicable when
both variables are dichotomous and findings are presented as frequencies or pro-
portions. This measure of effect is used most in medical sciences, in which the
researcher is often interested in the effect of a treatment on mortality or the appear-
ance or disappearance of disease. It also appears frequently in studies of educational
interventions when the outcome of interest is drop-out or retention rates or crim-
inal justice studies where the outcome is recidivism. For example, if the synthesist
was interested in whether exposure to fear-arousing ads led adolescents to continue
or quit smoking, then an odds ratio would be an appropriate effect size metric.
First, the odds of smoking must be determined for each condition, when partici-
pants are exposed to fear-arousing advertisements versus control advertisements.
Then, the ratio of the odds for being exposed to fear-arousing advertisements over
control advertisements is then calculated as the ratio of the odds.
11-Bickman-45636:11-Bickman-45636 7/28/2008 7:49 PM Page 356
a. d = 0
Identical Distributions
for Ad Group and
Control Group
b. d = .40
Ad Control
Group Group
c. d = .85
Ad Control
Group Group
Figure 11.1 Three Relations Between Fear-Arousing Ads and Attitudes Toward
Smoking Expressed by the d-Index
Moderator Analyses
Another advantage of performing a statistical integration of research is that it
allows synthesists to test hypotheses about why the outcomes of studies differ. To
11-Bickman-45636:11-Bickman-45636 7/28/2008 7:49 PM Page 358
continue with the fear-arousing ad example, the synthesist might calculate average
d-indexes for subsets of studies, deciding that he or she wants different estimates
based on certain characteristics of the data. For example, the synthesist might want
to compare separate estimates for studies that use different outcomes, distinguish-
ing between those that measured likelihood of smoking and those that measured
attitude toward smoking. Or, the synthesist might wish to compare the average
effect sizes for different media formats, distinguishing print from video advertise-
ments. Or, the synthesist might want to look at whether advertisements are differ-
entially effective for males and females.
The ability to ask these questions about variables that moderate effects reveals one
of the major contributions of research synthesis. Specifically, even if no individual
study has compared different outcomes, media, or adolescent sexes, by comparing
results across studies the synthesist can get a first hint about whether these variables
would be important to look at in future research and/or as guides to policy.
Without the aid of statistics, the synthesist simply examines the differences in
outcomes across studies, groups them informally by study features, and decides
(based on an interocular inference test) whether the feature is a significant pre-
dictor of variation in outcomes. At best, this method is imprecise. At worst, it leads
to incorrect inferences. In contrast, meta-analysis provides a formal means for test-
ing whether different features of studies explain variation in their outcomes. After
calculating the average effect sizes for different subgroups of studies, the synthesist
can statistically test whether these factors are reliably associated with different mag-
nitudes of effect also using homogeneity analyses. As previously suggested, homo-
geneity analysis allows meta-analysts to test whether sampling error alone accounts
for variation in effect sizes or whether features of studies, samples, treatment
designs, or outcome measures also play a role. This test is analogous to conducting
an analysis of variance, in that a significant homogeneity statistic indicates that
at least one group mean differs from the others. It is relatively simple to carry out
a homogeneity analysis; formulas are described in Cooper (1998), Cooper and
Hedges (1994), Hedges and Olkin (1985), and Lipsey and Wilson (2001).
An alternative strategy for examining whether particular characteristics of stud-
ies are related to the sizes of the treatment effect is meta-regression. Unlike the
strategy previously discussed, meta-regression allows the meta-analyst to explore
the relationship between continuous, as well as categorical, characteristics and
effect size, and allows the effects of multiple factors to be investigated simultane-
ously (Thompson & Higgins, 2002). In our example, imagine that our studies
ranged in the duration of exposure to fear-arousing ads. One option would be to
group studies into several distinct categories of duration of exposure to fear-arousing
ads and continue with subgroup moderator analyses as previously discussed.
However, an alternative would be to employ meta-regression, leaving this charac-
teristic continuous. The interested reader may refer to Thompson and Higgins
(2002) or Higgins and Thompson (2004) for a full discussion of this method.
In sum, a generic meta-analysis might contain three or four separate sets of
statistics: (a) a frequency analysis of positive and negative results, (b) estimates of
average effect sizes with confidence intervals, (c) homogeneity analyses to assess
dispersion and examine study features that might influence study outcomes, and
11-Bickman-45636:11-Bickman-45636 7/28/2008 7:49 PM Page 359
Publish or Perish
Research synthesists disagree about how exhaustive a literature search needs to
be. Some synthesists go to great lengths to locate as much relevant material as pos-
sible; others are less thorough. Typically, disagreement centers on the importance
of including unpublished research in syntheses.
Those in favor of limiting syntheses to only published material argue that pub-
lication is an important screening device for maintaining quality control. Because
published research has been reviewed for quality, it provides the best evidence avail-
able. Also, the inclusion of unpublished material typically does not change the con-
clusions drawn by synthesists. Therefore, the studies found in unpublished sources
do not warrant the additional time and effort needed to obtain them.
Those who argue that research should not be judged based on publication
status give three rationales. First, they dispute the claim that published research and
unpublished research yield similar results; statistically significant results are more
likely to be published (Begg, 1994). That is, studies revealing smaller effects may be
systematically censured from the published literature, making relationships appear
stronger than if all estimates were retrieved (Rothstein, Sutton, & Borenstein, 2005).
Lipsey and Wilson (1993) compared the magnitudes of effects reported in pub-
lished versus unpublished studies contained in 92 different research syntheses. They
reported that the impacts of interventions in unpublished research were, on aver-
age, one third smaller than published effects.
Second, even if publication status does relate to the quality of research, there will
still be much overlap in the quality of published and unpublished studies. Superior
studies sometimes are not submitted or are turned down for publication for other
11-Bickman-45636:11-Bickman-45636 7/28/2008 7:49 PM Page 360
reasons. Inferior studies sometimes find their way into print. Application of the
publish or perish rule may lead to the omission of numerous high-quality stud-
ies and will not ensure that only high-quality studies are included in the synthesis.
And finally, in a meta-analysis, both the reliability of effect size estimates,
expressed through the size of confidence intervals, and tests for effect size modera-
tors will depend on the amount of available data. Therefore, synthesists may unnec-
essarily impede their ability to make confident statistical inferences by excluding
unpublished studies (Rothstein et al., 2005).
Consequently, it is accepted practice that rigorous research syntheses should
always access multiple channels to retrieve studies and operate with the goal of
obtaining all relevant research (Cooper, 1998; Lipsey & Wilson, 2001), regardless of
whether or where it was published. If the synthesis includes only published research
it must be accompanied by a convincing justification.
study as well as eventual results to be publicly available. This would create an unbi-
ased compilation of studies for subsequent meta-analyses and allow the synthesist
to obtain information and results about studies regardless of the significance of
their findings or publication status. In prospective meta-analysis, studies are iden-
tified and determined to be eligible before the results of any of the studies are
known. Prospective meta-analysis may be accomplished when multiple groups
of investigators agree to combine their findings on completion. Furthermore, the
comparability of research included in the meta-analysis is improved when investi-
gators also decide prospectively to employ the same methods and assessment
instruments across studies. Because the studies and specific analyses to be included
in the meta-analysis are determined prior to any single study being conducted,
missing data and data censoring is virtually eliminated.
adolescents used two different measures, one attitudinal and one behavioral, two
separate d-indexes would be calculated. In the shifting unit of analysis approach,
for estimating the overall relation between exposure to fear-arousing ads and
smoking, statistical independence would be maintained by averaging these two
d-indexes prior to entry into the analysis, so that the study only contributes one
effect size. However, in an analysis that examined the effect of measurement
characteristics, attitudinal or behavioral, on smoking outcomes, this sample
would contribute one estimate to each category in the moderator analysis. This
shifting unit of analysis approach retains as much data as possible from each
study while holding to a minimum any violations of the assumption that data
points are independent.
Models of Error
Another aspect of conducting a meta-analysis that has recently received consid-
erable attention involves the decision about whether a fixed-effects or random-
effects model of error underlies the generation of study outcomes. In a fixed-effects
model, all studies are assumed to be drawn from a common population. As such,
variance in effect sizes is assumed to reflect only sampling error, that is, error solely
due to participant differences. However, sometimes other features of studies can be
viewed as random influences. For example, studies that look at the impact of fear-
arousing advertisements on smoking might vary in the length of exposure to ads or
in how the ads are introduced to participants. In this case, it may be most appro-
priate to consider advertisements as randomly sampled from all fear-arousing
advertisements. That is, in a random-effect analysis, study-level variance is assumed
to be present as an additional source of random influence.
The question meta-analysts must ask is whether the effect sizes in their data set
are affected by a large number of these study-level random influences. If it is the
case that the meta-analysts suspect a large number of these additional sources
of random error, then a random-effects model is most appropriate to take these
sources of variance into account. If the meta-analysts suspects that the data are
most likely little affected by other sources of random variance, then a fixed-effects
model can be applied. Alternatively, Hedges and Vevea (1998) state that fixed-effect
models of error are most appropriate when the goal of the research is to make
inferences only about the effect size parameters in the set of studies that are
observed (or a set of studies identical to the observed studies except for uncertainty
associated with the sampling of subjects) (p. 3). A further consideration is that in
the search for moderators, fixed-effect models may seriously underestimate error
variance and random-effects models may seriously overestimate error variance
when their assumptions are violated (Overton, 1998).
In view of these competing sets of concerns, the meta-analysts might consider
applying both models (e.g., Cooper et al., 2006). Specifically, all analyses could be
conducted twice, once employing fixed-effect assumptions and once using ran-
dom-effect assumptions. Differences in results based on which set of assumptions
is used can be incorporated into the interpretation and discussion of findings.
11-Bickman-45636:11-Bickman-45636 7/28/2008 7:49 PM Page 364
Discussion Questions
1. What is the primary impetus for adoption of meta-analysis in the social
sciences?
2. Name several channels by which to search for relevant literature. What are
the strengths, weaknesses, and cost-effectiveness of each?
4. What criteria are most crucial to consider when evaluating the quality of
primary research?
5. What criteria are most crucial to consider when evaluating the quality of a
research synthesis?
Exercises
1. Identify a conceptual variable and list the operational definitions associated
with it that are known to you now.
2. List the keywords that you would use to search for articles relevant to your
conceptual variable in electronic reference databases. Use them to identify other
related terms in the thesauri of at least two reference databases. What did you learn
about your concepts from the new keywords you discovered? Did the keywords
differ for the different reference databases and if so, how?
3. Find several reports that describe research relevant to your topic. How many
new operational definitions did you find? Evaluate these with regard to their corre-
spondence to the conceptual variable.
4. Read two research syntheses. Outline what the authors report on each of the
following: (a) how the literature search was conducted, (b) what rules were used to
decide if studies were relevant to the hypothesis, and (c) what rules were used to
decide if cumulative relations existed. Was there any information that the synthe-
sists did not report that would be needed to fully evaluate the quality of the research
syntheses?
References
Barber, T. (1978). Expecting expectancy effects: Biased data analyses and failure to exclude
alternative interpretations in experimenter expectancy research. Behavioral and Brain
Sciences, 3, 38.
Becker, B. J. (2005, November). Synthesizing slopes in meta-analysis. Paper presented at
the meeting on Research Synthesis and Meta-Analysis: State of the Art and Future
Directions, Durham, NC.
Begg, C. B. (1994). Publication bias. In H. M. Cooper & L. V. Hedges (Eds.), Handbook of
research synthesis (pp. 399409). New York: Russell Sage Foundation.
Begg, C. B., & Mazumdar, M. (1994). Operating characteristics of a rank correlation test for
publication bias. Biometrics, 50, 10881101.
Berlin, J. A., & Ghersi, D. (2005). Preventing publication bias: Registries and prospective meta-
analysis. In H. R. Rothstein, A. J. Sutton, & M. Borenstein (Eds.), Publication bias in meta-
analysis: Prevention, assessment and adjustments (pp. 3548). Chichester, UK: John Wiley.
Berlin, J. A., & Rennie, D. (1999). Measuring the quality of trials. Journal of the American
Medical Association, 282, 10831085.
11-Bickman-45636:11-Bickman-45636 7/28/2008 7:49 PM Page 367
Bero, L., & Rennie, D. (1995). The Cochrane Collaboration. Preparing, maintaining, and
disseminating systematic reviews of the effects of health care. Journal of the American
Medical Association, 274, 19351938.
Borenstein, M., Hedges, L., Higgins, J., & Rothstein, H. (2005). Comprehensive Meta Analysis
(Version 2.1) [Computer software]. Englewood, NJ: BioStat.
Bushman, B. J. (1994). Vote-counting procedures in meta-analysis. In H. M. Cooper &
L. V. Hedges (Eds.), Handbook of research synthesis (pp. 193213). New York: Russell
Sage Foundation.
Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental designs for
research. Chicago: Rand McNally.
Chalmers, I. (1993). The Cochrane Collaboration: Preparing, maintaining and disseminat-
ing systematic reviews of the effects of health care. Annals of the New York Academy
of Sciences, 703, 156163.
Chalmers, I., Hedges, L. V., & Cooper, H. (2002). A brief history of research synthesis.
Evaluation & the Health Professions, 25, 1237.
Cohen, J. (1988). Statistical power analysis in the behavioral sciences. Hillsdale, NJ: Lawrence
Erlbaum.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues
for field settings. Chicago: Rand McNally.
Cook, T. D., Cooper, H. M., Cordray, D. S., Hartmann, H., Hedges, L. V., Light, R. J., et al.
(1992). Meta-analysis for explanation: A casebook. New York: Russell Sage Foundation.
Cooper, H. M. (1979). Statistically combining independent studies: A meta-analysis of sex dif-
ferences in conformity research. Journal of Personality and Social Psychology, 37, 131146.
Cooper, H. M. (1982). Scientific guidelines for conducting integrative research synthesis.
Synthesis of Educational Research, 52, 291302.
Cooper, H. M. (1998). Synthesizing research: A guide for literature synthesis (3rd ed.).
Thousand Oaks, CA: Sage.
Cooper, H. M., & Hedges, L. V. (Eds.). (1994). Handbook of research synthesis. New York:
Russell Sage Foundation.
Cooper, H., Robinson, J. C., & Patall, E. A. (2006). Does homework improve academic achieve-
ment? A synthesis of research, 19872003. Synthesis of Educational Research, 76, 162.
Cooper, H. M., & Rosenthal, R. (1980). Statistical versus traditional procedures for summa-
rizing research findings. Psychological Bulletin, 87, 442449.
Duval, S., & Tweedie, R. (2000a). A nonparametric trim and fill method of accounting
for publication bias in meta-analysis. Journal of the American Statistical Association,
95, 8998.
Duval, S., & Tweedie, R. (2000b). Trim and fill: A simple funnel plot-based method of
testing and adjusting for publication bias in meta-analysis. Biometrics, 56, 276284.
Eddy, D. M., Hassleblad, V., & Schachter, R. (1992). Meta-analysis by the confidence profile
method. New York: Academic Press.
Egger, M., Davey Smith, G., Schneider, M., & Minder, C. (1997). Bias detected in meta-
analysis detected by a simple, graphical test. British Medical Journal, 315, 629634.
Eysenck, H. (1978). An exercise in mega-silliness. American Psychologist, 33, 517.
Feldman, K. A. (1971). Using the work of others: Some observations on synthesizing and
integrating. Sociology of Education, 4, 86102.
Fisher, R. A. (1932). Statistical methods for research workers. London: Oliver & Boyd.
Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Education Researcher,
5, 38.
11-Bickman-45636:11-Bickman-45636 7/28/2008 7:49 PM Page 368
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills,
CA: Sage.
Glass, G. V., & Smith, M. L. (1979). Meta-analysis of research on class size and achievement.
Educational Evaluation and Policy Analysis, 1, 216.
Gleser, L. J., & Olkin, I. (1994). Stochastically dependent effect sizes. In H. Cooper & L. V. Hedges
(Eds.), Handbook of research synthesis (pp. 339355). New York: Russell Sage Foundation.
Greenhouse, J. B., & Iyengar, S. (1994). Sensitivity analysis and diagnostics. In H. M. Cooper
& L. V. Hedges (Eds.), Handbook of research synthesis (pp. 383398). New York: Russell
Sage Foundation.
Greenwald, R., Hedges, L. V., & Laine, R. D. (1996). The effect of school resources on student
achievement. Synthesis of Educational Research, 66, 361396.
Hanushek, E. A. (1989). The impact of differential expenditures on school performance.
Educational Researcher, 18, 4551.
Hedges, L. V., Cooper, H. M., & Bushman, B. J. (1992). Testing the null hypothesis in meta-
analysis: A comparison of combined probability and confidence interval procedures.
Psychological Bulletin, 111, 188194.
Hedges, L. V., & Olkin, I. (1980). Vote-counting methods in research synthesis. Psychological
Bulletin, 88, 359369.
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic
Press.
Hedges, L. V., & Vevea, J. L. (1998). Fixed and random effects models in meta-analysis.
Psychological Methods, 3, 486504.
Higgins, J. P. T., & Thompson, S. G. (2004). Controlling the risk of spurious findings from
meta-regression. Statistics in Medicine, 23, 16631682.
Hunt, M. (1997). How science takes stock: The story of meta-analysis. New York: Russell Sage
Foundation.
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in
research findings (2nd ed.). Thousand Oaks, CA: Sage.
Hunter, J. E., Schmidt, F. L., & Hunter, R. (1979). Differential validity of employment tests by
race: A comprehensive synthesis and analysis. Psychological Bulletin, 86, 721735.
Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Meta-analysis: Cumulating research find-
ings across studies. Beverly Hills, CA: Sage.
Jackson, G. B. (1980). Methods for integrative review. Synthesis of educational research,
50, 438460.
Johnson, B. T. (1993). DSTAT: Software for the meta-analytic synthesis of research/book, update
and disc. Hillsdale, NJ: Erlbaum.
Jni, P., Witschi, A., Bloch, R., & Egger, M. (1999). The hazards of scoring the quality of clini-
cal trials for meta-analysis. Journal of the American Medical Association, 282, 10541060.
Light, R. J., & Pillemer, D. B. (1984). Summing up: The science of research synthesizing.
Cambridge, MA: Harvard University Press.
Light, R. J., & Smith, P. V. (1971). Accumulating evidence: Procedures for resolving contra-
dictions among research studies. Harvard Educational Synthesis, 41, 429471.
Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of psychological, educational, and
behavioral treatment: Confirmation from meta-analysis. American Psychologist, 48,
11811209.
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage.
Mahoney, M. (1977). Publication prejudice: An experimental study of confirmatory bias in
the peer synthesis system. Cognitive Therapy and Research, 1, 161175.
11-Bickman-45636:11-Bickman-45636 7/28/2008 7:49 PM Page 369
Wachter, K. W., & Straf, M. L. (Eds.). (1990). The future of meta-analysis. New York: Russell
Sage Foundation.
Wang, M. C., & Bushman, B. J. (1999). Integrating results through meta-analytic synthesis
using SAS software. Cary, NC: SAS Institute.
Wu, M., & Becker, B. J. (2004, April). Synthesizing results from regression studies: What can we
learn from combining results from studies using large data sets? Paper presented at the
annual meeting of the American Educational Research Association, San Diego, CA.
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 371
PART III
Practical Data
Collection
371
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 372
decide whether the focus group approach would be useful for answering their
research questions. The authors outline the steps to designing, conducting, and analyzing
a focus group, including framing its purpose, selecting the participants, developing
the interview guide, conducting the group, and analyzing and interpreting the data.
Opportunities offered by new technology both in analyzing the data and in conduct-
ing virtual groups, groups that cannot be brought to one location, are described.
The authors provide the important reminder that, regardless of the technology used
either in the analysis or conduct of the focus group, validity is not ensured and needs
to be addressed throughout the focus group process.
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 375
CHAPTER 12
Carol Cosenza
375
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 376
A critical part of the science of survey research is the empirical evaluation of sur-
vey questions. Like measurement in all sciences, the quality of measurement in sur-
vey research varies. Good science entails attempting to minimize error and taking
steps to measure the remaining error so that we know how good our data are and
we can continue to improve our methods.
There are two types of question evaluation: those aimed at evaluating how well
questions meet the four standards above, which can be thought of as process stan-
dards, and those aimed at assessing the validity of answers that result. In order to
assess the extent to which questions meet process standards, we can take a number
of possible steps. These include (a) systematic question review; (b) cognitive inter-
views, in which peoples comprehension of questions and how they go about
answering questions is probed and evaluated; and (c) field pretests under realistic
conditions. Each of these activities has strengths and limitations in terms of the
kinds of information they provide about questions. However, in the past decade,
there has been growing appreciation of the importance of evaluating questions
before using them in a research project, and a great deal has been learned about
how to use these techniques to provide systematic information about questions
(see, e.g., Presser et al., 2004).
The evaluation of validity usually occurs after data have been collected and
entails specific analyses aimed at producing evidence that the answers are measur-
ing what they were intended to measure.
We begin this chapter by describing what we know about how to design survey
questions. The discussion is separated by whether the focus is on measuring objec-
tive facts or subjective states of respondents, such as knowledge, opinions, or feel-
ings. The latter part of the chapter is devoted to the objective evaluation of survey
questions. The overall goal in this chapter is to describe how to design survey ques-
tions that will be good measures.
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 377
Question Objectives
One of the hardest tasks for methodologists is to induce researchers, people who
want to collect data, to define their objectives. The difference between a question
objective and the question itself is a critical distinction. The objective defines the
kind of information that is needed. Designing the particular question or questions
to achieve the objective is an entirely different step. In fact, this chapter is basically
about the process of going from a question objective to a set of words, a question,
the answers to which will achieve that objective.
Sometimes the distance between the objective and the question is short:
Objective: Age
Possible Example 1a: How old were you on your last birthday?
Possible Example 1b: On what date were you born?
The answer to either of these questions probably will meet this question
objective most of the time. An ambiguity might be whether age is required to
the exact year, or whether broad categories, or a rounded number, will suffice.
Example 1a produces more ages rounded to 0 or 5. Example 1b may be less sen-
sitive to answer than Example 1a for some people, because it does not require
that the respondent explicitly state an age. There also may be some difference
between the questions in how likely people are to err in their answers due to
recall or miscalculations. However, the relationship between the objective and
the information asked for in the questions is close, and the two questions yield
similar results.
Each of these three questions has been used as a measure of ethnicity. However,
the results are very different. Which question is best depends on the way the analyst
plans to use the results and what is to be measured. The first question measures
race, but it does not take into account national or cultural issues. The most com-
mon measures in the United States include at least one additional question that
identifies those of Hispanic background. However, Hispanic is not a race; it cuts
across race, as there are black, white, and Indian Hispanics. Example 2a also has a
perceptual component for all those respondents who have some degree of racial
mixture in their backgrounds, so that two people with the same racial backgrounds
could answer the question differently.
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 378
When a question includes words or phrases that a respondent cannot define, there
are several things that the respondents can dothey can try to guess what the ques-
tion is asking and answer the question anyway, they can skip the question and not
answer it at all, or they can just choose an answer at random. All these options are
detrimental to the reliability of the data. It is the responsibility of the researcher to
provide the respondent with all the information needed to answer a question
including definitions or examples of words that may not be universally understood.
Sometimes, question ambiguity arises from using a common abstract word or
phrase without a definition. When that happens, it is easy for respondents to
wrongly assume that they know what the question means.
Alternative 3: Do you have access to a car or other vehicle you can use every day
to get to work?
Proper question design means making certain that the researcher and all respon-
dents are using the same definitions when classifying people or counting events. In
general, researchers have tended to solve the problem by giving the respondents a
definition to use and then asking the respondents to do the classification work.
Example 4: A health provider is anyone you would see for health care. In the last
12 months, not counting the times you needed health care right away, did you
make any appointments with a doctor or other health provider for health care?
The problem with this question is that there are numerous issues about how to
calculate income. Among them are whether income is current or for some period of
time in the past, whether it is only income earned from salaries and wages or
includes income from other sources, and whether it is only the persons own income
that is at issue or includes income of others in which the respondent might share.
Alternative 5: Next we need to get an estimate of the total income for you and
family members living with you during 2008. When you calculate income, we
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 380
would like you to include what you and other family members living with you
made from jobs and also any income that you or other family members may
have had from other sources, such as rents, welfare payments, social security,
pensions, or even interest from stocks, bonds, or savings. So including income
from all sources, before deductions for taxes, for you and for family members
living with you, how much was your total family income in 2008?
Example 6: In the last 12 months, how many times have you seen or talked with
a doctor?
It has been found that receiving advice over the telephone from a physician,
seeing nurses or assistants who work for a physician, and receiving services from
physicians who are not always thought of as medical doctors (such as psychia-
trists) often are left out. One solution is to ask a general question and then ask some
follow-up questions:
Example 6a: Other then the visits to doctors that you just mentioned, how many
times in the last 12 months have you gotten medical advice from a physician
over the telephone?
Example 6b: Other than what youve already mentioned, how many times in the
last 12 months have you gotten medical services from a psychiatrist?
Using multiple questions to cover all aspects of what is to be reported, rather than
trying to pack everything into a single definition, can be an effective way to simplify
the reporting tasks for respondents. It is one of the easiest ways to make sure that
commonly omitted types of events are included in the total count that is obtained.
In addition to being able to understand the vocabulary used in a question, it is
also important for respondents to understand for what time period they should be
answering. For any question that could reasonably be expected to vary from day to
day, week to week, or month to month, the researcher should include a time frame
or reference period. Without a time frame, it is left up to the respondent whether to
answer about today, last week, or some longer period.
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 381
Example 7: How often do you skimore than once a week, about once a week,
two to four times a month, or less often than that?
If this question is asked in the winter, the same person could answer differently
than if it was asked in the summer. By not including any specific reference period,
respondents must choose on their own what time periods to think about. If they
choose to think about the last 30 days, the answer will likely be different than
if they think about the entire year. By allowing respondents to make their own
choices about a time frame, answers can vary for that reason alone and the data
reliability is reduced. In addition, this question also assumes a pattern of regular-
ity that may not always be the case. The alternative question below fixes both of
these problems.
Alternative 7: In the last 12 months, on about how many days did you ski?
Example 8: In the last 6 months, how often did you buy a newspaper at a
newsstandalways, sometimes, rarely, never?
This question requires at least three cognitive steps. First, respondents have to
decide whether they have bought any newspapers in the last 6 months. Then, they
have to figure out how many times they bought a newspaper at a newsstand. Then,
they have to figure out the ratio of newspapers bought at a newsstand to newspa-
pers bought elsewhere and decide which of the adjectival responses best describes
their situation. A better way to ask these kinds of complex questions is to ask about
each part separately.
Alternative 8a: In the last 30 days, about how many newspapers have you bought?
Alternative 8b: (if any) And about how many of those newspapers did you buy
at a newsstand?
at a gym or health club. The simplest solution would be to ask the questions
in reverse order, asking about exercising first, so question 9a would not be part of
the context for 9b. Another alternative would be to add a phrase to the second
question asking respondents to think about all the different places that they might
have exercised.
Another example of the influence that context can have on answers is given below.
Example 10a: The next questions refer to the joints in your body. Please do not
include the back or neck. During the past 30 days, have you had symptoms of
pain, aching, or stiffness in or around a joint?
Example 10b:
1. Have you ever been told by a doctor or other health professional that
you have some form of arthritis, rheumatoid arthritis, gout, lupus, or
fibromyalgia?
2. During the past 30 days, have you had symptoms of pain, aching, or stiff-
ness in or around a joint?
Example 10a above asks first about pain and then asks about a diagnosis. In a
recent study comparing the two examples, 58.8% of the people who answered
Example 10a answered that they had joint pain while 49.4% who answered
Example 10b said that they had joint pain. One possible explanation might be that
asking about the long list of medical conditions first gives the respondent the sense
that the questions are asking only about significant or major pain. In Example 10a,
with no previous mention of medical diagnosis, people may be more likely to
report less significant pain.
There are other characteristics of questions that can lead to inconsistent under-
standing by respondents. Good survey questions about factual data should be
about what people know and can answer. Since behavior is largely determined by
situations, asking hypothetical questions about what respondents might do in the
future is less of a factual question and more of a guess or opinion. In general, people
are not good at predicting what they will do in circumstances that they have not yet
encountered. Since it has yet to happen, respondents have to fill in what they imag-
ine might happen. However, the more experience a respondent has with similar sit-
uations, the more likely meaningful answers can be provided. When questions are
truly hypothetical, asking about situations or things with which the respondent has
little or no experience, answers are unlikely to be meaningful. Moreover, because
people have to fill in their own assumptions about what the situation will be, each
respondent is likely to be answering a different question.
Another potential problem is multibarreled questions. If a question asks about
more than one issue (e.g., Do you want to be rich and famous? or Are you
unhappy and overworked?), respondents are faced with the task of deciding, on
their own, what to do if the answer to each barrel is different (I am not over-
worked, but I am unhappy). To the extent that they decide differently, respondents
are answering different questions.
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 383
1. The respondent may never have had the information needed to answer the
question.
2. The respondent may once have known the information, but may have diffi-
culty recalling it.
3. For questions that require reporting events that occurred in a specific time
period, respondents may recall that the events occurred, but have difficulty
accurately placing them in the time frame called for in the question.
Lack of Knowledge
Sometimes, respondents simply do not have the information needed to answer
a question. One critical part of the preliminary work a researcher must do in
designing a survey instrument is to find out whether or not questions have been
included to which some respondents do not know the answers. The limit of survey
research is what people are able and willing to report. If a researcher wants to find
out something that is not commonly known by respondents, the researcher must
find another way to get the information.
Example 11: In the last 6 months, how often did you feel that your personal doctor
had all the information needed to correctly diagnose and treat your health problems?
Asking about other people: Sometimes, the problem of asking people questions to
which they do not know the answers is one of respondent selection rather than
question design. Many surveys ask a specific member of a household to report
information about other household members or about the household as a whole.
When such designs are chosen, a critical issue is whether or not the information
required (such as insurance or employment status) is known to the person who will
be doing the reporting.
In other situations, researchers make conscious decisions to ask a proxy respon-
dent for information, rather than talking to the person of interest. For example, it
is common to ask parents about their children and to ask family members to
report on experiences of nursing home residents. However, in situations such as
these, researchers need to be careful about the questions that they ask and the
assumptions that they make. Parents could answer factual questions about the
grade their child is in or how their child gets to school, but they are not the best
reporters of whether their child is happy in school or how many cigarettes the
child smokes. Family members of nursing home residents can report on what type
of room the resident lives in and, of course, on their own experiences of visiting
the nursing home, but they will most likely not be able to reliably answer how
often a call light is answered quickly when help is requested or how day and night
staffs compare.
There is a large literature comparing self-reporting with proxy reporting (Cannell,
Marquis, & Laurent, 1977; Clarridge & Massagli, 1989; Moore, 1988; OMuirchearteigh,
1991; Rodgers & Herzog, 1989; Tourangeau, Rips, & Rasinski, 2000). Across all topics,
usually self-respondents are better reporters than proxy respondents.
Recall
Memory researchers tell us that few things, once directly experienced, are for-
gotten completely. The readiness with which information and experiences can be
retrieved follows some fairly well-developed principles (Cannell, Marquis, et al.,
1977; Eisenhower, Mathiowetz, & Morganstein, 1991; Tourangeau et al., 2000):
If the researcher wants information about very small events that had minimal
impact, it follows that the reference period should be quite short. For example,
when researchers want reporting about dietary intake or soft drink consumption, it
has been found that even a 24-hour recall period can produce deterioration and
reporting error due to recall. When people are asked to report their behavior over
1 or 2 weeks, they resort to giving estimates of their average or typical behavior,
rather than trying to remember (Blair & Burton, 1987).
So if a researcher wants accurate information about something such as how
many glasses of water someone drank, having respondents report for a very short
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 385
period, such as a day, is probably the only way to get reasonably accurate answers
(Smith, 1991). However, if a researcher is asking about events that probably had a
greater impact in someones life, a longer time period could be asked about.
Table 12.1 is from a study in which people were sampled based on having had a
hospital stay. The survey asked respondents about recent hospital stays; then the
researchers compared the survey responses with the actual hospital records. The table
reports the percentages of known hospitalizations that were and were not reported.
The shorter the stay in the hospital and the greater the time period between the dis-
charge and the interview, the less likely the respondent was to report the hospital stay.
More than 30% of the 1-day hospitalizations 40 weeks before the interview were not
reported at all, while only 5% of the longer stays within 20 weeks were not mentioned.
This table shows that the more important the event (such as a long hospital stay), the
more likely it was to be reportedboth in the immediate and recent past.
120 weeks 21 5 5
2140 weeks 27 11 7
4152 weeks 32 34 22
Researchers have explored strategies for improving the quality of the recall per-
formance of respondents. One example is decomposing the question and asking
several questions about smaller parts. Asking multiple questions improves the
probability that an event will be recalled and reported (Cannell, Marquis, et al.,
1977; Sudman & Bradburn, 1982).
Example 12: During the past 30 days, how many times have you used oils to cook
food or added oils to foods like salad, pasta, or bread?
Alternative 12a: The next few questions are about oils used with food. You
should include things like vegetable oil or olive oil, but not butter or margarine.
During the past 30 days, how many times have you used oils to cook food?
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 386
Alternative 12b: During the past 30 days, how many times have you added oil to
salads, such as oil and vinegar?
Alternative 12c: During the past 30 days, how many times have you added oils to
other foods like pasta or bread?
In a recent study in which these series of questions were compared, the average
number of times someone reported using oil was 11.9 times when asked Example
12 and 16.6 times when asked the three alternative questions.
Another strategy for increasing recall is stimulating associations likely to be tied
to what the respondent is supposed to report. Activating the cognitive and intellec-
tual network in which a memory is likely to be embedded is likely to improve recall
as well (Eisenhower et al., 1991).
Example 13: In the last 12 months, to how many organizations did you volun-
teer your time?
There are many different things that could count as volunteering. To help
remind the respondent of all the different things that could be included, a
researcher could provide some cues that might stimulate memories. This could be
done by asking additional questions or adding an introduction to the question.
Alternative 13: There are many ways that people volunteer their timethey
could help at a church or school, help provide meals to the homeless during the
holidays, or participate in a charity walk or other event. In the last 12 months,
for how many organizations did you volunteer your time?
There are limits to what people are able to recall. If a question requires infor-
mation that most people cannot recall easily, the data will almost certainly suffer.
However, even when the recall task is comparatively simple for most people, if get-
ting an accurate count is important, asking multiple questions and developing
questions that trigger associations that may aid recall are both effective strategies
for improving the quality of the data.
In order to improve the ability of respondents to place events in time, the simplest
step is to help clarify the time frame. For example, rather than In the last 6 months,
did you go to any museums? the question could add the actual month: In the last 6
months, that is, since March, did you go to any museums? This simple addition could
help focus the respondent on exactly what is being asked about. For in-person surveys,
showing respondents a calendar with the reference period outlined may be helpful. In
addition, respondents can be asked to recall what was going on and what kinds of
things were happening in their lives at the time of the boundary of the reporting
period. Filling in any life events, such as birthdays or jobs, can help make the dates on
the calendar more meaningful (Belli, 1998; Sudman, Finn, & Lannon, 1984).
A very different approach to improving the reporting of events in a time period is
to create an actual boundary for respondents by conducting two or more interviews.
During the initial interview, respondents are asked about events and situations that
happened during some time period before the interview. In the subsequent interview,
they are then asked about what has happened between the time of the initial inter-
view and the time of the second interview. This method is used in several national
surveys, including the National Crime Victimization Survey (NCVS, formally called
the National Crime Survey; Groves et al., 2004). For example, the NCVS surveys are
done every 6 months, and respondents are asked about any crimes they were a victim
of in the last 6 months. To prevent telescoping of events (talking about events that
happened outside the 6 months time frame), respondents answers are compared with
their answers in the prior survey, and duplicate events are eliminated. Obviously, such
reinterview designs are much more expensive to implement than are one-time
surveys. However, when accurate reporting of events in time is very important, they
provide a strategy that improves the quality of data.
This could be answered in many wayssince I was 18, 13 years, and a long
time. All of these could be correct answers to the question how long something
has been going on. By not providing the respondents information about what unit
of measure to use, the researcher may be left with data that are not comparable. The
alternative version below provides enough detail for the respondent to know how
to answer.
Alternative 14: For how many years have you been working here?
Example 15 is another example where the response task is unclear.
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 388
Example 16: In the last week, did you drink coffee at breakfastYes, always; Yes,
sometimes; or No?
The researcher correctly worried that this question is not a simple yes or no
question. However, rather than altering the question, the response task was changed
to combine the yes/no task with a frequency question (which is not explicitly
asked). The cognitive complexity of the question can be reduced by either asking
two questions or simply changing it to a frequency.
Alternative 16: In the last week, on how many days did you drink coffee at break-
fastevery day, some days, no days?
The answer categories provided in this question are not mutually exclusive. It is
possible that someone is working full- or part-time and is also a student. Or some-
one could be retired and a homemaker. Thus, respondents could legitimately put
themselves into more than one category. Respondents (and interviewers) must
decide how to handle this situationmark more than one answer, choose one or
the other, skip the question, or write something in the margins. When there is a
possibility of people being in more than one category, it is sometimes better to ask
a series of yes/no questions to describe the respondents situation.
Closed-ended response tasks also need to be exhaustiveevery situation must
be taken into account in the answer choices available. Frequency scales, especially
those that use a number of times per unit of measurement (see example below), are
notorious for not being exhaustive.
Example 18: How often have you attended a sporting eventseveral times a week,
about once a week, about once a month, a few times a year, about once a year?
In addition to the fact that this question is assuming a regularity that may not
exist, this is also an example of answer choices that are not exhaustive. There is no
answer to describe the situation of someone who goes to sporting events every
other week, every other month, or less than once a year. With no place that exactly
describes their situation, respondents will be left on their own to figure out the
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 389
closest fitand respondents who have the same answer to give will answer differ-
ently from one another.
As discussed earlier, often the answer to a question varies over time. When a ques-
tion makes the assumption of regularity, respondents who vary will have trouble.
Example 19: In the last 30 days, were you able to climb a flight of stairs with no
difficulty, with some difficulty, or were you not able to climb stairs at all?
This question imposes an assumption: that the respondents situation was stable
for 30 days. For a study of patients with AIDS, we found that questions in this form
did not fit the answers of respondents, because their symptoms (and ability to
climb stairs) varied widely from day to day (Cleary et al., 1993).
Explain the purposes of questions so that respondents can see why they are
appropriate.
Frame questions, and take care in wording, to reduce the extent to which
respondents will perceive that particular answers will be interpreted in a
negative or inaccurate light.
These steps are likely to improve the quality of reporting in every area of a
survey, not just those deemed to be particularly sensitive. Researchers never know
when a question may cause a respondent some embarrassment or unease. A survey
instrument should be designed to minimize the extent to which such feelings will
affect answers to any question asked.
This does not mean that there are no standards for questions designed to mea-
sure subjective states. The standards are basically the same as for questions about
factual things: Questions should be understood consistently by all respondents so
they are all answering the same question, they should usually cover topics with
which most respondents are familiar, and the response task, the way respondents
are asked to answer the questions, should be one that respondents can use consis-
tently and that provides meaningful information about what they have to say.
By far, the largest number of survey questions ask about respondents perceptions
or feelings about themselves, others, or ideas. The basic task for the respondent on
most questions in this category is to place answers on a continuum. Such questions
all have the same basic framework, which consists of three components: (a) what is
to be rated, (b) what dimension or continuum the rated object is to be placed on,
and (c) the characteristics of the continuum that are offered to the respondent.
Example 20: In general, do you think government officials care about your inter-
ests a lot, some, only a little, or not at all?
Example 21a: Overall, how would rate your healthexcellent, very good, good,
fair, or poor?
Example 21b: Consider a scale from 0 to 10, where 10 represents the best your
health can be, where 0 represents the worst your health can be, and the numbers
in between represent health states in between. What number would you give
your health today?
Example 21c: Overall, would you say you are in good health?
These three questions all ask the same thing; they differ only in the ways in
which the respondents are asked to use the continuum.
When the goal is to have respondents place themselves or something else along
a continuum, the researcher must make choices about the characteristics of the
scale or response task to be offered to respondents. Two key issues include (a) how
many categories to offer and (b) whether to use scales defined by numbers or by
adjectives. In general, the goal of any rating task is to provide the researcher with as
much information as possible about where respondents stand compared with
others. Consider a continuum from positive to negative and the results of a question
such as the following:
Example 23: In general, would you rate the job performance of the President as
good or not so good?
10 9 8 7 6 5 4 3 2 1 0
Best Worst
Such a question divides respondents into two groups. That means that the infor-
mation coming from this question is not very refined. Respondents who answer
good are more positive than the people who say not so good, but there is no
information about the relative feelings of all the people who answer good, even
though there may be quite a bit of variation among them in the degree of positive-
ness that they feel about the Presidents job performance.
There is another issue as well: the distribution of answers. In the above example,
suppose most of the respondents answered the question in a particular way; for
example, suppose 90% said that the President is doing a good job. In that case, the
value of the question is particularly minimal. The question gives meaningful infor-
mation only for about 10% of the population, the 10% who responded not good.
For the 90% of the population that answered good, absolutely nothing was
learned about where they stand compared with others who gave the same answer.
This analysis suggests that there are two general principles for thinking about
optimal categories for a response task. First, to the extent that valid information can
be obtained, more categories are better than fewer categories. Second, generally
speaking, an optimal set of categories along a continuum will maximize the extent
to which people are distributed across the response categories.
Given these considerations, is there any limit to the number of categories that
are useful? Is it always better to have more categories? There are at least two limit-
ing factors to the principle that using more categories produces better measure-
ment. First, there appear to be real limits to the extent to which people can use
scales to provide meaningful information. Although the optimal number of cate-
gories on a scale may vary, in part with the dimension and in part based on the dis-
tribution of people or items rated, most studies have shown that little new valid
information is provided by response tasks that provide more than 10 categories
(Andrews, 1984). Beyond that, people seem not to provide new information; the
variation that is added seems to be mainly a reflection of the different ways that
people use the scales. In fact, five to seven categories are probably as many cate-
gories as most respondents can use meaningfully for most rating tasks.
A second issue has to do with ease of administration. If the survey instrument is
being self-administered (with respondents reading the questions to themselves) or
administered by an in-person interviewer (who can hand respondents a list of the
response categories), long lists of scale points do not pose any particular problem.
However, when surveys are done on the telephone, it is necessary for respondents
to retain all the response options as the interviewer reads them in order to answer
the question. There clearly are limits to peoples abilities to retain complex lists of
categories.
When long, complex scales are presented by telephone, sometimes it is found
that this produces biases simply because respondents cannot remember the cate-
gories well. For example, there is some tendency for respondents to remember the
first or the last categories better than some of those in the middle (Schwartz &
Hippler, 1991). When questions are to be used on the telephone, researchers often
prefer to use scales with only three or four response categories in order to ease the
response task and ensure that respondents are aware of all the response alternatives
when they answer questions.
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 394
Example 24: Higher taxes generally hurt the rich and benefit the poor. Do you
agree or disagree?
Example 25: America is getting so far away from the true American way of life
that force may be necessary to restore it.
Three issues: How far America is from the true American way, whether or not the
true American way should be restored, and whether or not force may be needed (or
desirable) to restore it.
Example 26: There is little use writing public officials because they often arent
really interested in the problems of the average man.
Two issues: The value of writing to officials and how interested officials are in the
problems of the average man.
With respect to both of these questions, it is not possible to define what an
agree or disagree answer actually means.
There are three common problems with questions in the agree-disagree form
or related question forms such as the oppose-favor form. First, many questions in
this form do not produce interpretable answers, either because they are not on a
clearly defined place on a continuum or because they reflect more than one dimen-
sion. Those problems can be solved through careful question design. However, two
other problemsthat these questions usually sort people into only two groups
(agree or disagree) and that they often are cognitively complexare more generic
to the question form.
The most important limitation of such questions, however, is that the question
form itself introduces error into the measurement process that is unnecessary.
Essentially the same question can be answered in either a direct or indirect way.
Examples 27a and 27b illustrate the indirect and direct approaches to asking the
same question.
Example 27a: Consider the statement, Federal income taxes should be reduced.
Would you say you completely agree, generally agree, neither agree nor disagree,
generally disagree, or strongly disagree with that statement?
Example 27b: How do you feel about the level of federal income taxeswould
you say they should be much higher, a little higher, about as they are now, a little
lower, or much lower?
For Example 27b, the respondent directly puts where he or she wants taxes to be
on a continuum from much higher to much lower. In Example 27a, the respondent
is asked to report on the distance between his or her own views and the position
stated in the question stem. Figure 12.1 is a pictorial representation of the Example
27a task. The steps include the following:
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 396
Distance to Assess
1. Figuring out where on the continuum ones own views are (I in figure). This
has to be done to answer either Example 27a or 27b. In addition, however, to
answer Example 27a, the respondent must
2. Evaluate the distance between ones view and the position stated in the ques-
tion stem (O in the figure)
To the extent that two respondents have differences of opinion about how close
their views need to be to the stated position (O) in order to be considered agreement,
they could give different answers for that reason alone, even if their views on income
taxes are the same. In essence, indirect ratings introduce an additional source of
potential error into the measurement process. This can be denoted as follows:
X = t + ed + ei,
where X is the answer, t is the true value or score that we want the respondent to
report, ed is the error related to the way the respondent performs the direct rating
task of locating his or her own views on the oppose-favor continuum, and ei is the
error related to the way the respondent performs the process of coding the distance
from his or her answer to the point stated in the question stem into the agree-disagree
format.
It is fairly obvious that indirect ratings are cognitively more complicated than
direct ratings. They also introduce a second task of coding the distance between
the stimulus and the respondents views that will be done differently from
respondent to respondent and, hence, introduce an additional source of
measurement error into the answer. We think researchers will almost always be
better served by using direct ratings and avoiding agree-disagree and related
question forms.
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 397
Rank Ordering
There are occasions when researchers want respondents to compare objects on
some dimension.
The basic question objectives can all be met through one of four tasks for
respondents:
Task 1. Respondents can be given a list of options and asked to rank order them
from top to bottom on some continuum.
Task 2. Respondents can be given a list of options and asked to name the most
(second most, third most, and so on) extreme on the rating dimension.
Task 4. Respondents can be given a list and asked to rate each one using some scale
(rather than just putting them in order or picking one or more of the most
extreme).
If there is a short list of options, Task 1 is not hard to do. However, as the list
becomes longer, the task is harder, soon becoming impossible on the telephone,
when respondents cannot see all the options. Task 2 is easier than Task 1 when the
list is long (or even when the list is short). Often researchers are satisfied to know
which are the one or two most important, rather than having a complete rank
ordering. In that case, Task 2 is attractive. Psychometricians often like the paired
comparison approach of Task 3, in which each alternative is compared with every
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 398
Narrative Answers
When the goal is to place answers on a continuum, allowing people to answer in
their own words will not do. Consider a question such as the following:
People can answer in all kinds of ways: Some will say fine, some will say great,
some will say not bad. If one were trying to order such comments, some ordinal
properties would be clear. Those who say terrible would obviously be placed at a
different point on a continuum from those who say great. However, there is no way
to order responses such as not bad, pretty good, good enough, or satisfactory.
In contrast, when the purpose of a question is to identify priorities or prefer-
ences among various items, there is a choice to be made between the following two
approaches:
Example 33a: What do you consider to be the most important problem facing
your local city government today?
Example 33b: The following is a list of some of the problems that are facing your
local city government. Which do you consider to be most important?
Crime
Tax rates
Schools
Trash collection
The open-ended approach has several advantages. It does not limit answers to those
the researcher thought of, so there is opportunity to learn the unexpected. It also
requires no visual aids, so it works on the telephone. On the other hand, the diversity
of answers may make the results hard to analyze. The more focused the question and
the clearer the kind of answer desired, the more analyzable the answers. Moreover,
Schuman and Presser (1981) found that the answers are probably more reliable and
valid when a list is provided than when the question is asked in open form.
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 399
If the list of possible answers is not known or is very long, the open form may be
the right approach. Although computer-assisted interviewing creates great
pressure to use only fixed-response questions, respondents like to answer some ques-
tions in their own words. The measurement result may not be as easy to work with,
but asking some questions to be answered in narrative form may be justified for that
reason alone. However, if good measurement is the goal and the alternatives can be
specified, providing respondents with a list and having them choose is usually the best.
design of questions, and this chapter will not address most of them. Dillman (2007)
and Groves et al. (2004) are two good places to look for more information on those
issues. However, we wanted to point out one very important principle that will have
an important effect on data quality.
If a survey is going to be administered in more than one mode or more than one
language, the researcher wants the results to be as comparable as possible. To that
end, the questions that are asked should be identical. For that to happen, the survey
should be designed from the beginning to be used in more than one language
and/or mode. Designing a survey for one mode or one language and then trying to
adapt it to another language or mode is the wrong way to proceed.
With respect to language, there are some words that translate much better than
others. For example, the categories excellent, good, fair, poor are frequently used
in English surveys. However, poor and particularly fair do not translate easily
into other languages. If a researcher is thinking about how precisely questions can
be translated when the initial questions are being written, choices can be made that
will greatly increase the comparability of the questions across languages (Harkness,
van de Vijver, & Mohler, 2007).
The same is true for mode of data collection issues (Dillman, 2007). If a survey is
going to be administered in person or in a self-administered questionnaire, it is pos-
sible to ask respondents to choose from a long list of answers. However, if that same
survey is going to be used on the telephone, respondents will not be able to remem-
ber more than a few answer options. In a self-administered survey, questions do not
have to include all the words. The question and the answers can be combined by the
respondent to understand what is wanted. However, a telephone interviewer must
have a complete script so that the words that are read give the respondent all the
information needed to know what is being asked and how to answer.
If the researcher thinks about the way questions will work in multiple languages
or modes from the beginning, problems of comparability can be minimized.
However, attempts to adapt surveys designed for a single mode or language to other
modes or languages almost always produce major problems of comparability.
that are indicative of potential problems (Lessler & Forsyth, 1996). Willis and
Lesslers (1999) Question Appraisal System (QAS) has an 8-step process that looks
at everything, including the readability and clarity of the question, whether the
question contains unstated assumptions or is inherently sensitive or biased, the knowl-
edge and recall skills needed to answer the question, and characteristics of the
response categories. The Question Understanding Aid (QUAID) is a computerized
version of a QAS (Graesser, Cai, Louwerse, & Daniel, 2006). A computer program
analyzes the wording of question, comparing it to a set of programmed algorithms
checking for uncommon words, vague terms, complex syntax, and the number of
clauses in the question.
Often question appraisal checklists require the appraiser to make some sort of
judgment. Whether the question is hard to understand or asks for information that
a person may not have are all based on the impression of the appraiser. In Table 12.3,
we present a systematic appraisal form that requires minimal judgment from the
appraiser. Questions that are identified as having the characteristics listed in Table 12.3
can be rewritten or revised to make them better questions before any testing occurs.
In some cases, if a particular question has a pedigree or no suitable alternative way
of asking the question can be found, the appraisal can flag questions and issues for
subsequent testing.
Comprehension Issues
1. Does the question have a reference period (time)? This applies to any question for which the
answer could reasonably be expected to vary from day to day, week to week, or month to month.
2. Is the question hypothetical?
3. Are there multiple questions being asked in a single question? (Is the question multibarreled)?
4. Does the question include an abstract noun that is not defined?
Retrieval of Information
5. Is the question cognitively complex? Does the question require multiple calculations in order to
answer the question?
Formation of Answer
6. Does the question contain assumptions about the respondents situation, or the way the
respondent thinks about things, that are not necessarily true but that are critical to answering
the question?
7. Does the question make the response task clear to the respondent; that is, is it clear what kind of
answer is required, and at what level of detail, in order to to meet the question objectives?
8. (If fixed-response question) Are the answer categories mutually exclusive and exhaustive?
9. Does the question give respondents a task other than a direct rating to provide information about
where something (an idea, experience, person, or institution) is seen to lie on some continuum?
Usability Concerns
10. (If interviewer-administered question) Is the question fully scripted, including when and how to use
any optional text?
11. Does the question end with a question? (Are definitions and introductory phrases at the beginning
of the question?)
12. Are there appropriate skip instructions so that respondents are asked to answer only those
questions that apply to them?
13. Are the response tasks that respondents are supposed to use appropriate to the question that
is asked?
effort to replicate the data collection procedures to be used in the full-scale survey.
The basic protocol involves reading questions to respondents (or having them read
the questions themselves), having respondents answer the questions, and then hav-
ing a specially trained interviewer use some strategy to find out what was going on
in the respondents minds during the question and answer process.
There are three common procedures for trying to monitor the cognitive processes
of the respondent who is answering questions: think-aloud interviews; asking
probe or follow-up questions after each question or short series of questions; and
going through the questions twice, first having respondents answer them in the
usual way, then returning to the questions and having a discussion with respon-
dents about the response tasks.
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 403
Field Pretesting
A field pretest generally replicates procedures that will be used in the survey
itself. The pretest should provide information about the usability of the proposed
survey instrument for respondents and, if they are used, interviewers. If it is an
interviewer-administered instrument, it should also provide information about
how well the instrument facilitates a standardized question and answer process.
If a survey is self-administered, there are two approaches that can be used.
Individuals or groups can be invited to a central location to fill out the instrument,
then be debriefed about the experience. For a mail survey, a small mail pilot study
can be undertaken. Feedback from respondents about usability and individual
questions can come either from a series of debriefing questions at the end of the
instrument itself, or, much better, from an interviewer-administered debriefing
after the questionnaire has been filled out.
If a survey is interviewer-administered, one source of information is debriefing
the interviewers who conducted the pretest interviews. There has been a prototype
of a traditional field pretest for interviewer-administered surveys. When a survey
instrument is in near final form, experienced interviewers conduct 15 to 35 inter-
views with people similar to those who will be respondents in the planned survey.
Data collection procedures are designed to be similar to those to be used in the
planned survey, except that the people interviewed are likely to be chosen on the
basis of convenience and availability, rather than according to some probability
sampling strategy. Question evaluation from such a survey mainly comes from
interviewers (Converse & Presser, 1986).
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 404
Summary
A sensible protocol for the development of a survey instrument prior to virtu-
ally any substantial survey would include all the steps outlined above: systematic
question review, cognitive interviewing, and field pretests with behavior coding.
Moreover, in the ideal situation, at least two field pretests would be done, the sec-
ond to make sure the problems identified in the first field pretest have been solved.
Arguments against this kind of question evaluation usually focus on time and
money. Certainly the elapsed calendar time for the question design process will be
longer if the researcher includes cognitive interviews than if he or she does not;
however, these processes can be carried out in a few weeks. The time implications
of question testing have less to do with the amount of time it takes to gather infor-
mation about the questions than with the time it takes to design new and better
questions when problems are found. For almost any survey, experience shows that
each of these steps yields information that will enable the researcher to design
better questions.
In recent years, there has been increased attention given to the evaluation of sur-
vey questions from the cognitive and interactional perspectives. The basic idea is
that before a question is asked in a full-scale survey, testing should be done to find
out if respondents can understand it, if they can perform the tasks that it requires,
and if the interviewers can and will read it as worded.
common with respect to questions about facts and always the case for measures of
subjective states, then validity is assessed by studying the relationship between
answers to a question and the answers to other questions. If answers are good mea-
sures of their intended constructs, there should be a set of predictable relationships.
For example, a good measure of health status should have predictable relationships
to the amount of medical care a person receives, the number of days of work that
are missed, and how able a person is to perform difficult physical tasks. Stewart and
Ware (1992) provide a kind of prototype for how to systematically develop and
validate measures of important health concepts. McDowell (2006) provides a com-
pendium on the evidence for the reliability and validity of many of the measures
related to health research. In the process, he describes the steps that researchers do
(and sometimes do not) take to psychometrically evaluate their measures.
Validation studies are highly desirable, but they are not done routinely. Ideally, they
should be done with the population in which they are being used. On occasion, mea-
sures are referred to as if being validated were some absolute state, such as beatifica-
tion. Validity is the degree of correspondence between a measure and what is
measured. Measures that can serve some purposes well are not necessarily good for
other purposes. For example, some measurements that work well for group averages
and to assess group effects are quite inadequate at an individual level (Ware, 1987).
Validation studies for one population may not generalize to others (Kulka et al., 1989).
The challenges are of two sorts. First, we need to continue to encourage
researchers to evaluate the validity of their measurement procedures routinely from
a variety of perspectives. Second, we particularly need to develop clear standards for
what validation means for particular analytic purposes.
Conclusion
To return to the topic of total survey design, no matter how big and representative
the sample, no matter now much money is spent on data collection and what the
response rate is, the quality of the resulting data from a survey will be no better than
the questions that are asked. Although we can certainly hope that the number and
specificity of principles for good question design will grow with time, the principles
outlined in this chapter constitute a good, systematic core of guidelines for writing
good questions. In addition, whereas the development of evaluative procedures
will also evolve with time, cognitive testing, good field pretests, and appropriate vali-
dating analyses provide scientific, replicable, and quantified standards by which the
success of question design efforts can be measured.
A final word is in order about standards for survey questions. In fact, there are
four kinds of standards for survey questions:
1. Are they measuring the right thingthat is, what is needed for an analysis?
2. Do they meet cognitive standards?
3. Do they meet psychometric standards?
4. Do they meet usability standards?
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 407
The first three kinds of standards have been the primary focus of this chapter.
The fourth refers to the fact that questions also have to work in the mode in which
they are used. If a survey is interviewer administered, an interview schedule is also
a protocol for an interaction. It has been shown that the quality of measurement
can be compromised by the way the questions affect the way interviewers and
respondents interact (Mangione, Fowler, & Louis, 1992; Schaeffer, 1991; Suchman
& Jordan, 1990). If the survey is being done by mail or via the Internet, the ques-
tions also must be demonstrated to be able to be used comfortably by respondents.
Indeed, with no interviewer to help, it is particularly important that the questions
delivered in those modes be easy for respondents to manage.
A tension is created because these standards are not necessarily positively
related, and in fact they can work against each other. For example, the easiest ques-
tions from a cognitive perspective may be weak psychometrically. One reason for
weak survey questions is that researchers tend to one standard while neglecting the
others (Fowler, 2001). A real challenge is to design questions that meet all four of
these kinds of standards.
That said, certainly the most important challenge is to induce researchers to
evaluate questions routinely. Unfortunately, there is a long history of researchers
designing questions in haphazard ways that do not meet adequate standards and
have not even been well evaluated. Moreover, we have a large body of social and
medical science, collected over the past 50 years, that includes some very bad ques-
tions. The case for holding on to the questions that have been used in the past, in
order to track change or to compare new results with those from old studies, is not
without merit. However, a scientific enterprise is probably ill served by repeated use
of poor measures, no matter how rich their tradition. In the long run, science will
be best served by the use of survey questions that have been carefully and system-
atically evaluated and that meet the standards enunciated in this chapter.
Discussion Questions
4. When asking people to do ratings, which kind of rating scale seems better:
those that use numbers, such as 0 to 10, or those that use adjectives, such as excel-
lent to poor? What are the pros and cons of each? Which provide a better way for
people to say what they have to say?
5. How important is it to give respondents a time frame for questions? What are
some of the kinds of questions for which a time frame is essential? Are there any
kinds of questions for which a time frame is not important?
6. Why would someone want to ask about a respondents income in a survey?
What are some of the constructs for which income might be a measure? What are
some examples of analysis questions for which a measure of income might be help-
ful? Depending on the hypotheses to be tested, what are the implications for what
measure of income one might want to use?
7. Survey questions either ask respondents to choose from a set of provided
answer categories or ask them to respond in their own words. If one was trying to
describe how people felt about a government official or about their significant
others, which would be a better kind of question to ask? What are the pros and
cons of each approach?
Exercises
1. Take a set of questions that have been used in professional surveys and cog-
nitively test them with two or three people. Ask the questions, then probe until you
understand how people understood the questions and whether or not their answers
were good measures of what the questions are designed to measure. Write a critical
evaluation of the questions as measures, based on your results. The Behavioral Risk
Factor Social Survey, conducted by the Center for Disease Control, is a good source
of questions on various aspects of health and health-related behavior. Questions
used can be accessed at www.cdc.gov/brfss.
2. Write three questions in an agree-disagree form. Then design three questions
in a direct rating form that measure the same constructs.
3. Write three questions that include a noun that could be interpreted in more
than one way. Examples used in the chapter include crime, income, car, and
political leaders, but you can use your own vague nouns. Then, for each, write
another question in which you explain, define or clarify the term so that everyone
will understand the question in the same way.
4. Use the standards outlined in Table 12.3 to critically evaluate the following
questions. Refer to the numbers in the table in your answers.
a. How often have you been feeling stressedalways, usually, sometimes,
rarely, or never?
b. Where did you live before you moved here?
c. Given the crime rate where you live, how likely are you to move somewhere
else in the next year or twovery likely, fairly likely, or not likely at all?
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 409
d. When you go to the movies, how often do you have popcornvery often,
fairly often, not very often, or not at all?
e. If an interviewer contacted you about being in a survey about using drugs
and alcohol, do you think you would agree to be interviewed?
f. How often do you have at least one alcoholic beverage to drinkevery
day, a couple of times a week, once a week, once a month, or less often?
g. Are you married, living with a partner, divorced, separated, widowed, or
have you never married?
5. Write questions to measure three of the following constructs:
a. Age
b. Weight
c. Number of offspring
d. Sexual orientation
e. Physical fitness
f. Mood
g. Political conservatism
h. Religiosity
i. Soft drink consumption
j. Music preferences
References
Anderson, B., Silver, B., & Abramson, P. (1988). The effects of race of the interviewer on
measures of electoral participation by blacks. Public Opinion Quarterly, 52, 5383.
Andrews, F. M. (1984). Construct validity and error components of survey measures:
A structural modeling approach. Public Opinion Quarterly, 48, 409422.
Aquilino, W. S., & Losciuto, L. A. (1990). Effects of interviewers on self-reported drug use.
Public Opinion Quarterly, 54, 362391.
Belli, R. (1998). The structure of autobiographical memory and the event history calendar.
Memory, 6, 383406.
Blair, E., & Burton, S. (1987). Cognitive process used by survey respondents in answering
behavioral frequency questions. Journal of Consumer Research, 14, 280288.
Brener, N. D., Eaton, D. K., Kann, L., Grunbaum, J. A., Gross, L. A., Kyle, T. M., et al. (2006).
The association of survey setting and mode with self-reported health risk behaviors
among high school students. Public Opinion Quarterly, 70(3), 354374.
Cannell, C. F., Marquis, K. H., & Laurent, A. (1977). A summary of studies. In Vital and
health statistics (Series 2, No. 69). Washington, DC: Government Printing Office.
Cannell, C. F., Oksenberg, L., & Converse, J. (1977). Experiments in interviewing techniques:
Field experiments in health reporting: 19711977. Hyattsville, MD: National Center for
Health Services Research.
Clarridge, B. R., & Massagli, M. P. (1989). The use of temple spouse proxies in common
symptom reporting. Medical Care, 27, 352366.
Cleary, P. D., Fowler, F. J. Weissman, J., Massagli, M. P., Wilson, I., Seage, G. R., et al. (1993).
Health-related quality of life in persons with acquired immune deficiency syndrome.
Medical Care, 31, 569580.
Converse, J. M., & Presser, S. (1986). Survey questions: Handcrafting the standardized ques-
tionnaire. Beverly Hills, CA: Sage.
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 410
Cronbach, L., & Meehl, P. (1955). Construct validity in psychological tests. Psychological
Bulletin, 52, 281302.
DeMaio, T. J., & Landreth, A. (2004). Do different cognitive interview methods produce
different results? In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin,
J. Martin, et al. (Eds.), Methods for testing and evaluating survey questionnaires (pp. 89108).
New York: John Wiley.
DeMaio, T. J., & Rothgeb, J. M. (1996). Cognitive interviewing techniques: In the lab and in
the field. In N. A. Schwarz & S. Sudman (Eds.), Answering questions: Methodology for
determining cognitive and communicative processes in survey research (pp. 177195). San
Francisco: Jossey-Bass.
Dillman, D. A. (2007). Mail and Internet surveys: The tailored design method (2nd ed.). New
York: John Wiley.
Edwards, W. S., Winn, D. M., Kurlantzick, V., Sheridan, S., Berk, M. L., Retchin, S., et al.
(1994). Evaluation of National Health Interview Survey Diagnostic Reporting. In Vital
and health statistics (Series 2, No. 120). Hyattsville, MD: National Center for Health
Statistics.
Eisenhower, D., Mathiowetz, N. A., & Morganstein, D. (1991). Recall error: Sources and bias
reduction techniques. In P. N. Biemer, R. M. Groves, L. E. Lyberg, N. A. Mathiowetz, &
S. Sudman (Eds.), Measurement errors in survey (pp. 367392). New York: John Wiley.
Fowler, F. J., Jr. (1997). Choosing questions to measure the quality of experience with med-
ical care providers and health care plans. In 1997 Proceedings (pp. 5154), Survey
Methods Section, American Statistical Association.
Fowler, F. J., Jr. (2001). Why it is easy to write bad questions. ZUMA-Nachrichten, 48(25), 4966.
Fowler, F. J., Jr. (2002). Survey research methods. Thousand Oaks, CA: Sage.
Fowler, F. J., Jr. (2004). The case for more split-sample experiments in developing survey
instruments. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin,
et al. (Eds.), Methods for testing and evaluating survey questionnaires (pp. 173188). New
York: John Wiley.
Fowler, F. J., Jr., & Cannell, C. F. (1996). Using behavioral coding to identify cognitive prob-
lems with survey questions. In N. Schwartz & S. Sudman (Eds.), Answering questions
(pp. 1536). San Francisco: Jossey-Bass.
Fowler, F. J., Jr., & Mangione, T. W. (1990). Standardized survey interviewing: Minimizing
interviewer-related error. Newbury Park, CA: Sage.
Graesser, A. C., Cai, Z., Louwerse, M. M., & Daniel, F. (2006). Question understanding aid
(QUAID): A web facility that tests question comprehensibility. Public Opinion
Quarterly, 70(1), 322.
Groves, R. M., Fowler, F. J., Couper, M., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2004).
Survey methodology. New York: John Wiley.
Harkness, J. A., van de Vijver, F. J. R., & Mohler, P. Ph. (2007). Cross-cultural survey methods.
New York: John Wiley.
Jabine, T. B. (1987). Reporting chronic conditions in the National Health Interview Survey:
A review of tendencies from evaluation studies and methodological tests. In Vital and
health statistics (Series 2, No. 105, DHHS Publication No. PHS 871397). Washington,
DC: Government Printing Office.
Jabine, T. B., Straf, M. L., & Tanur, J. M. (1984). Cognitive aspects of survey methodology:
Building a bridge between disciplines. Washington, DC: National Academic Press.
Kulka, R. A., Schlenger, W. E., Fairbank, J. A., Hough, R., Jordan, B. K., Marmar, C., et al.
(1989). Validating questions against clinical evaluations: A recent example using
diagnostic interview schedule-based and other measures of post-traumatic stress disor-
der. In F. J. Fowler Jr. (Ed.), Conference proceedings: Health survey research methods
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 411
(pp. 2734; DHHS Publication No. PHS 893447). Washington, DC: National Center
for Health Services Research.
Lessler, J. T., & Forsyth, B. H. (1996). A coding system for appraising questionnaires. In N. A.
Schwartz & S. Sudman (Eds.). Answering questions (pp. 259292). San Francisco: Jossey-Bass.
Lessler, J. T., & Tourangeau, R. (1989, May). Questionnaire design in the cognitive research
laboratory. In Vital and health statistics (Series 6, No. 1). Washington, DC: Government
Printing Office.
Locander, W., Sudman, S., & Bradburn, N. (1976). An investigation of interview method, threat
and response distortion. Journal of the American Statistical Association, 71, 269275.
Mangione, T. W., Fowler, F. J., & Louis, T. A. (1992). Question characteristics and interviewer
effects. Journal of Official Statistics, 8(3), 293307.
McDowell, I. (2006). Measuring health: A guide to rating scales and questionnaires. New York:
Oxford University Press.
Moore, J. C. (1988). Self/proxy response status and survey response quality. Journal of Official
Statistics, 4, 155172.
Nunnally, J. C. (1978). Psychometric theory. New York: McGraw-Hill.
Oksenberg, L., Cannell, C. F., & Kalton, G. (1991). New strategies for testing survey questions.
Journal of Official Statistics, 7, 349365.
OMuirchearteigh, C. (1991). Simple response variance: Estimation and determinants. In
P. N. Beimer, R. M. Groves, L. E. Lyberg, N. A. Mathiewetz, & S. Sudman (Eds.),
Measurement errors in surveys (pp. 287310). New York: John Wiley.
Parry, H., & Crossley, H. (1950). Validity of responses to survey questions. Public Opinion
Quarterly, 14, 6180.
Presser, S. (1989). Pretesting. A neglected aspect of survey research. In F. J. Fowler Jr. (Ed.),
Conference proceedings: Health survey research methods (pp. 3538; DHHS Publication
No. PHS 893447). Washington, DC: National Center for Health Services Research.
Presser, S., Rothgeb, J. M., Couper, M., Lessler, J. T., Martin, E., Martin, J., et al. (2004).
Methods for testing and evaluating survey questionnaires. New York: John Wiley.
Rasinski, K. A. (1989). The effect of question wording on public support for government
spending. Public Opinion Quarterly, 53, 388394.
Robinson, J. P., & Shaver, P. R. (1973). Measures of social psychological attitudes (Rev. ed.). Ann
Arbor, MI: Institute for Social Research, Survey Research Center.
Rodgers, W. L., & Herzog, A. R. (1989). The consequences of accepting proxy respondents on
total survey error for elderly populations. In F. J. Fowler Jr. (Ed.), Conference proceedings:
Health survey research methods (pp. 139146; DHHS Publication No. PHS 893447).
Washington, DC: National Center for Health Services Research.
Schaeffer, N. C. (1991). Interview: Conversation with a purpose or conversation? In P. N. Biemer,
R. M. Groves, L. E. Lyberg, N. A. Mathiowetz, & S. Sudman (Eds.), Measurement errors
in surveys (pp. 367393). New York: John Wiley.
Schuman, H. H., & Presser, S. (1981). Questions and answers in attitude surveys. New York:
Academic Press.
Schwartz, N., & Hippler, H. (1991). Response alternatives: The impact of their choice and
presentation order. In P. N. Biemer, R. M. Groves, L. E. Lyberg, N. A. Mathiowetz, &
S. Sudman (Eds.), Measurement errors in surveys (pp. 4156). New York: John Wiley.
Smith, A. F. (1991). Cognitive processes in long-term dietary recall. In Vital and health
statistics (Series 6, No. 4). Washington, DC: Government Printing Office.
Stewart, A. L., & Ware, J. E., Jr. (Eds.). (1992). Measuring functioning and well-being: The
medical outcomes study approach. Durham, NC: Duke University Press.
Suchman, L., & Jordan, B. (1990). Interactional troubles in face-to-face survey interviews.
Journal of the American Statistical Association, 85, 232241.
12-Bickman-45636:12-Bickman-45636.qxp 7/28/2008 7:14 PM Page 412
Sudman, S., & Bradburn, N. (1974). Response effects in surveys. Chicago: Aldine.
Sudman, S., & Bradburn, N. (1982). Asking questions. San Francisco: Jossey-Bass.
Sudman, S., Finn, A., & Lannon, L. (1984). The use of bounded recall procedures in single
interviews. Public Opinion Quarterly, 48, 520524.
Tourangeau, R., Rips, L. J., & Rasinski, K. (2000). The psychology of survey response.
Cambridge, UK: Cambridge University Press.
Turner , C. F., Forsyth, B. H., OReilly, J. M., Cooley, P. C., Smith, T. K., Rogers, S. M., et al.
(1998). Automated self-interviewing and the survey measurement of sensitive behav-
iors. In M. P. Couper, R. P. Baker, J. Bethlehem, C. Z. F. Clark, J. Martin, W. L. Nicholls II,
et al. (Eds.), Computer assisted survey information collection (pp. 455473). New York:
John Wiley.
Turner , C. F., Lessler, J. T., & Gfroerer, J. C. (1992). Survey measurement of drug use:
Methodological studies. Washington, DC: U.S. Department of Health and Human
Services, National Institute on Drug Abuse.
Ware, J. (1987). Standards for validating health measures: Definition and content. Journal of
Chronic Diseases, 40, 473480.
Weisberg, H. F. (2005). The total survey error approach. Chicago: University of Chicago Press.
Willis, G. B. (2005). Cognitive interviewing. Thousand Oaks, CA: Sage.
Willis, G. B., DeMaio, T., & Harris-Kojetin, B. (1999). Is the bandwagon headed to the
methodological promised land? Evaluating the validity of cognitive interviewing tech-
niques. In M. G. Sirken, D. G. Herrmann, S. Schechter, N. Schwarz, J. M. Tanur, &
R. Tourangean (Eds.), Cognition and survey research (pp. 133154). New York: John
Wiley.
Willis, G. B., & Lessler, J. (1999). The BRFSS-QAS: A guide for systematically evaluation
survey question wording. Rockville, MD: Research Triangle Institute.
13-Bickman-45636:13-Bickman-45636.qxp 7/28/2008 11:26 AM Page 413
CHAPTER 13
Chase H. Harrison
hand, if many members of the target population for a study do not typically have
Internet access, then the ability to generalize all members of this group will be limited.
For example, a survey of university faculty, at an institution where Internet access
is universal, might be appropriately accomplished over the Internet. However, an
Internet survey designed to study the characteristics of homeless persons would be
futile and inappropriate because many such persons do not have access to the
Internet in any meaningful way.
In samples of special populations of specific persons, researchers typically use
some type of directory or database as a sample frame. Researchers can limit target
populations to those possessing Internet access and for whom they can acquire a
complete list of e-mail addresses, such as the many private organizations, public
bureaucracies, trade associations, and schools that produce comprehensive e-mail
directories of individuals affiliated with these institutions. If an available database
or directory contains a valid e-mail address for most, or all, persons in the target
population, or if such an e-mail address can be added from secondary sources, then
the database can be very functional as a sample frame.
different sources than nonusers (Fox, 2005; Stempel, Hargrove, & Bernt, 2000),
participate in different social activities (Fox, 2005; Pew Internet and American Life,
2000), and socialize in different ways (Boase, Horrigan, Wellman, & Rainie, 2006;
Nie & Erbring, 2000). The only way to reliably estimate differences in the relation-
ships is to draw a probability sample of the U.S. population, which, of course, is not
currently possible in the online environment (Mitofsky, 1999).
Although scientific surveys of many populations may be difficult or impossible
to conduct solely over the Internet, Internet data collection is increasingly used in
conjunction with other methodologies to enhance or improve the ability to easily
and effectively contact individuals (Dillman, 2007). These multimode surveys often
begin with a sampling method that selects individuals or establishments through
specific addresses or telephone numbers. Rather than using a single method of col-
lecting data, however, multimode surveys employ multiple methodologies, either to
improve levels of survey response or to optimize the advantages of different data
collection strategies.
Alternatively, representative samples of adults can be obtained by traditional
communication modes and then outfitted with the equipment necessary to receive
and/or respond to online instruments (Huggins & Eyerman, 2001). For example,
the company, Knowledge Networks recruits households through random-digit-dial
telephone calling and then equips them with free connections to the Internet and
hardware (a WebTV unit) to use it. In exchange for the Internet access they receive,
each household member must regularly participate in online instruments trans-
mitted directly to their unit.
incentives are best not mentioned in the subject field. Focus instead should be on
legitimizing the e-mail, either by referencing the researchers home institution or
the objectives of the study. The cc and bcc fields should remain blank. The car-
bon copy and blind carbon copy fields enable identical e-mails to be transmitted to
multiple e-mail addresses simultaneously. Although these features offer an efficient
approach to transmitting bulk e-mail, they also are likely to trigger spam filters.
Researchers are well-advised to use e-mail software that can be configured to send
e-mails one-at-a-time. Last, the attachment field should be used prudently. Some
users will not open e-mails containing attachments for fear of acquiring computer
viruses. Therefore, the line should remain blank, if possible. Attachments contain-
ing the instrument, audio files, video streams, or any other extraneous materials,
should never be affixed to the message.
The body of the message should disclose the objectives, procedures, expecta-
tions, and authors of the study, as well as how the individuals name and e-mail
address were obtained. These messages should be brief and crafted as warily as the
header. As with subject fields, researchers should avoid using words and phrases
commonly found in product advertisements. Intrinsic appeals are less likely to be
flagged than extrinsic appeals. Requests for survey participation should include a
hyperlink to a Web page hosting the instrument. However, researchers cannot rely
solely on the hyperlink to direct potential respondents to the data collection instru-
ment. Some subjects will not be connected to the Internet when they open their
e-mail, or their e-mail programs will not be configured to process the hyperlink.
Therefore, the URL address of the Web site as well as instructions regarding how to
import it into a browser should be included in the e-mail as well.
Soliciting Visitors to Web Sites. Researchers can also be recruited by soliciting visitors
to a Web site. Online solicitations are most often used to recruit large, diverse, non-
probabilistic samples. Millions of Internet users surf the contents of the Web daily.
By posting advertisements on frequented Web pages or popular search engines,
researchers can invite a variety of visitors to take part in their studies. Interested
parties simply click through the advertisement and are immediately directed to the
research site, where they can be formally recruited and, if receptive, directed to the
appropriate version of the instrument to complete. Such Web advertisements are
the virtual equivalent of recruitment posters, with the advantage of providing
potential subjects immediate access to research materials.
In a small number of cases, usually limited to cases where a researcher is inter-
ested in inferring their research to a survey of visitors to a specific Web page, survey
solicitations on a Web can result in scientific samples. For example, a corporation or
organization that is attempting to better design their Web page might seek a scien-
tific survey of Web visitors. In this case, the sample framevisitors who view a Web
pagecorresponds exactly, or almost exactly, to the target population of a survey,
making scientific sampling from a full sample frame possible.
Online advertisements come in two forms: embedded and intercept advertise-
ments. Embedded advertisements are displayed as part of a Web page. In contrast,
intercept advertisements appear in a separate browser window from the one being
used to retrieve a particular Web page. Whereas embedded advertisements are part
13-Bickman-45636:13-Bickman-45636.qxp 7/28/2008 11:26 AM Page 420
of the page being retrieved, not interfering with its contents, intercept ads obstruct
the content of the requested page if they are to be read. There are numerous types
of intercept ads; pop-up ads appear over the requested page, floating ads move across
the content of the page, and pop-under ads appear under the page. Interstitial ads
appear before the browser brings up the requested page, while hijack ads divert the
user from the requested page entirely, redirecting the user to the new browser win-
dow instead. Regardless of type, they too can take on any size, appear anywhere in
the viewing frame, and feature text, graphics, or animation.
Intercept advertisements have traditionally enjoyed a higher degree of forced
exposure than embedded advertisements (Comley, 2000). Whereas embedded
advertisements can be ignored, intercept advertisements must be either opened or
closed before the requested Web page can be viewed. This requirement not only
ensures that all intercept advertisements will be seen, but that all viewers must con-
sciously decide whether to participate or not. The primary disadvantage to using
intercept advertisements is the increasing use of software capable of blocking inter-
cept advertisements from overriding the requested Web page. Since embedded
advertisements are part of the requested page, such programs cannot filter them. In
either case, click-through rates are usually extremely low.
In recent years, a growing number of studies have assessed factors underlying
effective Web advertisements, offering guidelines for the construction of successful
research invitations. Banner advertisements that use an intrinsic appeal, such as
Contribute to an important study, have been found to be more effective than those
with an external appeal, such as Win valuable prizes (Tuten, Bosnjak, & Bandilla,
1999). Consumers who are exposed to more colorful, image-laden Web sites rather
than monotone, simple Web sites, were more likely to browse, engage in more
unplanned purchasing, and seek out more stimulating products (Menon & Kahn,
2002). Advertisements with stationary black backgrounds have been found to have
significantly more positive effects on judgments of the advertisement and purchas-
ing intention than advertisements with blinking phrases and moving images
(Stevenson, Bruner, & Kumar, 2000). Researchers who attempt to solicit participants
through Web recruitment must be careful, however, to design ads that are neutral
with respect to the underlying goals of the survey. If certain types of people are more
likely than others to respond to a particular Web ad, selection bias can result.
Obtaining a high click-through rate, though, is not the same as securing research
participants. Once Web users have clicked through the advertisement, they still
must be formally recruited. A Web page must be constructed that informs visitors
of the objectives, expectations, and procedures of the study in a manner that
appeals for their participation. Generally, Web designers should focus on creating
pages that are as basic as possible, with limited graphics or images. This ensures that
visitors, regardless of the specifications of their browser, connection speed, or hard-
ware will be able to view the page as researchers intended.
Regardless of the approach adopted, the likelihood that Web surfers will happen-
chance across these pages is slim. Most Web surfers are not looking to be participants
in various research studies; hence, they are unlikely to look for them in search
engines or visit Web sites of professional organizations. Research sites will have to be
promoted by word-of-mouth or in offline publications. Such approaches, though,
13-Bickman-45636:13-Bickman-45636.qxp 7/28/2008 11:26 AM Page 421
may defeat the purpose of Web recruitment, resulting in participants who could
have been secured without Internet-based initiatives altogether.
Item Delivery
Researchers must initially determine the method for delivering individual items
to subjects. Researchers can either display items on a single static Web page or dis-
seminate them over multiple interactive Web pages. Each approach has advantages
and disadvantages that researchers should weigh before making a selection.
Static delivery displays the entire instrument on a single Web page. Subjects can
view all the questions simultaneously without having to access a new page. They
can scroll from item to item either forward or backward through the instrument
without limitation. They transmit responses to the server on one occasion, after
they click a submit responses button at the end of the instrument. They are the
electronic equivalent of a pencil-and-paper instrument.
Static Web instruments are easy to implement. They can be programmed straight-
forwardly with HTML forms. Client-side coding can be added without jeopardiz-
ing the integrity of the instrument. Conditional branching, whereby subjects are
13-Bickman-45636:13-Bickman-45636.qxp 7/28/2008 11:26 AM Page 423
does so in an overbearing fashion that may lead subjects to drop out of the instru-
ment. Researchers are better served by incorporating pop-up screens or conditional
pages designed to inform respondents when they fail to answer questions and
encourage them to reconsider (DeRouvray & Couper, 2002).
Interactive delivery has many advantages. It ensures greater uniformity and con-
trol of response conditions, reducing question order effects. It permits the analysis
of dropouts through the inspection of their partially completed instruments.
Prompts can be introduced after any page that is completed incorrectly or is left
blank. And subjects who wish to pause and resume the instrument at a later time
can do so from the point where they left off if a stop temporarily or quit for now
button is included on a page. On the flip side, interactive delivery requires many
more interactions with the host server, increasing download times and the possi-
bility of connection failures. Navigation can be more challenging. And respondents
prevented from inspecting the entire instrument simultaneously may lose track of
the context of various questions.
In most cases, the design of the instrument dictates the choice between static or
interactive delivery. For example, interactive delivery is the optimal choice if com-
plex question ordering is a priority, whereas static displays are better suited for
shorter instruments targeting technologically varied populations.
Response Style
Another important decision that researchers must make when formatting an
instrument is the type of items to be asked. Questions can be asked open ended or
closed ended. Open-ended questions enable subjects to answer in their own words,
whereas closed-ended questions force subjects to choose from a predetermined set
of responses. Researchers should recognize that open-ended questions, though
beneficial for some analyses, require more effort from online subjects and often
induce unit and item nonresponse (Knapp & Heidingsfelder, 1999).
Open-ended questions are straightforward to implement. Researchers simply
insert a text-input field below the question for typed entry. These text-input fields
may be programmed to limit responses to a fixed number of characters or accept as
much text as desired. In either case, researchers must determine the initial size of
the field that subjects confront. Prior research has found that longer entry fields
elicit less nonresponse, lengthier responses, and more explicit answers than shorter
ones (Couper, 2000; Couper et al., 2001; Fuchs & Couper, 2001). However, there is
also evidence that longer entry fields are more prone to receiving invalid entries
from subjects than shorter entry fields (Couper et al., 2001). These findings suggest
that researchers should pretest the length of entry fields, spacing them according to
what is expected to be the typical response.
Closed-ended questions pose challenges to researchers as well. Researchers can
choose from text-input fields, pull-down menus, click tags, or slider bars. Each can
be adapted to solicit single or multiple responses.
Text input fields are designated boxes on a Web page where subjects can indicate
their preference by typing a character, usually an X or a numerical value. They
can be programmed to accept single or multiple responses, enable options to be
13-Bickman-45636:13-Bickman-45636.qxp 7/28/2008 11:26 AM Page 425
rank-ordered, and even compute running totals. The downside is that they require
time and effort for subjects to complete and programming skills from researchers
to ensure that they actually prevent invalid responses.
Another response format available on the Web is pull-down menus (or drop-
boxes). Pull-down menus conceal the list of response options, save for a default cate-
gory, until the subjects click on the menu with their cursor. Subjects indicate their
preferences by clicking again on the appropriate response category. Researchers can
program pull-down menus to accept multiple responses. Since respondents only see
one response category until clicking the menu, researchers must be careful to set the
default response option blank on one of the standard response categories to ensure
that they can determine whether subjects do respond. Pull-down menus have the
advantage of taking up little space on the screen, making an instrument appear shorter
to subjects. Unfortunately, the two-step process that must be completed to respond
both decreases usability and increases the time necessary to complete the instrument
(Dillman, 2007). Respondents answering questions in this method have also been
found to be more likely to select choices toward the top of the list, have higher nonre-
sponse rates, and be more likely to inadvertently select unintended answers when
using certain types of mice (Couper, Tourangeau, Conrad, & Crawford, 2004; Healey,
2007). Thus, although drop-down boxes are common features of Web surveys and
forms, researchers are generally well-advised to avoid these answer formats.
The Web also enables researchers to collect responses to closed-ended questions
with click tags. In this format, subjects respond by maneuvering their cursor over
the input tag of their preferred choice and clicking their mouse. Click tags can be
radio buttons or check boxes. Radio buttons are circular click tags that appear filled
when selected. Radio buttons allow one and only one choice from the predeter-
mined categories, thereby preventing multiple responses. In contrast, check boxes
are square tags that display a checkmark when selected. Check boxes accommodate
as many responses as desired by the individual taking the instrument. Click tags are
easy to understand and fast to employ, but take up considerable space, and require
hand-eye coordination to use efficiently.
Finally, the Web offers the opportunity to introduce slider bars. Slider bars align
response options along a track containing a pointer or bar that can be moved back-
and-forth. Subjects slide the bar until it aligns with the preferred response. Sliders
are a particularly attractive option for questions with rating scales because they
offer the sense of a continuum (Arnau, Thompson, & Cook, 2001). They can also
be designed to permit more response options than their counterparts, while occu-
pying no more space. However, sliders may not appear identically across all
browsers, and it is difficult to differentiate preferences from nonresponse when
the default position is left untouched. Moreover, respondents may be less likely to
continue with a survey when receiving a slider-bar question than more common
formats (Watson, Lissitz, & Rudner, 2006).
No consensus has emerged concerning the effectiveness of different closed-end
formats. A series of studies have demonstrated that radio buttons produce faster
completion times. Otherwise, choices on closed-ended response formats are best
guided by the nature of the question, space considerations, and technical capabilities
of the sample.
13-Bickman-45636:13-Bickman-45636.qxp 7/28/2008 11:26 AM Page 426
Researchers must also decide how to handle instances where subjects do not
know the answer or do not care to convey it. Some researchers are tempted to not
provide an option to respondents in an effort to increase the proportion of substan-
tive responses. This is not only more likely to increase measurement error from
subjects who feel compelled to respond but also prevents researchers from differen-
tiating among subjects who leave the question blank. Researchers are better served
by including dont know and decline to answer options after the response cate-
gories. Although this will generate fewer substantive responses, this effect can be
diminished in multiple-page instruments, where pop-up screens prompting respon-
dent to reconsider such responses can be added (DeRouvray & Couper, 2002).
Alignment
Researchers must also decide how to align or position items on subjects com-
puter screens. Item placement like display configurations can be fluid or fixed. Fluid
layout enables items to expand or contract to fit various display configurations.
Since this can obviously change the appearance of text for different subjects, fluid
layouts should be avoided. Instead, researchers should implement fixed layouts,
where items are positioned to originate from a particular part of the screen.
There are several different alignment decisions that must be made when for-
matting instruments. First, researchers must determine the horizontal positioning
of questions. Although questions can be left justified, centered, or right justified on
a computer screen, researchers should only employ left justification, as it is both
consistent with user expectations and easier to follow as subjects move from top to
bottom.
Second, researchers must determine the alignment of response options for
closed-ended questions. Responses can be positioned either vertically (one below
another) or horizontally (one after another) under the questions. Vertical position-
ing is less prone to alignment problems from technical variation, but it takes up
more space, extending the physical length of the instrument. Conversely, horizon-
tal positioning saves space but may extend past users screen configurations requir-
ing horizontal scrolling. For example, horizontal positioning is more appropriate
when response options are intended to convey the sense of a continuum; whereas
vertical positioning is more suitable when there are an extensive number of
response options. In either approach, researchers should remain consistent through-
out the instrument to avoid confusion.
Researchers choosing to vertically align response categories must also determine
their arrangement relative to the questions that precede them. They must decide
whether to place response options in a single column or in multiple columns.
Although there is no evidence supporting one approach over another, Couper
(2001) did find that users tend to gravitate toward the top half and leftmost options
in columned categories. Moreover, researcher must determine whether to left,
center, or right justify columns. Left justification is more familiar, centered is
more visually appealing, and right justification is closer to the arrow keys used
for navigation. Experimentally manipulating left-justified and right-justified
response options, Bowker and Dillman (2000), however, found no statistical or sub-
stantive differences between users preferences or their performance.
13-Bickman-45636:13-Bickman-45636.qxp 7/28/2008 11:26 AM Page 427
Length
Last, researchers must decide on the length of the instrument. The length of an
instrument can be measured in one of two ways, either as the number of items
administered or the time it takes to complete them. Studies suggest that the length
of online instruments can have detrimental effects on response and dropout rates.
This not only reduces the number of cases available for analysis but can also
increase bias in the data if the response and dropout rates correlate with the vari-
ables of interest.
Research suggests that the length of online instruments is correlated with
dropout. Dropout occurs when subjects fail to complete an instrument after they
have begun, leaving the remaining questions unanswered. Though researchers need
to ensure that their instruments provide a sufficient number of variables for appro-
priate analysis, they also need to be mindful that each additional question appears
to increase the odds that subjects will fail to complete the instrument (Galesic,
2006). Crawford, Couper, and Lamias (2001) offer some evidence that by disclos-
ing the length of the instrument to subjects before they begin may serve to lessen
the impact of length on response and dropout rates.
Providing Instructions
Instructions to participants must be formulated so that all targeted respondents
clearly understand how to complete and submit the data collection instrument.
Nonexistent or poorly worded instructions may induce subjects to perform
tasks incorrectly, skip particular portions of the instrument, or fail to participate
13-Bickman-45636:13-Bickman-45636.qxp 7/28/2008 11:26 AM Page 428
Instructions must not only describe how to complete various tasks but how to
navigate through an instrument. Researchers delivering instruments in a single Web
page should inform subjects how to use scrolling bars, whereas those employing
multiple-page instruments should describe how to operate applicable action but-
tons. If subjects are provided with an option to quit, they should be instructed how
to resume later and what they can expect when they do so. Instruments that employ
conditional branching on single-page instruments must ensure that subjects can
skip to the appropriate items. Researchers employing nonautomated delivery
should include skip instructions to the right of the item where subjects can easily
see them after reading the response option, while those employing single-screen
instruments with automated skips should forewarn subjects of the movement to
reduce their disorientation after it occurs.
Balancing the need for and threats from extensive instructions can be challeng-
ing. Researchers posting their instruments on the Web, though, possess several tools
to make this effort easier. They can inset hyperlinks or pop-ups to provide more
detailed instruction without lengthening or disrupting the continuity of the instru-
ment. These should not entirely replace embedded instructions, since subjects may
miss or ignore them. Instead, they should be used to provide greater depth for com-
plicated explanations or information that is not applicable to everyone. Hyperlinks
or pop-ups should be set off from, but adjacent to, related instructions with a clear,
concise remark, such as For further details, click here, with the words pro-
grammed to activate the hyperlink or pop-up. Researchers are also well served by
affording respondents with opportunities to relay comments or questions to the
researcher.
Collecting Submissions
After inducing subjects to complete the instrument, researchers must provide
the means by which they can submit their responses. The approaches are somewhat
different depending on whether the survey uses a static Web page or an interactive
Web page. Each has several advantages and disadvantages.
Researchers administrating static Web instruments must instruct subjects that
they can return them by clicking a submit button included at the end of the
instrument. They should program the button to both transmit the instrument and
click-through respondents to a corresponding page notifying them that the instru-
ment has been successfully transmitted and thanking them for their cooperation.
Submitted instruments are then e-mailed to the researchers workstation unbe-
knownst to subjects.
This approach has several benefits. The instrument can be easily programmed to
transform closed-ended responses into a preassigned numerical format and auto-
matically imported into a database, saving researchers considerable time and effort.
Moreover, the submission mechanism does not directly expose any personal iden-
tifying information, thereby accentuating the perception of anonymity. The flip-
side, though, is that subjects unfamiliar with Web transmissions can easily lose their
responses by inadvertently closing their browser, instead of clicking the submit
13-Bickman-45636:13-Bickman-45636.qxp 7/28/2008 11:26 AM Page 430
button. Moreover, if the connection fails or the transmission is corrupted, the entire
set of responses vanishes as well.
Researchers administering interactive Web instruments can instruct subjects to
submit either completed pages or the entire instrument. Since interactive Web
instruments are usually administered over a series of pages, submission procedures
are typically designed to appear as continuation buttons. When subjects click con-
tinue or next page buttons affixed to the bottom of the page, responses are trans-
mitted directly to the corresponding Web server, without invoking e-mail. After
arriving, responses are automatically compiled for each subject and then added to
a preformatted database.
Interactive instrument submission possesses several advantages over their coun-
terparts. It loses little existing data when Internet connections fail or subjects aban-
don the instrument. By avoiding e-mail altogether, it is likely to induce stronger
perceptions of anonymity. And, it places far fewer demands on researchers work-
stations. Unfortunately, these benefits come at a price. Interactive submissions are
more expensive to manage than their counterparts and require more advanced
programming skills to implement.
Conclusion
The Internet is an exciting and increasingly popular method for collecting survey
and other sorts of data. Compared with other data collection modes, the Internet
often has a relatively low marginal cost for conducting interviews, particularly
when large samples are desired. Equally important, the Internet offers a way of
incorporating experiments and visual stimuli into self-administered surveys. In
many cases, experimental researchers find the Internet to be an efficient way of con-
ducting studies among populations that are far broader and more representative
than those typically found in a psychology lab.
At the same time, though, the Internet faces shortcomings that researchers
need to be aware of. In cases where scientific samples of the general population
are not important, or where probabilistic samples of Web users can be generated,
the Internet can serve as an optimal data collection tool. The Internet is often
used successfully in studies employing multimode data collection approaches
that provide respondents with different options for completing survey question-
naires. However, Internet studies are usually seen as inadequate when used for
estimating population parameters for groups that might not all have Internet
access or who might not be easily identified or included in scientifically devel-
oped Internet sampling frames. Consequently, when considering Internet data
collection, researchers need to think very carefully about the goals of their study.
In particular, researchers need to be especially careful in specifying and consider-
ing the relationship between the target population of their study and the available
sampling methods. Though Internet data collection offers great promise, it also
has limitations that can make an otherwise useful data collection method inap-
propriate for some studies.
13-Bickman-45636:13-Bickman-45636.qxp 7/28/2008 11:26 AM Page 431
Discussion Questions
1. For what types of samples would the Internet be an appropriate tool to
recruit subjects? For what types of samples would the Internet be inappropriate?
2. What methods are available for contacting individuals and soliciting partici-
pation? What are the advantages and disadvantages of each?
3. What two approaches can be taken to question delivery on the Web? What
considerations should researchers weigh before making a decision?
4. What response styles are available to researchers designing questions for
Internet surveys? What types of questions are best suited for each? How should they
be aligned on a Web page?
5. Why are instructions so important for Web surveys? What conventions
should be adopted to ensure that respondents understand how to complete Web
surveys correctly?
6. What options are available for collecting survey submissions? What are the
strengths and weaknesses of each option?
Exercises
1. Design an e-mail to send to prospective subjects inviting them to participate
in a Web survey. Make sure to describe how you will construct the heading and the
body of the message.
2. Construct a 20-item Web survey. Detail how you will approach the following
considerations:
a. What method will you use to deliver the questions to subjects?
b. What response style will you use with each question?
c. How will you align each question on the page?
d. What instructions will you use to explain how the survey should be completed?
e. How will the submissions be collected?
3. Design an Internet sample
a. Who, specifically, do you want to target?
b. What percentage of these people are likely to have Web access? How can
you find this out?
c. How are you going to develop your sample frame? What specific lists or
sources will you use to develop your sample frame? How will you get
access to these lists or sources; will you need permission?
d. How well does your sample frame cover the population that you are
intending to study? If the sample frame does not cover the entire popula-
tion, how might the people you exclude be different from those who are
in the list? Is this a problem for your data?
13-Bickman-45636:13-Bickman-45636.qxp 7/28/2008 11:26 AM Page 432
References
Arnau, R. C., Thompson, R. L., & Cook, C. (2001). Do different response formats change
the latent structure of responses? An empirical example using taxometric analysis.
Educational and Psychological Measurement, 61(1), 2344.
Bell, D. S., Mangione, C. M., & Kahn, C. E., Jr. (2001). Randomized testing of alternative
survey formats using anonymous volunteers on the World Wide Web. Journal of the
American Medical Informatics Association, 8, 616620.
Best, S. J., & Krueger, B. (2002). New approaches to assessing opinion: The prospects for elec-
tronic mail surveys. International Journal of Public Opinion Research, 14, 7392.
Best, S. J., & Krueger, B. (2004). Internet data collection. Thousand Oaks, CA: Sage.
Best, S. J., Krueger, B., Hubbard, C., & Smith, A. (2001). An assessment of the generalizabil-
ity of internet surveys. Social Science Computer Review, 19, 131145.
Boase, J., Horrigan, J., Wellman, B., & Rainie, L. (2006). The strength of Internet ties; the
Internet and email aid users in maintaining their social networks and provide pathways
to help when people face big decisions (Research Report). Pew Internet and American
Life Project. Retrieved June 11, 2007, from www.pewinternet.org/pdfs/PIP_Internet_
ties.pdf
Bowker, D., & Dillman, D. A. (2000, May). An experimental evaluation of left and right
oriented screens for web questionnaires. Paper presented at the annual meeting of the
American Association for Public Opinion Research, Portland, OR.
Comley, P. (2000, April). Pop-up surveys: What works, what doesnt work and what will work
in the future. Paper presented at the ESOMAR Net Effects Internet Conference, Dublin,
Ireland.
Couper, M. P. (2000). Web surveys: A review of issues and approaches. Public Opinion
Quarterly, 64, 464494.
Couper, M. P. (2001, August). Web surveys: The questionnaire design challenge. Invited paper
presented at the International Statistical Institute, Seoul, South Korea.
Couper, M. P., Tourangeau, R., Conrad, F., & Crawford, S. (2004). What they see is what we
get: Response options for Web-based surveys. Social Science Computer Review, 22(2),
111127.
Couper, M. P., Traugott, M., & Lamias, M. (2001). Web survey design and administration.
Public Opinion Quarterly, 65(2), 230253.
Crawford, S., Couper, M. P., & Lamias, M. J. (2001). Web surveys: Perceptions of burden.
Social Science Computer Review, 19(2), 146162.
DeRouvray, C., & Couper, M. P. (2002). Designing a strategy for capturing respondent uncer-
tainty in web-based surveys. Social Science Computer Review, 20(1), 39.
Dillman, D. A. (2007). Mail and internet surveys: The tailored design method (2nd ed.).
Hoboken, NJ: Wiley.
Fallows, D. S. (2005). How women and men use the Internet: Women are catching up to men in
most measures of online life. Men like the Internet for the experiences it offers, while women
like it for the human connections it promotes (Research Report). Pew Internet and
American Life Project. Retrieved June 11, 2007, from www.pewinternet.org/pdfs/
PIP_Women_and_Men_online.pdf
Fallows, D. S. (2007). The volume of spam is growing in Americans personal and workplace
email accounts, but email users are less bothered by it (Data Memo). Pew Internet and
American Life Project. Retrieved June 11, 2007, from www.pewinternet.org/pdfs/PIP_
Spam_May_2007.pdf
13-Bickman-45636:13-Bickman-45636.qxp 7/28/2008 11:26 AM Page 433
Fox, S. (2005). Digital divisions: There are clear differences among those with broadband con-
nections, dial-up connections, and no connections at all to the Internet (Research Report).
Pew Internet and American Life Project. Retrieved June 11, 2007, from www.pewinter
net.org/pdfs/PIP_Digital_Divisions_Oct_5_2005.pdf
Fox, S., & Livingston, G. (2007). Latinos online: Hispanics with lower levels of education and
English proficiency remain largely disconnected from the Internet (Research Report). Pew
Internet and American Life Project. Retrieved June 11, 2007, from www.pewinternet
.org/pdfs/Latinos_Online_March_14_2007.pdf
Fuchs, M., & Couper, M. P. (2001). Length of input field and the responses provided in a
self-administered survey: A comparison of paper and pencil and a web survey. Paper pre-
sented at the International Conference on Methodology and Statistics, Ljubljana,
Slovenia.
Galesic, M. (2006). Dropouts on the Web: The effects of interest and burden experienced
during an online survey. Journal of Official Statistics 22(2), 313328.
Grimes, G. A. (2006). Online behaviors affected by spam. Social Science Computer Review
24(4), 507515.
Healey, B. (2007). Drop downs and scroll mice: The effect of response option format and
input mechanism employed on data quality in Web surveys. Social Science Computer
Review 25(1), 111128.
Heerwegh, D. (2005). Effects of personal salutations in e-mail invitations to participate in a
web-based survey. Public Opinion Quarterly, 69, 588598.
Heerwegh, D., & Loosveldt, G. (2006a). An experimental study of the effects of personaliza-
tion, survey length statements, progress indicators, and survey sponsor logos in Web
surveys. Journal of Official Statistics, 22(2), 191210.
Heerwegh, D., & Loosveldt, G. (2006b). Personalizing e-mail contact: Its influence on Web
survey response rate and social desirability bias. International Journal of Public Opinion
Research 19(2), 258268.
Huggins, V., & Eyerman, J. (2001, February). Probability based Internet surveys: A synopsis of
early methods and survey research results. Paper presented at the Federal Committee on
Statistical Methodology Research Conference, Arlington, VA.
Knapp, F., & Heidingsfelder, M. (1999). Drop-out Analyse: Wirkungen des Untersuchungsdesigns
[Drop-out analysis: The effect of research design]. In U.-D. Reips, B. Batinic, W.
Bandilla, M. Bosnjak, L. Graf, K. Moser, et al. (Eds.), Current Internet science: Trends,
techniques, results. Zurich, Switzerland: Online Press. Retrieved November 15, 2002,
from www.pewinternet.org/pdfs/PIP_Religion_Report.pdf
Lee, S. (2004). Statistical estimation methods in volunteer panel Web surveys. Unpublished
doctoral dissertation, University of Maryland, Joint Program in Survey Methodology.
Lee, S. (2006a). An evaluation of nonresponse and coverage errors in a prerecruited proba-
bility. Social Science Computer Review, 2(4), 460475.
Lee, S. (2006b). Propensity score adjustment as a weighting scheme for volunteer Internet
surveys. Journal of Official Statistics, 22(2), 329349.
Menon, S., & Kahn, B. (2002). Cross-category effects of induced arousal and pleasure on the
Internet shopping experience. Journal of Retailing, 78, 3140.
Mitofsky, W. J. (1999). Pollsters.com. Public Perspective, 10, 2426.
Nie, N., & Erbring, L. (2000). Internet and society: A preliminary report. Report from the
Stanford Institute for the Quantitative Study of Society, Palo Alto, CA.
Pew Internet and American Life. (2000). Wired churches, wired temples: Taking congregations
and missions into cyberspace. Retrieved November 15, 2002, from http://63.210.24.35/reports/
pdfs/PIP_Religion_Report.pdf
13-Bickman-45636:13-Bickman-45636.qxp 7/28/2008 11:26 AM Page 434
Peytchev, A., Couper, M. P., McCabe, S. E., & Crawford, S. (2006). Web survey design; paging
versus scrolling. Public Opinion Quarterly, 70(4), 596607.
Rosenbaum, P. R. (1995). Observational studies. New York: Springer-Verlag.
Schonlau, M., Zapert, K., Simon L. P., Sanstad, K., Marcus, S., Adams, J., et al. (2004). A com-
parison between a propensity weighted Web survey and an identical RDD survey. Social
Science Computer Review, 22, 128138.
Smith, R. M., & Kiniorski, K. (2003, May). Participation in online surveys: Results from a series
of experiments. Paper presented at the annual meeting of the American Association of
Opinion Research, Nashville, TN.
Stempel, G. H., Hargrove, T., & Bernt, J. P. (2000). Relation of growth of use of the Internet
to changes in media use from 1995 to 1999. Journalism and Mass Communication
Quarterly, 77, 7179.
Stevenson, J. S., Bruner, G. C., II, & Kumar, A. (2000). Web page background and viewer atti-
tudes. Journal of Advertising Research, 40(1/2), 2934.
Tuten, T. L., Bosnjak, M., & Badilla, W. (2000). Banner-advertised Web surveys. Marketing
Research, 11(4), 1721.
Vehovar, V., Lozar Manfreda, K., & Batagelj, Z. (2000). Design issues in WWW surveys.
In 2000 Proceedings of the section on survey research methods (pp. 983988). Alexandria,
VA: American Statistical Association.
Watson, J. T., Lissitz, R. W., & Rudner, L. M. (2006). The influence of Web-based question-
naire presentation variations on survey cooperation and perceptions of survey quality.
Journal of Official Statistics, 22(2), 271291.
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:21 PM Page 435
CHAPTER 14
William M. Trochim
435
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:21 PM Page 436
We pitched camp, lasted out the snowstorm, and then with the map we dis-
covered our bearings. And here we are. The lieutenant borrowed this remark-
able map and had a good look at it. He discovered to his astonishment that it
was not a map of the Alps but of the Pyrenees. (Weick, 1995, p. 54)
This chapter is about developing maps, not of geographical territory, but of the-
ories and ideas. It describes a structured applied social research methodology that
can be used to connect theory to observation and research to practice.
Concept mapping is a method for designing and populating conceptual models,
to inform, confirm, or revise a testable theory. Recognizing that a wide range of
thought and practice are associated with the term concept mapping, we concentrate
on one particular approach as having special relevance for applied social research
and that is especially appropriate for this Handbook. This approach has strong
roots in several important traditions in applied social research, and the method has
broad utility in many applied social research contexts.
In this chapter, we place concept mapping within the more general context of
structured conceptualization methods; we then describe the specific steps in imple-
menting this methodology, from initiation of a project to utilization of results. We
consider the variety of ways this concept mapping approach has been or could be
used in applied social research. Finally, we discuss some of the current related issues
in, and how it might evolve in the near term.
(GSR)i V.
This model indicates that all process activities (G, S, and R) are accomplished
essentially as one process step (thus enclosed within common parentheses), from
Models
Ve r b a l
Pi c t o r i a l
Ma t h e m a t i c a l
Representational Forms
(G)g(SR)g V.
Concept Mapping
The structured conceptualization model enables a more formal description of the
central focus of this chapter: Concept mapping is a collaborative and algorithmic-
structured conceptualization process that results in a visual representation of ideas
and their interrelationships. In notational form, any process that generates a picto-
rial (P) representation could be described as concept mapping. While the notion of
relating concepts to each other is as old as thought itself, the idea that the result
might be represented visually is a relatively modern phenomenon.
Included under the broad rubric of concept mapping are approaches such as idea
maps (Armbruster & Anderson, 1982, 1984), mind maps (Buzan & Buzan, 1993),
mental maps (Dillon, Richardson, & McKnight, 1993), cognitive maps (Axelrod,
1976), and a host of literatures related to how to generate such structures, including
lateral thinking (DeBono, 1971, 1973), brainstorming (Adams, 1979), and brainwrit-
ing (Hiltz & Turoff, 1978; Rothwell & Kazanas, 1989). In social science and education,
a variety of different concept mapping approaches represent several traditions and
methods. Many are individual learning, organizing, or writing methodologies.
In contrast, collaborative group concept mapping methods are explicitly
designed to collect input about ideas from several or many individuals, identify
how they organize the interrelationships among the ideas, and represent their
group thinking pictorially or graphically. These approaches are highly structured,
with each process step performed as a distinct activity.
Concept mapping as discussed in the remainder of this chapter is of this form:
(G)g(S)g(R)a P.
The generation of the domain of ideas takes place first, typically (although not
necessarily) through some form of group brainstorming. Individuals contribute to
the delineation of ideas, so it is notated with a subscripted g. Structuring of the
ideas is a distinct second step (within its own parentheses); usually by having each
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:21 PM Page 440
of the individuals in a group sort the ideas. The product of this step is also group
based, even though each individual separately sorts the ideas before aggregation. An
algorithm is used to compute the map, in this case, a sequence of multivariate sta-
tistical analyses as described later. The result is a map (P), which the participants
discuss and interpret. This type of conceptualization method is both a child and a
parent of applied social research methods: Its analytical tools and group processes
are rooted in social research; and its integrated process is frequently used as a
methodology in applied social research that generates and explores conceptual
structures of group thinking.
Group concept mapping was developed in the early 1980s (Kane & Trochim,
2006; Trochim & Linton, 1986), and had its foundations in a variety of applied
social research and organizational behavior traditions, including
The emergent map that is developed through the engagement of multiple stake-
holders and the application of rigorous processes and tools, is a product that, in applied
social research, would be difficult to arrive at through other more traditional means.
In practical terms, concept mapping helps a group solve a problem, articulate
a group need or desire, author a plan, or develop a program or intervention. A
researcher might consider concept mapping an appropriate methodology when a
group has unique experience that can inform theory or represents a range of opin-
ions that are not easily reconciled in traditional group conceptualization modes;
when the power differential in a group has the effect of reducing contributions of
thought from certain quarters; or when the desired outcome of a groups thinking
is not well articulated. Concept mapping is an especially applicable methodology
for research or evaluation in organizations or communities where there is a history
or culture of community participation in decision making and planning.
Issue or Question
Action
need or interest from the participants. The following are examples of focus
prompts, for a variety of concept mapping projects:
A specific issue that affects the mental health of women and girls is . . .
In order to improve community services to vulnerable new residents in a
city, the community clinic system should . . .
We will know that our after-school program is a success when . . .
Th e D e la w a re E xa mp l e
SOURCE: Delaware Department of Health and Social Services, Division of Public Health.
SOURCE: Delaware Department of Health and Social Services, Division of Public Health.
For the DCC Cancer plan, 32 individuals conducted individual pile sorts of
the final statements, and the data were used as the foundation for the
development of the concept map.
The project asked participants to rate (or provide value observations on)
importance and feasibility. A total of 93 participants provided ratings on
importance, and 80 provided ratings on feasibility.
Participants were asked to provide nonidentifying information in response
to the following characteristics:
County of residence
Relationship to cancer control
Type of organization
SOURCE: Delaware Department of Health and Social Services, Division of Public Health.
Ratings may take different forms and ask a range of questions, and are collected
on each statement from each stakeholder participant. Although a standard Likert-like
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 445
1 2 3 4 5 6 7 8 9 10
1 1 1 0 0 0 1 0 0 1 0
2 1 1 0 0 0 1 0 0 1 0
3 0 0 1 1 0 0 0 0 0 0
4 0 0 1 1 0 0 0 0 0 0
Binary Square
5 0 0 0 0 1 0 0 1 0 0
Similarity Matrix
6 1 1 0 0 0 1 0 0 1 0
for One Person
7 0 0 0 0 0 0 1 0 0 0
8 0 0 0 0 1 0 0 1 0 0
9 1 1 0 0 0 1 0 0 1 0
10 0 0 0 0 0 0 0 0 0 1
Figure 14.3 Transforming Sort Data Into a Binary Square Similarity Matrix
SOURCE: From Concept Mapping for Planning and Evaluation by M. Kane and W. M. Trochim,
2006. Reprinted with permission of SAGE.
consisting only of 0s and 1s. If two statements were placed together in a pile by the
individual, their corresponding row and column numbers would contain a 1. If
they werent placed together, their joint row-column value would hold a 0. Because
a statement is always sorted into the same pile as itself, the diagonal of the matrix
always consists of 1s. The matrix is symmetric because, for example, if Statement 5
is sorted with Statement 8, it must always be the case that Statement 8 is sorted with
Statement 5. Thus, the concept mapping analysis begins with construction from the
sort information of an N N (where N = the number of statements) binary, sym-
metric matrix of similarities, Xij. For any two items i and j, a 1 is placed in Xij if the
two items were placed in the same pile by the participant, otherwise a 0 is entered
(Weller & Romney, 1988, p. 22).
This creates a common data structure that is the same size for all participants,
permitting aggregation across participants input. Figure 14.4 shows how this might
look when aggregating sort results from 5 participants who each sorted the same
10-statement set. The figure illustrates that, in effect, the individual binary matrices
are stacked on top of each other and added. Thus, any cell in this aggregate matrix
could take integer values between 0 and 5 (i.e., the number of people who sorted
the statements); the value indicates the number of people who placed the i, j pair in
the same pile. The total N N similarity matrix, Tij was obtained by summing
across the individual Xij matrices.
This total similarity matrix Tij is the input for nonmetric MDS analysis with a
two-dimensional solution. The solution is limited to two dimensions for ease of
use, as recommended by Kruskal and Wish (1978).
The analysis yields a two-dimensional (x, y) configuration of the set of state-
ments based on the criterion that statements piled together by more people are
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 447
1 0 1 1 0 0 0 0 0
1 0 0 1 0 0 0 1 0 0
1 0 0 1 0 0 1 1 0 0 0
1 0 1 1 0 0 0 1 0 0 0 0
0 0
1 0 0 1 0 0 1 0 0 0 0
0 0 0 0
5 0 2 5 0 0 2 3 0 0 0 0
0 0
0 5 0 0 0 1 0 0 2 0 0 0
0 0
Total Square 0 0
2 0 5 3 0 0 0 0 0 0 0 0
0 0
Similarity 5 0 3 5 0 0 0 0 0 0 0 0
0 0
Matrix 0 0
0 0 0 0 5 0 0 2 0 0 0 0
0 0
Across 0 1 0 0 0 5 0 0 4 0 0 1
0 0
Five 0 1
2 0 0 0 0 0 5 0 0 0 0 0
Participants 1
3 0 0 0 2 0 0 5 0 0 0 1
0 2 0 0 0 4 0 0 5 0 1
0 0 0 0 0 0 0 0 0 5
Figure 14.4 Aggregating Sort Data Across Five Participants Into the Total Square
Similarity Matrix for a 10-Statement Map
SOURCE: From The Concept System Training Manual, 2007. Reprinted with permission of Concept
Systems, Inc. www.ConceptSystems.com
located closer to each other in two-dimensional space while those piled together
less frequently are further apart. There are numerous mathematical descriptions of
the MDS process (Davison, 1983; Kruskal & Wish, 1978); a visual, nonmathemati-
cal explanation follows to provide insight for social researchers whose work
requires explanation of this analysis to others.
Here, we use a hypothetical example to illustrate the analysis. In the example, 80
statements are assumed to be generated by 10 participants. The first 10 statements
from this example are given in Table 14.1.
Table 14.1 The First 10 Brainstormed Statements (of 80) From a Hypothetical
Example Concept Mapping Process on Organizational Development
and Sustainability
10. Conduct program effectiveness analysis for all major current programs
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 448
Figure 14.5 shows an excerpt of the aggregate 80 80 sort matrix that shows the
results for the first 10 statements. Each cell shows how many of the 10 participants
sorted each statement with each other statement. The maximum number in each
cell is necessarily 10, since that is the total number of sorters. The minimum
number in each cell is 0, since that is the lowest possible number of sorters con-
necting a specific statement to another statement.
Expanding this example to the entire data set, the data we have are 10 individual
sorts of the 80 statements. MDS takes a square matrix of similarities3 for a set of
items/objects, such as the one above, as input and produces a map4 as output.
The map shown in Figure 14.6 represents statements and their relationship
to each other, and highlights the first 10 statements. The cells in the table in
Figure 14.5 indicate, for example, that 8 out of 10 people sorted Statements 5 and 7
together; and these statements are consequently located next to each other on the
bottom of the map. Similarly, 8 out of 10 people sorted Statements 3 and 9 together
and they are located next to each other on the top of the map. On the other hand,
none of the participants sorted Statement 3 with either 5 or 7 or Statement 9 with
either 5 or 7. Statements 3 and 9 on the top are located far away from 5 and 7 on
the bottom. Interstatement relationships in the similarity matrix are translated by
MDS into distances on the map.
How does MDS take the aggregate sort matrix and produce the two-dimensional
point map? The following simple illustration of MDS is not an exact explanation
for how the statistical algorithm works, but rather provides a visual metaphor that
suggests what the formula is grappling with. We find this example useful for
students and nonstatisticians who are interested in the analysis. For an exact
Statement 1 2 3 4 5 6 7 8 9 10
1 10 0 0 0 0 0 0 0 0 5
2 0 10 0 2 2 1 1 0 0 0
3 0 0 10 1 0 0 0 0 8 0
4 0 2 1 10 1 2 0 1 2 0
5 0 2 0 1 10 0 8 0 0 0
6 0 1 0 2 0 10 0 4 0 0
7 0 1 0 0 8 0 10 0 0 0
8 0 0 0 1 0 4 0 10 0 0
9 0 0 8 2 0 4 0 0 10 0
10 5 0 0 0 0 0 0 0 0 10
Figure 14.5 Excerpt of the Aggregate Similarity Matrix Showing Results for the
First 10 Statements for 10 Participants in an 80 80 Similarity Matrix
SOURCE: From The Concept System Training Manual, 2007. Reprinted with permission of Concept
Systems, Inc. www.ConceptSystems.com
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 449
24 77
3 60
21 9
31 63
27 19
35
57
54 64
4 56 14
12 47 11
39
78
79 48
62 71
73 59
34
25 42 41
29 55 58
30 74
45 13
37 2
1 17 10 65 69
75 72 80
76 16 6
22 18
23 70
33 38 26 15
28
53
61 8 51
36 49 20 50
52
32
66 5 67
40 44 7 68
43
46
Figure 14.6 Final Map of 80 Statements as Sorted by 10 Participants, With the First 10 Statements
Highlighted
SOURCE: From Concept Mapping for Planning and Evaluation by M. Kane and W. M. Trochim, 2006. Reprinted with
permission of SAGE.
Similarity Matrix
1 2 3
1 5 1 2
2 1 5 0
3 2 0 5
If 4 out of 5 people
grouped Statement 1 2
with 2
If 3 out of 5 people
grouped Statement 1
with 2 1
If 2 out of 5 people
grouped Statement 1
with 2
If 1 out of 5 people
grouped Statement 1
with 2
If 0 out of 5 people
grouped Statement 1
with 2
Figure 14.7 Similarity Matrix for Three Items, and Theoretical Distance of Item 2
From Item 1, Based on Sorters Input
SOURCE: From The Concept System Training Manual, 2007. Reprinted with permission of Concept
Systems, Inc. www.ConceptSystems.com
as indicated in Figure 14.8. Two people sorted Statements 1 and 3, and none sorted
2 and 3. To place Statement 3, we locate the position that is simultaneously three cir-
cles from Statement 1 and five from Statement 2. Figure 14.8 shows that there are
two equally accurate locations for Statement 3. We arbitrarily select the one on the
upper left (the highlighted Statement 3).
With only three points in two dimensions, it is always possible to place the
points exactly in two dimensions. However, the process gets more complicated with
a fourth statement to add to the project. In the lower left of Figure 14.8, we see the
same hypothetical similarity matrix for the five sorters, but with a fourth statement
added. To place this fourth point on the map, we need to locate its distance simul-
taneously from each of the other three statements. The concentric circles in
Figure 14.8 show the possibilities. The best location is an intersection that is simul-
taneously one unit away from Statement 1, five units away from Statement 2, and
two units away from Statement 3. But note that the required concentric circles do
not have such an intersection point. In two dimensions, the best we can do is to
locate Statement 4 as closely as possible to the intersection, as shown in Figure 14.8.
Several important insights emerge from this simple visual description of what MDS
is grappling with. MDS does not know directions, so that when we place Point 2 in
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 451
Similarity Matrix
1 2 3
1 5 1 2
2 1 5 0
3 2 0 5
3 2
4
1
Similarity Matrix 3
1 2 3 4
1 5 1 2 4
2 1 5 0 0
3 2 0 5 3
4 4 0 3 5
Figure 14.8 Constructing an MDS-like Point Plot for the Hypothetical Case of Four Statements and
Five Sorters
SOURCE: From The Concept System Training Manual, 2007. Reprinted with permission of Concept Systems, Inc.
www.ConceptSystems.com
interpretable or useful than ones with considerably higher stress. The Stress Value is
sensitive to even the smallest distance discrepancies on the map. Interpretation of
micromeasurements of distances among points is typically neither necessary nor use-
ful. Slight variances between the input and the placement can contribute to higher
stress without diminishing the general map result or its interpretability.
The discussion thus far shows how the concept mapping analysis uses aggregate
sorting results and MDS to produce the basic point map that is the foundation for
all other maps. While this is a useful result in itself, it is helpful to be able to view a
concept map at different levels of detail. The point map generated by MDS is a fairly
detailed map, especially when it contains as many as a hundred points. To arrive at
a higher-level view of the map, a procedure known as hierarchical cluster analysis
(Anderberg, 1973; Everitt, 1980) is used. The input to the cluster analysis is the point
map, specifically, the x, y values for all the points, or units of input, on the MDS
map. Using the MDS configuration as input to the cluster analysis forces the clus-
ter analysis to partition the MDS configuration into nonoverlapping clusters in
two-dimensional space. Mathematicians do not agree on what constitutes a cluster
mathematically, so several algorithms exist for conducting cluster analysis, each of
them likely to yield different results. In group concept mapping, we typically con-
duct hierarchical cluster analysis using Wards algorithm (Everitt, 1980) as the basis
for defining a cluster. Wards algorithm has the advantage of being especially appro-
priate with the type of distance data that comes from the MDS analysis. The hier-
archical cluster analysis uses the point map data to construct a tree that at one
extreme represents all points together (in the trunk of the tree) and at another rep-
resents all points as individual end points of the branches. Cluster analysis
approaches can be classified as either divisive (i.e., top down) or agglomerative (i.e.,
bottom up). Wards algorithm is an agglomerative approach.
Figure 14.9 illustrates how agglomerative hierarchical cluster analysis is related to an
MDS point map. Returning to the 10-statement example, the top of the figure shows a
10-statement point map. The bottom shows the cluster analysis tree. Each statement is
the end-point of a branch. The tree shows, moving from top to bottom, how statements
are agglomerated and eventually combined onto a single trunka one-cluster solution.
To illustrate, it is visually apparent that Statements 1 and 6 are closer to each other than
any other pair of statements on the map. In the cluster tree, they are the first two
branches that are merged. The next closest pair is Statements 5 and 7, and they are
grouped next. The merge table on the bottom left of Figure 14.9 shows which state-
ments (or previously formed clusters of statements) are combined at each number of
clusters. By taking horizontal slices at different heights of the tree, one can look at dif-
ferent numbers of clusters. For instance, for a five-cluster solution, we look at the hori-
zontal slice at the 5 Number of Clusters level in the cluster tree to see that the
following statements would be grouped in clusters: (1, 6, 8) (3, 4) (7, 5) (9, 10) (2).
The resulting graphic representation partitions the universe of statements
ideas, issues, or articulations of knowledgeinto groups or clusters that appro-
priately represent the map content on a higher conceptual level. The process of
selecting the appropriate level of detail, or concept, is typically driven by the needs
of the group in the study or the research. In brief, the process relies on qualitative
review of a range of cluster solutions, from fairly granularin the case of a map
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 453
7
5
8
9
1 10
6
2
3
4
3 9 + 10 8
4 (1 + 6) + 8 7
5 3+4 6
6 2 + (9 + 10) 5
7 ((1 + 6) + 8)) + (3 + 4) 4
8 (5 + 7) + ((2 + (9 + 10)) 3
9 (((1 + 6) + 8)) + (3 + 4)) + 2
(5 + 7) + ((2 + (9 + 10)) 1
Employee Issues
Employee Relations/Communication
Partnership
Information Technology
Basic concept mapspoint maps and cluster mapsthat form the founda-
tion for further analysis.
Pattern matches, that use the value ratings gathered at the structuring phase
(Step 3) to show consensus, or differences of opinion or judgment between
groups, at the cluster level.
Bivariate value plots, called go zones, that use the value rating data on a
statement-by-statement level within each cluster.
Ratings data from Step 3 can be used by the researcher to describe the ratings of
all participants or a subgroup, to compare across subgroups; or to compare all par-
ticipants across different dimensions, such as importance, feasibility, or potential
impact. Figure 14.11 represents a cluster ratings map, which illustrates the range of,
in this case, importance levels that the participants as a whole associate with each
conceptual cluster on the map.
The overall values related to each concept provide rich feedback to the
researcher and community of interest. Here, the group would likely notice that the
northeast ridge indicates high importance associated with the concepts of
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 455
Employee Issues
Employee Relations/Communication
Partnership
Information Technology
employee issues, employee relations, and efficiency. Some might note the comple-
mentary relationship among those concepts, as a region of interest for planning.
In contrast, the west coast is relatively less important to the organization, accord-
ing to those who participated.
More detailed rating comparison tools are also typically used. Pattern matching
(Trochim, 1985, 1989b) is used to explore consensus across different stakeholders
or stakeholder groups. Pattern matching is both a statistical and a graphic analysis.
Graphically, a pattern match is portrayed using a ladder graph that consists of two
vertical axes (one for each pattern) as shown in Figure 14.12. The vertical axes are
joined by lines that indicate the average values for each cluster on the concept map
for any variable and group specified. Statistically, the two patterns are compared
with a Pearson Product Moment correlation that is displayed at the bottom of the
ladder graph. The graphic is derived from the ratings data taken on each statement
from each participant and the demographic information collected at the same time.
The analysis segments the stakeholders by self-identified group; it also averages the
value ratings of the statements within each cluster (as on the cluster rating map)
and aligns them on a vertical number line for each subgroup of interest. Connecting
Cluster A on the left side with Cluster A on the right side shows us graphically the
relative importance of the opinions between Groups 1 and 2. Figure 14.12 repre-
sents a pattern match that compares, in this case, managers and staff opinions of
importance on each of the concepts or clusters.
This pattern match is a cluster-level representation of the average values of the
statements in each cluster, and how they compare for two subgroups. The cluster
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 456
Importance
Managers Staff
4 4.52
Program Staff
Employee Relations/Communication
Efficiency
Employee Issues
Program Staff
Information Technology
Partnership
Community Relations Community Relations
Figure 14.12 Pattern Match Comparing Managers and Staff on Importance Ratings by Cluster
SOURCE: From The Concept System Training Manual, 2007. Reprinted with permission of Concept Systems, Inc.
www.ConceptSystems.com
names from the concept map are arrayed according to their average rating, in com-
parison with the other clusters. In this pattern match, managers and staff agree that
community relations is not as important as the other concepts associated with
organizational planning (which is the focus of this project). On the other hand, the
conceptual areas that managers feel are important are rated relatively low by the
staff and vice versa. As a planning tool, pattern matches can point to elements that
require attention before decisions are made, as in this case.
The next step in exploring concept map data allows the researcher to come full
circle, back to the level of specific statements or issues in the domain. The
researcher can develop bivariate value plots, also labeled go zones, that show the
average rating values of each statement in relation to the other statements in its
conceptual cluster. An example is shown in Figure 14.13. The horizontal axis shows
the importance rating for managers; the vertical shows for staff. Statements are dis-
played with their identifying number. The plot is divided into quadrants based on
the average for each axis. The upper-right quadrant indicates the statements that
are rated above average on importance by both managers and staff. The plot takes
its name from this quadrant, which is sometimes called the go zone, to indicate that
these are the first issues one might typically go to in thinking about action plan-
ning, because they are the ones both managers and staff agree are important. The
participants review these plots and use them as the basis for an initial discussion
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 457
about action. Such plots can be valuable to planners, researchers, and evaluators in
agencies or organizations because they enable one to identify issues that are high
value by agreement.
14
63
1.80
1.20 3.43 4.80
Managers
r = .04
Step 6: Utilization
The last step in the process is utilization. The critical issue here is ensuring that the
work undertaken to construct the rich conceptual framework, both by the researcher
and by the community of interest, is used as the foundation for whatever application
is desired. The next section describes a variety of applications in social science
research, in both use (theory building, program development, measurement and
evaluation) and area of study or research (health, mental health, education, etc.).
Looking back at our Delaware Advisory Council on Cancer Incidence and
Mortality example gives us some insight into how a larger-scale project might
evolve. Figure 14.14 shows the cluster map of 118 statements, which was authored,
in effect, by the sorts of 32 individuals. The key concepts for programming and
innovation focus are the ring of clusters surrounding the central cluster; and the
central cluster itself represents management and oversight as part of the overall
plan. Figure 14.15 is an example of one of the DCC clusters Go Zones, indicating
in the top-right quadrant the items of highest importance and highest feasibility
according to participants. This provided the consortium with specific recommen-
dations for action to address the focus of the initiative; the Go Zone for each clus-
ter was queried, interpreted, and used to inform the plan and set milestones for
each topic.
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 458
Figure 14.14 Final Concept Map of the Delaware Advisory Council on Cancer Incidence and Mortality
SOURCE: Delaware Department of Health and Social Services, Division of Public Health.
30
research. (26)
3 26 Look at whether the high cancer rate is related to an
48 88
aging population. (27)
29 64
Use miscarriage/birth defect case data to look at the
17 effects of carcinogenic exposure. (29)
2 Determine how many deaths from the 6 leading
32 cancers in DE were avoidable. (30)
2 3 4 5 Determine the effect of the Cancer Society on
Importance recidivism rates in DE. (32)
Consider the potential years of life lost to certain types
of cancer when setting state priorities. (33)
Conduct a study to determine the impact of the use and
Use registry data to create a geographic map of
abuse of alcohol. (48)
cancer incidence (by type), including the risk factors
and causes associated with each area. (65) Address the high incidence and death rate for African
Americans. (54)
Focus on survival rates due to early detection and
advancements in treatment. (69) Standardize specific, accurate reporting of the cause of
death for cancer patients. (60)
Research cancer incidence and cure rates for children
(18 yrs. and younger). (81) Examine how the high incidence of HIV and other
sexually transmitted diseases impacts the cancer rate
Determine why DE cancer rate is so high. (88)
(sarcomas, cervical) (64)
Figure 14.15 An Example of a Bivariate Go Zone Plot for Cluster Research and Data Analysis in
the Delaware Advisory Council on Cancer Incidence and Mortality Project
SOURCE: Delaware Department of Health and Social Services, Division of Public Health.
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 459
Theory Development
Because concept mapping is a structured methodology for identifying what a
group of people think about some topic, it is hardly surprising that one of its major
uses in applied social research contexts has been for exploring or developing theo-
ries or models that can subsequently be assessed empirically. Over the past several
decades, there has been a broad recognition in applied social research that articula-
tion of program theory (Bickman, 1986; Chen & Rossi, 1990) is critical to the
understanding of causal relationships between interventions and outcomes (Chen
& Rossi, 1983, 1984; Trochim, 1985, 1989b).
Concept mapping has often been employed to explore the theory or meaning of
some construct or area from a multistakeholder perspective. It has been used to
explore multistakeholder perspectives on primary health care services (Southern
Young, Dunt, Appleby, & Batterham, 2002); how patients cope with illness and with
the health care system (DeRidder, Richardson, Severens, & Malsch, 1997); the needs of
children in pediatric hospice and palliative care (Donnelly, Huff, Lindsey, McMahon,
& Schumacher, 2005); how stakeholders perceive services in mental health (Johnsen,
Biegel, & Shafran, 2000); the barriers to racial or ethnic minority application and
competition for NIH (National Institutes of Health) research funding (Shavers et al.,
2005); the problems that persons with traumatic brain injury face (J. P. Donnelly,
K. Z. Donnelly, & Grohman, 2005; K. Z. Donnelly, J. P. Donnelly, & Grohman, 2000);
what is meant by systems thinking in public health (Trochim, Cabrera, Milstein,
Gallagher, & Leischow, 2006); staffs views of a supported employment program for
persons with severe mental illness (Trochim, Cook, & Setze, 1994); gender differences
in perceptions of sexual harassment in the workplace (Hurt, Weiner, Russell, &
Mannen, 1999); what clients perceive as helpful in counseling (Paulson, Truscott, &
Stuart, 1999); factors that affect psychiatric hospitalization (Dumont, 1993); quality of
life (Boevink, Wolf, van Nieuwenhuizen, & Schene, 1995; van Nieuwenhuizen, Schene,
Koester, & Huxley, 2001); quality of care (VanderWaal, Casparie, & Lako, 1996); the
differences in perceptions between student employees and recreational sports admin-
istrators about student employee work in a recreational sports setting (Miller &
Grayson, 2006); student perceptions of issues in their lives as students (Trochim,
1989a); and in a project designed to see what clients experience as helpful in counsel-
ing (Paulson et al., 1999). In many of these projects, formal models or theories were
not the explicit goal, even though they sometimes resulted from the process.
Concept mapping has been used explicitly for more formal development of a
theory, model, or framework. Some examples include the development of theories
or models of multiconstruct issues such as general practice in health (Batterham
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 460
et al., 2002); depression in college students (Daughtry & Kunkel, 1993); tobacco
industry tactics to undermine tobacco control (Trochim, Stillman, Clark, &
Schmitt, 2003); womens perceptions of intimate partner violence experiences
(Burke et al., 2005); and group conflict in organizations (Jackson & Trochim, 2002).
Researchers sometimes focus on a specific construct or concept such as comple-
mentary and alternative medicine (Baldwin, Kroesen, Trochim, & Bell, 2004); fem-
inism (Linton, 1989a, 1989b); caring in nursing (Valentine, 1989); and the construct
of listening (Witkin & Trochim, 1997). In applied studies, researchers engage stake-
holders to validate or extend theories and models, as in the study regarding the
challenges faced by foster parents (Brown & Calder, 1999).
both the overall differences and, if significant effects are found, of multiple com-
parisons to identify specific differences, in a fashion directly analogous to multiple
comparison tests in analysis of variance frameworks.
Concept mapping is a pattern-oriented method. One of the most intriguing
issues is whether patterns can be used in applied social research to help address
problems of noise or variation in measures, that is, to improve statistical power.
Consider a common situation in applied social research where we might conduct
many similarly structured analyses. For instance, using a multi-item scale, we esti-
mate significant change from before to after an intervention is in place on an item
level. Comparing on an item level, we might be tempted to conclude that the inter-
vention was not significant. But if it were possible to rank order or scale the
expected outcomes of the set of tests, we would be able to correlate the expected
outcomes with the observed test values. It is possible that this correlation is statis-
tically significanteven though none of the individual tests was. If it is easier to
detect significant patterns in situations of low statistical power than to detect point-
specific predictions (Trochim, 1989b), this approach may provide support for an
expanded application of pattern matching. This pattern-matching approach was
taken in a test of the effects of a psychiatric rehabilitation program for supported
employment (Trochim & Cook, 1992). Here, the intriguing finding was that there
was a significant negative correlation between theoretical expectations and observed
change scores (estimated through t tests), leading program staff and researchers to
rethink the theory of the program. In a similar manner, it may be that by overlay-
ing statistical results onto a concept map, we could detect patterns of similarity
among treatment effect estimates not detectable from the estimates themselves
(Caracelli, 1989). This notion has yet to be thoroughly investigated but continues
to have significant potential.
Although concept mapping has traditionally been viewed as a method for devel-
oping conceptual frameworks, theories, or constructs, it opens up new possibilities
as an analysis approach for qualitative data. Many qualitative methods use proce-
dures that are analogous to the steps in concept mapping or that could easily be
coupled with it. For example, if one had transcribed interview text, typical qualita-
tive analysis would involve the identification of key themes and the organization of
these themes into broader frameworks or rubrics. Concept mapping suggests a way
to do this collaboratively as a type of participatory qualitative analysis where a
group of individual interviewees could be directly involved in the collective the-
matic analysis of their own interview data through sorting and rating of a common
set of excerpted statements from each of their interviews. Such approaches are
already being explored in the use of concept mapping for the participatory analysis
of short open-ended questions on surveys (Burke et al., 2005; Jackson & Trochim,
2002) and in the conduct of community-based participatory research projects
(Trochim et al., 2004).
Group process issues related to concept mapping benefit from ongoing attention
and research. At the very beginning of a project, for example, developing the focus
statement is one of the most critical tasks, but no clear standard method exists to
accomplish this. A structured method for developing and pilot testing alternative
focus statements for the context at hand would be useful (Mercer, 1992).
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 464
What are the different types of focus statements that have been used?
How many statements are typically brainstormed in a concept mapping
project?
Whats the typical person-to-statement brainstorming ratio?
How many piles do people typically sort statements into?
What is the length of time for participation in each step of the mapping process?
What is the typical distribution of rating variables?
How many clusters do maps typically have, and how much does this vary
from project to project?
What is the distribution of pattern-matching or go-zone correlations?
This points to what may be the most important eventual evolution of concept
mapping: the development of complex and adaptive mapping. Currently, the focus
prompt sets the direction of the project and the brainstormed set of statements is
the universe of available data on that issue. But evolving computer power and the
availability of continuously networked Web-based participant groups suggest a dif-
ferent approach, where focus statements evolve over time and new statements may
be created as previous ones decline in relevance, based on algorithms that use par-
ticipant input. This kind of dynamic modeling would enable maps to evolve in real
time and change as our understanding of the problem changes. This would almost
certainly require the evolution of new statistical and analytic methods, and new
data structures, integrating the principles of MDS but using different algorithms
than those currently used. Where the current method assumes a fixed number of
sorters who process a fixed set of statements, this dynamic alternative would
assume that different people would organize different subsets of statements where
the overlap enables broader common data structures to be estimated and to
emerge. The future may focus on meta-mapping that knits together multiple
existing maps or shards of maps. Newer and more dynamic analytic methods and
data structures will make this more feasible, suggesting the possibility that concept
mapping might be a foundation for a broader, more integrated, and continuously
adaptive mapping of a more general semantic space. Expanding the definitions of
theory, concept, and model with which we began this chapter is both a process, and
a result, of the continued exploration of concept mapping in applied social
research.
Discussion Questions
Exercises
Focus
1. Instruct the group to identify and agree on a social issue or a problem in their
context that requires group input to address. An example might be improving
student housing or increasing sustainability of a specific social program.
2. Draft the focus prompt using the following structure:
A specific (thing, issue, element, need) we need to (do, investigate, identify,
solve) in order to (accomplish the goal of the project) is . . .
Discuss and get agreement on the wording, for the purposes of the exercise.
Brainstorm
1. Identify a facilitator from the group.
2. Instruct the facilitator that the brainstorm is a focused response to the prompt
described above. Basic rules are as follows: all input to the focus is acceptable; no
editing of others input except for clarity and understanding. Also, redundancy is
acceptable at the brainstorming stage. Items not related to the focus should be
recorded so as not to sidetrack the topic, but still capture the input. Statements
should be short and contain only one main idea in response to the prompt.
3. The facilitator will state the focus prompt and ask for input. Another partic-
ipant will write these statements on a white board, or type on a computer, so that
the group can see the statements.
4. After 12 specific statements are generated, the facilitator can end the session.
Sorting
The sorting routine that each stakeholder conducts is the key to the data input
and analysis.
1. Provide each student with a number of blank index cards or small slips of
paper to correspond to the number of statements.
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 467
2. Instruct each to write the statements down and number them exactly as they
were numbered in the brainstorming session. The card should contain the state-
ment and the statement number (in parentheses).
3. Each person will conduct a sort of the 12 statements, according to the fol-
lowing rules:
The focus is how similar in meaning statements are, in relation to the
others.
There is no specific number of clusters that is better than any other.
A statement must be put in one groupit cannot be put in two at the
same time.
An individual statement may be considered its own group; do not put all
statements in one pile.
Do not create a miscellaneous pile or sort things according to impor-
tance; this is a meaning sort, not a sort for value.
After giving the group 5 to 7 minutes to complete the sort, ask for a show of
hands on the following questions:
How many ended up with 10 piles, how many with 9, how many with 8, etc.?
Ask the group for observations: Where did most people end up on the sort
spectrum? What was the range of sort numbers?
Reinforce that all sort results are valid, provided they follow the simple rules
described.
To show the linkage between the individual sort data and the similarity matrix
To enhance understanding of the unit of interest in MDS as applied in con-
cept mapping
Using the sort piles that the participants developed in the previous exercise,
they will create a matrix of similarities. This matrix is the data source for the MDS
analysis.
Instruct the participants to look at their sorts and finalize them if needed.
Statement 1 2 3 4 5 6 7 8 9 10 11 12
1 /
2 /
3 / / /
4 /
5 /
6 /
7 /
8 / / /
9 /
10 / / /
11 /
12 /
Put a slash in the diagonal boxes in the middle of the matrix. Each state-
ment is always sorted with itself.
Pick up one of the piles that you have sorted the statements into and
notice which statements are sorted together in that pile by their identify-
ing number.
Look at the first pair of statements (e.g., 3 and 8). On the matrix, put a
slash in the boxes that represent where 3 and 8 intersect. There will be two
of them.
Look at what else is sorting with Statement 3 (e.g., 10). Put a slash in the
boxes that represent where 3 and 10 intersect. There will be two.
Notice that if you put 3 and 8 together and 3 and 10 together, then 8 and
10 are also together. Put a slash in the two boxes that represent where 8 and
10 intersect.
Continue in this way until all statements that were placed together are
recorded.
2. Check to see if the group is finished after 7 or 8 minutes or has reached a
point of understanding.
3. Draw the students attention to the fact that they have each created a binary
square.
4. Instruct each person to work with another person and combine their sorts.
Instruct as follows:
Get together with the person next to you. Decide which sort matrix youll
use as the base for combining your sorts. It doesnt matter which.
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 469
Review the piles that you each had: How many did you each have?
Look at the two matrices side by side. Do they look the same?
Using the base matrix, transfer the information from the second matrix
to combine them on one sheet.
5. This should take about 5 to 7 minutes. Ask
How complicated was that to do? What would a database of 80 statements
with, say, 25 sorters, look like? Would it be feasible to do?
This is a manual exercise to show the construction of the database that
the analysis is built on. This should help to illustrate that the data unit
of interest is not the person who sorted, or even the piles of sorted
items, but rather, the relationship of one idea to every other idea in
the set.
Notes
1. The authors wish to thank the Division of Public Health, Delaware Department
of Health and Social Services, for permission to reproduce the information related to the
Delaware Cancer Consortium.
2. The analysis can be accomplished in most standard statistical packages such as SAS or
SPSS. Some programming in these statistical packages would typically be required to get the
data into the appropriate form and to sequence the analytic steps appropriately. Alternatively,
the entire sequence of analytic steps have already been integrated along with data entry and
graphics output of maps, pattern matches, and go-zone graphs into the Concept System soft-
ware that is available from Concept Systems Incorporated (http://www.conceptsystems.com).
3. The term (dis)similarity is used in the MDS literature to indicate that the data can
consist of either dissimilarities or similarities. In concept mapping, the data are always the
square symmetric similarity matrix that is generated from the sorting data, so this discussion
only considers similarity input.
4. The map is the distribution of points that represent the location of objects in
N-dimensional space. In concept mapping, the objects are the brainstormed (or otherwise
generated) statements and the map that MDS produces is the point map in two dimensions.
References
Abrahams, D. A. (2004). Technology adoption in higher education: A framework for identifying
and prioritizing issues and barriers to adoption. Unpublished doctoral dissertation,
Cornell University, Ithaca, NY.
Adams, J. L. (1979). Conceptual blockbusting: A guide to better ideas (2nd ed.). New York:
W. W. Norton.
Anderberg, M. R. (1973). Cluster analysis for applications. New York: Academic Press.
Anderson, L. A., Gwaltney, M. K., Sundra, D. L., Brownson, R. C., Kane, M., Cross, A. W.,
et al. (2006). Using concept mapping to develop a logic model for the prevention
research centers program. Preventing Chronic Disease: Public Health Research, Practice
and Policy, 3(1), 19.
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 470
Armbruster, B. B., & Anderson, T. H. (1982). Idea mapping: The technique and its use in the
classroom, or simulating the ups and downs of reading comprehension (Tech. Rep.
No. 36). Urbana: University of Illinois Center for the Study of Reading.
Armbruster, B. B., & Anderson, T. H. (1984). Mapping: Representing informative text graph-
ically. In C. D. Holley & D. F. Dansereau (Eds.), Spatial learning strategies (pp.198209).
New York: Academic Press.
Axelrod, R. (1976). Structure of decision: The cognitive maps of political lites. Princeton, NJ:
Princeton University Press.
Baldwin, C. M., Kroesen, K., Trochim, W. M., & Bell, I. R. (2004). Complementary and con-
ventional medicine: A concept map. BMC Complementary and Alternative Medicine,
4(2). Retrieved April 24, 2008, from www.biomedcentral.com/1472-6882/4/2
Basbll, T., & Graham, H. (2006). Substitutes for strategy research: Notes on the source of
Karl Weicks anecdote of the young lieutenant and the map of the Pyrenees. Ephemera,
6(2), 195204.
Batterham, R., Southern, D., Appleby, N., Elsworth, G., Fabris, S., Dunt, D., et al. (2002).
Construction of a GP integration model. Social Science & Medicine, 54(8), 12251241.
Bickman, L. (Ed.). (1986). Using program theory in evaluation. New directions for program
evaluation (Series No. 31). San Francisco: Jossey-Bass.
Biegel, D. E., Johnsen, J. A., & Shafran, R. (1997). Overcoming barriers faced by African-
American families with a family member with mental illness. Family Relations, 46(2),
163178.
Boevink, W., Wolf, J., van Nieuwenhuizen, C. H., & Schene, A. H. (1995). Quality of life
of long-term mentally ill patients: A conceptual exploration (in Dutch). Tijdschr
Psychiatrie, 37, 97110.
Brown, J., & Calder, P. (1999). Concept-mapping the challenges faced by foster parents.
Children and Youth Services Review, 21(6), 481495.
Burke, J. G., OCampo, P., Peak, G. L., Gielen, A. C., McDonnell, K. A., & Trochim, W. (2005).
An introduction to concept mapping as a participatory public health research method-
ology. Qualitative Health Research, 15(10), 13921410.
Buzan, T., & Buzan, B. (1993). The mind map book: Radiant thinking, the major evolution in
human thought. London: BBC Books.
Caracelli, V. (1989). Structured conceptualization: A framework for interpreting evaluation
results [Special issue]. Evaluation and Program Planning, 12(1), 4552.
Carpenter, B. D., Van Haitsma, K., Ruckdeschel, K., & Lawton, M. P. (2000). The psychoso-
cial preferences of older adults: A pilot examination of content and structure. The
Gerontologist, 40(3), 335348.
Carroll, J. D., & Wish, M. (1975). Multidimensional scaling: Models, methods, and relations
to Delphi. In H. A. Linstone & M. Turoff (Eds.), The Delphi method: Techniques and
applications (pp. 402431). Reading, MA: Addison-Wesley.
Chen, H., & Rossi, P. (1983). Evaluating with sense: The theory-driven approach. Evaluation
Review, 7, 283302.
Chen, H., & Rossi, P. (1984). Evaluating with sense: The theory-driven approach. In
R. F. Conner (Ed.), Evaluation studies: Review annual (Vol. 9). Beverly Hills, CA: Sage.
Chen, H., & Rossi, P. (1990). Theory-driven evaluations. Thousand Oaks, CA: Sage.
Cooksy, L. (1989). In the eye of the beholder: Relational and hierarchical structures in con-
ceptualization. Evaluation and Program Planning, 12(1), 5966.
Cousins, J. B., & MacDonald, C. J. (1998). Conceptualizing the successful product develop-
ment project as a basis for evaluating management training in technology-based com-
panies: A participatory concept mapping application. Evaluation and Program Planning,
21(3), 333344.
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 471
Coxon, A. P. M. (1999). Sorting data: Collection and analysis. Thousand Oaks, CA: Sage.
Daughtry, D., & Kunkel, M. A. (1993). Experience of depression in college-students: A con-
cept map. Journal of Counseling Psychology, 40(3), 316323.
Davis, J. (1989). Construct validity in measurement: A pattern matching approach [Special
issue]. Evaluation and Program Planning, 12(1), 3136.
Davison, M. L. (1983). Multidimensional scaling. New York: John Wiley.
DeBono, E. (1971). Lateral thinking for management: A handbook of creativity. London:
American Management Association.
DeBono, E. (1973). Lateral thinking: Creativity step by step. New York: Harper & Row.
Delaware Advisory Council on Cancer Incidence and Mortality. (2004). Turning commitment
into action: Recommendations of the Delaware advisory council on cancer incidence and
mortality. Dover: Delaware Department of Health.
DeRidder, D., Depla, M., Severens, P., & Malsch, M. (1997). Beliefs on coping with illness: A
consumers perspective. Social Science & Medicine, 44(5), 553559.
Dillon, A., Richardson, J., & McKnight, C. (1993). Space: The final chapter or why physical
representations are not semantic intentions. In C. McKnight, A. Dillon, & J. Richardson
(Eds.), Hypertext: A psychological perspective (pp. 169192). Chichester, UK: Ellis
Horwood.
Donnelly, J. P., Donnelly, K. Z., & Grohman, K. J. (2005). A multi-perspective concept map-
ping study of problems associated with traumatic brain injury. Brain Injury, 19(13),
10771085.
Donnelly, J. P., Huff, S. M., Lindsey, M. L., McMahon, K. A., & Schumacher, J. D. (2005). The
needs of children with life-limiting conditions: A healthcare-provider-based model.
American Journal of Hospice & Palliative Care, 22(4), 259267.
Donnelly, K. Z., Donnelly, J. P., & Grohman, K. J. (2000). Cognitive, emotional, and behav-
ioral problems associated with traumatic brain injury: A concept map of patient, family,
and provider perspectives. Brain and Cognition, 44(1), 2125.
Dumont, J. M. (1993). Community living and psychiatric hospitalization from a consumer/
survivor perspective: A causal concept mapping approach. Unpublished doctoral disserta-
tion, Cornell University, Ithaca, NY.
Everitt, B. (1980). Cluster analysis (2nd ed.). New York: Halsted Press.
Galvin, P. F. (1989). Concept mapping for planning and evaluation of a big brother/big
sister program: Planning and evaluation example. Evaluation and Program Planning,
12(1), 5358.
Greene, J. C., & Caracelli, V. J. (1997). Advances in mixed-method evaluation: The challenges
and benefits of integrating diverse paradigms. In J. C. Greene & V. J. Caracelli (Eds.),
New directions for program evaluation (Vol. 74, pp. 518). San Francisco: Jossey-Bass.
Gurowitz, W. D., Trochim, W., & Kramer, H. (1988). A process for planning. Journal of the
National Association of Student Personnel Administrators, 25(4), 226235.
Hiltz, S. R., & Turoff, M. (1978). The network nation: Human communication via computer.
London: Addison-Wesley.
Holub, M. (1977). Brief thoughts on maps. Times Literary Supplement, 4, 118.
Hurt, L. E., Wiener, R. L., Russell, B. L., & Mannen, R. K. (1999). Gender differences in eval-
uating social-sexual conduct in the workplace. Behavioral Sciences & the Law, 17(4),
413433.
Jackson, K., & Trochim, W. (2002). Concept mapping as an alternative approach for the analy-
sis of open-ended survey responses. Organizational Research Methods, 5(4), 307336.
Johnsen, J. A., Biegel, D. E., & Shafran, R. (2000). Concept mapping in mental health: Uses
and adaptations. Evaluation and Program Planning, 23(1), 6775.
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 472
Kane, M., & Trochim, W. (2006). Concept mapping for planning and evaluation. Thousand
Oaks, CA: Sage.
Keith, D. (1989). Refining concept maps: Methodological issues and an example. Evaluation
and Program Planning, 12(1), 7580.
Krippendorf, K. (2004). Content analysis: An introduction to its methodology (2nd ed.).
Thousand Oaks, CA: Sage.
Kruskal, J. B., & Wish, M. (1978). Multidimensional scaling. Beverly Hills, CA: Sage.
Lewin, K. (1951). Frontiers in group dynamics. In D. Cartwright (Ed.), Field theory in social
science: Selected theoretical papers (pp. 188237). New York: Harper & Row.
Linstone, H. A., & Turoff, M. (Eds.). (1975). The Delphi method: Techniques and applications.
Reading, MA: Addison-Wesley.
Linton, R. (1989a). Conceptualizing feminism: Clarifying social science concepts. Evaluation
and Program Planning, 12(1), 2530.
Linton, R. (1989b). Toward a feminist research method. In A. M. Jagger & S. R. Bordo (Eds.),
Gender/body/knowledge: Feminist reconstructions of being and knowing (pp. 273292).
New Brunswick, NJ: Rutgers University Press.
Luke, D. A. (2004). Multilevel modeling (Vol. 143). Thousand Oaks, CA: Sage.
Mannes, M. (1989). Using concept mapping for planning the implementation of a social
technology. Evaluation and Program Planning, 12(1), 6774.
Marquart, J. M. (1989). A pattern matching approach to assess the construct validity of an
evaluation instrument [Special issue]. Evaluation and Program Planning, 12(1), 3744.
McLinden, D., & Trochim, W. (1998a). From puzzles to problems: Assessing the impact of
education in a business context with concept mapping and pattern matching. In
J. Phillips (Ed.), Implementing evaluation systems and processes (Vol. 18, pp. 285304).
Alexandria, VA: American Society for Training and Development.
McLinden, D., & Trochim, W. (1998b). Getting to parallel: Assessing the return on expecta-
tions of training. Performance Improvement, 37(8), 2125.
Mercer, M. L. (1992, November). Brainstorming issues in the concept mapping process.
Paper presented at the annual conference of the American Evaluation Association, Seattle, WA.
Mercier, C., Piat, M., Peladeau, N., & Dagenais, C. (2000). An application of theory-driven
evaluation to a drop-in youth center. Evaluation Review, 24(1), 7391.
Michalski, G. V., & Cousins, J. B. (2000). Differences in stakeholder perceptions about train-
ing evaluation: A concept mapping/pattern matching investigation. Evaluation and
Program Planning, 23(2), 211230.
Miller, G. L., & Grayson, T. E. (2006). Student employees and recreational sports adminis-
trators: A comparison of perceptions. Recreational Sports Journal, 30, 5369.
Osborn, A. F. (1948). Your creative power. New York: Scribner.
Pammer, W., Haney, M., Wood, B. M., Brooks, R. G., Morse, K., Hicks, P., et al. (2001). Use of
telehealth technology to extend child protection team services. Pediatrics, 108(3), 584590.
Paulson, B. L., Truscott, D., & Stuart, J. (1999). Clients perceptions of helpful experiences in
counseling. Journal of Counseling Psychology, 46(3), 317324.
Rao, J. K., Alongi, J., Anderson, L. A., Jenkins, L., Stokes, G., & Kane, M. (2005). Development
of public health priorities for end-of-life initiatives. American Journal of Preventive
Medicine, 29(5), 453460.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data
analysis methods. Thousand Oaks, CA: Sage.
Rosas, S. R. (2005). Concept mapping as a technique for program theory development: An
illustration using family support programs. American Journal of Evaluation, 26(3),
389401.
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 473
Rosenberg, S., & Kim, M. P. (1975). The method of sorting as a data gathering procedure in
multivariate research. Multivariate Behavioral Research, 10, 489502.
Rothwell, W. J., & Kazanas, H. C. (1989). Strategic human resource development. Engelwood
Cliffs, NJ: Prentice Hall.
Shavers, V. L., Fagan, P., Lawrence, D., McCaskill-Stevens, W., McDonald, P., Browne, D.,
et al. (2005). Barriers to racial/ethnic minority application and competition for NIH
research funding. Journal of the National Medical Association, 97(8), 10631077.
Shepard, R. N., Romney, A. K., & Nerlove, S. B. (1972). Multidimensional scaling: Theory and
applications in the behavioral sciences (Vol. 1). New York: Seminar Press.
Shern, D. L., Trochim, W. M. K., & Lacomb, C. A. (1995). The use of concept mapping for
assessing fidelity of model transfer: An example from psychiatric rehabilitation.
Evaluation and Program Planning, 18(2), 143153.
Singer, J. D. (1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models
and individual growth models. Journal of Educational and Behavioral Statistics, 24(4),
323355.
Snijders, T., & Bosker, R. (1999). Multilevel analysis: An introduction to basic and advanced
multilevel modelin g. Thousand Oaks, CA: Sage.
Southern, D. M., Young, D., Dunt, D., Appleby, N. J., & Batterham, R. W. (2002). Integration
of primary health care services: Perceptions of Australian general practitioners, non-
general practitioner health service providers and consumers at the general practice-
primary care interface. Evaluation and Program Planning, 25(1), 4759.
Stokols, D., Fuqua, J., Gress, J., Harvey, R., Phillips, K., Baezconde-Garbanati, L., et al. (2003).
Evaluating transdisciplinary science. Nicotine and Tobacco Research, 5(Suppl. 1),
S21S39.
Trochim, W. (1985). Pattern matching, validity, and conceptualization in program evalua-
tion. Evaluation Review, 9(5), 575604.
Trochim, W. (1989a). Concept mapping: Soft science or hard art? Evaluation and Program
Planning, 12(1), 87110.
Trochim, W. (1989b). Outcome pattern matching and program theory. Evaluation and
Program Planning, 12(1), 355366.
Trochim, W. (1993, November). The reliability of concept mapping. Paper presented at the
annual conference of the American Evaluation Association, Dallas, TX.
Trochim, W. (1996). Criteria for evaluating graduate programs in evaluation. Evaluation
News and Comment: The Magazine of the Australasian Evaluation Society, 5(2), 5457.
Trochim, W., Cabrera, D. A., Milstein, B., Gallagher, R. S., & Leischow, S. J. (2006). Practical
challenges of systems thinking and modeling in public health. American Journal of
Public Health, 96(3), 538546.
Trochim, W., & Cook, J. (1992). Pattern matching in theory-driven evaluation: A field
example from psychiatric rehabilitation. In H. Chen & P. Rossi (Eds.), Using theory to
improve program and policy evaluations (pp. 4969). New York: Greenwood.
Trochim, W., Cook, J., & Setze, R. (1994). Using concept mapping to develop a conceptual
framework of staff s views of a supported employment program for persons with severe
mental illness. Consulting and Clinical Psychology, 62(4), 766775.
Trochim, W., & Linton, R. (1986). Conceptualization for planning and evaluation. Evaluation
and Program Planning, 9, 289308.
Trochim, W., Milstein, B., Wood, B., Jackson, S., & Pressler, V. (2004). Setting objectives
for community and systems change: An application of concept mapping for plan-
ning a statewide health improvement initiative. Health Promotion Practice, 5(1),
819.
14-Bickman-45636:14-Bickman-45636.qxp 7/28/2008 6:22 PM Page 474
Trochim, W., Stillman, F., Clark, P., & Schmitt, C. (2003). Development of a model of the
tobacco industrys interference with tobacco control programs. Tobacco Control, 12,
140147.
Valentine, K. (1989). Contributions to the theory of care [Special issue]. Evaluation and
Program Planning, 12(1), 1724.
van Nieuwenhuizen, C., Schene, A. H., Koeter, M. W. J., & Huxley, P. J. (2001). The Lancashire
quality of life profile: Modification and psychometric evaluation. Social Psychiatry and
Psychiatric Epidemiology, 36(1), 3644.
VanderWaal, M. A. E., Casparie, A. F., & Lako, C. J. (1996). Quality of care: A comparison of
preferences between medical specialists and patients with chronic diseases. Social
Science & Medicine, 42(5), 643649.
Weick, K. E. (1995). Sensemaking in organizations. Thousand Oaks: Sage.
Weller, S. C., & Romney, A. K. (1988). Systematic data collection. Newbury Park, CA: Sage.
White, K. S., & Farrell, A. D. (2001). Structure of anxiety symptoms in urban children:
Competing factor models of the revised childrens manifest anxiety scale. Journal of
Consulting and Clinical Psychology, 69(2), 333337.
Witkin, B., & Trochim, W. (1997). Toward a synthesis of listening constructs: A concept map
analysis of the construct of listening. International Journal of Listening, 11, 6987.
Yampolskaya, S., Nesman, T. M., Hernandez, M., & Koch, D. (2004). Using concept mapping
to develop a logic model and articulate a program theory: A case example. American
Journal of Evaluation, 25(2), 191207.
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 475
CHAPTER 15
Mail Surveys
Thomas W. Mangione
Authors Note: The authors would like to acknowledge their colleague John Carper, MLS,
ALM, corporate librarian at John Snow, Inc., for his able assistance in updating the citations
that follow.
475
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 476
Keep in mind that some of these may not be advantageous in your own area of
research. However, a mail survey can be an especially good choice when (a) you
have limited human resources to help you conduct your study, (b) your questions
can be written in a closed-ended style, (c) your research sample has a moderate to
high interest or investment in the topic, and (d) your list of research objectives is
modest in length.
Many key steps to conceptualizing research questions, developing question-
naires, avoiding errors, and ensuring quality are the same for mailed surveys as for
those administered by other means. Rather than repeating the excellent guidance
provided elsewhere, we focus, instead, on the core elements and considerations that
are unique to the mail survey process (e.g., cover letters, questionnaire graphic
design and instructions, procedures for encouraging returns) and on those aspects
where the mailed format itself could increase the risk for error.
A second broad problem area comes from response error, for example, respondents
misunderstanding the wording of the questions as presented. A central tenet of quan-
titative survey research is that all respondents should understand each question in the
same way so that they are able to provide answers to each from the same frame of ref-
erence. This simply stated goal can be vexing to achieve. Two general rules will help
you write good questions: (a) make them clear and (b) keep them simpledo not go
beyond what is reasonable to expect people to understand or remember. There are
many new tools available to help you reach your goal of creating valid questions, but
still it takes effort to get there. For an excellent summary of the major issues you need
to address to avoid response error, see Fowler and Cosenza (Chapter 12, this volume).
A third problem area is item nonresponse error, the failure of respondents to
answer individual questions. Respondents may leave questions blank or acciden-
tally skip over items. They may not follow instructions and then fill out answers
incorrectly. They may write marginal comments that cannot be equated with your
printed answer categories. If this happens often enough, the data that remain may
be biased. With a mail survey, respondents do not have the benefit of an interviewer
who is able to make clarifications or point the way though various skip patterns. We
will discuss some design and content considerations that should help respondents
to fill out your questionnaire properly.
Finally, the most challenging pitfall, for the mail survey researcher involves non-
response error, the biased nature of the responding sample. It does not matter how
accurately and randomly you draw a sample if returns come mainly from people
who are biased in a particular way. Unfortunately, it can be difficult to determine
whether a responding sample is biased. Thus, the standard safeguard is to aim to
achieve a high response rate so that nonresponders would have to be very different
from responders to affect your overall estimates for the population (Etter & Perneger,
1997). The next section of this chapter outlines in broad strokes the capacity of
nonresponse error to wreak havoc on your data quality and offers two proven
strategies for avoiding this common and potentially fatal problem. Later in the
chapter, we detail additional strategies to ensure that every component of your mail
survey project is carried out with an eye toward maximizing response rates.
Nonresponse Error
Nonresponse error is the bias that results when you do not get returns from 100%
of your sample. Nonresponse errors distort your picture of the population and cre-
ate problems for your study in two ways. First, if those who do not respond hold
different views or behave differently from the majority of people, your study will
incorrectly report the population average. It will also drastically underreport the
number of people who feel as the nonresponders do. How far off the mark you are
depends on how big the nonresponse is and how different the nonresponders are
from responders (Armstrong & Overton, 1977; Barnette, 1950; Baur, 1947; Blair,
1964; Blumberg, Fuller, & Hare, 1974; Brennan & Hoek, 1992; Campbell, 1949;
Champion & Sear, 1969; Clausen & Ford, 1947; Cox, Anderson, & Fulcher, 1974;
Daniel, 1975; Dillman, 1978; Donald, 1960; Eichner & Habermehl, 1981; Filion,
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 478
1975; Gannon, Northern, & Carroll, 1971; Gough & Hall, 1977; Jones & Lang, 1980;
Larson & Catton, 1959; Newman, 1962; Ognibene, 1970; Reuss, 1943; Suchman &
McCandless, 1940). Second, even if nonresponders are not that different, low
response rates give the appearance of a poor-quality study and undermine confi-
dence in its results. The study becomes less useful or less influential simply because
it does not have the trappings of quality.
Nonresponse error poses a particular risk for mail surveys, in that it is so very easy
for recipients not to respond. It is not as if they have to close the door in someones
face, or even hang up the phone on a persistent interviewer; all they have to do is toss
the survey questionnaire into the wastebasket. In addition, some recipients of mail
surveys who are interested and have good intentions to participate become nonre-
sponders simply because they never get around to filling out the questionnaire.
Unfortunately, in many studies very little can be discerned about the nonresponders,
and we are thus left with uncertainty about the quality of the data. By obtaining a very
high return, you reduce the likelihood that the nonresponders will have an impact on
the validity of your population estimates, even if the nonresponders are different.
What is considered a high response rate? Certainly, a rate of return in excess of
85% is viewed as excellent. With such a rate, it would take a highly unusual set of
circumstances to throw off your results by very much. Response rates in the 70%
to 85% range are viewed as very good. While rates in the 60% to 70% range are con-
sidered acceptable, at this level you should begin to feel uneasy about the charac-
teristics of nonresponders. Response rates between 50% and 60% are barely acceptable;
at this level, you really need some additional information that can contribute to
confidence about the quality of your data. Response rates below 50% are not scien-
tifically acceptableafter all, at this level, the majority of the sample is not repre-
sented in the results.
In addition to striving for high response rates, it is always useful to try to obtain
information about the nonresponders, so that you can compare them with responders.
Sometimes this information is available from the list that you originally sampled.
For instance, city lists that are used to confirm eligibility for voter registration have
each persons age; gender (not listed explicitly, but you can usually figure it out from
the first name); occupation (in broad categories); precinct or voting district;
whether the person is registered to vote or not and, if registered, party affiliation.
By keeping track of who from your original sample has and has not responded, you
can compare the characteristics of one group to the other.
It turns out that there are some common traits of nonresponders (Baur, 1947;
Campbell, 1949; Gannon et al., 1971; Gelb, 1975; Goodstadt, Chung, Kronitz, &
Cook, 1977; Ognibene, 1970; Peterson, 1975; Robins, 1963; Suchman, 1962); we can
get a picture of some common traits of nonresponders. Compared with responders,
they tend to be less educated; they also tend to be elderly, unmarried and male; or
they have some characteristic that makes them seem less relevant to the study (e.g.,
abstainers for a drinking study, nondrivers for a traffic safety study, or lower income
people for a study about mortgages).
Although a variety of response rates are reported in the literature, it is safe to
assume that some of the worst response rates never are published. If you were to
simply stuff questionnaires in envelopes and mail them to people asking them to fill
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 479
them out, it would be common to see response rates in the 20% range, though it
would not be surprising to see them in the 5% range too. This is much lower than
the rate of 70% or higher that can inspire confidence in the data. So the funda-
mental question is, how can you achieve the highest possible response rates?
Sending Reminders
The key technique for producing high response rates is the use of reminders
(Denton, Tsai, & Chevrette, 1988; Diamantopoulos & Schlegelmilch, 1996; De
Rada, 2005; Dillman, 1978; Dillman et al., 1974; Eckland, 1965; Edwards et al., 2002;
Erdogan & Baker, 2002; Etzel & Walker, 1974; Evangelista, Albaum, & Poon, 1999;
Filion, 1976; Ford & Zeisel, 1949; Fox, Robinson, & Boardley, 1998; Furse, Stewart,
& Rados, 1981; House, Gerber, & McMichael, 1977; Jones & Lang, 1980; Kanso,
2000; Kanuk & Berenson, 1975; Kephart & Bressler, 1958; Linsky, 1975; Scott, 1961;
Yammarino, Skinner, & Childers, 1991). Even under the best of circumstances, you
will not achieve acceptable levels of return if you send no reminders. In fact, it is
important to send out several, and it is imperative to pay attention to their timing.
As you track the daily returns, an interesting pattern becomes apparent. For the
first few days after questionnaires have been mailed, you will receive nothing. This
makes sense because it takes time for the surveys to be delivered, it takes a short
period for respondents to fill them out, and then it takes a day or two for the
respondents to mail them back (actually this can be a day or two longer with busi-
ness-reply returns). About 5 to 7 days after the initial mailing, you will receive a few
returns; then in the next few days you receive many more, with more coming in
each day than the day before. Around the 10th day after the mailing, returns will
start to level off, and around the 14th day they will drop off precipitously.
An abrupt reduction in returns is a signal that whatever motivational influence
your initial letter had is now fading. Those who have not returned the questionnaire
by now are going to begin to forget about doing it, or the survey is going to get
buried on their desks. At this point in the return patternabout the 14th dayyou
want to have your first reminder arrive.
The initial pattern repeats itself after you send out the first reminder. After a few
days of inactivity, a burst of returns with more coming in each day will be followed
by a precipitous decline at about 14 days. Another interesting feature of this pattern
is that whatever return rate you got in the first wave (e.g., 40%), you will get about
half that number in the second wave (e.g., 20%), and so on for each succeeding wave.
Aiming for at least a 75% return rate, you should plan for at least four mailings
the initial mailing and three reminders. Each of these mailings should be spaced
about 2 weeks apart. This will result in a pattern of returns something like this:
Thus, your total mailing period will take about 8 to 9 weeks, leaving some time
after your last reminder for the final returns to come in. A final point of interest
about this pattern is that the rate of returns and the number of reminders is unre-
lated to the total size of your sample: Follow the same procedures whether your
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 480
sample size is 200 or 20,000. The only impact of scale is that you need more staff to
put together the mailings in each round.
Sending reminders more frequently than every 2 weeks does not speed up the
returnsit merely wastes time and money reminding people who were going to
respond anyway. Conversely, spacing out two or three reminders over a longer
period than at 2-week intervals (hoping to save money on postage, for example) is
not as effective in producing a good return rate. Your reminder sequence will not
build momentum among the nonresponders, because the time lapse is so long that
they would have forgotten about the survey. In this case, each reminder must start
all over again to motivate people to participate.
Ti p
If budget constraints are causing you to consider reducing the number of
reminders or not doing any at all, there is a middle ground. From the sample
sent to the initial mailing, select a random subset to receive the full
sequence of reminders. You will be able to compare their results with those
received from the group with fewer reminders to gauge the extent of bias.
to everyone. Always include a line that says, If you have already sent in your ques-
tionnaire, thank you very much. This strategy has disadvantages in that (a) it wastes
postage, supplies, and resources; (b) it irritates respondents to receive reminders when
they have already returned their questionnaires; (c) it dilutes your message by apolo-
gizing to people who have already returned their questionnaires and not focusing
exclusively on those who have yet to respond; and, furthermore, (d) reminders sent to
all respondents may confuse some, lead them to worry that their surveys got lost in the
mail and prompt them to fill out new ones. With no way of knowing which surveys
might be duplicates, you cannot remove them from your returns.
A technique that one might call the reminder postcard strategy can sidestep these
concerns. This method maintains complete anonymity for the respondents returned
questionnaires while letting you know who has and has not returned the question-
naire. Thus, reminders need be sent only to those who have yet to respond. When
using the reminder postcard strategy, enclose in the original mailing of the ques-
tionnaire a postage-paid return postcard imprinted with either an identification
code or the recipients name (or both). Be sure that the questionnaire itself bears no
identification. In the survey instructions, state explicitly that returning the postcard
lets you know that they do not need any more reminders and ask that they mail the
postcard back separately from the questionnaire. By using this procedure, you know
who has returned the questionnaire without having to put any identifying infor-
mation on the questionnaire itself.
Some may worry that respondents might simply return the postcard and not the
questionnaire. That would certainly be a problem, but in our experience that has
not been the case. More questionnaires than postcards are returned. Some respon-
dents forget to mail their postcards, some lose them, and a small number (e.g., 5%
or so) purposely do not return them as a way to ensure their anonymity. Thankfully,
there are only a few who take this last route; otherwise the method would not
achieve its intended purpose of providing information about who has responded
while maintaining respondent anonymity.
Providing Incentives
these values to 1990 dollars, and still showed improvements for values less than
50 cents. A more recent study showed that both $2 and $5 generated respectable
and similar response rates (Shaw, Beebe, Jensen, & Adlis, 2001).
The question of whether increased benefit accrues for increasing dollar amounts
is harder to answer definitively. Much of the research to test alternate amounts has
not tended to use sums more than $1; therefore the number of studies we have
available to make generalizations about larger-sized incentives is relatively small. In
their review, Hopkins and Gullickson (1992) did find an increasing percentage of
returns over no incentive control methods for greater incentive values, but their
top group was designated as $2 or more and included only eight studies. A more
recent study by Edwards, Cooper, Roberts, and Frost, (2005) showed steady increase
in response rates for amounts up to $5. In addition, our experience with a recent
nonexperimental study dealing with alcohol use and work included one work site
where we used a $5 prepaid incentive; the resulting response rate was 82%.
Understanding the meaning of the reward to the respondent helps interpret
these findings about larger ($5) and smaller incentives ($1$2). With small amounts
of money, people clearly do not interpret the reward as fair market exchange for
their time. Even a $1 reward for filling out a 20-minute questionnaire works out to
only a $3 per hour rate of pay. Therefore, people must view the reward in another
light; one idea is that it represents to the respondent a token of good faith or a trust
builder (Dillman, 1978). The respondent feels that the research staff are nice to show
their appreciation by giving the incentive and therefore feels motivated to reciprocate
by filling out the questionnaire.
There have not been many studies reporting on the provision of larger-sized
rewards, but it looks as though response rates tend to be higher for these than for
lesser amounts (Hopkins & Gullickson, 1992; Martinson et al., 2000; Yu & Cooper,
1983). In particular, higher incentive amounts are reported in the literature for
surveys conducted with persons in professional occupations, particularly doctors.
Incentive amounts from $20 to $50 have been used (Godwin, 1979). In these cir-
cumstances, higher response rates are obtained with higher rewards (Berry &
Kanouse, 1987; James & Bolstein, 1992; Jobber, Saunders, & Mitchell, 2004).
Another monetary incentive technique is the use of a lottery prize. This tech-
nique falls within the promised reward category, but has a twist. Respondents are
offered a chance to win a big prize, although they also have, of course, a chance
of getting nothing. Again, research on this variation is limited, so definitive gener-
alizations about its effectiveness are not possible (Gajraj, Faria, & Dickinson, 1990;
Hopkins & Gullickson, 1992; Leung et al., 2002; Lorenzi, Friedmann, & Paolillo,
1988; Martinson et al., 2000). The logic behind this idea is that the chance of hit-
ting big will be such an inducement that respondents will fill out their surveys to
qualify. This technique also works well if you are trying to encourage respondents
to mail in their surveys by a particular deadline.
Of course, to give out the lottery prize incentives, respondents cannot remain
anonymous. To conduct a drawing and give out prizes, you need to know the name
and address (and possibly a phone number) associated with each returned survey.
This lack of anonymity may be counterproductive in some circumstances. The
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 484
certain number or percentage of returns (e.g., a 70% return rate). Our recent work
site study included two sites in which we used the group strategy, with a $750 con-
tribution to a local charity. We achieved response rates of 68% and 78%.
Preserving Confidentiality/Anonymity
If respondents believe that their answers will be kept confidential, rather than
being attributed to them directly, they will be more likely to return a survey (Boek
& Lade, 1963; Bradt, 1955; Childers & Skinner, 1985; Cox et al., 1974; Fuller, 1974;
Futrell & Hise, 1982; Futrell & Swan, 1977; Kerin & Peterson, 1977; McDaniel &
Jackson, 1981; Pearlin, 1961; Rosen, 1960; Wildman, 1977). There are a number of
straightforward safeguards for maintaining confidentiality. First, never write respon-
dent names or addresses directly on the questionnaires. Instead, use code numbers
on the surveys and maintain a separate list of names and addresses with their cor-
responding code numbers. Keep the list out of the view of people who are not on
the research team. Second, when the questionnaires come back, do not leave them
lying around for curious eyes to peruse. Store returns in file cabinets, preferably
locked when you are not present; lock your office when you are not there. Third,
do not talk to colleagues, friends, or family about the responses you receive on
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 486
Ti p
For both community-based surveys and institutional-based surveys, any
publicity you can garner that includes support from leaders (mayor, plant
manager, company doctor, or union leader) will reassure people who are
concerned that the study will not make a difference or is not strictly
confidential..
you should not leave questionnaires lying around for curious eyes to view and you
should not report data for small, identifiable, groups of respondents.
First-class indicia can also be considered for outgoing postage. This is similar to
the business-reply franking, except that it is used for outgoing first-class mail. You
set up a prepaid account with the postal service and print
your account number and a first-class designation on your
outgoing envelopes (illustration at right). The postal ser- First Class Mail
vice keeps track of your mailings and deducts the postage U.S. Postage
amounts from your account. This is the least labor-inten- PAID
sive method of sending out your questionnaires, but it Boston, MA
probably suffers somewhat from the same problem as Permit No. 108
metered mail in that it may be confused with junk mail.
Another alternative to consider for outgoing postage is to use premium
postage/shipping for mailings, such as special delivery or next-day delivery services.
The research shows that there is some advantage to using this type of postage, but the
costs are so substantial that many consider it prohibitive (Clausen & Ford, 1947;
Hager, Wilson, Pollak, & Rooney, 2003; Kephart & Bressler, 1958). When special
postage is used, it is most often for final reminders. At least at this stage of the process
you are mailing only to part of your sample and therefore the cost impact is less.
However, a study by Schmidt, Calantone, Griffin, & Montoya-Weiss (2005) showed
no extra benefit for certified mail over first-class mail when used as a third reminder.
Jones & Lang, 1980; Jones & Linda, 1978; Kanso, 2000; Peterson, 1975; Roeher,
1963; Watson, 1965). For example, they are more likely to respond to surveys that
are sponsored by government agencies or well-known universities (Houston &
Nevin, 1977; Jones & Lang, 1980; Jones & Linda, 1978; Peterson, 1975). Also, when
the cover letter is on university or government agency letterhead, recipients may be
less concerned that the survey is a ploy to send them a credit card or sell them insur-
ance. Taking further advantage of the institutional affiliation, do not refer to the
study name alone (e.g., the Healthy Family Study); instead, include the name of the
university or research institution as well (e.g., the Famous and Well-Regarded
Universitys Healthy Family Study).
Start with a first sentence that captures attention and encourages the recipi-
ent to read the rest of the letter. For example, in a study of police officers concern-
ing gambling enforcement policies, we started our letter with, We would like the
benefit of your professional experience and 10 minutes of your time! For a corpo-
rate study of alcohol policies, we started with, Many people are concerned about
alcohol abuse in the workplace.
Describe why this study is important and how the information may be used.
Respondents want to participate in activities that they think are useful and that
relate to their lives in some specific way.
Explain who is being asked to participate in the survey and how you got this
persons name and address.
Discuss whether this survey is confidential or anonymous, and describe
exactly how privacy will be achieved.
Make it clear that that participation in the study is voluntary, but emphasize
the importance of the recipients participation. If an incentive is to be provided, be
sure to describe it as a good-will gesture, not as a ploy to coerce participation.
Tell the recipient how to get in touch if he or she has questions. Include the
name of a contact person and a phone numberperhaps even a toll-free number
or the instruction to call collect.
Show how the respondent can return the questionnaire to you, pointing out
the return envelope and noting that it is stamped and preaddressed, for example.
Use clear language, and keep the needs of your audience in mind when
choosing font and type size, reading and language level, and layout and reproduc-
tion quality.
question numbers and question text, and consistent line spacing between question
and their groups of response categories and between questions themselves. On
occasion, a sequence of questions might create the need for a page break three quar-
ters of the way down because the next question will not fit in the remaining space.
It is important to make such a page look complete by increasing the space between
items so that the page balances with the one that it faces. The same principle applies
to the use of pages with a two-column format. Make sure that both columns are
equally filled. In achieving balance, it is better to have more white space around
questions than to produce an overly crammed or squeezed look. This is an area
where you will benefit by having someone with a good eye for graphic detail peri-
odically review the layout of your questionnaire as it takes shape.
Ti p
Although research on the impact of colored paper stock (pastel please!) is
not strong, it certainly helps those with messy desks to locate the survey
when they finally decide to complete it.
significantly more pages, then it is not clear that there is a net benefit. If a com-
mercial printer is involved, you can easily specify any dimension; however, if you
are relying on your office printer or copy machine, using odd size paper may not be
worth its trouble. Finally, smaller-sized questionnaires raise the issue of the size of
the envelope. A small questionnaire rattling around in an envelope made to hold
standard 8.5-by-11-inch paper may not make a polished first impression.
Printing pages back-to-back cuts the number of sheets needed in half, resulting
in a questionnaire that looks less weighty, which may help response rates. We rec-
ommend this style provided that the paper is of sufficient weight to keep the print
from bleeding through to the other side. It also lowers postage costs and blunts
criticism from environmentally conscious respondents.
Another style feature that will have a direct influence on the number of pages in
your questionnaire is the use of a two-column, newspaper-type format. Many ques-
tions that have relatively short response categories (e.g., yes/no, agree/disagree) can be
easily placed in a two-column format. The questions themselves may take up a few
extra lines, but the response categories take up no more space. Using this technique can
reduce the number of pages in your questionnaire by anywhere from 25% to 50%.
With a commercial print vendor, it is a simple task to commission a multipage
questionnaire in a booklet format. However, with a printer/copy machine enabled
for booklet/two-sided printing, you too can produce a polished-looking booklet. For
an 8.5-by-11-inch finished product, use 17-by-11-inch paper (with the pages set up
in the right order to flow in the correct sequence when the pages are folded), fold
each set in the middle and staple into the fold. You may need to purchase a long-arm
stapler if your copier does not have this feature, called saddle stitching. Keep in mind
that the total number of pages in your finished booklet must be divisible by four,
though you can cheat if need be by placing the overall instructions on the cover
page and leaving the back page blank (good for inviting additional comments).
As a last step to ensure that your instructions, formatting, and overall layout
actually do make the questionnaire easy to follow and to fill out, see if a few of your
detail-oriented colleagues can complete the survey correctly. Incorporate their feed-
back if appropriate, and then ask a few volunteers who are not part of the survey
research world to do the same.
questionnaire will not. There has been a fair amount of research on this issue, but
the results are muddled because of several confounding factors (Berdie, 1973;
Burchell & Marsh, 1992; Champion & Sear, 1969; Childers & Ferrell, 1979;
Lockhart, 1991; Mason et al., 1961; Roscoe, Lang, & Sheth, 1975; Scott, 1961).
Part of the explanation for these contradictory findings is the different meanings
of length of a questionnaire used in the research study. Is length determined by
the number of questions, the number of pages, or some combination of the two?
For example, 30 questions on three pages may seem different from 30 questions on
six pages. Another confound is that different-length questionnaires may be per-
ceived differently by respondents in terms of interest levels or in terms of impor-
tance. Longer questionnaires may actually be seen as more interesting or more
important because they can impart a fuller picture of a topic than a more cursory
version. Even within one methodological study to test the effects of varying ques-
tionnaire length, it is hard to hold constant other factors that may play a role
in response rates. Many studies that try hard to control for these issues wind up
comparing different-length questionnaires that are actually not so different. For
example, Adams and Gale (1982) compared surveys with one page versus three
pages versus five pages. They found no difference in response rates between one-
and three-page surveys but did find a lower response rate for five-page surveys.
Another limitation on drawing conclusions from findings on a series of studies
are differences in topics covered, samples, reminder procedures, and so on. In an
ambitious review covering 98 methodological studies, Heberlein and Baumgartner
(1978) were unable to document any zero-order correlation between length mea-
sures and overall responses. However, a more recent review by Edwards et al. (2005)
of randomized clinical trials did find a significant effect for length of the survey.
What is clear from this research is that length by itself is not the sole determining
factor driving response rates. Whatever the length of a questionnaire, other design
factors can influence whether a good response rate is obtained or not. However, in
general, it makes sense that shorter questionnaires will on average do better than
substantially longer versions. To put this statement in its proper context, however,
our recent work site studywhich included reminders and incentivesused a
24-page survey and generated an average response rate of 71% across all 16 work sites.
The real challenge for the researcher is to design a questionnaire that efficiently
asks about all the elements that are important to the study. In particular, steer clear
of questions that seem off the topic or that are overly redundant. Avoid long
sequences of questions that try to measure very minor differences in issues: For
example, it is not a good idea to first ask about the length of time the respondent
had to wait in a doctors waiting room; then, ask how long he or she had to wait in
the examining room before the doctor came in; later ask how long the wait was
overall; and finally, prolonging the agony, ask how satisfied the respondent was with
the waiting time (Helgeson et al., 2002).
more generic greeting, such as Dear Boston resident) or through the use of per-
sonally signed letters. However, neither procedure has consistently shown benefits
for response rates (Andreasen, 1970; Carpenter, 1975; Dillman & Frey, 1974;
Edwards et al., 2002; Frazier & Bird, 1958; Gendall, 2005; Houston & Jefferson,
1975; Kawash & Aleamoni, 1971; Kerin & Peterson, 1977; Kimball, 1961; Rucker,
Hughes, Thompson, Harrison, & Vanderllp, 1984; Simon, 1967; Weilbacher &
Walsh, 1952). Some authors have commented that personalizing the letters may
have just the opposite effect of reducing response rates, because it calls attention to
the fact that the researcher knows the respondents name.
Giving a Deadline
By giving respondents a deadline they will try harder to return the questionnaire,
rather than putting it aside, meaning to get to it later and then forgetting it. The use
of a deadline gets a little complicated, however, when you are also using reminders.
It is not a good idea to set 2 weeks from now as the deadline for responding, and then
send the respondent a reminder at that time saying, Please respondwe are giving
you 2 more weeks. On the other hand, giving a deadline of 8 weeks in the future
hardly serves a motivating purpose.
Research, however, does not show any particular advantage in final response
rates by using deadlines. What the research does show is that the returns come in a
little faster (Edwards et al., 2002; Futrell & Hise, 1982; Henley, 1976; Kanuk &
Berenson, 1975; Linsky, 1975; Nevin & Ford, 1976; Roberts, McCrory, & Forthofer,
1978; Vocino, 1977). Consider using soft deadlines that also incorporate the infor-
mation about subsequent reminders. For instance, Please try to respond within the
next week, so we will not have to send you any reminders (Green, 1996).
The Schedule
Preparing a written schedule will help more effectively manage the mail survey
process. The schedule allows you to appreciate how the various parts of the mail sur-
vey study must fit together like a jigsaw puzzle for the project to roll out in a timely
fashion. By having a schedule, you can anticipate milestones and their inherent chal-
lenges so that you are not overly rushed to get particular steps accomplished.
In developing a schedule, you will find that several independent processes must,
at various points, merge to create a high-quality mail survey study. These include
the sampling process, the development of the questionnaire, the development of
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 495
Quality Controls
Checking the quality of work from your mailing team is also important. There
are many areas to consider. Everything that is word processed must be sent through
a spell check program with every change reviewed before acceptance. All materials
must be carefully proofread before they are sent to the printer. Special attention
should be paid to contact information and telephone numbers, return addresses
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 496
Weeks
Overall Mail Survey Timeline
6
Sampling
7
(13 Weeks)
Questionnaire
8
Development
9 (16 Weeks)
10 Materials
Preparation
11
(11 Weeks)
12
13
14
15
16
17
18
19
20
21
22
Data
23 Collection
(11 Weeks)
24
Data Entry
25 and Data File
Preparation
26
(17 Weeks)
27
28
29
30
31
32
and postal indicia, and consistency in punctuation and layout. Ideally, proofreading
should be done by at least two individuals: Someone who is familiar with the project
and someone who is not involved in the study on a day-to-day basis. Above all, be
extremely careful about last-minute changes; sometimes in the rush to revise some-
thing, new errors are created.
The core tasks for bringing out a mail survey are stuffing the envelopes and
putting on mailing labels. Often, this phase has relatively simple steps: insert a let-
ter, a numbered questionnaire and stamped return envelope in an envelope, put a
mailing label on the envelope, seal it, affix postage, and mail. However, even with a
straightforward process, things can, and do, go wrong. Someone can forget to insert
a cover letter or may incorrectly number or forget to number a questionnaire. The
wrong labels could go with the wrong questionnaires or they might be put on
crooked. The postage could be insufficient or missing; the envelopes sent without
being sealed or with the seal not firmly glued. Assume that if something can go
wrong, it will go wrong on occasion. All these nightmares have happened at one
point or another on our projects, even though we were trying to be diligent.
If the study is more complex then even more things can go wrong. As the mail-
ing process requires more steps and more people to carry them out, there is much
more room for things to go wrong. One way to help ensure the ultimate quality of
your product is to analyze the work flow of the questionnaire mailing assembly
process. As you do so, think about mistakes that could be made, then design
processes in a way that minimizes the potential for mistakes and maximizes your
ability to monitor the work of others.
Surveys in Cyberspace
Many of us think back to the years before the dawn of the Information Age and
fondly recall the quaint ways we reviewed the literature (went to the library), wrote
proposals (stocked up on correction fluid), bought lunch (slipped out for pizza),
and stayed in touch with Aunt Martha (rummaged for a stamp). Electronic tech-
nologies have transformed our lives to the extent that these and so many other
activities now can be accomplished from the comfort of our offices. When carried
out electronically, each of these involves completing forms or questionnaires of one
kind or another. Furthermore, each transaction is dispatched into the ether with the
senders faith that information or a product will be returned or that key details will
be securely recorded in a database.
It is astonishing to consider how these new developments in technology will
make things easier as we go forward. This is especially true as we peer over the lead-
ing edge of electronic survey development and administration. With programmers
and Web design professionals on our team, we can use e-mail to contact potential
subjects, explain the study, invite them to participate, and automatically send
reminders. We can embed within that invitation and later reminders a direct link to
the Web site where the survey can be accessed. We can attach a unique code to the
invitation so that respondents can complete the survey only once. Or, if the research
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 498
is related to something the respondent registers for online, we can help the
respondent choose a unique user identification and password.
Our questionnaire can be delivered instantaneously with no postal costs; large
numbers of recipients can complete it and submit their responses on the spot (Tse,
1998). Even more exciting to research assistants everywhere, we can help respon-
dents fill out the survey correctly by automatically skipping to the next appropri-
ate question; by insisting (nicely) that some responses are required; by ensuring
that check one only instructions are never violated; and by making sure answers
are logically consistent (e.g., not allowing someone to say they were born in the
current year or that they started smoking before they were born).
At some point in the future, nearly all mail surveys may be conducted electron-
ically rather than by snail mail. However, that future has not quite arrived. While
good programming can reduce item response and item nonresponse errors to a
great extent, surveys administered electronically are at least as vulnerable to sample
bias and nonresponse error as are their hard-copy cousins (Tse et al., 1995). While
the promise is great, the reality for the moment is that an e-mail/Web survey is a
sensible choice in some fairly limited circumstances (Dillman & Bowker, 2001).
Of course, each mail survey must be sent somewhere, and so potential subjects
mailing addresses must be known to the research team. When using e-mail to invite
participants to the study, one must have a current list of e-mail addresses. For the
general public, reliable lists of e-mail addresses do not exist. Right now, it is possi-
ble to obtain e-mail lists for affiliates of a particular institution (e.g., a school or
company) if the institution is interested in collecting the data and willing to make
them available to the research team. However, even for institution-based surveys,
some recipients may not use the organizations e-mail system but use alternative
e-mail systems; some may not check their e-mails very often or at all. This can be
especially true when people have multiple affiliations and use one e-mail system as
their primary one and never or hardly ever check the others.
Some researchers try to overcome the problem of sample bias by disseminating
notification of the study by standard mail and including the address of the Web site
where the survey instrument can be found in the cover letter. This is not a bad solu-
tion by any means; however, it does presume that all who fall in the sample have
ready access to the Internet. (Many households do of course, but not all by a long
stretch (Ranchhod & Zhou, 2001).
However, even if your study design solves the sample access problems, the tradi-
tional, major problem with mail surveys, nonresponse error, is waiting in the wings
(Couper, 2001; Dillman & Bowker, 2001; Kaplowitz, Hadlock, & Levine, 2004; Sills
& Song, 2002; Tourangeau, 2004; Tse, 1998). E-mail and Web-based surveys make
it difficult to implement two critical procedures discussed earlier that ensure good
response ratesreminders and up-front incentives. The reminder problem is
twofold. First, will the respondent even open the e-mail to read the reminder? As
more and more spam saturates the Internet, many of us purposely ignore e-mails
that do not come from familiar sources. Plus many Internet providers or institu-
tions use sophisticated spam detectors and filters to block the delivery of blast
e-mails (the equivalent of bulk postal mail) or those sent from unknown sources.
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 499
Second, how often should we send our reminders? Since e-mails are easy to
produce (no envelopes to stuff!) and cost nothing to send, it can be tempting to
send too many, thereby making the respondent feel harassed. Plus, sending several
reminders to people who may have already completed the questionnaire will make
them very angry indeed. We suggest sending reminders only to those who have not
yet responded. Since there is no need to take into account the time it takes for the
postal service to deliver your letter and return a completed questionnaire, timing
e-mail reminders at intervals of about one week seems appropriate.
It is very difficult to deliver token up-front incentives by electronic means and,
therefore, most Web-based surveys, if they offer incentives at all, are likely to frame
them as a promised reward (where the respondent may need to provide a mailing
addressanother sticking point, perhaps). E-mailed gift certificates available from
many online merchants can be presented up-front, but this attractive option comes
with a high price tag: minimum denominations range from $5 to $15 or more, a
potential budget breaker. Of late, the Internet has become a distribution medium for
traditional Cents- and Dollars-Off coupons which the customers can print and
redeem online or at specific retail locations. However, safeguards against coupon
manipulation (counterfeiting, changing the value or expiration date), coupon reuse,
unauthorized use, and customer privacy will have to be standardized and widely
available before these systems can be trusted to distribute incentives. One solution
that currently exists is to make the first contact with a potential responder via regu-
lar mail; then, of course, you can include the up-front reward in the mailer.
A final point of concern that may contribute to non-response error is that many
people are justifiably concerned about how personal information relayed electron-
ically isand often is notsafeguarded. For example, promises of confidentiality
can be viewed with suspicion because of the relative ease of forwarding informa-
tion via e-mail (e.g., to a persons supervisor).Your pledge to maintain respondents
anonymity or confidentiality must be buttressed by reliable security controls over
both electronic and human resources to protect data from hackers, viruses and
threats to privacy. Be sure to detail these measures and policies, but realize that your
descriptions may be too technical, may be ignored, or simply may not ease some
skeptical subjects concerns.
In sum, the adoption of electronic methods for survey administration holds
great promise, but we are not there yet. When the technology does arrive in full
force, however, all of the issues we have discussed about improving the quality of
mail surveys will still be relevant.
Summary
We have focused in turn on the various components and phases of the mail survey
process and have tried to give you an in-depth understanding of the unique issues,
potential hazards, and procedures to follow. However, in the real world, we rarely have
the luxury of conducting an ideal projectone where time and money are no
object and where quality can be maximized at each decision point. Instead, each
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 500
Discussion Questions
Exercises
Exercise 1
Describe your data collection procedures and develop a detailed timeline for a
15-page mail survey project with a sample size of 2,000 people from a funder that
will not allow monetary incentives to be used but still demands a high response
rate. Include in the timeline questionnaire development, sampling details of the
data collection process, data entry, and data analysis.
Exercise 2
How would your procedures and timeline change (if at all) if the funder for the
survey described in Exercise 1, would allow a $10 monetary incentive to be used?
References
Adams, L. L. M., & Gale, D. (1982). Solving the quandary between questionnaire length and
response rate in educational research. Research in Higher Education, 17, 231240.
Allen, C. T., Schewe, C. D., & Wijk, G. (1980). More on self-perception theorys foot tech-
nique in the pre-call/mail survey setting. Journal of Marketing Research, 17, 498502.
Andreasen, A. R. (1970). Personalizing mail questionnaire correspondence. Public Opinion
Quarterly, 34, 273277.
Armstrong, J. S., & Lusk, E. J. (1987). Return postage in mail surveys. Public Opinion
Quarterly, 51, 233248.
Armstrong, J. S., & Overton, T. S. (1977). Estimating nonresponse bias in mail surveys.
Journal of Marketing Research, 14, 396402.
Baldauf, A., Reisinger, H., & Moncrief, W. C. (1999). Examining motivations to refuse in
industrial mail surveys. Journal of the Market Research Society, 41, 345353.
Barnette, W. L. (1950). Non-respondent problem in questionnaire research. Journal
of Applied Psychology, 34, 397398.
Baur, E. J. (1947). Response bias in a mail survey. Public Opinion Quarterly, 11, 594600.
Berdie, D. R. (1973). Questionnaire length and response rate. Journal of Applied Psychology,
58, 278280.
Berry, S., & Kanouse, D. (1987). Physicians response to a mailed survey: An experiment in
timing of payment. Public Opinion Quarterly, 51, 102104.
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 502
Biemer, P. N., Groves, R. M., Lyberg, L. E., Mathiowetz, N. A., & Sudman, S. (Eds.). (1991).
Measurement errors in surveys. New York: John Wiley.
Blair, W. S. (1964). How subject matter can bias a mail survey. Mediascope, 8(1), 7072.
Blumberg, H. H., Fuller, C., & Hare, A. P. (1974). Response rates in postal surveys. Public
Opinion Quarterly, 38, 113123.
Blumenfeld, W. S. (1973). Effect of appearance of correspondence on response rate to a mail
questionnaire survey. Psychological Reports, 32, 178.
Boek, W. E., & Lade, J. H. (1963). Test of the usefulness of the postcard technique in a mail
questionnaire study. Public Opinion Quarterly, 27, 303306.
Bradt, K. (1955). Usefulness of a postcard technique in a mail questionnaire study. Public
Opinion Quarterly, 19, 218222.
Brennan, M., & Hoek, J. (1992). Behavior of respondents, nonrespondents and refusers
across mail surveys. Public Opinion Survey, 56, 530535.
Brennan, R. (1958). Trading stamps as an incentive. Journal of Marketing, 22, 306307.
Bright, K. D., & Smith, P. M. (2002). The use of incentives to affect response rates for a mail
survey of U.S. marina decision makers. Forest Products Journal, 52, 2629.
Brook, L. L. (1978). Effect of different postage combinations on response levels and speed of
reply. Journal of the Market Research Society, 20, 238244.
Brunner, A. G., & Carroll, S. J., Jr. (1969). Effect of prior notification on the refusal rate in
fixed address surveys. Journal of Advertising Research, 9, 4244.
Burchell, B., & Marsh, C. (1992). Effect of questionnaire length on survey response. Quality
and Quantity, 26, 233244.
Campbell, D. T. (1949). Bias in mail surveys. Public Opinion Quarterly, 13, 562.
Carpenter, E. H. (1975). Personalizing mail surveys: A replication and reassessment. Public
Opinion Quarterly, 38, 614620.
Champion, D. J., & Sear, A. M. (1969). Questionnaire response rates: A methodological
analysis. Social Forces, 47, 335339.
Childers, T. J., & Ferrell, O. C. (1979). Response rates and perceived questionnaire length in
mail surveys. Journal of Marketing Research, 16, 429431.
Childers, T. L., & Skinner, S. J. (1985). Theoretical and empirical issues in the identification
of survey respondents. Journal of the Market Research Society, 27, 3953.
Clausen, J. A., & Ford, R. N. (1947). Controlling bias in mail questionnaires. Journal of the
American Statistical Association, 42, 497511.
Couper, M. P. (2001). Web surveys: A review of issues and approaches. Public Opinion
Quarterly, 64, 464494.
Cox, E. P., III, Anderson, W. T., Jr., & Fulcher, D. G. (1974). Reappraising mail survey response
rates. Journal of Marketing Research, 11, 413417.
Daniel, W. W. (1975). Nonresponse in sociological surveys: A review of some methods for
handling the problem. Sociological Methods and Research, 3, 291307.
Denton, J., Tsai, C., & Chevrette, P. (1988). Effects on survey responses of subject, incentives,
and multiple mailings. Journal of Experimental Education, 56, 7782.
De Rada, V. D. (2005). Response effects in a survey about consumer behavior. International
Journal of Market Research, 47, 4564.
Diamantopoulos, A., & Schlegelmilch, B. (1996). Determinants of industrial mail survey
response: A survey of survey analysis of researchers and managers views. Journal of
Marketing Management, 12, 505531.
Dickinson, J. R., & Faria, A. J. (1995). Refinements of charitable contribution incentives for
mail surveys. Journal of the Market Research Society, 37, 447453.
Dillman, D. A. (1972). Increasing mail questionnaire response in large samples of the general
public. Public Opinion Quarterly, 36, 254257.
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 503
Dillman, D. A. (1978). Mail and telephone surveys: The total design method. New York: John Wiley.
Dillman, D. A., & Bowker, D. K. (2001). The Web questionnaire challenge to survey method-
ologists. In Bosnjak, M. (Eds.), Dimensions of internet science (pp. 159178). Lengerich,
Germany: Pabst Science.
Dillman, D. A., Carpenter, E., Christenson, J., & Brooks, R. (1974). Increasing mail question-
naire response: A four state comparison. American Sociological Review, 39, 744756.
Dillman, D. A., & Frey, J. H. (1974). Contribution of personalization to mail questionnaire
response as an element of a previously tested method. Journal of Applied Psychology,
59, 297301.
Dommeyer, C. J. (1985). Does response to an offer of mail survey results interact with ques-
tionnaire interest? Journal of the Market Research Society, 27, 2738.
Donald, M. N. (1960). Implications of non-response for the interpretation of mail question-
naire data. Public Opinion Quarterly, 24, 99114.
Doob, A. N., Freedman, J. L., & Carlsmith, J. M. (1973). Effects of sponsor and prepayment
on compliance with a mailed request. Journal of Applied Psychology, 57, 346347.
Duncan, W. J. (1979). Mail questionnaires in survey research: A review of response induce-
ment techniques. Journal of Management, 5, 3955.
Eckland, B. (1965). Effects of prodding to increase mail back returns. Journal of Applied
Psychology, 49, 165169.
Edwards, P., Cooper, R., Roberts, I., & Frost, C. (2005). Meta-analysis of randomized trials of
monetary incentives and response to mailed questionnaires. Journal of Epidemiology
and Community Health, 59, 987999.
Edwards, P., Roberts, I., Clarke, M., DiGuiseppi, C., Pratap, S., Wentz, R., et al. (2002).
Increasing response rates to postal questionnaires: Systematic review. British Medical
Journal, 324, 11831192.
Eichner, K., & Habermehl, W. (1981). Predicting response rates to mailed questionnaires.
American Sociological Review, 46, 361363.
Erdogan, B. Z., & Baker, M. J. (2002). Increasing mail survey response rates from an indus-
trial population: A cost-effectiveness analysis of four follow-up techniques. Industrial
Marketing Management, 31, 6573.
Etter, J. F., & Perneger, T. V. (1997). Analysis of non-response bias in a mailed health survey.
Journal of Clinical Epidemiology, 50(10), 11231128.
Etzel, M. J., & Walker, B. J. (1974). Effects of alternative follow-up procedures on mail survey
response rates. Journal of Applied Psychology, 59, 219221.
Evangelista, F., Albaum, G., & Poon, P. (1999). An empirical test of alternative theories of
survey response behavior. Journal of the Market Research Society, 41(2), 227244.
Ferris, A. L. (1951). Note on stimulating response to questionnaires. American Sociological
Review, 16, 247249.
Filion, F. L. (1975). Estimating bias due to nonresponse in mail surveys. Public Opinion
Quarterly, 39, 482492.
Filion, F. L. (1976). Exploring and correcting for nonresponse bias using follow-ups on
nonrespondents. Pacific Sociological Review, 19, 401408.
Ford, N. M. (1967). The advance letter in mail surveys. Journal of Marketing Research, 4, 202204.
Ford, N. M. (1968). Questionnaire appearance and response rates in mail surveys. Journal
of Advertising Research, 8, 4345.
Ford, R. N., & Zeisel, H. (1949). Bias in mail surveys cannot be controlled by one mailing.
Public Opinion Quarterly, 13, 495501.
Fox, C. M., Robinson, K. L., & Boardley, D. (1998). Cost-effectiveness of follow-up strategies
in improving the response rate of mail surveys. Industrial Marketing Management,
27, 127133.
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 504
Frazier, G., & Bird, K. (1958). Increasing the response of a mail questionnaire. Journal of
Marketing, 22, 186187.
Fuller, C. (1974). Effect of anonymity on return rate and response bias in a mail survey.
Journal of Applied Psychology, 59, 292296.
Furse, D. H., & Stewart, D. W. (1982). Monetary incentives versus promised contribution
to charity: New evidence on mail survey response. Journal of Marketing Research, 19,
375380.
Furse, D. H., Stewart, D. W., & Rados, D. L. (1981). Effects of foot-in-the-door, cash incen-
tives, and followups on survey response. Journal of Marketing Research, 18, 473478.
Futrell, C., & Hise, R. T. (1982). The effects on anonymity and a same-day deadline on the
response rate to mail surveys. European Research, 10, 171175.
Futrell, C., & Swan, J. E. (1977). Anonymity and response by salespeople to a mail question-
naire. Journal of Marketing Research, 14, 611616.
Gajraj, A. M., Faria, A. J., & Dickinson, J. R. (1990). Comparison of the effect of promised
and provided lotteries, monetary and gift incentives on mail survey response rate, speed
and cost. Journal of the Market Research Society, 32, 141162.
Gannon, M., Northern, J., & Carroll, S. J., Jr. (1971). Characteristics of non-respondents
among workers. Journal of Applied Psychology, 55, 586588.
Gelb, B. D. (1975). Incentives to increase survey returns: Social class considerations. Journal
of Marketing Research, 12, 107109.
Gendall, P. (2005). The effect of covering letter personalization in mail surveys. International
Journal of Market Research, 47(4), 367382.
Gendall, P., Hoek, J., & Brennan, M. (1998). The tea bag experiment: More evidence on
incentives in mail surveys. Journal of the Market Research Society, 40, 347351.
Godwin, K. (1979). Consequences of large monetary incentives in mail surveys of elites.
Public Opinion Quarterly, 43, 378387.
Goodstadt, M. S., Chung, L., Kronitz, R., & Cook, G. (1977). Mail survey response rates:
Their manipulation and impact. Journal of Marketing Research, 14, 391395.
Gough, H. G., & Hall, W. B. (1977). Comparison of physicians who did and did not respond
to a postal questionnaire. Journal of Applied Psychology, 62, 777780.
Green, J. (1996). Warning that reminders will be sent increased response rate. Quality and
Quantity, 30(4), 449450.
Hager, M. A., Wilson, S., Pollak, T. H., & Rooney, P. M. (2003). Response rates for mail
surveys of nonprofit organizations: A review and empirical test. Nonprofit and
Voluntary Sector Quarterly, 32, 252267.
Hancock, J. W. (1940). An experimental study of four methods of measuring unit costs of
obtaining attitude toward the retail store. Journal of Applied Psychology, 24, 213230.
Hansen, R. A. (1980). A self-perception interpretation of the effect of monetary and
non-monetary incentives on mail survey respondent behavior. Journal of Marketing
Research, 17, 7783.
Harris, J. R., & Guffey, H. J., Jr. (1978). Questionnaire returns: Stamps versus business reply
envelopes revisited. Journal of Marketing Research, 15, 290293.
Heaton, E. E., Jr. (1965). Increasing mail questionnaire returns with a preliminary letter.
Journal of Advertising Research, 5, 3639.
Heberlein, T. A., & Baumgartner, R. (1978). Factors affecting response rates to mailed ques-
tionnaires: A quantitative analysis of the published literature. American Sociological
Review, 43, 447462.
Helgeson, J. G., Voss, K. E., & Terpening, W. D. (2002). Determinants of mail-survey response:
Survey design factors and respondent factors. Psychology & Marketing, 19(3), 303328.
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 505
Henley, J. R., Jr. (1976). Response rate to mail questionnaires with a return deadline. Public
Opinion Quarterly, 40, 374375.
Hopkins, K. D., & Gullickson, A. R. (1992). Response rates in survey research: A meta-
analysis of the effects of monetary gratuities. Journal of Experimental Education, 61, 5262.
Hopkins, K. D., & Podolak, J. (1983). Class-of-mail and the effects of monetary gratuity
on the response rates of mailed questionnaires. Journal of Experimental Education,
51, 169170.
Hornik, J. (1981). Time cue and time perception effect on response to mail surveys. Journal
of Marketing Research, 18, 243248.
House, J. S., Gerber, W., & McMichael, A. J. (1977). Increasing mail questionnaire response:
A controlled replication and extension. Public Opinion Quarterly, 41, 9599.
Houston, M. J., & Jefferson, R. W. (1975). The negative effects of personalization on response
patterns in mail surveys. Journal of Marketing Research, 12, 114117.
Houston, M. J., & Nevin, J. R. (1977). The effects of source and appeal on mail survey
response patterns. Journal of Marketing Research, 14, 374377.
Hubbard, R., & Little, E. (1988). Promised contributions to charity and mail survey
responses replication with extension. Public Opinion Quarterly, 52, 223230.
James, J. M., & Bolstein, R. (1990). Effect of monetary incentives and follow-up mailings on the
response rate and response quality in mail surveys. Public Opinion Quarterly, 54, 346361.
James, J. M., & Bolstein, R. (1992). Large monetary incentives and their effect on mail survey
response rates. Public Opinion Quarterly, 56, 442453.
Jobber, D., & OReilly, D. (1996). Industrial mail surveys: Techniques for inducing response.
Marketing Intelligence & Planning, 14, 2934.
Jobber, D., & OReilly, D. (1998). Industrial mail surveys: A methodological update.
Industrial Marketing Management, 27, 95107.
Jobber, D., Saunders, J., & Mitchell, V.-W. (2004). Prepaid monetary incentive effects on mail
survey response. Journal of Business Research, 57(4), 347350.
Jolson, M. A. (1977). How to double or triple mail response rates. Journal of Marketing,
41, 7881.
Jones, W. H., & Lang, J. R. (1980). Sample composition bias and response bias in a mail sur-
vey: A composition of inducement methods. Journal of Marketing Research, 17, 6976.
Jones, W. H., & Linda, G. (1978). Multiple criteria effects in a mail survey experiment. Journal
of Marketing Research, 15, 280284.
Kalafatis, S. P., & Blankson, C. (1996). An investigation into the effect of questionnaire iden-
tification numbers in consumer mail surveys. Journal of the Market Research Society,
38(3), 277284.
Kalafatis, S. P., & Madden, F. J. (1995). The effect of discount coupons and gifts on mail
survey response rates among high involvement respondents. Journal of the Market
Research Society, 37, 171184.
Kanso, A. (2000). Mail surveys: Key factors affecting response rates. Journal of Promotion
Management, 5, 316.
Kanuk, L., & Berenson, C. (1975). Mail surveys and response rates: A literature review.
Journal of Marketing Research, 12, 440453.
Kaplowitz, M., Hadlock, T., Levine, R. (2004). A comparison of web and mail survey
response rates. Public Opinion Quarterly, 68(1), 94101.
Kawash, M. B., & Aleamoni, L. M. (1971). Effect of personal signature on the initial rate of
return of a mailed questionnaire. Journal of Applied Psychology, 55, 589592.
Kephart, W. M., & Bressler, M. (1958). Increasing the responses to mail questionnaires. Public
Opinion Quarterly, 22, 123132.
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 506
Kerin, R. A., & Peterson, R. A. (1977). Personalization, respondent anonymity, and response
distortion in mail surveys. Journal of Applied Psychology, 62, 8689.
Kernan, J. B. (1971). Are bulk rate occupants really unresponsive? Public Opinion Quarterly,
35, 420424.
Kimball, A. E. (1961). Increasing the rate of return in mail surveys. Journal of Marketing,
25, 6365.
LaGarce, R., & Washburn, J. (1995). An investigation into the effects of questionnaire format
and color variations on mail survey response rates. Journal of Technical Writing and
Communication, 25(1), 5770.
Larson, P. D., & Chow, G. (2003). Total cost/response rate trade-offs in mail survey research:
Impact of follow-up mailings and monetary incentives. Industrial Marketing Management,
32, 533537.
Larson, R. F., & Catton, W. R., Jr. (1959). Can the mail-back bias contribute to a studys valid-
ity? American Sociological Review, 24, 243245.
Leung, G. M., Ho, L. M., Chan, M. F., Johnston, J. J., & Wong, F. K. (2002). The effects of cash
and lottery incentives on mailed surveys to physicians: A randomized trial. Journal of
Clinical Epidemiology, 55, 801807.
Linsky, A. S. (1975). Stimulating responses to mailed questionnaires: A review. Public
Opinion Quarterly, 39, 82101.
Lockhart, D. C. (1991). Mailed surveys to physicians: The effect of incentives and length on
the return rate. Journal of Pharmaceutical Marketing and Management, 6, 107121.
Lorenzi, P., Friedmann, R., & Paolillo, J. (1988). Consumer mail survey responses: More
(unbiased) bang for the buck. Journal of Consumer Marketing, 5, 3140.
Martin, J. D., & McConnell, J. P. (1970). Mail questionnaire response induction: The effect of
four variables on the response of a random sample to a difficult questionnaire. Social
Science Quarterly, 51, 409414.
Martinson, B. C., Lazovich, D., Lando, H. A., Perry, C. L., McGovern, P. G., & Boyle, R. G.
(2000). Effectiveness of monetary incentives for recruiting adolescents to an interven-
tion trial to reduce smoking. Preventive Medicine, 31, 706713.
Mason, W. S., Dressel, R. J., & Bain, R. K. (1961). An experimental study of factors affecting
response to a mail survey of beginning teachers. Public Opinion Quarterly, 25, 296299.
McCrohan, K. F., & Lowe, L. S. (1981). A cost/benefit approach to postage used on mail ques-
tionnaires. Journal of Marketing, 45, 130133.
McDaniel, S. W., & Jackson, R. W. (1981). An investigation of respondent anonymitys effect
on mailed questionnaire response rate and quality. Journal of the Market Research
Society, 23, 150160.
Myers, J. H., & Haug, A. F. (1969). How a preliminary letter affects mail survey return and
costs. Journal of Advertising Research, 9, 3739.
Nederhof, A. J. (1983). The effects of material incentives in mail surveys: Two studies. Public
Opinion Quarterly, 47, 103111.
Nevin, J. R., & Ford, N. M. (1976). Effects of a deadline and a veiled threat on mail survey
responses. Journal of Applied Psychology, 61, 116118.
Newman, S. W. (1962). Differences between early and late respondents to a mailed survey.
Journal of Advertising Research, 2, 3739.
Ognibene, P. (1970). Traits affecting questionnaire response. Journal of Advertising Research,
10, 1820.
Parsons, R. J., & Medford, T. S. (1972). The effect of advance notice in mail surveys of homo-
geneous groups. Public Opinion Quarterly, 36, 258259.
Pearlin, L. I. (1961). The appeals of anonymity in questionnaire response. Public Opinion
Quarterly, 25, 640647.
15-Bickman-45636:15-Bickman-45636.qxp 7/28/2008 11:34 AM Page 507
Suchman, E. A., & McCandless, B. (1940). Who answers questionnaires? Journal of Applied
Psychology, 24, 758769.
Taylor, S., & Lynn, P. (1998). The effect of a preliminary notification letter on response to a
postal survey of young people. Journal of the Market Research Society, 40(2), 165173.
Tourangeau, R. (2004). Survey research and societal change. Annual Review of Psychology,
55, 775801.
Tse, A. C. B. (1998). Comparing the response rate, response speed and response quality of
two methods of sending questionnaires: E-mail vs. mail. Journal of the Market Research
Society, 40, 353361.
Tse, A. C. B., Tse, K. C., Yin, C. H., Ting, C. B., Yi, K. W., Yee, K. P., et al. (1995). Comparing
two methods of sending out questionnaires: E-mail versus mail. Journal of the Market
Research Society, 37, 441446.
Vocino, T. (1977). Three variables in stimulating responses to mailed questionnaires. Journal
of Marketing, 41, 7677.
Walker, B. J., & Burdick, R. K. (1977). Advance correspondence and error in mail surveys.
Journal of Marketing Research, 14, 379382.
Warriner, K., Goyder, J., Gjertsen, H., Hohner, P., & McSpurren, K. (1996). Charities, no; lot-
teries, no; cash, yes: Main effects and interactions in a Canadian incentives experiment.
Public Opinion Quarterly, 60(4), 542562.
Watson, J. (1965). Improving the response rate in mail research. Journal of Advertising
Research, 5, 4850.
Weilbacher, W., & Walsh, H. R. (1952). Mail questionnaire and the personalized letter of
transmittal. Journal of Marketing, 16, 331336.
White, E., Carney, P. A., & Kolar, A. S. (2005). Increasing response to mailed questionnaires
by including a pencil/pen. American Journal of Epidemiology, 162(3), 261266.
Wildman, R. C. (1977). Effects of anonymity and social settings on survey responses. Public
Opinion Quarterly, 41, 7479.
Wotruba, T. R. (1966). Monetary inducements and mail questionnaire response. Journal of
Marketing Research, 3, 398400.
Wynn, G. W., & McDaniel, S. W. (1985). The effect of alternative foot-in-the-door manipu-
lations on mailed questionnaire response rate and quality. Journal of the Market Research
Society, 27, 1526.
Yammarino, F. J., Skinner, S. J., & Childers, T. L. (1991). Understanding mail survey response
behavior. Public Opinion Quarterly, 55, 613639.
Yu, J., & Cooper, H. (1983). A quantitative review of research design effects on response rates
to questionnaires. Journal of Marketing Research, 20, 3644.
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 509
CHAPTER 16
509
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 510
Disadvantages
A major disadvantage of telephone surveyseven when well executedis the
limitations they place on the complexity and length of the interview. Unlike the
dynamics of face-to-face interviewing, the average respondent often finds it tire-
some to be kept on the telephone for longer than 20 minutes, especially when the
topic does not interest her or him. In contrast, personal interviewers do not seem
to notice much respondent fatigue even with interviews that last 30 minutes or
longer. Mail and Web surveys also do not suffer as much from this disadvantage as
those questionnaires often can be completed at a respondents leisure over multiple
sessions. Similarly, complicated questions, especially those that require the respon-
dent to see or read something, heretofore have been impossible to display via the
telephone. With the advent of video telecommunication technology via the Web
and telephones, this limitation should diminish.
Other traditional concerns about telephone surveys include potential coverage
error that may occur. For example, not everyone in the United States lives in a
household with telephone service, and among those who do, not every demo-
graphic group is equally willing or can be reached for interviewing via telephone.
According to the most recently available Federal Communications Commission
statistics, in 2004 approximately 6% of the U.S. public lived in a home without any
telephonewith Arizona (8%), Arkansas (11%), the District of Columbia (8%),
Georgia (9%), Illinois (10%), Indiana (8%), Kentucky (9%), Louisiana (9%),
Mississippi (10%), New Mexico (9%), Oklahoma (9%), and Texas (8%) having the
highest rates of noncoverage. In contrast, regional coverage in European Union
countries was not as problematic, with only Portugal at 90% coverage and Belgium
at 94% coverage, having more than 5 in 100 households without a telephone line
(IPSOS-INRA, 2004).
Furthermore, currently there are no scientifically accepted ways to incorporate
cell phone and Voice-Over-Internet (VoIP) telephone numbers into the traditional
sampling methods used to survey the U.S. public via telephone (see Brick et al., 2007;
Brick, Dipko, Presser, Tucker, & Yuan, 2006; www.nielsenmedia.com/cellphone
summit/cellphone.html). By the end of 2007, an estimated 20% of U.S. households
had only cell phone coverage (see Blumberg, Luke, & Cynamon, 2006; Tucker,
Brick, & Meekins, 2007). Thus, landline telephone surveys in the United States are
at a disadvantage in reaching certain segments of the general population such as
renters and adults younger than 25 years of age. For other countries, these problems
do not exist because the business model used to charge their customers does not
hamper respondents willingness to be interviewed on their wireless phoneas it
often does in the United Statesnor are there as many restrictive federal telecom-
munications policies that currently hamper survey researchers in the United States
from surveying respondents reached on a cell phone.
In addition, since the advent of number portability1 in the United States in 2004,
researchers can no longer be certain where (in a geographical sense) a respondent
has been reached when contacted on a telephone. Depending on the extent to which
the people continue to exercise their right to port their telephone number(s)and
in 2005, approximately 3 million already had done soand depending on how far
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 512
they move from the original geographic area in which they were assigned their
phone number, telephone surveys may suffer the considerable burden of having to
conduct explicit geographic screening of respondents to determine whether the
respondent lives within the geopolitical area being surveyed (see Lavrakas, 2004). If
this were not done, then serious Errors of Commission (false positives) and Errors
of Omission (false negatives) may result from interviewing respondents who are
geographically ineligible for the survey. Furthermore, geographic screening would
lead to increase in nonresponse. These problems do not exist for researchers outside
the United States.
Noncoverage
As it applies to telephone surveys, noncoverage is the gap that often exists
between the sampling frame (the set of telephone numbers from which a sample is
drawn) and the larger population the survey is meant to represent. To the extent the
group covered by the sampling frame differs in nonignorable ways on variables of
interest from the group not included in the sampling frame, the survey will have
coverage biases. For example, all household telephone surveys in the United States
using random-digit dialing (RDD) landline sampling frames miss households and
persons without telephones and persons with only cell phone service. Thus, RDD
landline surveys have the potential for coverage error if researchers infer findings to
the general public about issues that are correlated with whether or not someone can
be surveyed via a landline telephone. Worldwide, not having a telephone is related
to very low income, low education, rural residency, younger ages of household
heads, and minority racial status. In the United States, having only wireless phone
service is related to many of these same demographic factors and to being a renter.
Thus, there will be some level of nonnegligible coverage errors in many telephone
surveys that sample only households with wired telephone service.
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 513
Nonresponse
Nonresponse error in a telephone survey occurs when people who are sampled,
but not interviewed, differ as a group in some nonnegligible way from those who
are successfully interviewed. Nonresponse in telephone surveys is due primarily to:
(a) failure to contact sampled respondents, (b) sampled respondents who refuse to
participate, and (c) sampled respondents who have language or health problems.
Since the early 1990s, response rates in telephone surveys of the United States and
European publics have noticeably and continuously declined each year, albeit
slowly (see Curtin, Presser, & Singer, 2005; de Heer, 1999). This is due to a combi-
nation of the publics increasing unwillingness to participate in telephone surveys
because of busy lifestyles, the problems caused by telemarketers and the publics
behavioral responses to avoid such nuisance calls, and the increase in telecommu-
nications system challenges to reaching a sampled respondent within a fixed length
field period, especially within the United States.
In the United States, the implementation of the Do Not Call List (DNCL) in
October 2003, appears to have significantly reduced the telemarketing nuisance call
problem, but it is too soon to know with confidence what long-run effect this will
have on response rates in legitimate telephone surveys. Some evidence to date is
promising in that those listed on the DNCL appear more likely to participate when
subsequently sampled for a telephone survey than those who are not. But other
findings are troubling, in that a large minority of the U.S. public would like to have
the DNCL restrictions extended to opinion polls and other types of research sur-
veys (Lavrakas, 2004).
One of the most effective ways to counter nonresponse in a telephone survey is
to make an advance contact via mail with the sampled household before contacting
them via telephone (see de Leeuw, Joop, Korendijk, Mulders, & Callegaro, 2005).
The most effective type of advance-mailed contact is a polite, informative, and per-
suasive letter that is accompanied by a token cash incentive. Lavrakas and Shuttles
(2004) reported experimental findings in very large national surveys of gains in
RDD response rates of 10 percentage points with as little as $2 mailed in advance
of phone contact. Of course, this advance mail treatment requires the ability to
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 514
match sampled telephone numbers with accurate mailing addresses, which is pos-
sible in approximately 60% to 70% of the time for many U.S. RDD samples if
researchers use multiple vendors for the matching process.
Special training for interviewers is a different approach to reducing the problem
of refusals in telephone surveys. Groves and others (e.g., Groves & McGonagle,
2001; Shuttles, Welch, Hoover, & Lavrakas, 2002) have made advances using care-
fully controlled experiments in testing a theory-based Refusal Avoidance inter-
viewer training curriculum that includes the following:
(a) focus groups with top performing interviewers that identify the actual ver-
biage they hear from refusers and then map persuasive replies that these
interviewers use to try to convert reluctant respondents to each reason for
refusing;
(b) communication discourse techniques for extending the time that reluctant
respondents stay on the telephone before hanging up on the interviewer, for
example, posing a conversational question back to the respondent to engage
her or him in a two-way dialogue;2 and
(c) correctly and rapidly identifying the reasons why the respondent is refusing
and delivering relevant persuasive verbiage to counter them.
The results of these experiments have been mixed with some studies showing
upwards of a 10 percentage point gain in cooperation by those interviewers receiv-
ing this training and other studies showing no effects whatsoever.
In terms of reducing nonresponse associated with noncontacts in telephone sur-
veys, the basic technique is to make many callbacks, scheduled at various times of
the day and days of the week, over as long a field period as possible. That is, the
more callbacks made and the longer the field period, the higher will be the contact
rate in RDD surveys, all other factors being equal. This is problematical for many
surveys, especially those conducted for news purposes because newsworthiness
often exists only for a brief moment in time. In these instances, the only choices a
researcher faces is to exercise care in considering the effect of noncontact-related
nonresponse and to weight the data by gathering information in the survey about
the propensity of the respondent to be at home over a longer field period (e.g., the
past week), with those least likely to be at home being assigned weights greater than
1.0 and those most likely to be at home being assigned weights less than 1.0.
In considering how to handle callbacks during any finite field period, not all
RDD telephone numbers merit equal calling effort since many of them are non-
working or otherwise nonresidential, yet are not reliably detected as such by
autodialers or live interviewers. In the United States, this is due in part to the incon-
sistent manner in which local telephone companies handle such nonresidential
numbers. Using data from several extremely large national RDD surveys, Stec,
Lavrakas, and Shuttles (2005) reported that U.S. telephone numbers that have a
repeated Busy-Signal outcome (>5 times) or a repeated Ring-No-Answer outcome
(>10 times) are very unlikely to ever produce an interview with as many as 30 call
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 515
Measurement
Not all data that interviewers record during an interview are accurate measures
of the attitudes, behaviors, and demographics of interest. These inaccuracies, in the
forms of both bias and variance, may be due to errors associated with (a) the ques-
tionnaire and/or (b) the interviewers and/or (c) the respondents (see Biemer,
Groves, Lyberg, Mathiowetz, & Sudman, 1991). In thinking about these potential
sources of measurement error, the researchers should consider ways that the nature
and size of such errors might be measured so that the researcher can consider post
hoc adjustments to the raw data gathered from respondents by interviewers. The
best way to base such adjustments on sound empirical evidence is to build experi-
ments into the telephone questionnaire. This is especially important whenever a
researcher is using questions that have not been used in previous surveys, and thus,
their wording is not validated by solid experience. In this case, a researcher should
use an experimental design to test different wordings, even if only a small part of
the sample is exposed to alternative wordings.
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 516
Cost-Benefit Trade-Offs
Every telephone survey should be viewed as an endeavor with a finite set of
resources available. The challenge faced by the researchers is to deploy those
resources in the most cost-beneficial way possible, so as to maximize the quality of
the data that are gathered. The TSE perspective can guide researchers through a
series of choices (trade-offs) that often pit what they know or assume about one
source of potential error against what they know or assume about another source
of potential error. For novice researchers, these considerations can seem forbidding
or even overwhelming. When faced with all the potential threats to a surveys valid-
ity, some may throw up their hands and question the value of the entire survey
enterprise. To do so, however, is to fail to remember that highly accurate surveys are
routinely conducted by researchers who exercise the necessary care.
This chapter serves as an introduction to these considerations as they apply to tele-
phone surveys. This discussion of TSE is meant to alert future researchers to the many
challenges they face in conducting telephone surveys that will be accurate enough for
the purposes for which they are intended. The message to the novice should be clear:
Planning, implementing, and interpreting a survey that is likely to be accurate is a
methodical and time-consuming process, but one well worth the effort.
a tool that, when properly implemented on appropriate studies, has the potential to
improve the quality of resulting data by reducing TSE and/or more readily, pro-
ducing data that allow a researcher to conduct post hoc investigations of possible
error sources.
Proper implementation of CATI calls for much more than merely purchasing
computers, other hardware, and software. It also requires a proper channeling of
the physical and social environment within a survey facility (see Hansen, 2008;
Kelly, Link, Petty, Hobson, & Cagney, 2008). Ideally, the use of CATI should be
based on a survey organizations desire to reduce TSE. CATI offers great promise for
those concerned with minimizing TSE, but it should never be viewed as a techno-
logical fix that replaces the need for intensive human quality control procedures.
Just the opposite is true: When properly implemented, CATI allows for an increase
in the quality control that humans can impose on the telephone survey process.
3. Decide on the length, in days, of the field period, and the calling rules that
will be used to reach a proper final disposition for all telephone numbers in the
sampling pool that are dialed within the field period. Also, decide at what hours of
each day and on which days of the week calling will occur. For the calling rules,
decide on the maximum number of call attempts per telephone number, how much
time should be allowed to elapse before recalling a busy number, and whether or
not refusal conversions will be performed. In terms of refusal conversions, decide
how much time should elapse before redialing the number while recognizing that
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 518
best practice is to allow as many days as possible to pass, within the finite con-
straints of the field period, before redialing the refusing number (discussed later in
more detail).
4. Produce a call-record for each telephone number that will be used to track
and control its call history during the field period. Most CATI systems that control
the processing of a sample have such a feature built in. The information in these
call-recordssometimes referred to as paradatais very informative for inter-
viewers to review before making each callback to help prepare themselves for the
recontact. The more detailed the information recorded by the previous interview-
ers who contacted the household, the more prepared an interviewer will be for any
subsequent contacts within the household.
5. As the sampling design is being selected, develop and format a draft ques-
tionnaire keeping in mind how much time, on average, the questionnaire can afford
to take to complete, given the available resources and the purpose and needs of the
survey project.
6. Develop a draft introduction and respondent selection sequence and draft
fallback statements (persuaders) for use by interviewers to help tailor their intro-
duction and help gain cooperation from reluctant sampled respondents (Lavrakas,
1993).
7. Decide whether advance contact will be made with sampled respondents,
such as an advance letter, and, if so, whether an incentive will be included in the
advance mailing.
8. Pilot test and revise survey procedures and the questionnaire. Pilot testing
of all materials and procedures is an extremely important part of any high-quality
telephone survey; an adequate pilot test often can be accomplished with as few as
20 to 30 practice interviews. As part of the pilot stage, the researcher should hold
a debriefing session with the interviewers who participated, the project manage-
ment team, and (ideally) the survey sponsor, to identify any changes that are needed
before the sampling scheme and the respondent selection procedures are finalized,
and before final versions of the questionnaire and other survey materials are printed
or programmed into CATI.
9. Program the script (introduction, respondent selection method, and ques-
tionnaire) into CATI (see House & Nicholls, 1988) or print them onto paper.
10. Hire interviewers and supervisors, and schedule interviewer training and
the data collection sessions. When doing a survey in more than one language, it is
best from a data accuracy standpoint and response rate standpoint to have inter-
viewers interview in only one language. It is best to use native speakers of a lan-
guage rather than using bilingual speakers whose primary language is not the one
in which they will interview exclusively. The value of this approach is that native
speakers also will share cultural similarities with many respondents who speak that
language and, thus, will be able to gain cooperation more readily and probe unclear
answers more effectively.
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 519
11. Train interviewers and supervisors (see Tarnai & Moore, 2008). When doing
a survey in more than one language, each group of interviewers should have super-
visory personnel whose primary language matches the language they will use to
conduct interviews.
12. Conduct fully supervised interviews. Decide what portion, if any, of the
interviewing will be monitored (see Steve et al., 2008) and whether any respondents
will be called back to validate the completed interviews (see Lavrakas, 1993).
14. Assign weights (if any) to correct for unequal probability of selection (e.g.,
for multiple-telephone-line households; the number of adults in a household; the
proportion of time in the past year that the household did not have telephone ser-
vice), and for deviations in sample demographic statistics (gender, age, race, edu-
cation, etc.) from known population parameters. In the latter case, adjustments for
education are likely to be the most important because most telephone surveys of
the general public vastly oversample those with high educational attainment and
vastly undersample those with low educational attainment, and many behaviors
and attitudes are highly correlated with educational attainment.
1. The population of inference, that is, the group, the setting, and the time to
which the findings must generalize: For many telephone surveys, this will be
the entire adult population within a specific geopolitical area. For example,
in the United States, this might be the entire nation; the 48 contiguous states,
some region of the nation (e.g., the South or the West), a state, a large met-
ropolitan area, a county or a combination of counties, a city, a precinct, or
even a smaller neighborhood area. Another key consideration in choosing
the population of inference is the implications such a decision has on the
language(s) in which the survey will be conducted.
2. The target population, that is, the finite population that is purportedly surveyed.
3. The sampling frame, often in list form, that will operationalize the target
population.
In most instances in which the U.S. general public within a geopolitical area
is being surveyed, including rare subgroups within the general population, a researcher
will need to use an RDD frame.4 In contrast, in many European countries, RDD
sampling is not always necessary to reach a representative sample of the public, as
unlike in the United States, nearly all residences have listed telephone numbers
(Kuusela, 2003; Taylor, 2003). In these instances, a directory may exist that can be
used as the sampling frame. When sampling elites or members of special interest
groups via a telephone survey, RDD essentially is never the preferred frame because
it is highly inefficient in reaching these types of respondents. Instead, a list frame
(e.g., the membership of a professional organization) needs to be acquired (or
built) that well covers the target population of interest.
When the RDD frame was first embraced, the Mitofsky-Waksberg approach
became the standard methodology, but this proved to be difficult and costly to imple-
ment accurately and was rather inefficient. Subsequently, many approaches to list-
assisted RDD sampling were devised that were more easily administered, much more
efficient, and thus less costly in reaching sampled respondents (Brick, Waksberg,
Kulp, & Starer, 1995; Tucker, Lepkowski, & Piekarski, 2002). Nowadays, there are sev-
eral reputable commercial vendors that supply accurate, efficient, and reasonably
priced list-assisted RDD sampling pools to survey the public in just about any geo-
graphical area in the United States and in many other countries as well. Thus, it is
unusual for a researcher to engage in the manual approach to generate an RDD sam-
pling pool for the target population (see Lavrakas, 1987, 1993). For those conducting
cross-national telephone surveys, the work of Kish (1994) and Gabler and Hader
(2001) is recommended for guidance in building sampling frames and probability
sampling designs that best represent the respective target population in each country.
Practices have not as yet been identified by the survey industry, but issues that must
be balanced are as follows:
(a) the extent to which those who can be reached only by cell phone have dif-
ferent attitudes and behaviors from those who can be reached via a tradi-
tional landline (see Blumberg & Luke, 2007; Callegaro & Poggio, 2004; Keeter,
Kennedy, Clark, Tompson, & Mokrzycki, 2007; Vehovar, Belak, Batagelj, &
Cikic, 2004);
(b) the size of the final sample of cell phone respondents with whom interviews
must be completed and whether this will be restricted to cell phone only
respondents or not (see Kennedy, 2007; Lavrakas et al., 2008);
(c) how wireless phone and wired phone exchanges will be mixed in the sam-
pling pool and how respondents reached via a wired line versus those
reached via a wireless phone will be weighted at the analysis stage (see Brick,
Edwards, & Lee, 2007; Kennedy, 2007; Lavrakas et al., 2008; Link, Battaglia,
Frankel, Osborn, & Mokdad, 2007);
(d) how long a questionnaire is reasonable to use with someone reached on
their cell phone (see Brick et al., 2007);
(e) the considerable greater costs that sampling U.S. cell phone numbers require,
in part, because of the restrictions placed by federal and state regulations on
the use of automatic dialing technologies when calling cell phone numbers
(see Lavrakas, Shuttles, Steeh, & Fienberg, 2007; Lavrakas et al., 2008); and
(f) how respondents reached on a cell phone will be incented, have their safety
protected, and how the accuracy of the responses that they provide will be
maximized (see Lavrakas et al., 2007, 2008).
Furthermore, as of 2008, there exists no data in the United States on the per-
centage of households that are cell phone only at the state, county, or city level
(see Lavrakas et al., 2007). As such, researchers who are conducting a subna-
tional telephone survey can at best make only an informed guesstimate about
the proportion of a mixed landline and cell phone sample should come from
each frame and how to weight and integrate the data that are gathered from each
type of sample.
Within-Unit Respondent
Selection/Screening Techniques
Some persons unfamiliar with valid telephone survey methods mistakenly
assume that the person who initially answers the telephone is always the one who is
interviewed. This is almost never the case with any survey designed to gather a rep-
resentative within-unit sample of the general population. For example, although
males and females are born at a near 50:50 rate, the adult population in most urban
communities is closer to a 55:45 female/male split. A survey that strives to conduct
interviews with a representative sample of an areas adult population must rely on
a systematic respondent selection procedure to achieve a valid female/male balance,
in part because, on average, a female is more likely than a male to answer the tele-
phone when an interviewer calls. Thus, always interviewing the first person who
answers the telephone would lead to an oversampling of females.
Obviously, when sampling is done from a list and the respondent is known by
name, respondent selection requires merely that the interviewer ask to speak with
that person. But in many instances with list sampling, and with all RDD sampling,
the interviewer will not know the name of the person within the household who
should be interviewed, unless this has been learned in a previous contact with the
household. Therefore, a survey designed to gather estimates of person-level popu-
lation parameters (as opposed to household-level measures) must employ a sys-
tematic selection technique to maximize external validity by lessening the chance of
within-unit noncoverage error.
As a rule, interviewers should neither be allowed to interview the first person
who answers the telephone nor be allowed to interview anyone who is merely will-
ing to be surveyed. Instead, the interviewer should select one designated respondent
in a systematic and unbiased fashion from among all possible eligible respondents
within the unit, who meet the the surveys demographic/experiential definition of
a respondent (e.g., an adult who is 18 years of age or older).
Respondents can be selected within a sampling unit using a true probability
sampling schemeone that gives every possibly eligible respondent a known and
nonzero chance of selectionalthough researchers will not always need, nor nec-
essarily want, to employ such an approach. For the purposes of most surveys, it is
acceptable to use a procedure that systematically balances selection along the lines
of both gender and age. Because most sampling units (e.g., households) are quite
homogeneous on many other demographic characteristics (e.g., race, education,
religion), random sampling of units should provide adequate coverage of the pop-
ulation on these other demographic factors.
During the past 30 years, most of the techniques that have been commonly
employed for respondent selection were devised to be minimally intrusive about
gathering personal information at the start of the interviewers contact with the
household, while attempting to provide a demographically balanced sample of
respondents across an entire survey. Because asking for sensitive information
before adequate trust has been developed by the interviewer can seriously increase
telephone survey refusals, and thus nonresponse, researchers have tried to strike a
somewhat difficult balance in their respondent selection techniques, between
avoiding coverage error and avoiding nonresponse error.
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 523
Id like to speak with the person in your household, 18 years of age or older, who
had the last birthday. There is evidence that the birthday methods lead to the cor-
rect eligible being interviewed in most, but not all, cases. Evidence also suggests that
some of these errors are not random across a sample (see Lavrakas et al., 2000). As
such, a prudent strategy when using the birthday method is to randomly assign
sampled households to either the next or last birthday, as it is reasoned that the
errors that occur with each technique will balance out across the two.
(a) identification of the interviewer (i.e., her or his actual first name at a mini-
mum), the interviewers affiliation, and the surveys sponsor;
(b) a brief explanation of the purpose of the survey and its sampling area (or
target population);
(c) some positively worded phrase to encourage cooperation;
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 527
(a) the purpose of the survey and how the findings will be used;
(b) how the respondents number was selected;
(c) more about the survey firm and/or sponsor than simply a name; and
(d) why the particular respondent selection method is being used.
For each of these questions, written fallback statements (or persuaders) should
be provided to interviewers to enable them to give honest, standardized answers
to respondents who ask them. The goal of these statements is to help interviewers
convince potential respondents that the survey is a worthwhile (and harmless)
endeavor; this should be kept in mind by the person who composes the statements.
(For more details on telephone survey introductions and fallback statements, see
Frey, 1989, pp. 125137; Lavrakas, 1993, pp. 100105.)
important of these are the resources spent on (a) developing an effective introduc-
tory spiel and (b) employing a skilled and well-trained group of interviewers.
interviewer earn a living (or, for the unpaid interviewer, the respondent is helping
the interviewer fulfill her or his obligation). By personalizing the issue of coopera-
tion, the interviewer is neither referring to an abstract incentive, such as to help
plan better social programs, nor appealing in the name of another party (the sur-
vey organization or sponsor).
In addition to training interviewers about what to say to minimize the refusals they
experience, researchers should train them in how to say itin terms of both attitude
and voice. Collins, Sykes, Wilson, and Blackshaw (1988) found that less successful
interviewers, when confronted with problems such as reluctant respondents, showed
a lack of confidence and a tendency to panic; they seemed unprepared for problems,
gave in too easily, and failed to avoid deadends (p. 229). The confidence that suc-
cessful interviewers feel is conveyed in the way they speak. Oksenberg and Cannell
(1988) have reported that dominance appears to win out, with interviewers with low
refusal rates being generally more potent (p. 268), rather than trying to be overly
friendly, ingratiating, and/or nonthreatening. In terms of interviewers voices,
Oksenberg and Cannell found that those who spoke somewhat faster, louder, with
greater confidence, and in a falling tone (declarative vs. interrogative) had the lowest
refusal rates (cf. Groves, OHare, Gould-Smith, Benki, & Maher, 2008).
Refusal Conversions
Due in part to continuing difficulties in eliciting respondent cooperation over
the past three decades, procedures have been developed and tested that are designed
to lessen the potential problems refusals may cause (see Lyberg & Dean, 1992). One
approach involves the use of a structured refusal report form (RRF) that the inter-
viewer completes after encountering a refusal (see Lavrakas, 1993, pp. 7881). This
form can provide information that may help the sampling pool controller and
interviewers in subsequent efforts to convert refusalscalling back at another time
to try to convince a respondent to complete the interview after a refusal was previ-
ously encounteredand may help the researcher learn more about the size and
nature of potential nonresponse error. If a researcher chooses to incorporate an
RRF into the sampling process, it is not entirely obvious what information should
be recorded. That is, even in the late-2000s, use of these forms has not received
much attention in the survey methods literature. With this in mind, I urge inter-
ested readers to consider the following discussion of RRFs as suggestive and to fol-
low the future literature on this topic.
Figure 16.1 is an example of an RRF used at my former university survey orga-
nizations. The interviewer completes the RRF immediately after encountering a
refusal. Using the RRF shown in Figure 16.1, the interviewer would begin by
recording who it was within the household that refused, although this is not
always obvious and depends on information that the interviewer is able to glean
prior to the termination of the call. The interview might also code some basic
demographics about the person refusing, but only if the interviewer has some
degree of certainty in doing so. Research suggests that interviewers can do this
accurately in a majority of cases for gender, age, and race (Bauman, Merkle, &
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 530
Lavrakas, 1992; Lavrakas, Merkle, & Bauman, 1992). To the extent that this demo-
graphic information is accurate, the supervisor can use it to make decisions about
which interviewers should attempt which subsequent refusal conversions. For
example, my own experience and research suggests that an interviewer of the same
race as the person who initially refused will have better success in converting a
refusal. Furthermore, to the extent that respondent demographic characteristics
correlate with survey measures, the researcher could investigate the effects of
nonresponse by considering the demographic characteristics of the unconverted
refusals; however, much more needs to be learned before the validity of this strat-
egy is known. The interviewer can also rate the severity of the refusal, as shown
in Figure 16.1, as well as add comments and answer other questions that may help
explain the exact nature of the verbal exchange (if any) that transpired prior to the
termination of the call. It also is recommended that households in which someone
has told the interviewer at the initial refusal, Dont call back! or some such explicit
comment not be recontacted.
No definitive evidence exists about the success rate of refusal-conversion attempts,
although Groves and Lyberg (1988) placed it in the 25% to 40% range; my own
experience leads me to put it in the 10% to 20% range nowadays. In making deci-
sions about whether or not to attempt to convert refusals, the researcher is faced
with this trade-off: the investment of resources to convert refusals so as to possibly
decrease potential nonresponse error, versus the possible increase in other potential
sources of survey error that otherwise might be reduced if those same resources
were invested differently (e.g., paying more to have better-quality interviewers or
refining the questionnaire more with additional pilot testing). Of note, Stec and
Lavrakas (2007) reported that it is considerably more cost-efficient to gain com-
pleted interviews from converted refusals than from releasing new numbers from
the sampling pool.
Interviewer #: __________________________________________________________________
1. Did the person who refused have the last (most recent) birthday?
Yes 1
No/Uncertain 2
60 or older 3 White 4
Uncertain 9 Uncertain 9
_________________________________________________________________________________________________
YES NO
C. Confidentiality? 1 2
6. What can you recommend, if anything, for gaining respondent/household cooperation if a conversion
attempt were made?
_________________________________________________________________________________________________
_________________________________________________________________________________________________
_________________________________________________________________________________________________
_________________________________________________________________________________________________
Interviewer Recruitment
A basic consideration regarding interviewers is whether they are paid for their
work or unpaid, such as volunteers or students who do interviewing as part of their
course work. When a telephone survey employs paid interviewers, there should be
a greater likelihood of higher-quality interviewing, due to several factors. In situa-
tions in which interviewers are paid, the researchers can select carefully from
among the most skilled individuals. With unpaid interviewers, researchers have
much less control over who will not be allowed to interview. Paid interviewers are
more likely to have an objective detachment from the surveys topic. In contrast,
unpaid interviewers often have expectancies of the data; that is, volunteers by
nature are often committed to an organizations purpose in conducting a survey
and may hold preconceived notions of results, which can alter their behavior as
interviewers and contribute bias to the data that they gather. Similarly, students
who interview for academic credit often have an interest in the survey outcomes,
especially if the survey is their classs own project.
Regardless of whether interviewers are paid or unpaid, I recommend that each
interviewer be asked to enter into a written agreement with the researcher. This
agreement should include a clause about not violating respondents confidentiality.
Also, the researcher must make it very clear to all prospective interviewers that tele-
phone surveys normally require standardized survey interviewing (see Fowler &
Mangione, 1990)a highly structured and rather sterile style of asking questions.
Standardized survey interviewingas opposed to conversational interviewing
(Schober & Conrad, 1997)does not allow for creativity on the part of interview-
ers in the ordering or wording of particular questionnaire items or in deciding who
can be interviewed. Furthermore, the researcher should inform all prospective tele-
phone interviewers that constant monitoring will be conducted by supervisors,
including listening to ongoing interviews (see Steve et al., 2008). The researchers
informing prospective interviewers of quality control features such as these in
advance of making a final decision about their beginning to work will create realis-
tic expectations. In the case of paid interviewers, it may discourage those who are
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 533
Interviewer Training
The training of telephone survey interviewers, prior to the on-the-job training
that they should constantly receive by working with their supervisors, has two distinct
components: general training and project-specific training. New interviewers should
receive general training to start their learning process. General training also should be
repeated, or at least refreshed, for experienced interviewers. Project-specific training
is given to everyone, no matter what seniority or ability they have as interviewers.
The following issues should be addressed in the part of training that covers
general practices and expectancies:
All interviewers must be trained in the particulars of each new survey. Generally,
this second, project-specific, part of training should be structured as follows:
develop with interviewers will affect the quality of data produced. To achieve a high
level of quality, there must be constant verbal and/or written feedback from super-
visors to interviewers, especially during the early part of a field period, when
on-the-job training is critical.
Whenever possible, a telephone survey should use a centralized bank of tele-
phones with equipment that allows the supervisors telephone to monitor all inter-
viewers lines. There are special telephones that can be used to monitor an ongoing
interview without the interviewer or respondent being aware of it. With CATI sur-
veys, monitoring ongoing interviews often is a supervisors primary responsibility.
The use of a structured Interviewer Monitoring Form (IMF) is recommended (see
Lavrakas, 1993, pp. 157161; Steve et al., 2008). Supervisors need not listen to com-
plete interviews, but rather they should systematically apportion their listening, a
few minutes at a time, across all interviewers, concentrating more frequently and at
longer intervals on less-experienced ones. All aspects of interviewer-respondent
contact should be monitored, including the interviewers use of the introduction,
the respondent selection sequence, fallback statements, and administration of the
questionnaire itself. An IMF can (a) aid the supervisor by providing documented
on-the-job feedback to interviewers, (b) generate interviewer performance data for
the field director, and (c) provide the researcher with a valuable type of data for
investigating item-specific interviewer-related measurement error (see Cannell &
Oksenberg, 1988; Groves, 1989, pp. 381389).
In addition to noting whether or not interviewers are reading the items exactly
as they are written, supervisors should pay special attention to the ways in which
interviewers probe incomplete, ambiguous, or irrelevant responses, and to whether
or not interviewers adequately repeat questions and define/clarity terms respon-
dents may not understand in an unbiased fashion, if the latter is appropriate for the
survey. Supervisors also need to pay close attention to anything interviewers may be
saying or doing (verbally) that might reinforce certain response patterns that may
bias answers. With many CATI systems, monitoring an ongoing interview includes
being able to view the interviewers use of the keyboard as it happens. Listening to
ongoing interviewing and providing frequent feedback is especially important in
the early stages of the field period and with new interviewers, and at these times
extra supervisors may be needed.
Ethical Considerations,
Telemarketing, and Pseudopolls
High-quality telephone surveys practice the principle of informed consent.
Respondents are informed, either explicitly or implicitly, that their participation is
voluntary and that no harm will come to them regardless of whether they choose
to participate or not. In addition to practicing these ethical standards, legitimate
telephone surveys assure respondents that the answers they provide will be confi-
dential; that is, no one other than the survey organization will know who said
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 536
Discussion Questions
1. What were the factors that led to telephone surveys becoming the mode
of preference in the 1980s and 1990s for sampling and gathering data from the
general population?
2. Currently, what are the major advantages and disadvantages in using tele-
phone surveys to sample the general public? What about sampling populations
other than the general public, such as members of a professional organization,
students at a university, or members of a synagogue?
3. Why are there certain states in the United States that have relatively low cov-
erage of residential telephone service, instead of coverage being essentially equal
across all states? What effect does this have on telephone survey results in those
states with low residential telephone coverage rates? What survey topics would be
most biased by low coverage of telephone service?
4. What effects will number portability in the United States have on telephone
surveys of the general population? What effects will the trend toward more U.S. res-
idents using only a cell phone for their telephone service have on telephone sur-
veys?
5. Why have telephone survey response rates dropped in the United States in
the past decade? What direction will this trend likely go in the next decade? What
implications does this have for the accuracy of telephone surveys used to measure
the general population? What can be done to raise telephone survey response rates
in the Unites States?
6. Discuss how the prior calling history on a given telephone number chosen
for a telephone survey might affect future outcomes when calling back the same
number as part of a telephone survey.
7. What are the advantages and disadvantages of using computer-assisted tele-
phone interviewing (CATI) compared with a paper and pencil (PAPI) method?
What are some circumstances when a telephone survey should be done using PAPI?
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 537
Exercises
1. As a class, develop a short questionnaire that measures the type of telephone
service(s) someone has in her or his home residence and the proportion of calls
they receive and make on each type of service; include a few demographic questions
at the end. Have all students complete 10 telephone interviews with other students
at their college/university using the questionnaire. Have students write up and
discuss their experiences as telephone interviewers.
3. Create an introductory spiel for a telephone survey about seat belt usage that
is being conducted by a university for a government agency in which adults will be
sampled.
Notes
1. Number portability refers to an option that went into effect in November 2004 in the
United States allowing people to transfer (port) their 10-digit telephone number to another
geographic area when they moved and/or allowing them to keep the same number when
they changed their telephone service from a landline to a cell phone or vice versa.
2. Starting in 2005, Linguist Dr. Erik Camayd-Freixas and researchers at the Nielsen
Company began a series of progressive involvement experiments with training interviewers to
use relatively brief sentences to encourage respondents reached on the telephone to become
engaged in the conversation so as to counter the tendency of many respondents to hang up
within the first few seconds after an interviewer has made contact (see Burks et al., 2007).
3. The concept of a sampling pool is not often addressed explicitly in the survey
methods literature. A naive observer might assume, for example, that a telephone survey in
which 1,000 persons were interviewed, actually sampled only those 1,000 persons and no
othersbut this is almost never the case, for many reasons, including the problem of non-
response. Thus, a researcher is faced with the reality of often needing many times more tele-
phone numbers for interviewers to process than the total number of interviews that the
survey requires. Although most researchers refer to the set of telephone numbers that will be
dialed as their sample and also use the word sample to refer to the final number of completed
interviews achieved, Lavrakas (1987, 1993) proposed using the term sampling pool for the
starting set of numbers to be dialed and the word sample for the final set of interviews that
are achieved from the sampling pool.
4. First proposed by Cooper (1964), random-digit dialing, or RDD, comprises a group
of probability sampling techniques that provide a nonzero chance of reaching any household
with a telephone access line in a sampling area (assuming all exchanges/prefixes in the area
are represented in the frame), regardless of whether its telephone number is published or
listed. RDD does not provide an equal probability of reaching every telephone household in
a sampling area because some households have more than one telephone number. For
households with two or more numbers, postsampling adjustments (weighting) typically
need to be made before the data are analyzed to correct for this unequal probability of selec-
tion; thus, data must be gathered via the questionnaire in RDD sampling about how many
telephone numbers reach each household. Recent estimates are that, about two in five resi-
dential telephone numbers in the United States are unlisted. In theory, using RDD eliminates
the potential problem of coverage error that might result from missing households with
unlisted telephone numbers.
References
American Association for Public Opinion Research. (2005). Code of professional ethics and prac-
tices. Lenexa, KS: Author. Retrieved April 29, 2008, from www.aapor.org/aaporcodeofethics
Bass, R. T., & Totora, R. D. (1988). A comparison of centralized CATI facilities for an
agricultural labor survey. In R. M. Groves, P. N. Biemer, L. E. Lyberg, J. T. Massey,
W. L. Nicholls, & J. Waksberg (Eds.), Telephone survey methodology (pp. 497508). New
York: John Wiley.
Bauman, S. L., Merkle, D. M., & Lavrakas, P. J. (1992, November). Interviewer estimates of
refusers gender, age, and race in telephone surveys. Paper presented at the 15th annual
conference of the Midwest Association for Public Opinion Research, Chicago.
Biemer, P. N., Groves, R. M., Lyberg, L. E., Mathiowetz, N. A., & Sudman, S. (Eds.). (1991).
Measurement errors in surveys. New York: John Wiley.
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 539
Blumberg, S. J., & Luke, J. V. (2007). Coverage bias in traditional telephone surveys of
low-income and young adults. Public Opinion Quarterly, 71(5), 734749.
Blumberg, S. J., Luke, J. V., & Cynamon, M. (2006). Telephone coverage and health survey
estimates: Evaluating the need for concern about wireless substitution. American
Journal of Public Health, 96, 926931.
Brick, J. M., Brick, P. D., Dipko, S., Presser, S., Tucker, C., & Yuan, Y. Y. (2007). Cell phone
survey feasibility in the U.S.: Sampling and calling cell numbers versus landline
numbers. Public Opinion Quarterly, 71(1), 2339.
Brick, J. M., Dipko, S., Presser, S., Tucker, C., & Yuan, Y. Y. (2006). Nonresponse bias in a dual
frame sample of cell and landline numbers. Public Opinion Quarterly, 70(5), 780793.
Brick, J. M., Edwards, W. S., & Lee, S. (2007). Sampling telephone numbers and adults, inter-
view length, and weighting in the California health interview survey cell phone pilot
study. Public Opinion Quarterly, 71(5), 793813.
Brick, J. M., Waksberg, J., Kulp, D., & Starer, A. (1995). Bias in list-assisted telephone surveys.
Public Opinion Quarterly, 59, 218235.
Burks, A. T., Camayd-Freixas, E., Lavrakas, P. J., & Bennett, M. A. (2007, May). The use of pro-
gressive involvement principles in a telephone survey introduction to reduce immediate
refusals. Paper presented at the 62nd annual conference of the American Association for
Public Opinion, Anaheim, CA.
Callegaro, M., McCutcheon, A., & Ludwig, J. (2006, January). Whos calling? The impact
of caller-ID on telephone survey response. Paper presented at the Second International
Conference on Telephone Survey Methodology, Miami, FL.
Callegaro, M., & Poggio, T. (2004). Espansione della telefonia mobile ed errore di copertura
nelle inchieste telefoniche [Mobile telephone growth and coverage error in telephone
surveys]. Polis, 18, 477506. (English version available at http://eprints.biblio.unitn.it/
archive/00000680)
Cannell, C. F., & Oksenberg, L. (1988). Observation of behavior in telephone interviews. In
R. M. Groves, P. N. Biemer, L. E. Lyberg, J. T. Massey, W. L. Nicholls, & J. Waksberg
(Eds.), Telephone survey methodology (pp. 475496). New York: John Wiley.
Collins, M., Sykes, W., Wilson, P., & Blackshaw, N. (1988). Nonresponse: The UK experience.
In R. M. Groves, P. N. Biemer, L. E. Lyberg, J. T. Massey, W. L. Nicholls, & J. Waksberg
(Eds.), Telephone survey methodology (pp. 213232). New York: John Wiley.
Cooper, S. L. (1964). Random sampling by telephone: An improved method. Journal
of Marketing Research, 1(4), 4548.
Curtin, R., Presser, S., & Singer, E. (2005). Changes in telephone survey nonresponse over the
past quarter century. Public Opinion Quarterly, 69(1), 8798.
de Heer, W. (1999). International response trends: Results of an international survey. Journal
of Official Statistics, 15(2), 129142.
de Leeuw, E., Joop, H., Korendijk, E., Mulders, G.-L., & Callegaro, M. (2005). The influence
of advance letters on response in telephone surveys: A meta-analysis. In C. van Dijkum,
J. Blasius, & C. Durand (Eds.), Recent developments and applications in social research
methodology. Proceedings of the RC 33 Sixth International Conference on Social Science
Methodology, Amsterdam 2004 [CD-ROM]. Leverkusen-Opladen, Germany: Barbara
Budrich.
de Leeuw, E. D., & van der Zouwen, J. (1988). Data quality in telephone and face to face surveys:
A comparative meta-analysis. In R. M. Groves, P. N. Biemer, L. E. Lyberg, J. T. Massey,
W. L. Nicholls, & J. Waksberg (Eds.), Telephone survey methodology (pp. 283300). New
York: John Wiley.
Dillman, D. A., Gallegos, J., & Frey, J. H. (1976). Reducing refusals for telephone interviews.
Public Opinion Quarterly, 40, 99114.
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 540
Dillman, D. A., & Tarnai, J. (1988). Administrative issues in mixed mode surveys. In R. M.
Groves, P. N. Biemer, L. E. Lyberg, J. T. Massey, W. L. Nicholls, & J. Waksberg (Eds.),
Telephone survey methodology (pp. 509528). New York: John Wiley.
Fowler, F. J., Jr., & Mangione, T. W. (1990). Standardized survey interviewing: Minimizing
interviewer-related error. Newbury Park, CA: Sage.
Frey, J. H. (1989). Survey research by telephone (2nd ed.). Newbury Park, CA: Sage.
Gabler, S., & Hader, S. (2001). Idiosyncrasies in telephone sampling: The case of Germany.
International Journal of Public Opinion Research, 14(3), 339345.
Gawiser, S. R., & Witt, G. E. (1992). Twenty questions a journalist should ask about poll results.
New York: National Council on Public Polls.
Gaziano, C. (2005). Comparative analysis of within-household respondent selection tech-
niques. Public Opinion Quarterly, 69(1), 124157.
Groves, R. M. (1989). Survey errors and survey costs. New York: John Wiley.
Groves, R. M., & Lyberg, L. E. (1988). An overview of nonresponse issues in telephone surveys.
In R. M. Groves, P. N. Biemer, L. E. Lyberg, J. T. Massey, W. L. Nicholls, &
J. Waksberg (Eds.), Telephone survey methodology (pp. 191212). New York: John Wiley.
Groves, R. M., & McGonagle, K. A. (2001). A theory-guided interviewer training protocol
regarding survey participation. Journal of Official Statistics, 17(2), 249265.
Groves, R. M., OHare, B. C., Gould-Smith, D., Benki, J., & Maher, P. (2008). Telephone inter-
viewer voice characteristics and survey participation decision. In J. Lepkowski,
C. Tucker, M. Brick, E. de Leeuw, L. Japec, P. J. Lavrakas, et al. (Eds.), Telephone survey:
Innovations and methodologies (pp. 385400). Hoboken, NJ: John Wiley.
Groves, R. M., Singer, E., & Corning, A. (2000). Leverage-saliency theory of survey partici-
pation: Description and an illustration, Public Opinion Quarterly, 64, 299308.
Hansen, S. E. (2008). CATI sample management. In J. Lepkowski, C. Tucker, M. Brick, E. de
Leeuw, L. Japec, P. J. Lavrakas, et al. (Eds.), Telephone survey: Innovations and method-
ologies (pp. 340358). Hoboken, NJ: John Wiley.
Henry, G. T. (1990). Practical sampling. Newbury Park, CA: Sage.
House, C. C., & Nicholls, W. L. (1988). Questionnaire design for CATI: Design objectives and
methods. In R. M. Groves, P. N. Biemer, L. E. Lyberg, J. T. Massey, W. L. Nicholls, &
J. Waksberg (Eds.), Telephone survey methodology (pp. 421426). New York: John Wiley.
IPSOS-INRA. (2004). EU telecomm service indicators. Retrieved April 15, 2008, from
http://ec.europa.eu/information_society/policy/ecomm/info_centre/documentation/
studies_ext_consult/index_en.htm
Keeter, S., Kennedy, C., Clark, A., Tompson, T. N. & Mokrzycki, M. (2007). Whats missing
from national landline RDD surveys? The impact of the growing cell-only population.
Public Opinion Quarterly, 71(5), 772792.
Kelly, J., Link, M., Petty, J., Hobson, K., & Cagney, P. (2008). Establishing a new survey
research call center. In J. Lepkowski, C. Tucker, M. Brick, E. De Leeuw, L. Japec,
P. J. Lavrakas, et al. (Eds.), Advances in telephone survey methodology (pp. 317339).
Hoboken, NJ: John Wiley.
Kennedy, C. (2007). Evaluating the effects of screening for telephone service in dual frame
RDD surveys. Public Opinion Quarterly, 71(5), 750771.
Kish, L. (1949). A procedure for objective respondent selection within the household. Journal
of the American Statistical Association, 44, 380387.
Kish, L. (1965). Survey sampling. New York: John Wiley.
Kish, L. (1994). Multi-population survey designs: Five types with seven shared aspects.
International Statistical Review, 62, 167186.
Kuusela, V. (2003). Mobile phones and telephone survey methods. In R. Banks, J. Currall,
J. Francis, L. Gerrard, R. Kahn, T. Macer, et al. (Eds.), ASC 2003The impact of new
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 541
technology on the survey process. Proceedings of the fourth ASC International Conference
(pp. 317327). Chesham Bucks, UK: Association for Survey Computing.
Lavrakas, P. J. (1987). Telephone survey methods: Sampling, selection, and supervision.
Newbury Park, CA: Sage.
Lavrakas, P. J. (1991). Implementing CATI at the Northwestern survey lab: Part I. CATI News,
4(1), 23.
Lavrakas, P. J. (1992, November). Attitudes towards and experiences with sexual harassment in
the workplace. Paper presented at the 15th annual conference of the Midwest
Association for Public Opinion Research, Chicago.
Lavrakas, P. J. (1993). Telephone survey methods: Sampling, selection, and supervision (2nd
ed.). Newbury Park, CA: Sage.
Lavrakas, P. J. (1996). To err is human. Marketing Research, 8(1), 3036.
Lavrakas, P. J. (2004, May). Will a perfect storm of cellular forces sink RDD sampling? Paper presented
at the 56th annual conference of the American Association for Public Opinion, Phoenix, AZ.
Lavrakas, P. S., Harpuder, B., & Stasny, E. A. (2000, May). A further investigation of the last-
birthday respondent selection method. Paper presented at the 52nd annual conference of
the American Association for Public Opinion, Portland, OR.
Lavrakas, P. J., & Merkle, D. A. (1991, November). A reversal of roles: When respondents ques-
tion interviewers. Paper presented at the 13th annual conference of the Midwest
Association for Public Opinion Research, Chicago.
Lavrakas, P. J., Merkle, D. A., & Bauman, S. L. (1992, May). Refusal report forms, refusal con-
versions, and nonresponse bias. Paper presented at the 47th annual conference of the
American Association for Public Opinion Research, St. Petersburg, FL.
Lavrakas, P. J., & Shuttles, C. D. (2004, August). Two advance letter experiments to raise survey
responses rates in a two-stage mixed mode survey. Paper presented at the 2004 Joint
Statistical Meetings, Toronto, Ontario, Canada.
Lavrakas, P. J., Shuttles, C. D., Steeh, C., & Fienberg, H. (2007). The state of surveying cell phone
numbers in the United States: 2007 and beyond. Public Opinion Quarterly, 71(5), 840854.
Lavrakas, P. J., Steeh, C., Blumberg, S., Boyle, J., Brick, J. M., Callegaro, M., et al. (2008).
Guidelines and considerations for survey researchers when planning and conducting RDD
and other telephone surveys in the U.S. with respondents reached via cell phone numbers.
Lenexa, KS: AAPOR.
Lepkowski, J. M. (1988). Telephone sampling methods in the U.S. In R. M. Groves,
P. N. Biemer, L. E. Lyberg, J. T. Massey, W. L. Nicholls, & J. Waksberg (Eds.), Telephone
survey methodology (pp. 7398). New York: John Wiley.
Link, M. W., Battaglia, M. P., Frankel, M. R., Osborn, L., & Mokdad, A. H. (2007). Reaching
the U.S. cell phone generation: Comparison of cell phone survey results with an ongo-
ing landline telephone survey. Public Opinion Quarterly, 71(5), 814839.
Link, M. W., Battaglia, M. P., Frankel, M. R., Osborn, L., & Mokdad, A. H. (2008). A com-
parison of address-based (ABS) versus random-digit dialing (RDD) for general popu-
lation surveys. Public Opinion Quarterly, 72(1), 627.
Lyberg, L. E. (1988). The administration of telephone surveys. In R. M. Groves, P. N. Biemer,
L. E. Lyberg, J. T. Massey, W. L. Nicholls, & J. Waksberg (Eds.), Telephone survey method-
ology (pp. 453456). New York: John Wiley.
Lyberg, L. E., & Dean, P. (1992, May). Methods for reducing nonresponse rates: A review. Paper
presented at the 47th annual conference of the American Association for Public
Opinion Research, St. Petersburg, FL.
Oksenberg, L., & Cannell, C. F. (1988). Effects of interviewer vocal characteristics on nonre-
sponse. In R. M. Groves, P. N. Biemer, L. E. Lyberg, J. T. Massey, W. L. Nicholls, &
J. Waksberg (Eds.), Telephone survey methodology (pp. 257272). New York: John Wiley.
16-Bickman-45636:16-Bickman-45636.qxp 7/28/2008 6:23 PM Page 542
Pew Research Center. (2004). Polls face growing resistance, but still representative [News release].
Retrieved April 15, 2008, from http://people-press.org/reports/display.php3?ReportID=211
Piazza, T. (1993). Meeting the challenge of answering machines. Public Opinion Quarterly,
57, 219231.
Rizzo, L. J., Brick, J. M., & Park, I. (2004). A minimally intrusive method for sampling per-
sons in random digit dial surveys. Public Opinion Quarterly, 68, 267274.
Schober, M. F., & Conrad, F. G. (1997). Does conversational interviewing reduce survey mea-
surement error? Public Opinion Quarterly, 61, 576602.
Shuttles, C., Welch, J., Hoover, B., & Lavrakas, P. J. (2002, May). The development and exper-
imental testing of an innovative approach to training telephone interviewers to avoid
refusals. Paper presented at the 57th annual conference of the American Association for
Public Opinion, St. Petersburg, FL.
Stec, J. A., & Lavrakas, P. J. (2007, May). The cost of refusals in large RDD national surveys.
Paper presented at the 62nd annual conference of the American Association for Public
Opinion, Anaheim, CA.
Stec, J., Lavrakas, P. J., & Shuttles, C. (2005, May). Gaining efficiencies in scheduling callbacks
in large RDD national surveys. Paper presented at the 60th annual conference of the
American Association for Public Opinion, Miami Beach, FL.
Steve, K., Burks, A. T., Lavrakas, P. J., Brown, K., & Hoover, B. (2008). The development of a
comprehensive behavioral-based system to monitor telephone interviewer performance.
In J. Lepkowski, C. Tucker, M. Brick, E. de Leeuw, L. Japec, P. J. Lavrakas, et al. (Eds.),
Telephone survey: Innovations and methodologies (pp. 401422). Hoboken, NJ: John Wiley.
Steve, K., Daily, G., Lavrakas, P. J., Bourquin, H. C., Yancey, T., & Kulp, D. (2007, May). R&D stud-
ies to replace the random-digit dial frame with an address-based sampling frame. Paper pre-
sented at the 62nd annual conference of the American Association for Public Opinion,
Anaheim, CA.
Tarnai, J., & Moore, D. (2008). Measuring and improving telephone interviewer performance and
productivity. In J. Lepkowski, C. Tucker, M. Brick, E. de Leeuw, L. Japec, P. Lavrakas, et al.
(Eds.), Telephone surveys: Innovations and methodologies (pp. 359384). Hoboken, NJ: Wiley.
Taylor, S. (2003). Telephone surveying for household social surveys: The good, the bad, and
the ugly. Social Survey Methodology Bulletin, 52, 1021.
Traugott, M. W., & Lavrakas, P. J. (2008). The voters guide to election polls (4th ed.). Lanham,
MD: Rowman & Littlefield.
Trussell, N., & Lavrakas, P. J. (2005, May). Testing the impact of caller ID technology on
response rates in a mixed mode survey. Paper presented at the 60th annual conference of
the American Association for Public Opinion, Miami Beach, FL.
Tuckel, P., & ONeill, H. (2002). The vanishing respondent in telephone surveys. Journal of
Advertising Research, 42(5), 2648.
Tucker, C., Brick, J. M., & Meekins, B. (2007). Household telephone service and usage
patterns in the United States in 2004: Implications for telephone samples. Public
Opinion Quarterly, 71(3), 322.
Tucker, C., Lepkowski, J., & Piekarski, L. (2002). The current efficiency of list-assisted tele-
phone sampling designs. Public Opinion Quarterly, 66, 321338.
Vehovar, V., Belak, E., Batagelj, Z., & Cikic, S. (2004). Mobile phone surveys: The Slovenian
case study. Metodoloki zvezki, 1(1), 119.
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 543
CHAPTER 17
Ethnography
David M. Fetterman
543
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 544
Overview
This chapter presents an overview of the steps involved in ethnographic work (see
Fetterman, 1998, for additional detail). The process begins when the ethnographer
selects a problem or topic and a theory or model to guide the study. The ethnog-
rapher simultaneously chooses whether to follow a basic or applied research
approach to delineate and shape the effort. The research design then provides a
basic set of instructions about what to do and where to go during the study.
Fieldwork is the heart of the ethnographic research design. In the field, basic
anthropological concepts, data collection methods and techniques, and analysis are
the fundamental elements of doing ethnography. Selection and use of various
pieces of equipmentincluding the human instrumentfacilitate the work. This
process becomes product through analysis at various stages in ethnographic
workin field notes, memoranda, and interim reports, but most dramatically in
the published report, article, or book.
This chapter presents the concepts, methods and techniques, equipment, analy-
sis, writing, and ethics involved in ethnographic research. This approach highlights
the utility of planning and organization in ethnographic work. The more organized
the ethnographer, the easier his or her task of making sense of the mountains of
data collected in the field. Sifting through notepads filled with illegible scrawl, lis-
tening to hours of digital voice recordings, labeling and organizing digital pho-
tographs and video, and conducting cross tabs and various data sorts in online
surveys are much less daunting to the ethnographer who has taken an organized,
carefully planned approach.
The reality, however, is that ethnographic work is not always orderly. It involves
serendipity, creativity, being in the right place at the right or wrong time, a lot of
hard work, and old-fashioned luck. Thus, although this discussion proceeds within
the confines of an orderly structure, I have made a concerted effort to ensure that
it conveys as well the unplanned, sometimes chaotic, and always intriguing charac-
ter of ethnographic research.
Whereas in most research analysis follows data collection, in ethnographic
research analysis and data collection begin simultaneously. An ethnographer is a
human instrument and must discriminate among different types of data and ana-
lyze the relative worth of one path over another at every turn in fieldwork, well
before any formalized analysis takes place. Clearly, ethnographic research involves
all different levels of analysis. Analysis is an ongoing responsibility and joy from the
first moment an ethnographer envisions a new project to the final stages of writing
and reporting the findings.
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 545
Ethnography 545
Concepts
The most important concepts that guide ethnographers in their fieldwork include
culture, a holistic perspective, contextualization, emic perspective and multiple
realities, etic perspective, nonjudgmental orientation, inter- and intracultural
diversity, structure and function, symbol and ritual, micro- and macrolevel studies,
and operationalism.
Culture
Culture is the broadest ethnographic concept. Definitions of culture typically
espouse either a materialist or an ideational perspective. The classic materialist inter-
pretation of culture focuses on behavior. In this view, culture is the sum of a social
groups observable patterns of behavior, customs, and way of life (see Harris, 1968,
p. 16; Murphy & Margolis, 1995; Ross, 1980). The most popular ideational definition of
culture is the cognitive definition. According to the cognitive approach, culture com-
prises the ideas, beliefs, and knowledge that characterize a particular group of people
(Strauss & Quinn, 1997). This secondand currently most populardefinition
specifically excludes behavior. Obviously, ethnographers need to know about both
cultural behavior and cultural knowledge to describe a culture or subculture ade-
quately. Although neither definition is sufficient, each offers the ethnographer a
starting point and a perspective from which to approach the group under study.
Both material and ideational definitions are useful at different times in explor-
ing fully how groups of people think and behave in their natural environments.
However defined, the concept of culture helps the ethnographer search for a logi-
cal, cohesive pattern in the myriad, often ritualistic behaviors and ideas that char-
acterize a group.
Anthropologists learn about the intricacies of a subgroup or community to
describe it in all its richness and complexity. In the process of studying these details,
they typically discover underlying forces that make the system tick. These cultural
elements are values or beliefs that can unite or divide a group, but that are com-
monly shared focal points. An awareness of what role these abstract elements play
in a given culture can give the researcher a clearer picture of how the culture works.
Many anthropologists consider cultural interpretation ethnographys primary
contribution. Cultural interpretation involves the researchers ability to describe
what he or she has heard and seen within the framework of the social groups view
of reality. A classic example of the interpretive contribution involves the wink and
the blink. A mechanical difference between the two may not be evident. However,
the cultural context of each movement, the relationship between individuals that
each act suggests, and the contexts surrounding the two help define and differenti-
ate these two significantly different behaviors. Anyone who has ever mistaken a
blink for a wink is fully aware of the significance of cultural interpretation (see
Fetterman, 1982, p. 24; Geertz, 1973, p. 6; Roberts, Byram, Barro, Jordan, & Street,
2001; Wolcott, 1980, pp. 57, 59).
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 546
Ethnography 547
decision about the program. In this case, contextualization ensured that the
program would continue serving former dropouts (see Fetterman, 1987a).
In the same study, it was important to describe the inner-city environment in
which the schools were locatedan impoverished neighborhood in which pimping,
prostitution, arson for hire, rape, and murder were commonplace (see Figure 17.2).
This helped policymakers understand the power of certain elements in the com-
munity to distract students from their studies. This description also provided some
insight into the often lucrative alternatives with which the school competed in
attracting and retaining students. Contextualization helped provide a more accu-
rate characterization of the schools degree of difficulty and helped prevent a com-
mon errorblaming the victim.
an understanding of why people think and act in the different ways they do.
Differing perceptions of reality can be useful clues to individuals religious, eco-
nomic, or political status and can help a researcher understand maladaptive
behavior patterns.
An etic perspective is an external, social scientific perspective on reality. Some
ethnographers are interested only in describing the emic view, without placing
their data in an etic or scientific perspective. They stand at the ideational
and phenomenological end of the ethnographic spectrum. Other ethnographers
prefer to rely on etically derived data first, and consider emically derived data
secondary in their analysis. They stand at the materialist and positivist philo-
sophical end of the ethnographic spectrum. At one time, a conflict (ideational,
typically emically oriented perspective) or the environment (materialist, often
etically based perspective) consumed the field. Today, most ethnographers
simply see emic and etic orientations as markers along a continuum of styles or
different levels of analysis. Most ethnographers start collecting data from the
emic perspective, then try to make sense of what they have collected in terms of
both the natives view and their own scientific analysis. Just as thorough field-
work requires an insightful and sensitive cul-
tural interpretation combined with rigorous
data collection techniques, so good ethnogra-
phy requires both emic and etic perspectives.
A burnt-out building in the inner city across
from the alternative school for dropouts pro-
vides an excellent example of why it is impor-
tant to combine emic and etic perspectives (see
Figure 17.3). From an initial etic perspective, it
looks like there was a fire, possibly due to faulty
electrical wiring. A few interviews with the
students, and an alternative emic view is
revealed. This was arson for hire. Some of the
students are hired to torch a building after
the landlord has increased the insurance cover-
age on the building. An interview with the local
fire department (another emic view with con-
siderable traditional authority) confirmed the
students emic view, adding a new insight into
the alternative schools competition for the
students attentionparticularly concerning
alternative sources of activity and revenue. An
etic view based on these emic views provides a
more accurate depiction of what happened to
the house and more to point, the social cir-
cumstances shaping what happened to the
Figure 17.3 Burnt-Out Building in the Inner City house (see Wolcott, 1999, p. 156).
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 549
Ethnography 549
Nonjudgmental Orientation
and Inter- and Intracultural Diversity
A nonjudgmental orientation requires the ethnographer to suspend personal
valuation of any given cultural practice. Maintaining a nonjudgmental orientation
is similar to suspending disbelief while watching a movie or play, or reading a
bookone accepts what may be an obviously illogical or unbelievable set of cir-
cumstances in order to allow the author to unravel a riveting story.
Intercultural diversity refers to the differences between two cultures, intracultural
diversity to the differences between subcultures within a culture. Intercultural dif-
ferences are reasonably easy to see. Compare the descriptions of two different cul-
tures on a point-by-point basistheir political, religious, economic, kinship,
ecological systems, and other pertinent dimensions. Intracultural differences, how-
ever, are more likely to go unnoticed.
These concepts place a check on our observations. They help the fieldworker see
differences that may invalidate pat theories or hypotheses about observed events in
the field. In some cases, these differences are systematic, patterned activities for a
broad spectrum of the community, compelling the fieldworker to readjust the
research focus; to throw away outdated and inappropriate theories, models,
hypotheses, and assumptions; and to modify the vision of the finished puzzle. In
other cases, the differences are idiosyncratic but useful in underscoring another,
dominant patternthe exception proves the rule. In most cases, however, such dif-
ferences are instructive about a level or dimension of the community that had not
received sufficient consideration.
Housing in the inner city provides an example of intracultural diversity. Most
of the houses in the inner-city neighborhood we were studying were in disrepair,
many were marked by graffiti by local gangs, and entire blocks were in rubble (see
Figure 17.4). This was the norm concerning quality of housing in the neigh-
borhood. However, there were families which were attempting to improve the
quality of the neighborhood, and they put their money where their mouth was
by painting and repairing their homes. They were, albeit, in the minority.
However, they represented a special group with a symbolic message of hope in the
community. This is an example of intracultural diversity. (For additional illustra-
tions of intracultural diversity in qualitative research, see Fetterman, 1998; Marcus,
1998, p. 65.)
Ethnographers use the concepts of structure and function to guide their inquiry.
They extract information from the group under study to construct a skeletal structure
and then thread in the social functionsthe muscle, flesh, and nerves that fill out
the skeleton. A detailed understanding of the underlying structure of a system pro-
vides the ethnographer with a foundation on and frame within which to construct
an ethnographic description.
In addition, ethnographers look for symbols that help them understand and
describe a culture. Symbols are condensed expressions of meaning that evoke pow-
erful feelings and thoughts. A cross or a menorah represents an entire religion, a
swastika represents a movement, whether the original Nazi movement or one of the
many neo-Nazi movements. A flag represents an entire country, evoking both patri-
otic fervor and epithets.
Symbols may signify historical influences in a community. For example, a Jewish
star or Star of David (with Hebrew words carved into the stone) of a building
marred by graffiti and broken glass, marks the historical presence of an orthodox
Jewish community (see Figure 17.5). This symbol of the past provides some insight
into the roots of current tensions between young African Americans in the com-
munity and older orthodox Jews (see Abramovitch & Galvin, 2001, p. 252).
Rituals are repeated patterns of symbolic behavior that play a part in both reli-
gious and secular life. Ethnographers see symbols and rituals as a form of cultural
shorthand. Symbols open doors to initial understanding and crystallize critical cul-
tural knowledge. Together, symbols and rituals help ethnographers make sense of
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 551
Ethnography 551
Micro- or Macrolevel
Studies and Operationalism
A microstudy is a close-up view, as if under
a microscope, of a small social unit or an
identifiable activity within the social unit.
Typically, an ethnomethodologist or symbolic
interactionist will conduct a microanalysis (see
Denzin, 1989; Hinkel, 2004, p. 194). The areas of
proxemics and kinesics in anthropology involve
microstudies. Proxemics is the study of how the
socially defined physical distance between people
varies under differing social circumstances
(Barfield, 1997; Birdwhistell, 1970). Kinesics is
the study of body language (Birdwhistell, 1970;
Psathas, 1995, p. 5). A macrostudy focuses on the
large picture. In anthropology, the large picture
can range from a single school to worldwide
systems. The typical ethnography focuses on a Figure 17.5 Yeshiva in the Inner City With
Graffitti
community or specific sociocultural system.
The selection of a micro- or macrolevel of study
depends on what the researcher wants to know, and thus what theory the study
involves and how the researcher has defined the problem under study.
Operationalism, simply, means defining ones terms and methods of measure-
ment (Anderson, 1996, p. 19). In simple descriptive accounts, saying that a few
people said this and a few others said that may not be problematic. However, estab-
lishing a significant relationship between facts and theory, or interpreting the
facts, requires greater specificity. Operationalism tests ethnographers and forces
them to be honest with themselves. Instead of leaving conclusions to strong impres-
sions, the fieldworker should quantify or identify the source of ethnographic
insights whenever possible. Specifying how one arrives at ones conclusions gives
other researchers something concrete to go on, something to prove or disprove.
In this section of the chapter, I have provided a discussion of some of the most
important concepts in the profession, beginning with such global concepts as cul-
ture, a holistic orientation, and contextualization and gradually shifting to more
narrow conceptsinter- and intracultural diversity, structure and function, sym-
bol and ritual, and operationalism. In the next section, I detail the ethnographic
methods and techniques that grow out of these concepts and allow the researcher
to carry out the work of ethnography.
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 552
Fieldwork
Fieldwork is the hallmark of research for both sociologists and anthropologists.
The method is essentially the same for both types of researchersworking with
people for long periods of time in their natural setting. The ethnographer conducts
research in the native environment to see people and their behavior given all the
real-world incentives and constraints. This naturalist approach avoids the artificial
response typical of controlled or laboratory conditions. Understanding the world
or some small fragment of itrequires studying it in all its wonder and complex-
ity. The task is in many ways more difficult than laboratory study, but it can also be
more rewarding (see Atkinson, 2002; McCall, 2006).
One of the benefits of fieldwork is that it provides a common sense perspective
to data. For example, in a study of schools in the rural south, I received boxes of
records indicating very low academic performance and high school attendance.
This was counterintuitive and contrary to my experience working with schools in
urban areas where students who received poor grades dropped out of school or
were often truant or late. However, traveling to the school watching cotton, rice, and
soy fields pass by, mile after mile, it became clear to me that the data made sense
(see Figure 17.6). There was nothing else to do but show up to school. It was the
only social game in town. As one student put it, It (school) sure beat sittin in the
field, doing nothing, all by yourself.
The fieldworker uses a variety of methods and techniques to ensure the integrity
of the data. These methods and techniques objectify and standardize the researchers
perceptions. Of course, the ethnographer must adapt each one of the methods and
techniques discussed later to the local environment. Resource constraints and dead-
lines may also limit the length of time for data gathering in the fieldexploring,
cross-checking, and recording information.
Ethnography 553
Participant Observation
Participant observation characterizes most ethnographic research and is crucial
to effective fieldwork. Participant observation combines participation in the lives
of the people under study with maintenance of a professional distance that allows
adequate observation and recording of data.
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 554
Interviewing
The interview is the ethnographers most important data-gathering technique.
Interviews explain and put into a larger context what the ethnographer sees and
experiences. General interview types include structured, semistructured, informal,
and retrospective interviews.
Formally structured and semistructured interviews are verbal approximations of
a questionnaire with explicit research goals. These interviews generally serve com-
parative and representative purposescomparing responses and putting them in
the context of common group beliefs and themes. A structured or semistructured
interview is most valuable when the fieldworker comprehends the fundamentals of
a community from the insiders perspective. At this point, questions are more
likely to conform to the natives perception of reality than to the researchers (see
Schensul, LeCompte, & Schensul, 1999).
Informal interviews are the most common in ethnographic work. They seem to
be casual conversations, but where structured interviews have an explicit agenda,
informal interviews have a specific but implicit research agenda. The researcher
uses informal approaches to discover the categories of meaning in a culture.
Informal interviews are useful throughout an ethnographic study for discovering
what people think and how one persons perceptions compare with anothers. Such
comparisons help the fieldworker identify shared values in the communityvalues
that inform behavior. Informal interviews are also useful for establishing and main-
taining healthy rapport.
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 555
Ethnography 555
Survey Questions
A survey questionor what Spradley and McCurdy (1972) call a grand tour
questionis designed to elicit a broad picture of the participant or natives world,
to map the cultural terrain. Survey questions help the ethnographer define the
boundaries of a study and plan wise use of resources. The participants overview of
the physical setting, universe of activities, and thoughts help focus and direct the
investigation.
Once survey questions reveal a category of some significance to both fieldworker
and native, specific questions about that category become most useful. The differ-
ence between a survey question and a specific or detailed question depends largely
on context.
Specific questions probe further into established categories of meaning or activity.
Whereas survey questions shape and inform a global understanding, specific ques-
tions refine and expand that understanding. Structural and attribute questions
subcategories of specific questionsare often the most appropriate approach to this
level of inquiry. Structural and attribute questions are useful to the ethnographer in
organizing and understanding of the natives view. Structural questions reveal the
similarities that exist across the conceptual spectrumin the natives head. (See
Spradley & McCurdy, 1972, for additional information about the construction of
taxonomic definitions. See also Clair, 2003.) Attribute questionsquestions about
the characteristics of a role or a structural elementferret out the differences
between conceptual categories. Typically, the interview will juxtapose structural with
attribute questions. Information from a structural question might suggest a question
about the differences among various newly identified categories.
Ethnographic research requires the fieldworker to move back and forth between
survey and specific questions. Focusing in on one segment of a persons activities or
worldview prematurely may drain all the ethnographers resources before the investi-
gation is half done. The fieldworker must maintain a delicate balance of questions
throughout the study; in general, however, survey questions should predominate in
the early stages of fieldwork, and more specific questions in the middle and final stages.
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 556
Ethnography 557
can check answers rather easily, but must stay on guard against such distortion and
contamination. Another, subtler problem occurs when a key actor begins to adopt
the ethnographers theoretical and conceptual framework. The key actor may inad-
vertently begin to describe the culture in terms of this a priori construct, under-
mining the fieldwork and distorting the emic or insiders perspective. (For further
discussion of the role of key informants, see Dobbert, 1982; Ellen, 1984; Freilick,
1970; Goetz & LeCompte, 1984; Pelto, 1970; Spradley, 1979; Taylor & Bogdan, 1984;
Wolcott, 1999.)
Ethnography 559
The life history approach is usually rewarding for both key actor and ethnogra-
pher. However, it is exceedingly time-consuming. Approximations of this approach,
including expressive-autobiographical interviewing, are particularly valuable
contributions to a study with resource limitations and time constraints. The expres-
sive-autobiographical interview consists of a highly abbreviated chronological auto-
biography, interrupted at critical points with questions of concern to the researcher
to narrow the scope almost immediately, for example, stress, puberty, marriage,
employment, and so on (see Spindler & Spindler, 1970, p. 293; 1987, p. 25).
Questionnaires
Structured interviews are close approximations of questionnaires. Questionnaires
represent perhaps the most formal and rigid form of exchange in the interviewing
spectrumthe logical extension of an increasingly structured interview. However,
questionnaires are qualitatively different from interviews because of the distance
between the researcher and the respondent. Interviews have an interactive nature
that questionnaires lack. In filling out a questionnaire, the respondent completes
the researchers form without any verbal exchange or clarification. Knowing
whether or not the researcher and the respondent are on the same wavelength, shar-
ing common assumptions and understandings about the questions, is difficult
perhaps impossible.
Misinterpretations and misrepresentations are common with questionnaires.
Many people present idealized images of themselves on questionnaires, answering
as they think they should to conform to a certain image. The researcher has no con-
trol over this type of response and no interpersonal cues to guide the interpretation
of responses. Other problems include bias in the questions and poor return rates.
Despite these caveats, questionnaires are an excellent way for fieldworkers to
tackle questions dealing with representativeness. They are the only realistic way of
taking the pulses of hundreds or thousands of people. Anthropologists usually
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 560
develop questionnaires to explore scientific concerns after they have a good grasp
of how the larger pieces of the puzzle fit together. The questionnaire is a product of
the ethnographers knowledge about the system, and the researcher can adapt it to
a specific topic or set of concerns. Ethnographers also use existing questionnaires
to test hypotheses about specific conceptions and behaviors. However, the ethnog-
rapher must establish the relevance of a particular questionnaire to the target cul-
ture or subculture before administering it.
Online surveys and questionnaires provide an efficient way to document the
views of large groups in a short period of time. The questions are posted on the
Web, including yes/no, all that apply, open-ended, and 5-point Likert scale ques-
tions. Respondents are notified about the location of the survey on the Web (with
a specific URL), enter their responses, and submit their survey online. The results
are automatically calculated. The responses are often visually represented in a bar
chart or similar graphic display as soon as the data are entered (see Figure 17.8).
This saves the ethnographer from the initial mailing costs, time-consuming and
expensive postal reminders, and the expense of data entry concerning all the sub-
mitted surveys. Ethnographers can help computer-phobic respondents or those
who do not have access to computer or the Internet complete the survey and enter
their responses in the same online data base if necessary (see Flick, 2006).
There are also many other ways to conduct surveys, ranging from PDAs to wire-
less polling devices. One of the benefits of wireless polling devices (where people
use a hand-held instrument to record their answers and the results are immediately
tabulated and visible) is the immediacy and transparency of the tool. Participants
can see and share their responses in real time. The approach provides an excellent
vehicle to launch focus group discussions. Individuals are also able to compare their
answers with the group and (if comfortable) discuss their reasons for a specific
response.
The credibility of survey findings (hard copy or online) depends on the response
rate. Response rates refer to the percentage of people who complete a survey. There
are many ways of increasing the response rate, ranging from keeping the survey
short (reducing the respondent burden) to offering incentives. In general, the
higher the response rate the better it is (see Fink, 2005, p. 6).
Projective Techniques
Projective techniques supplement and enhance fieldwork, they do not replace it.
These techniques are employed by the ethnographer to elicit cultural and often psy-
chological information from group members. Typically, the ethnographer holds an
Ethnography 561
item up and asks the participant what it is. The researcher may have an idea about
what the item represents, but that idea is less important than the participants per-
ception. The participants responses usually reveal individual needs, fears, inclina-
tions, and general worldview.
I typically share pictures and brief videos of the group I am working with while
I am on site or in their community. In part, it is a natural form of reciprocity.
However, it also yields important data. The pictures or videos elicit both confirm-
ing and unexpected comments. In one case, students yelled, Idi Amin, when they
saw the directors picture. This surprised me because I had only heard high praise
about him before that. The reaction led me to understand another side or dimen-
sion to the director that made him successfulcaring but firm.
Projective techniques, however revealing, rarely stand alone. The researcher
needs to set these techniques in a larger research context to understand the elicited
response completely. Projective techniques can elicit cues that can lead to further
inquiry or can be one of several sources of information to support an ongoing
hypothesis. Only the ethnographers imagination limits the number of possible
projective techniques. However, the fieldworker should use only those tests that can
be relevant to the local group and the study.
Unobtrusive Measures
I began this section on methods and techniques by stating that ethnographers
are human instruments, dependent on all their senses for data collection and analy-
sis. Most ethnographic methods are interactive: They involve dealing with people.
The ethnographer attempts to be as unobtrusive as possible to minimize effects
on the participants behavior. However, data collection techniquesexcept for
questionnairesfundamentally depend on that human interaction.
A variety of other measures, however, do not require human interaction and can
supplement interactive methods of data collection and analysis. These methods
require only that the ethnographer keep eyes and ears open. Ranging from out-
cropping to folktales, these unobtrusive measures allow the ethnographer to draw
social and cultural inferences from physical evidence (see Webb, Campbell,
Schwartz, & Sechrest, 2000).
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 562
Outcroppings
Outcropping is a geological term referring to a portion of the bedrock that is vis-
ible on the surfacein other words, something that sticks out. Outcroppings in
inner-city ethnographic research include skyscrapers, burned-out buildings, graf-
fiti, the smell of urine on city streets, yard littered with garbage, a Rolls-Royce, and
a syringe in the schoolyard. The researcher can quickly estimate the relative wealth
or poverty of an area from these outcroppings. Initial inferences are possible with-
out any human interaction. However, such cues by themselves can be misleading. A
house with all the modern conveniences and luxuries imaginable can signal wealth
or financial overextension verging on bankruptcy. The researcher must place each
outcropping in a larger context. A broken syringe can have several meanings,
depending on whether it lies on the floor of a doctors office or in an elementary
schoolyard late at night. On the walls of an inner-city school, the absence of graffiti
is as important as its presence.
An expensive white elephant of a building takes on special significance when
viewed within the confines of a township lacking rudimentary services and utilities.
The outcropping hints at political patronage, poor planning, and/or misdirected
resources. A South African woman standing in front of her modest home takes on
greater meaning and significance, when situated within a larger squatter settlement
in South Africa (see Figures 17.9 and 17.10). It becomes a political statement about
the scope of poverty and injustice.
Changes in a physical setting over time can also be revealing. For example, an
increase in the number of burned-out and empty buildings on a block indicates a
decaying neighborhood. Conversely, an increase in the number of remodeled and
Ethnography 563
Folktales
Folktales are important to both literate and nonliterate societies. They crystallize
an ethos or a way of being. Cultures often use folktales to transmit critical cultural
values and lessons from one generation to the next. Folktales usually draw on famil-
iar surroundings and on figures relevant to the local setting, but the stories themselves
are facades. Beneath the thin veneer is another layer of meaning. This inner layer
reveals the stories underlying values. Stories provide ethnographers with insight into
the secular and the sacred, the intellectual and the emotional life of a people.
All the methods and techniques discussed above are used together in ethno-
graphic research. They reinforce one another. Like concepts, methods and tech-
niques guide the ethnographer through the maze of human existence. Discovery
and understanding are at the heart of this endeavor. The next section explores a
wide range of useful devices that make the ethnographers expedition through time
and space more productive and pleasant.
Equipment
Notepads, computers, tape recorders, PDAs, camerasall the tools of ethnography
are merely extensions of the human instrument, aids to memory and vision. Yet
these useful devices can facilitate the ethnographic mission by capturing the rich
detail and flavor of the ethnographic experience and then helping organize and
analyze these data. Ethnographic equipment ranges from simple paper and pen to
high-tech laptop and mainframe computers, from tape recorders and cameras to
digital camcorders. The proper equipment can make the ethnographers sojourn in
an alien culture more pleasant, safe, productive, and rewarding.
Ethnography 565
Laptop Computers
The laptop computer is a significant improvement over pen and notepad. Laptop
computers are truly portable computers for use in the office, on a plane, or in the
field. I often use the laptop in lieu of pen and paper during interviews (once I have
established rapport and as long as it does not distance me from the person I am
working with). In a technologically sophisticated setting, a laptop is rarely obtrusive
or distracting if the fieldworker introduces the device casually and with considera-
tion for the person and the situation. Laptop computers can save ethnographers
time they can better spend thinking and analyzing. They greatly reduce the field-
workers need to type up raw data interview notes every day, because the fieldworker
enters these data into the computer only once, during or immediately after an inter-
view. These notes can then be expanded and revised with ease. The files can be trans-
ferred from the laptop to a personal computer or mainframe with an external disk
drive, appropriate software, and/or a high-speed modem or wireless connection.
These files can then be merged with other field data, forming a highly organized
(dated and cross-referenced), cumulative record of the fieldwork.
Laptops also provide the ethnographer with an opportunity to interact with par-
ticipants at critical analytic moments. Ethnographers can share and revise notes,
spreadsheets, and graphs with participants on the spot. I routinely ask participants
to review my notes and memoranda as a way to improve the accuracy of my obser-
vations and to sensitize me to their concerns. We also produce bar charts and other
graphic representations of the data together, providing an immediate cross-check
on the preliminary analysis.
The laptop computer is not a panacea, but it is a real time-saver and is particu-
larly useful in contract research. An ethnographer who conducts multisite research
can carry a laptop to the sites and send files home via modem linkup or wireless
with a home computer. Laptops also greatly facilitate communication from the field
to the research center through interactive electronic mail systems. Laptops have
drawbacks, of course, as any equipment does. The fieldworker must learn about the
operating system and the programs. They must configure the computer properly
with enough memory and storage. The ethnographer must also possess enough
patience to work through bugs, viruses, slow-downs, and crashes. In addition, the
fieldworker needs to take time to acquaint people with the device before thrusting
it in front of them. Certain people will explicitly or implicitly prohibit the use of
even a pen and notepad, never mind a laptop or other device. Also, the clatter of the
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 566
keyboard can be distracting and obtrusive in certain situations. In most cases, how-
ever, a brief desensitization period will make people feel comfortable with the
equipment. In fact, the laptop can be an icebreaker, helping the fieldworker to
develop a strong rapport with people and at the same time inuring them to its pres-
ence. Given a careful introduction, laptops or any other useful pieces of equipment
can greatly facilitate ethnographic work.
Desktop Computers
Many researchers use laptops to compose memos, reports, and articles to con-
duct interviews, and for general data collection, and then upload or send their files
to a desktop computer. There are convenient tools to mechanically transfer files.
However, an increasing number of researchers are skipping the transfer issue com-
pletely by using their laptop or notebook-type computers as their primary com-
puters, because they are as powerful as the larger systems and are more convenient.
Database Software
Database programs enable the ethnographer to play a multitude of what-if
games, to test a variety of hypotheses with the push of a button (and a few macros
strings of commandsassigned to that button). I have used a variety of database
programs to test my perceptions of the frequency of certain behaviors, to test spe-
cific hypotheses, and to provide new insights into the data. NUDIST, Ethnograph,
HyperQual, HyperResearch, AnSWR, EZ-Text, AskSam, Qualpro, and Atlas/ti
are some programs that are well
suited to ethnographic research
(see Figure 17.11).
These database programs allow
for the development of emergent
themes. In addition, these tools
help the ethnographer visualize
and organize the data into bins
or categories. FileMaker Pro and
similar programs are less suitable
for field notes, but are useful for
more limited data sets and manip-
ulation. Fixed fields do not allow
for the addition of new fields that
emerge along the way as the
ethnographer learns more about
the multidimensional nature of
the topic and the field. (See
Weitzman & Miles, 1995, for a
detailed review of qualitative data
analysis software. See also Hardy,
Figure 17.11 Computer Screen Snapshot of NUDIST Software 2004; OReilly, 2005.)
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 567
Ethnography 567
Internet Telephony
Internet telephone software, such as Skype and Jajah, enable people to speak
with one another for free over the Internet. Ethnographers are increasingly using
these tools to speak with colleagues and key actors in the field without long-
distance charges. They are also a free or inexpensive way to maintain contact with
community members as well.
Videoconferencing Technology
Videoconferencing technology allows geographically disparate parties to see and
hear each otheraround the globe. Free or inexpensive software programs, including
iVisit, iChat, and CU-SeeMe, are available that allow videoconferencing online over the
Internet, with no satellite or long-distance charges. With only this software and a small,
relatively inexpensive digital camera plugged directly into a personal computer,1 indi-
viduals can videoconference through their computer screens with any other similarly
equipped users worldwide. I use videoconferencing to conduct follow-up interviews
and observations at remote sites, after initially interviewing on-site and establishing
rapport in person. I also use it to consult with colleagues and staff members on the
ethnographic research team (see Fetterman, 1996, for additional details).
Videoconferencing was instrumental in a $15 million Hewlett-Packard funded
Digital Divide project (Fetterman, 2004). The purpose of the project was to help people
bridge the digital divide, specifically establishing wireless communication within
and outside the reservation. Videoconferencing facilitated communication through-
out the project. In addition, digital photographs of videoconference exchanges
between Native Americans in the Tribal Digital Village and ethnographers at Stanford
University were used as evidence that the project was successful (see Figure 17.12).
Ethnographers have conducted fieldwork for generations without the benefit of
laptop and desktop computers, printers, database software, and videoconferencing,
and continue to conduct it without them. However, these tools are becoming indis-
pensable in many disciplines, and few anthropologists conduct research without
the use of some type of computer. Yet computers have limitations: They are only as
good as the data the user enters. They still require the eyes and ears of the ethnog-
rapher to determine what to collect and how to record it, as well as how to interpret
the data from a cultural perspective. (For further information about computing in
ethnographic and qualitative research, see Best & Krueger, 2004; Brent, 1984;
Conrad & Reinharz, 1984; Fischer, 1994; Friese, 2006; Podolefsky & McCarthy,
1983; Sproull & Sproull, 1982; Weitzman & Miles, 1995; also see Dow, 1987. My
Web page provides a list of ethnographic resources on the Internet at www.stan
ford.edu/~davidf/ethnography.html.)
Ethnography 569
discussion and usually makes them comfortable with the machine. It also enables
me to identify accurately each participants words long after I have left the field.
Tape recorders do, however, have some hidden costs. Transcribing tapes is an
extremely time-consuming and tedious task (even when they are digitally recorded
and transferred to a computer). Listening to a tape takes as much time as making the
original recordinghours of interview data require hours of listening. Transcribing
tapes adds another dimension to the concept of time-consumption.2 Typically, the
fieldworker edits the tapes, transcribing only the most important sections. This
keeps the ethnographer close to the data, enabling the ethnographer to identify
subtle themes and patterns that might be overlooked by a professional transcriber
that is not familiar with the local community. However, a carefully selected profes-
sional transcriber can remove the pedestrian part of the process if funds are avail-
able (see Carspecken, 1996, p. 29; Robinson, 1994; Roper & Shapira, 2000).
Cameras
Cameras, particularly digital cameras, have a special role in ethnographic
research. They can function as a can opener, providing rapid entry into a commu-
nity or classroom (see Collier, 1967; Fetterman, 1980). They are a known com-
modity to most industrialized and many nonindustrialized groups. I use cameras to
help establish an immediate familiarity with people. Cameras can create pictures
useful in projective techniques or can be projective tools themselves. They are most
useful, however, for documenting field observations.
Cameras document people, places, events, and settings over time. They enable the
ethnographer to create a photographic record of specific behaviors. As Collier (1967)
explains,
Photographs are mnemonic devices. During analysis and writing periods, photo-
graphs can bring a rush of detail that the fieldworker might not remember otherwise.
By capturing cultural scenes and episodes on film at the beginning of a study
before he or she has a grasp of the situationthe ethnographer can use the pictures
to interpret events retroactively, producing a rare second chance. Also, the camera
often captures details that the human eye has missed. Although the camera is an
extension of the subjective eye, it can be a more objective observer, less dependent
on the fieldworkers biases and expectations. A photographic record provides infor-
mation that the fieldworker may not have noticed at the time. Photographs are also
excellent educational tools, in the classroom, in a sponsors conference room, or on
a protected blog.
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 570
Computer software programs help organize digital photos and videos into fold-
ers based on themes or topics. Similarly, Web storage filing programs, such as
Picasa and Dropshots, make it easy to organize and share photographs and digital
videos with colleagues and the people you work with on the Internet. The same
software can be used to tell a story by using these pictures to create digital slide
shows and digital videos. I produce these kinds of videos for many of my projects
and post them on blogs and Web pages. They help document a key event, share
group projects with others who could not attend meetings, and give voice to com-
munity members who would not have otherwise been heard. They also serve as use-
ful projective techniques, particularly as community members provide feedback on
the video during the editing phase of video production.
The use of the camera or any photographic or audio recording mechanisms in
fieldwork requires the subjects permission. Some people are uncomfortable having
their pictures taken; others cannot afford exposure. The ethnographer may enter
the lives of people on their terms, but may not invade individual privacy. Photography
is often perceived to be an intrusion. People are usually self-conscious about their
self-presentation and concerned about how and where their pictures will be seen.
An individuals verbal permission is usually sufficient to take a picture. However,
written permission is necessary to publish or to display that picture in a public
forum. Even with verbal and written permission in hand, the ethnographer must
exercise judgment in choosing an appropriate display and suitable forum. Cameras,
too, can be problematic. Inappropriate use of cameras can annoy and irritate
people, undermining report and degrading the quality of the data. I typically use a
pocket-sized digital camera that works under low-light conditions to minimize
obtrusiveness. Cameras can also distort reality. A skillful photographer uses angles
and shadows to exaggerate the size of a building or to shape the expression on a per-
sons face. The same techniques can present a distorted picture of an individuals
behavior. Photoshop and related software can easily modify and manipulate visual
images. (See Aldridge, 1995; Becker, 1979, for an excellent discussion of photogra-
phy and threats to validity. See also Pink, 2001, and the visual anthropology jour-
nal Studies in Visual Communication.)
Digital Camcorder
Digital camcorder recordings are extremely useful in ethnographic (and partic-
ularly microethnographic) studies. They are instrumental data collection tools
when producing videos or digital vignettes of social situations. Camcorders can
capture the ebb and flow of an activity or ritual. The three-dimensional movement
brings the viewer closer to the natural movement and activity of the people you are
describing. Raw digital video that is skillfully edited, much like a documentary, can
tell a compelling and authentic story. Most digital cameras have a camcorder built
into them, enabling the ethnographer to combine functions with a single device.
Ethnographers usually have a fraction of a second to reflect on a gesture or a per-
sons posture or gait. Camcorders provide the observer with the ability to stop time.
The ethnographer can tape a class and watch it over and over again, each time
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 571
Ethnography 571
finding new layers of meaning or nonverbal signals from teacher to student, from
student to teacher, and from student to student. Over time, visual and verbal pat-
terns of communication become clear.
Camcorder equipment is essential to any microethnographic research effort.
Gatekeeping procedures (Erickson, 1976) and the politics of the classroom
(McDermott, 1974) are some elements of complex social situations that the field-
worker can capture on tape. However, the fieldworker must weigh the expense of
the equipment and the time required to use it against the value of the information
it will capture. Many ethnographic studies simply do not need fine-grained pictures
of social reality. The equipment can be obtrusive, however, many fit in the palm of
your hand. Even after participants have spent time with the ethnographer with and
without the equipment, mugging and posing for the camera are not uncommon.
The most significant hazard in using a camcorder is the risk of tunnel vision.
Ideally, the ethnographer has studied the social group long enough to know what
to focus on. The ethnographer may need months to develop a reasonably clear
conception of specific behaviors before deciding to focus on them for a time. The
camcorder can focus in on a certain type of behavior to the exclusion of almost
everything else. Thus, the ethnographer may arrive at a very good understanding of
a specific cultural mechanism but achieve little understanding of its real role in a
particular environment.
In spite of the distinctions being made between visual media, the lines between
them, especially digital photography and video, are becoming blurred. I often pro-
duce videos consisting of a combination of digital pictures and video recordings,
with a voice track narrating the video and royalty free music in the background to
convey a culturally appropriate and meaningful tone (see http://homepage.mac
.com/profdavidf).
Analysis
Analysis is one of the most engaging features of ethnography. It begins at the
moment a fieldworker selects a problem to study and ends with the last word in the
report or ethnography. Ethnography involves many levels of analysis. Some are
simple and informal; others require some statistical sophistication. Ethnographic
analysis is iterative, building on ideas throughout the study. Analyzing data in the
field enables the ethnographer to know precisely which methods to use next, as well
as when and how to use them. Through analysis, the ethnographer tests hypotheses
and perceptions to construct an accurate conceptual framework about what is hap-
pening in the social group under study. Analysis in ethnography is as much a test of
the ethnographer as it is a test of the data.
Thinking
First and foremost, analysis is a test of the ethnographers ability to thinkto
process information in a meaningful and useful manner. The ethnographer con-
fronts a vast array of complex information and needs to make some sense of it all
piece by piece. The initial stage in analysis involves simple perception. However,
even perception is selective. The ethnographer selects and isolates pieces of infor-
mation from all the data in the field. The ethnographers personal or idiosyncratic
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 573
Ethnography 573
Triangulation
Triangulation is basic in ethnographic research. It is at the heart of ethnographic
validity, testing one source of information against another to strip away alternative
explanations and prove a hypothesis. Typically, the ethnographer compares infor-
mation sources to test the quality of the information (and the person sharing it), to
understand more completely the part an actor plays in the social drama, and ulti-
mately to put the whole situation into perspective.
I will typically ask how a student is doing in a particular study. I might also hear
reports from the teacher about the students performance. The students parent
might offer an insight into the students performance as well. When these three sep-
arate sources converge and reinforce each other, I am more confident reporting that
the students performance has indeed improved. At least it helps me rule out rival
hypotheses concerning the students performance. (See Flick, Kardorff, & Steinke,
2004; Webb et al., 2000, for a detailed discussion of triangulation.)
Patterns
Ethnographers look for patterns of thought and behavior. Patterns are a form of
ethnographic reliability. Ethnographers see patterns of thought and action repeat in
various situations and among various players. Looking for patterns is a form of
analysis. The ethnographer begins with a mass of undifferentiated ideas and behav-
ior, and then collects pieces of information, comparing, contrasting, and sorting
gross categories and minutiae until a discernible thought or behavior becomes
identifiable. Next the ethnographer must listen and observe, and then compare his
or her observations with this poorly defined model. Exceptions to the rule emerge,
variations on a theme are detectable. These variants help circumscribe the activity
and clarify its meaning. The process requires further sifting and sorting to make a
match between categories. The theme or ritualistic activity finally emerges, consist-
ing of a collection of such matches between the model (abstracted from reality) and
the ongoing observed reality.
Any cultural groups patterns of thought and behavior are interwoven strands.
As soon as the ethnographer finishes analyzing and identifying one pattern, another
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 574
pattern emerges for analysis and identification. The fieldworker can then compare
the two patterns. In practice, the ethnographer works simultaneously on many
patterns. The level of understanding increases geometrically as the ethnographer
moves up the conceptual laddermixing and matching patterns and building
theory from the ground up. (See Glaser & Strauss, 1967, for a discussion of
grounded theory.)
The observer can make preliminary inferences about the entire economic system
by analyzing the behavior that is subsumed within the pattern, as well as the pat-
terns themselves. Ethnographers acquire a deeper understanding of and apprecia-
tion for a culture as they weave each part of the ornate human tapestry together, by
observing and analyzing the patterns of everyday life (see Davies, 1999, p. 146;
Wolcott, 1999, p. 256).
Key Events
Key or focal events that the fieldworker can use to analyze an entire culture occur
in every social group. Geertz (1973) eloquently used the cockfight to understand
and portray Balinese life. Key events come in all shapes and sizes. Some tell more
about a culture than others, but all provide a focus of analysis (see also Atkinson,
2002; Geertz, 1957).
Key events, like digital photographs or Quicktime videos, concretely convey a
wealth of information. Some images are clear representations of social activity;
others provide a tremendous amount of embedded meaning. Once the event is
recorded, the ethnographer can enlarge or reduce any portion of the picture. A
rudimentary knowledge of the social situation will enable the ethnographer to infer
a great deal from key events. In many cases, the event is a metaphor for a way of life
or a specific social value. Key events provide lenses through which to view a culture.
Key events are extraordinarily useful for analysis. Not only do they help the field-
worker understand a social group, but the fieldworker in turn can use them to
explain the culture to others. The key event thus becomes a metaphor for the cul-
ture. Key events also illustrate how participation, observation, and analysis are inex-
tricably bound together during fieldwork.
Ethnography 575
Content Analysis
Ethnographers analyze written and electronic data in much the same way they
analyze observed behavior. They triangulate information within documents to test
for internal consistency. They attempt to discover patterns within the text and seek
key events recorded and memorialized in print.
Ethnographers may subject internal documents to special scrutiny to determine
whether they are internally consistent with program philosophy. Reviews may also
reveal significant patterns. It is often possible for the ethnographer to infer the
significance of a concept from its frequency and context in the text (see Graneheim
& Lundman, 2004; Krippendorff, 2004, p. 87; Neuendorf, 2001; Roberts, 1997;
Stemler, 2001; Titscher, 2000, p. 224; Tuval-Mashiach, Zilber, & Lieblich, 1998).
Statistics
Ethnographers use nonparametric statistics more often than parametric statistics
because they typically work with small samples. Parametric statistics require large
samples for statistical significance. The use of nonparametric statistics is also more
consistent with the needs and concerns of most anthropologists. Anthropologists
typically work with nominal and ordinal scales. Nominal scales consist of discrete
categories, such as sex and religion. Ordinal scales also provide discrete categories as
well as a range of variation within each categoryfor example, reform, conservative,
and orthodox variants within the category of Judaism. Ordinal scales do not deter-
mine the degree of difference between subcategories. The Guttman (1944) scale also
known as cumulative scaling or scalogram (Trochim, 2006a) analysis is one example
of an ordinal scale that is useful in ethnographic research.
The chi-square test and the Fisher exact probability test are popular nonpara-
metric statistical tools in anthropology. However, all statistical formulas require
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 576
that certain assumptions be met before the formulas may be applied to any situa-
tion. A disregard for these variables in the statistical equation is as dangerous as
neglect of comparable assumptions in the human equation in conducting ethno-
graphic fieldwork. Both errors result in distorted and misleading efforts at worst,
and waste valuable time at best.
Ethnographers use parametric statistics when they have large samples and lim-
ited time and resources to conduct all the interviews. Survey and questionnaire
work often requires sophisticated statistical tests of significance. t Tests are used to
determine whether the means of two groups are statistically significant (Trochim,
2006b). Analysis of covariance design in regression analysis is another common test
used when sample size permits in ethnography. Ethnographers also use the results
of parametric statistics to test certain hypotheses, cross-check their own observa-
tions, and generally provide additional insight. (See Fetterman, 1998, for discussion
of problems with statistics. See also Handwerker, 2001, p. 222, for examples rang-
ing from factor analysis to logistic regression.)
Crystallization
Ethnographers crystallize their thoughts at various stages throughout an ethno-
graphic endeavor. The crystallization may bring a mundane conclusion, a novel
insight, or an earth-shattering epiphany. The crystallization is typically the result of
a convergence of similarities that spontaneously strike the ethnographer as relevant
or important to the study. Crystallization may be an exciting process or the result
of painstaking, boring, methodical work. This research gestalt requires attention to
all pertinent variables in an equation.
Every study has classic moments when everything falls into place. After months
of thought and immersion in the culture, the ethnographer discovers that a special
configuration gels. All the subtopics, miniexperiments, layers of triangulated effort,
key events, and patterns of behavior form a coherent and often cogent picture of
what is happening. One of the most exciting moments in ethnographic research is
when an ethnographer discovers a counterintuitive conception of realitya con-
ception that defies common sense. Such moments make the long days and nights
worthwhile.
Analysis has no single form or state in ethnography. Multiple analyses and forms
of analyses are essential. Analysis takes place throughout any ethnographic
endeavor, from the selection of the problem to the final stages of writing. Analysis
is iterative and often cyclical in ethnography (see Atkinson, 2002, pp. 52, 384; Goetz
& LeCompte, 1984; Hammersley & Atkinson, 1983; Taylor & Bogdan, 1984). The
researcher builds a firm knowledge base in bits and pieces, asking questions, listen-
ing, probing, comparing and contrasting, synthesizing, and evaluating information.
The ethnographer must run sophisticated tests on data long before leaving the field.
However, a formal, identifiable stage of analysis does take place when the ethnog-
rapher physically leaves the field. Half the analysis at this stage involves additional
triangulation, sifting for patterns, developing new matrices, and applying statistical
tests to the data. The other half takes place during the final stage of writing an
ethnography or an ethnographically informed report.
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 577
Ethnography 577
Writing
Ethnography requires good writing skills at every stage of the enterprise. Research
proposals, field notes, memoranda, blogs, shared collaborative Web-based word
processing and spreadsheet documents, interim reports, final reports, articles, and
books are the tangible products of ethnographic work. The ethnographer can share
these written works with participants to verify their accuracy and with colleagues
for review and consideration. Ethnography offers many intangibles, through the
media of participation and verbal communication. However, written products,
unlike transitory conversations and interactions, withstand the test of time.
Writing good field notes is very different from writing a solid and illuminating
ethnography or ethnographically informed report. Note taking is the rawest kind of
writing. The note taker typically has an audience of one. Thus, although clarity,
concision, and completeness are vital in note taking, style is not a primary consid-
eration (see Emerson, Fretz, & Shaw, 1995).
Writing for an audience, however, means writing to that audience. Reports for
academics, government bureaucrats, private and public industry officials, medical
professionals, and various educational program sponsors require different formats,
languages, and levels of abstraction. The brevity and emphasis on findings in a
report written for a program-level audience might raise some academics eyebrows
and cause them to question the projects intellectual effort. Similarly, a refereed
scholarly publication would frustrate program personnel, who would likely feel
that the researcher is wasting their time with irrelevant concerns, time that they
need to take care of business. In essence, both parties feel that the researcher is
simply not in touch with their reality. These two audiences are both interested in
the fieldwork and the researchers conclusions, but have different needs and con-
cerns. Good ethnographic work can usually produce information that is relevant to
both parties.
This is possible when performance writing is used to guide ethnographic writ-
ing. Performance writing involves writing for an audience, caring about them, and
hoping that your work will make a difference to them (Madison, 2005, p. 192). It is
not unnecessarily complicated. It is relational in that it treats the reader like a gyro-
scope or a compass, in which the writers words revolve around them. The skillful
ethnographer will communicate effectively with all audiencesin part because the
ethnographer cares about each audienceusing the right smoke signals for the
right tribe. However, it is not simply a matter of language. (See Fetterman, 1987b,
for discussion of the ethnographer as rhetorician. See also Yin, 1994, for discussion
of differing audiences in the presentation of a case study.)
Blogs and Web pages provide a powerful medium for writing progress reports,
posting videos of key events, and capturing the spirit of the community you work
with. They are tools to facilitate reciprocity, by posting reports, tools, and informa-
tion the community values. Blogs and Web pages are also easily customized to mul-
tiple audiences, including scholarly audiences, program staff, and members of the
community. These Web-based documents also are highly accessible. They provide
an immediacy and transparency to ethnographic insights and understandings.
They help solidify a sense of community between the ethnographer and the people
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 578
they work with. Blogs and Web pages can be informal or scholarly, however, they
are typically a form of writing that falls between field notes and final reports or arti-
cles (with formal articles and publications linked to the blog or Web page).
Writing is part of the analysis process as well as a means of communication (see
also Hammersley & Atkinson, 1983). Writing clarifies thinking. In sitting down to
put thoughts on paper, an individual must organize those thoughts and sort out spe-
cific ideas and relationships. Writing often reveals gaps in knowledge. If the
researcher is still in the field when he or she discovers those gaps, the researcher
needs to conduct additional interviews and observations of specific settings. If the
ethnographer is a collaborative researcher, they might share Web-based word pro-
cessing and spreadsheet documents with community members. This enables com-
munity members to edit and cowrite ethnographic insights and findings. This places
a check on the ethnographers interpretation and promotes collaboration (commu-
nity building). I use an interactive spreadsheet, with an Arkansas tobacco prevention
project, to manage incoming data concerning numbers of people who quit smoking
and how this translates into dollars saved in terms of excess medical expenses. Data
collection, for this project, is iterative and a collaborative experience. If the
researcher has left the field, field notes, e-mails (including digital photographs), and
telephone calls must suffice (unless they also use Web-based documents and share
them with community members after leaving the field). Embryonic ideas often come
to maturity during writing, as the ethnographer crystallizes months of thought on
a particular topic. From conceptionas a twinkle in the ethnographers eyeto
delivery in the final report, an ethnographic study progresses through written stages.
(For additional discussions of ethnographic writing, see Fetterman, 1998; Madison,
2005; OReilly, 2005; Wolcott, 1990. See Van Maanen, 1988, for some of the rhetori-
cal and narrative devices used in ethnographic work, including realist, confessional,
and impressionist tales.)
Ethics
Ethnographers do not work in a vacuum, they work with people. They often pry
into peoples innermost secrets, sacred rites, achievements, and failures. In pursu-
ing these personal sciences, ethnographers subscribe to a code of ethics that pre-
serves the participants rights, facilitates communication in the field, and leaves the
door open for further research.
This code specifies first and foremost that the ethnographer do no harm to the
people or the community under study. In seeking a logical path through the cultural
wilds, the ethnographer is careful not to trample the feelings of natives or desecrate
what the culture calls sacred. This respect for social environment ensures not only
the rights of the people but also the integrity of the data and a productive, enduring
relationship between the people and the researcher. Professionalism and a delicate
step demonstrate the ethnographers deep respect, admiration, and appreciation for
the peoples way of life. Noninvasive ethnography is not only good ethics but also
good science (see American Anthropological Association, 1990, 1998; Rynkiewich
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 579
Ethnography 579
& Spradley, 1976; Weaver, 1973). Basic underlying ethical standards include the
securing of permission (to protect individual privacy), honesty, trust (both implicit
and explicit), and reciprocity (see Sieber, Chapter 4, this volume).
Permission
Ethnographers must formally or informally seek informed consent to conduct
their work. In a school district, formal written requests are requisite. Often, the
ethnographers request is accompanied by a detailed account of the purpose and
design of the study. Similarly, in most government agencies and private industry,
the researcher must submit a formal request and receive written permission. The
nature of the request and the consent changes according to the context of the study.
For example, no formal structure exists for the researcher to communicate within
a study of tramps. However, permission is still necessary to conduct a study. In this
situation, the request may be as simple as the following embedded question to a
tramp: I am interested in learning about your life, and I would like to ask you a few
questions, if thats all right with you. In this context, a detailed explanation of pur-
pose and method might be counterproductive unless the individual asks for addi-
tional detail. (See the section on institutional review boards presented later in this
chapter for more discussion on this topic.)
Honesty
Ethnographers must be candid about their task, explaining what they plan to
study and how they plan to study it. In some cases detailed description is appro-
priate, and in others extremely general statements are best, according to the type
of audience and the interest in the topic. Few individuals want to hear a detailed
discussion of the theoretical and methodological bases of an ethnographers
work. However, the ethnographer should be ready throughout the study to pre-
sent this information to any participant who requests it. Deceptive techniques are
unnecessary and inappropriate in ethnographic research. Ethnographers need
not disguise their efforts or use elaborate ploys to trick people into responding to
specific stimuli.
Trust
Ethnographers need the trust of the people they work with to complete their
task. An ethnographer who establishes a bond of trust will learn about the many
layers of meaning in any community or program under study. The ethnographer
builds this bond on a foundation of honesty, and communicates this trust verbally
and nonverbally. He or she may speak simply and promise confidentiality as the
need arises. Nonverbally, the ethnographer communicates this trust through self-
presentation and general demeanor. Appropriate apparel, an open physical posture,
handshakes, and other nonverbal cues can establish and maintain trust between an
ethnographer and a participant.
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 580
Actions speak louder than words. An ethnographers behavior in the field is usu-
ally his or her most effective means of cementing relationships and building trust.
People like to talk, and ethnographers love to listen. As people learn that the ethnog-
rapher will respect and protect their conversations, they open up a little more each
day in the belief that the researcher will not betray their trust. Trust can be an
instant and spontaneous chemical reaction, but more often it is a long, steady
process, like building a friendship.
Pseudonyms
Ethnographic descriptions are usually detailed and revealing. They probe
beyond the facade of normal human interaction. Such descriptions can jeopardize
individuals. One person may speak candidly about a neighbors wild parties and
mention calling the police to complain about them. Another individual may reveal
the arbitrary and punitive behavior of a program director or principal. Each indi-
vidual has provided invaluable information about how the system really works.
However, the delicate web of interrelationships in a neighborhood, a school, or an
office might be destroyed if the researcher reveals the source of this information.
Similarly, individuals involved in illegal activityranging from handling ven-
omous rattlesnakes in a religious ceremony to selling heroin in East Detroit in order
to build a gang empirehave a legitimate concern about the repercussions of the
researchers disclosing their identities.
The use of pseudonyms is a simple way to disguise the identities of individuals
and protect them from potential harm. Disguising the name of the village or
program can also prevent the curious from descending on the community and dis-
rupting the social fabric of its members lives. Similarly, coding confidential data
helps prevent them from falling into the wrong hands. However, there are limits to
confidentiality in litigation.
Reciprocity
Ethnographers use a great deal of peoples time, and they owe something in
return. In some cases, ethnographers provide a service simply by lending a sympa-
thetic ear to troubled individuals. In other situations, the ethnographer may offer
time and expertise as barterfor example, teaching a participant English or math,
milking cows and cleaning chicken coops, or helping a key actor set up a new com-
puter and learn to use the software. Ethnographers also offer the results of their
research in its final form as a type of reciprocity.
Some circumstances legitimate direct payment for services rendered, such as
having participants help distribute questionnaires, hiring them as guides on expe-
ditions, and soliciting various kinds of technical assistance. However, direct pay-
ment is not a highly recommended form of reciprocity. This approach often
reinforces patterns of artificial dependence and fosters inappropriate expectations.
Direct payment may also shape a persons responses or recommendations through-
out a study. Reciprocity in some form is essential during fieldwork (and, in some
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 581
Ethnography 581
cases, after the study is complete), but it should not become an obtrusive, contam-
inating, or unethical activity.
Conclusion
This chapter has provided a brisk walk through the intellectual landscape of
ethnography, leading the reader step by step through the ethnographic terrain, peri-
odically stopping to smell the roses and contemplate the value of one concept or
technique over another.
Each section of the chapter is built on the one beforeas each step on a path fol-
lows the step before. Discussion about the selection of a problem or issue has been
followed by a detailed discussion of guiding concepts. The ethnographers next log-
ical step is to become acquainted with the tools of the tradethe methods and
techniques required to conduct ethnographic research and the equipment used to
chisel out this scientific art form. A discussion of analysis in ethnographic research
becomes more meaningful at this stage, once the preceding facets of ethnography
have laid the foundation. Similarly, I have discussed the role of writing in the sec-
ond-to-last section of this chapter because writing is one of the final stages in the
process and because the meaning of writing in ethnography is amplified and made
more illuminating by a series of discussions about what doing ethnography
entails. Finally, ethics comes last because the complete ethnographic context is nec-
essary to a meaningful discussion of this topic. Step by step, this chapter provides a
path through the complex terrain of ethnographic work.
Exercises
I have presented three assignments that I have found useful in teaching ethnogra-
phy at Stanford University.
The first is called artifacts. It is designed to help students become aware of how
knowledgeable and insightful they already are, relying primarily on their observa-
tional skills and common sense.
The assignment also highlights the limitations of observation and the need to
ask questions and interview people to more accurately learn about whats going on.
The second assignment is designed to help students apply ethnographic con-
cepts and techniques to their observations.
The third assignment is designed to provide students with an opportunity to
apply ethnographic concepts and techniques to the art of interviewing.
Artifacts Assignment
Students are asked to bring in objects, pictures, and other relevant materials that they
can share with a peer. These items should tell someone something about who they are.
Instructions to students:
Ethnography 583
3. Your partner will share his/her artifacts with you at the same time. You will
record what your partners artifacts say about him/her but he/she will not help
explain anything about the meaning of the artifacts.
4. Both of you will be taking notes on what you observe. Describe the items or
artifacts and then briefly explain what the artifacts mean or tell you about the other
person.
5. Then you will take turns explaining to your partner what you think the artifacts
mean or say about your partner. Do not interrupt your peer. Let them complete their
explanation. If you interrupt or correct them it will alter the rest of their explanation.
6. After your partner has completed his/her explanation, you can confirm and
correct their story about you (based on the artifacts).
7. After completing the exercise, share
a. how powerful the experience was,
b. what you learned about your observational skills, and
c. how important it is to be able to ask people what they think.
Observation Assignment
The second assignment involves observing a situation or event.
Student assignment:
1. We would like you to observe something for 15 to 20 minutes. Write it down
(23 pages) and share it with us by posting it in the appropriate observation folder
(in the virtual classroom).
2. Please read as many of your peers postings as possible. Feel free to comment
on them by posting messages in their folders.
3. Guidelines concerning the selection of a person or situation to observe.
4. You should pick a situation that allows you to observe individuals unobtru-
sively. We do not want you just staring at someone or making them feel uncom-
fortable for 15 to 20 minutes. However, this is an observationnot an interview, so
observe and record your observations without interviewing the individual.
5. During this observation, we want you to use the ethnographic concepts and
methods you have been reading about and hearing about in class. These tools
should guide your observation. For example, you should be observing using con-
cepts such as culture, holistic perspective, emic and etic perspective, nonjudgmen-
tal orientation, symbolism, and so on.
6. You should also be using methods such as participant or nonparticipant
observation, outcroppings and relying on written information as is available.
7. You may want to read about field notes and thick description and verbatim
quotations in Fetterman (1998, chap. 6) to assist you in this assignment. Remember
detail is important in description. Concrete description is desired. You want to
bring back a description of what you saw with enough detail that the reader feels
like they were there or pretty close anyway.
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 584
8. We will discuss the assignment in class and provide a brief critique of each
presentation about your observations.
Interview Assignment
The third assignment involves interviewing and critiquing the interview.
Notes
1. Many computers have built-in cameras to facilitate videoconferencing.
2. The accuracy of voice recognition software is improving. However, a fair amount of
time is required to correct transcription errors. In addition, the software must be trained
for each persons voice. This limits the softwares utility for conducting interviews.
References
Abramovitch, I., & Galvin, S. (2001). Jews of Brooklyn. New England, MA: Brandeis
University Press.
Aldridge, M. (1995). Scholarly practice: Ethnographic film and anthropology. Visual
Anthropology, 7(3), 233235.
American Anthropological Association. (1990). Principles of professional responsibility.
Arlington, VA: Author.
American Anthropological Association. (1998). Code of ethics of the American Anthropological
Association. Retrieved October 15, 2004, from www.aaanet.org/committees/ethics/
ethcode.htm
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 585
Ethnography 585
Ethnography 587
Lee-Treweek, G. (2000). The insight of emotional danger. In G. Lee-Treweek & S. Linkogle (Eds.),
Danger in the field: Risk and ethics in social research (pp. 114131). London: Routledge.
Lewis, E. D. (2004). Timothy Asch and ethnographic film (Studies in Visualculture). London:
Routledge.
Madison, S. D. (2005). Critical ethnography: Method, ethics, and performance. Thousand
Oaks, CA: Sage.
Marcus, G. (1998). Ethnography: Through thick and thin. Princeton, NJ: Princeton University Press.
Masten, D., & Plowman, T. (2003). Digital ethnography: The next wave in understanding the
consumer experience. Design Management Journal. Retrieved April 8, 2008, from
http://findarticles.com/p/articles/mi_qa4001/is_200304/ai_n9199413
McCall, G. J. (2006). The fieldwork tradition. In D. Hobbs & R. Wright (Eds.), The Sage hand-
book of fieldwork (pp. 322). Thousand Oaks, CA: Sage.
McDermott, R. P. (1974). Achieving school failure: An anthropological approach to illiteracy
and social stratification. In G. D. Spindler (Ed.), Education and cultural process: Toward
an anthropology of education (pp. 82118). New York: Holt, Rinehart & Winston.
Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded sourcebook
(2nd ed.). Thousand Oaks, CA: Sage.
Murphy, M., & Margolis, M. (Eds.). (1995). Science, materialism, and the study of culture.
Gainesville: University of Florida Press.
Neuendorf, K. (2001). The content analysis cookbook. Retrieved April 8, 2008, from http://
academic.csuohio.edu/kneuendorf/content/resources/car.htm
OReilly, K. (2005). Ethnographic methods. London: Routledge.
Osgood, C. (1964). Semantic differential technique in the comparative study of cultures
[Special issue]. American Anthropologist, 66(3), 171200.
Pelto, P. J. (1970). Anthropological research: The structure of inquiry. New York: Harper & Row.
Pink, S. (2001). Doing visual ethnography: Images, media, and representation in research.
Thousand Oaks, CA: Sage.
Podolefsky, A., & McCarthy, C. (1983). Topical sorting: A technique for computer assisted
qualitative data analysis. American Anthropologist, 85, 886890.
Polsky, N. (1967). Hustlers, beats, and others. Chicago: Aldine.
Psathas, G. (1995). Conversation analysis: The study of talk-in-interaction. Thousand Oaks,
CA: Sage.
Punch, M. (1994). Politics and ethics in qualitative research. In N. K. Denzin & Y. S. Lincoln
(Eds.), Handbook of qualitative research (pp. 8397). Thousand Oaks, CA: Sage.
Riddell, S. (1989). Exploiting the exploited? The ethics of feminist educational research. In
R. G. Burgess (Ed.), The ethics of educational research (pp. 7799). London: Falmer Press.
Roberts, C. W. (1997). Text analysis for the social sciences: Methods for drawing statistical infer-
ences from texts and transcripts. Mahwah, NJ: Lawrence Erlbaum.
Roberts, C., Byram, M., Barro, A., Jordan, S., & Street, B. (2001). Language learners as ethnog-
raphers. Clevedon, England: Multilingual Matters and Channel View.
Robinson, H. (1994). The ethnography of empowerment: The transformative power of class-
room interaction. London: Falmer Press.
Roper, J. M., & Shapira, J. (2000). Ethnography in nursing research. Thousand Oaks, CA: Sage.
Ross, E. (1980). Beyond the myths of culture: Essays in cultural materialism. New York:
Academic Press.
Rouch, J., & Feld, S. (2003). Cine ethnography. Minneapolis: University of Minnesota Press.
Rynkiewich, M. A., & Spradley, J. P. (1976). Ethics and anthropology: Dilemmas in fieldwork.
New York: John Wiley.
Schensul, J., LeCompte, S., & Schensul, S. (1999). Essential ethnographic methods:
Observations, interviews, and questions. New York: AltaMira Press (a division of
Rowman & Littlefield).
17-Bickman-45636:17-Bickman-45636.qxp 7/28/2008 7:50 PM Page 588
Spindler, G. D., & Spindler, L. (1970). Being an anthropologist: Fieldwork in eleven cultures.
New York: Holt, Rinehart & Winston.
Spindler, G. D., & Spindler, L. (1987). Interpretive ethnography of education at home and
abroad. Hillsdale, NJ: Lawrence Erlbaum.
Spradley, J. P. (1979). The ethnographic interview. New York: Holt, Rinehart & Winston.
Spradley, J. P., & McCurdy, D. W. (1972). The cultural experience: Ethnography in complex
society. Palo Alto, CA: Science Research Associates.
Sproull, L. S., & Sproull, R. F. (1982). Managing and analyzing behavior records: Explorations
in nonnumeric data analysis. Human Organization, 41, 283290.
Stemler, S. (2001). An overview of content analysis. Practical Assessment, Research, & Evaluation,
7(17). Retrieved April 8, 2008, from http://pareonline.net/getvn.asp?v= 7&n=17
Strauss, C., & Quinn, N. (1997). A cognitive theory of cultural meaning. Cambridge, UK:
Cambridge University Press.
Swatos, W. (Ed.). (1998). Encyclopedia of religion and society (p. 505). Lanham, MD: AltaMira
Press (a division of Rowman & Littlefield).
Taylor, S. J., & Bogdan, R. C. (1984). Introduction to qualitative research methods: The search
for meanings. New York: John Wiley.
Titscher, S. (2000). Methods of text and discourse analysis. Thousand Oaks, CA: Sage.
Trochim, W. (2006a). Guttman scale. Research methods knowledge base. Retrieved April 8,
2008, from www.socialresearchmethods.net/kb/scalgutt.htm
Trochim, W. (2006b). T-test. Research methods knowledge base. Retrieved April 8, 2008, from
www.socialresearchmethods.net/kb/stat_t.htm
Tuval-Mashiach, R., Zilber, T., & Lieblich, A. (1998). Narrative research: Reading, analysis, and
interpretation. Thousand Oaks, CA: Sage.
Van Maanen, J. (1988). Tales of the field: On writing ethnography. Chicago: University of
Chicago Press.
Weaver, T. (1973). To see ourselves: Anthropology and modern social issues. Glenview, IL: Scott,
Foresman.
Webb, E. J., Campbell, D. T., Schwartz, R. D., & Sechrest, L. (2000). Unobtrusive measures (2nd
ed.). Chicago: Rand McNally.
Weisner, T., Ryan, G., Reese, L., Kroesen, K., Bernheimer, L., & Gallimore, R. (2001). Behavior
sampling and ethnography: Complementary methods for understanding home-school
connections among Latino immigrant families. Field Methods, 13(20), 2046. Retrieved
April 8, 2008, from http://fmx.sagepub.com/cgi/content/abstract/13/1/20
Weitzman, E. A., & Miles, M. B. (1995). A software sourcebook: Computer programs for quali-
tative data analysis. Thousand Oaks, CA: Sage.
Wolcott, H. F. (1980). How to look like an anthropologist without really being one. Practicing
Anthropology, 3(2), 5659.
Wolcott, H. F. (1990). Writing up qualitative research. Newbury Park, CA: Sage.
Wolcott, H. F. (1999). Ethnography: A way of seeing. New York: AltaMira Press (a division of
Rowman & Littlefield).
Yin, R. K. (1994). Case study research: Design and methods (2nd ed.). Thousand Oaks,
CA: Sage.
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 589
CHAPTER 18
David W. Stewart
Prem N. Shamdasani
Dennis W. Rook
F ocus group research is among the most common research methods used by
social scientists, marketers, policy analysts, health and social services profes-
sionals, political consultants, and other scientists and decision makers to
gather information. Originally called focussed interviews, this technique came
into vogue after World War II and has been a part of the social scientists tool kit
ever since. Focus groups emerged in behavioral science research as a distinctive
member of the qualitative research family, which also includes individual depth
interviewing, ethnographic participant observation, and projective methods,
among others. Like its qualitative siblings, the popularity and status of focus groups
among behavioral researchers has ebbed and flowed over the years, with distinctive
patterns in particular fields. For example, in qualitative marketing studies, the use
of focus groups has grown steadily since the 1970s; and today, business expendi-
tures on focus groups are estimated to account for at least 80% of the $1.1 billion
spent annually on qualitative research (Wellner, 2003).
In sociology, arguably the first field to embrace group research, qualitative
research flourished through the 1950s, faded away in the 1960s and 1970s, and
Authors Note: This chapter is an updated adaptation of Stewart, Shamdasani, and Rook (2007)
and Shamdasani and Stewart (1992).
589
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 590
reemerged in the 1980s. Various patterns of focus group ascendance, decline, and
revival characterize other fields, yet it is reasonable to conclude that focus group
research has never enjoyed such widespread usage across an array of behavioral
science disciplines and subfields as it does today. They are used by academic
researchers, government policymakers, and business decision makers. Focus groups
provide a rich and detailed set of data about perceptions, thoughts, feelings, and
impressions of group members in the members own words. They represent a
remarkably flexible research tool, in that they can be adapted to obtain information
about almost any topic in a wide array of settings and from very different types of
individuals. Group discussions may be very general or very specific; they may be
highly structured or quite unstructured. Visual stimuli, demonstrations, or other
activities may be used within the context of a focus group to provide a basis for dis-
cussion. This flexibility makes the focus group a particularly useful tool and explains
its popularity.
A focus group involves a group discussion of a topic that is the focus of the
conversation. The contemporary focus group interview generally involves 8 to 12
individuals who discuss a particular topic under the direction of a professional
moderator, who promotes interaction and assures that the discussion remains on
the topic of interest. A typical focus group session will last from 1.5 to 2.5 hours.
The most common purpose of a focus group interview is to stimulate an in-depth
exploration of a topic about which little is known. Focus group research is uniquely
suited for quickly identifying qualitative similarities and differences among people.
Focus groups also provide an efficient means for determining the language people
use when thinking and talking about specific issues and objects, and for suggesting
a range of hypotheses about the topic of interest. Focus groups may be useful at vir-
tually any point in a research program, but they are particularly useful for exploratory
research when rather little is known about the phenomenon of interest. As a result,
focus groups tend to be used very early in research projects and are often followed
by other types of research that provide more quantifiable data from larger groups
of respondents.
Focus groups have also been proven useful following analyses of large-scale,
quantitative surveys. In this use, the focus group facilitates interpretation of quan-
titative results and adds depth to the responses obtained in the more structured sur-
vey. Focus groups also have a place as a confirmatory method that may be used for
testing hypotheses. This application may arise when the researcher has strong rea-
sons to believe that a hypothesis is correct, and where disconfirmation by even a
small group would tend to result in rejection of the hypothesis.
Focus groups can produce quantitative data, but this is at odds with their nature
and primary purpose, which is the collection of qualitative data. Focus groups,
when properly designed and conducted, generate a rich body of data expressed in
the respondents own words and expressions. The degrees of freedom in partici-
pants responses are high, unlike survey questionnaires that narrow responses to
5-point rating scales or other constrained response categories. In focus groups, par-
ticipants can qualify their responses or identify important contingencies associated
with their answers. Thus, responses have a certain ecological validity not found in
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 591
traditional survey research. On the other hand, the data provided by focus groups
may be idiosyncratic and unique to the group.
Although focus groups can be conducted in a variety of sites, ranging from
homes to offices, they are typically held in commercial facilities designed especially
for focus group interviewing. Such facilities provide one-way mirrors and viewing
rooms where observers may unobtrusively observe an interview in progress. Focus
group facilities may also include equipment for audio- or videotaping interviews
and perhaps even small receivers for moderators to wear in their ears, so that
observers may speak to them and thus provide input into interviews. In an age of
online communication and videoconferencing, focus group facilities also tend to be
equipped for virtual groups where the members may be broadly dispersed geo-
graphically and communicate through electronic media. Focus group facilities tend
to be situated either in locations that are easy to get to, such as just off a major com-
muter traffic artery, or in places such as shopping malls, where people tend to
gather naturally.
Today, focus groups are in use almost everywhere around the globe, but they are
particularly important research tools in nations where survey research is difficult to
conduct due to an unavailability of lists of representative customers, norms gov-
erning contact via telephone or mail, unreliable mail or telephone service, or lan-
guage and literacy problems. In such settings, focus groups are often the only
practical vehicle for collecting information, even when other methods might be
more appropriate for the question at hand.
A variety of research needs lend themselves to the use of focus group interviews.
Among the more common uses of focus groups are the following:
7. Learning how respondents talk about the phenomenon of interest (which may,
in turn, facilitate the design of questionnaires, survey instruments, or other
research tools that might be employed in more quantitative research); and
because they allow individuals to respond in their own words using their own cat-
egorizations and perceived associations. They are not completely void of structure,
however, because the researcher does raise questions of one type or another and the
artificial group setting also influences the character of data obtained from focus
groups. Prototypic ethnographic research is probably the most emic due to its
immersion in natural settings and bottom-up approach to data collection. Survey
research and experimentation tend to produce data that are closer to the etic side
of the continuum, because the response categories used by the respondent are gen-
erally prescribed by the researcher. These response categories may or may not be
those with which the respondent is comfortable, though the respondent may still
select an answer. And even when closed-ended survey questions are the only
options available, some respondents elect to give answers in their own words, as
most experienced survey researchers have discovered.
Neither emic nor etic data are inherently better or worse than the other; they
simply differ. Both kinds of data have their place in social science research; they
complement each other, each compensating for the limitations of the other. Indeed,
one way to view social science research is as a process that moves from the emic to
the etic and back, in a cycle. Phenomena that are not well understood are often first
studied with tools that yield more emic data. As a particular phenomenon is better
understood and greater theoretical and empirical structure is built around it, tools
that yield more etic types of data tend to predominate. As knowledge accumulates,
it often becomes apparent that the exploratory structure surrounding a given phe-
nomenon is incomplete. This frequently leads to the need for data that are more
emic, and the process continues. (Further discussion of the philosophical issues
associated with the use of qualitative research and the complementarity of struc-
tured and unstructured approaches to social science research can be found in
Bogdan & Biklen, 2006; Denzin & Lincoln, 2005; Marshall & Rothman, 2006;
Maxwell, Chapter 7, this volume.)
Focus groups are widely used because they provide useful information and offer
researchers a number of advantages. This information and the advantages of the
technique come at a price, however. We review the relative advantages and limita-
tions of focus group research below. We then present a discussion of the steps
involved in the use and design of focus groups.
1. Focus groups can collect data from a group of people much more quickly and
at less cost than would be the case if each individual were interviewed separately.
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 594
They can also be assembled on much shorter notice than would be required for a
more systematic, larger survey.
2. Focus groups allow researchers to interact directly with respondents. This
provides opportunities for clarification and probing of responses as well as follow-
up questions. Respondents can qualify responses or give contingent answers to
questions. In addition, researchers can observe nonverbal responses, such as ges-
tures, smiles, and frowns that may carry information that supplements and, on
occasion, even contradicts, verbal responses.
3. The open-response format of focus groups provides researchers the oppor-
tunity to obtain large and rich amounts of data in the respondents own words.
Researchers can determine deeper levels of meaning, make important connections,
and identify subtle nuances in expression and meaning.
4. Focus groups allow respondents to react to and build on the responses of other
group members. This synergistic effect of the group setting may result in the pro-
duction of data or ideas that might not have been uncovered in individual interviews.
5. Focus groups are very flexible. They can be used to examine a wide range of
topics with a variety of individuals and in a variety of settings.
6. Focus groups may be one of the few research tools available for obtaining
data from children or from individuals who are not particularly literate.
7. The results of focus group research are usually easy to understand.
Researchers and decision makers can readily understand the verbal responses of
most respondents. This is not always the case with more sophisticated survey
research that employs complex statistical analyses.
8. Multiple individuals can view a focus group as it is conducted or review video-
or audiotape of the group session. This provides a useful vehicle for creating a com-
mon understanding of an issue or problem. Such an understanding can be especially
helpful for team building and for reducing conflict among decision makers.
Limitations
Although the focus group technique is a valuable research tool when used
appropriately and offers a number of advantages, it is not a panacea for all research
needs. It does have significant limitations, many of which are simply the negative
sides of the advantages listed above:
1. The small numbers of respondents that participate in even several different
focus groups and the convenient nature of most focus group recruiting practices
significantly limit generalization to larger populations. Indeed, persons who are
willing to travel to a locale to participate in a 1- to 2-hour group discussion may be
quite different from the population of interest.
2. The interaction of respondents with one another and with the moderator has
two potentially undesirable effects. First, the responses from members of the group
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 595
are not independent of one another; this restricts the generalizability of results.
Second, the results obtained in a focus group may be biased by a very dominant or
opinionated member. More reserved group members may be hesitant to talk.
3. The live and immediate nature of the interaction may lead a researcher or
decision maker to place greater faith in the findings than is actually warranted.
There is a certain credibility attached to the opinion of a live respondent that is
often not present in statistical summaries.
4. The open-ended nature of responses obtained in focus groups often makes
summarization and interpretation of results difficult. Statements by respondents
are frequently characterized by qualifications and contingencies that make direct
comparison of respondents opinions difficult.
5. A moderator, especially one who is unskilled or inexperienced, may bias
results by knowingly or unknowingly providing cues about what types of responses
and answers are desirable.
Focus group research has been the subject of much controversy and criticism.
Such criticism is generally associated with the view that focus group interviews do
not yield hard data and the concern that group members may not be representa-
tive of a larger population because of both the small numbers and the idiosyncratic
nature of the group discussion. Such criticisms are unfair, however. Although focus
groups do have important limitations of which researchers should be aware, limi-
tations are not unique to focus group research; all research tools in the social
sciences have significant limitations. The key to using focus groups successfully in
social science research is assuring that their use is consistent with the objectives and
purpose of the research. It is also important to recognize and appreciate the philo-
sophical underpinnings of focus group research.
There is a basis for criticizing focus group research that is poorly designed and
applied to inappropriate research questions. These are problems with any type of
research, but focus group research appears to have become especially prone to
abuse and misapplication (Nelems, 2003). The abuse of the focus group research is,
in large measure, a result of its apparent ease and low cost, relative to other tools for
social science research. This is, of course, an illusion because a properly designed
focus group, or a collection of focus groups addressing a common research ques-
tion, is not any easier and cheaper than a survey or experimental design and,
indeed, may be more difficult in some situations.
generally not unique; in fact, they are common to other types of both qualitative
and quantitative research. On the other hand, the communal nature of focus groups
makes some research design issues loom large, particularly those related to the
composition and likely interpersonal dynamics of group participants. The main
design elements of focus group research and their attendant considerations are
summarized in Table 18.2, and elaborated in the following discussion.
Group Composition
Once the researcher has generated a clear statement of the research purpose and
key questions, he or she can move to the second stage of focus group research. As
for a survey, it is important for the researcher to identify a sampling framethat is,
a list of people (households, organizations) the researcher has reason to believe is
representative of the larger population of interest. The sampling frame is the oper-
ational definition of the population. The identification of a sound sampling frame
is far more critical in large-scale survey research than it is for focus group research,
however. Because it is generally inappropriate to generalize far beyond the members
of focus groups, the sampling frame need only be a good approximation of the
population of interest. Thus, if the research is concerned with middle-class parents
of schoolchildren, a membership list for the local PTA might be an appropriate
sampling frame.
Indeed, random samples, which are the rule in most survey research, are less fre-
quently employed in focus group research. The reason for this is that the topics of
some focus group discussions are topics that require special expertise, experience,
or unique knowledge. For example, a random sample of the population of any
given country would be unlikely to produce individuals who could talk knowl-
edgeably about the direction of information technology over the next 50 years
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 598
or persons who could discuss their feelings about having contracted AIDS. Thus,
purposive sampling, in which respondents are purposely selected because they have
certain characteristics, is often used in focus group research. Random sampling is
also common in recruiting focus group participants, but it is important to recog-
nize that the representativeness of any set of focus group participants is diminished
by their participation in the group experience.
Unlike survey research, where data are obtained from respondents whose
answers are independent of one another, the design of focus group research must
also include consideration of the likely dynamics that will be produced by any par-
ticular combination of individuals (Carey & Smith, 1994). For example, the inter-
action among a group of 15-year-olds will be very different when their parents are
a part of the group versus when they are alone. Similarly, men may respond differ-
ently in groups composed only of other men than the way they would in groups
made up of a mixture of men and women. Furthermore, it may be unwise to
include individuals whose socioeconomic circumstances are quite different. This
idea is illustrated in the focus group application discussed below.
F O C U S GR O U P A P P L I C A T I O N : W H I T E - A ND B L U E - C O L L A R B E E R
D R I N K ER S
Care also needs to be exercised in mixing groups across cultures. For example, in
a 90-minute focus group session involving strangers, participants from more aggres-
sive cultures are likely to dominate. Therefore, the safest strategy would be to avoid
such mixing of participants from diverse cultures. Additionally, some topics and
issues (e.g., sexual habits and contraception use) are perceived to be more personal
and sensitive by members of some cultural groups than by others (Asians compared
with Westerners, for instance). Thus, the moderators of focus groups investigating
such sensitive topics need to exercise a great deal of tact and diplomacy, because
members of some cultures are quite reserved and reluctant to discuss openly behav-
iors and issues that may lead to embarrassment or loss of face. The general prefer-
ence for homogeneous group composition has logical foundations, but one caveat
should be mentioned. Many of the social science studies whose findings discouraged
fielding demographically or culturally diverse focus groups were conducted years
ago, and most have not been replicated recently. Arguably, Americans today would
be more comfortable sitting among people who reflect the nations demographic
and cultural diversity than their parents or grandparents were.
A growing body of research has focused on the use of focus groups with various
special populations. Such research has examined the unique issues that arise in the
use of focus groups in developing nations (Folch-Lyon, de la Macorra, & Schearer,
1981; Fuller, Edwards, Vorakitphokatorn, & Sermsri, 1993; Knodel, 1995; Stewart &
Shamdasani, 1992), with children (Hoppe, Wells, Morrison, Gillmore, & Wilsdon,
1995; Krueger & Casey, 2000; Vaughn, Schumm, & Singagub, 1996), and among
low-income and minority populations (Jarrett, 1993; Magill, 1993). Although such
populations do require some adaptation of technique, they have all been included
successfully in focus group research.
There is no best mix of individuals in a focus group. Rather, the researcher
needs to consider what group dynamic is most consistent with the research objec-
tives. If the interaction of children and their parents is important for purposes of
the research, then groups should be composed of parents and their children. On
the other hand, if the focus of the research is on adolescents perspectives on a
topic, the presence of parents in the group may reduce the willingness of the ado-
lescents to speak out and express their feelings. In the latter case, it would be more
consistent with research objectives for the researcher to design groups that include
only adolescents.
The interaction among members of a focus group adds a dimension to data col-
lection that is not common in other forms of social science research. Because the
results obtained from a group are the outcome of both the individuals in the group
and the dynamics of the group interaction, it is common for focus group researchers
to use several groups that differ with respect to composition. Indeed, it is uncom-
mon for focus group research to use only a single group. More often, the research
includes multiple groups composed of different types of individuals and different
mixes of individuals. The specific number of groups that may be included in any
research project is a function of the number of distinct types of individuals from
which the researcher wishes to obtain data and the number of mixtures of individ-
uals of interest to the researcher.
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 600
the agencys account executive. The moderator got the group off to a good start,
inviting the girls to introduce themselves and share their current hobbies and inter-
ests. Things went downhill quickly. The discussion guide was almost entirely com-
prised of 15 to 20 evaluative ratings of 30 different lip gloss products. After laboring
over their written evaluations of the first product concept, the girls were asked to
explain why they liked or disliked it. Flavor issues loomed large: I think blueberry
is icky. At this point, the girls remained enthusiastic, but then it was back to the
looming stack of paper concepts and ratings. They quickly appreciated how much
work they had to get through and hunkered down in silence to complete the
required forms. Much of the moderators time was spent collecting and collating
the completed materials. As the girls completed each consecutive concept, their dis-
cussion of likes and dislikes diminished to brief phrases, and some declined to com-
ment at all, knowing how much paperwork remained. Each concept evaluation
took about 10 minutes, during which the focus group observers watched the girls
working away in silence. One new ad agency executive who had never attended a
focus group asked an agency research staff member, Is this what focus groups are
like? He was told No, this is a group survey, unfortunately. He responded, Its
like watching people take the SAT.
Too many questions was only one problem with the lip gloss focus group. Given
the ostensibly broad, exploratory purpose of the research, restricting its scope to
evaluative ratings of alternative product concepts failed entirely to achieve the main
objective. Also, given the age of the participants, other approaches to asking ques-
tions would have yielded richer data. For example, more playful and indirect ques-
tions, or actually trying (rather than reading about) different products would have
generated greater enthusiasm and within-group interaction.
are covered in the time available. A group discussion might never cover particular
topics or issues unless the moderator intervenes. On the other hand, the frequency
and type of intervention by the moderator clearly affects the nature of the discus-
sion. This raises the question of the most appropriate amount of structure for a
given group. There is, of course, no best answer to this question, because the
amount of structure and the directiveness of the moderator must be determined by
the broader research agenda that gave rise to the focus group: the types of infor-
mation sought, the specificity of the information required, and the way the infor-
mation will be used.
There is also a balance that must be struck between what is important to
members of the group and what is important to the researcher. Less structured
groups will tend to pursue those issues and topics of greater importance, relevance,
and interest to the group. This is perfectly appropriate if the objective of the
researcher is to learn about the things that are most important to the group. Often,
however, the researcher has rather specific information needs. Discussion of issues
relevant to these needs may occur only when the moderator takes a more directive
and structured approach. It is important for the researcher to remember that when
this occurs, participants are discussing what is important to the researcher, not nec-
essarily what they consider significant.
It should be noted, however, that the transcript does not reflect the entire
character of the discussion. Nonverbal communication, gestures, and behavioral
responses are not reflected in a transcript. Thus, the interviewer observer may wish
to supplement the transcript with some additional observational data that were
obtained during the interview, such as a videotape or notes by an observer. Such
observational data may be quite useful, but they will be available only if their col-
lection is planned in advance. Preplanning of the analyses of the data to be obtained
from focus groups is as important as it is for any other type of research.
As with other types of research, the analysis and interpretation of focus group
data require a great deal of judgment and care. Unfortunately, focus group research
is easily abused and often inappropriately applied. A great deal of the skepticism
about the value of focus groups probably arises from (a) the perception that focus
group data are subjective and difficult to interpret and (b) the concern that focus
group participants may not be representative of a larger population because of both
the small numbers and the idiosyncratic nature of the group discussion.
The analysis and interpretation of focus group data can be as rigorous as the
analysis and interpretation generated by any other method. Focus group data can
even be quantified and submitted to sophisticated mathematical analyses, though
the purpose of focus group interviews seldom requires this type of analysis. Indeed,
there is no one best or correct approach to the analysis of focus group data. The
nature of the analysis of focus group interview data should be determined by the
research question and the purpose for which the data are collected. This, in turn,
has implications for the validity of the findings generated from focus groups.
Researchers should constantly be aware of the possible sources of bias at various
stages of the focus group research process and take appropriate steps to deal with
threats to the validity of the results.
A number of books and papers on focus group research have appeared in recent
years (e.g., Fern, 2001; Greenbaum, 2000; Krueger & Casey, 2000; Morgan, 1997;
Templeton, 1994). Although these publications are useful, their focus has tended to
be more on the mechanics of the interviews themselves rather than on the analysis
of the data generated in focus group sessions (see Stewart, Shamdasani, & Rook,
2007, for an exception). Where analysis is treated, the discussion is often limited to
efforts to identify key themes in focus group sessions. Researchers interested in
more sophisticated approaches have limited options. They can consult the rather
voluminous literature on content analysis that exists outside the marketing domain,
but this literature is not always readily accessible to researchers, particularly those
outside academic settings. The more common approaches to content analysis are
described below.
turn, with a brief introduction. The various pieces of interview transcription are used
as supporting materials and incorporated within an interpretative analysis.
Although the cut-and-sort technique is useful, it tends to rely very heavily on the
judgment of a single analyst. This analyst determines which segments of the tran-
script are important, develops a categorization system for the topics discussed by
the group, selects representative statements regarding these topics from the tran-
script, and develops an interpretation of what it all means. There is obviously much
opportunity for subjectivity and potential bias in this approach. Yet it shares many
of the characteristics of more sophisticated and time-consuming approaches. It
may be desirable to have two or more analysts independently code the focus group
transcript. The use of multiple analysts provides an opportunity to assess the relia-
bility of coding, at least with respect to major themes and issues. When determina-
tion of the reliability of more detailed types of codes is needed, more sophisticated
content-analytic coding procedures are required.
any technique (a) for the classification of the sign-vehicles (b) which relies
solely upon the judgments (which theoretically may range from perceptual
discrimination to sheer guesses) of an analyst or group of analysts as to which
sign-vehicles fall into which categories, (c) provided that the analysts judg-
ments are regarded as the report of a scientific observer. (p. 55)
A sign-vehicle is anything that may carry meaning, though most often it is likely to
be a word or set of words in the context of a focus group interview. Sign-vehicles
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 605
may also include gestures, facial expressions, or any of a variety of other means of
communication, however. Indeed, such nonverbal signs may carry a great deal of
information and should not be overlooked as sources of information.
A substantial body of literature now exists on content analysis, including books by
Krippendorf (2004), Neuendorf (2001), and West (2001). A number of specific
instruments have been developed to facilitate content analysis, including the Message
Measurement Inventory (Smith, 1978) and the Gottschalk-Gleser Content Analysis
Scale (Gottschalk, Winget, & Gleser, 1969). The Message Measurement Inventory was
originally designed for the analysis of communications in the mass media, such as
television programming and newsmagazines. The Gottschalk-Gleser Content
Analysis Scale, on the other hand, was designed for the analysis of interpersonal com-
munication. Both scales have been adapted for other purposes, but they are generally
representative of the types of formal content analysis scales that are in use.
Although content analysis is a specific type of research tool, it shares many features
in common with certain types of research. The same stages of the research process are
found in content analysis as are present in any research project (Krippendorf, 2004):
data making, data reduction, inference, analysis, validation, testing for correspon-
dence with other methods, and testing hypotheses regarding other data.
Data Making. Data used in content analysis include human speech, observations of
behavior, and various forms of nonverbal communication. The speech itself may be
recorded, and, if video cameras are available, at least some of the behavior and nonver-
bal communication may be permanently archived. Such data are highly unstructured,
however, at least for the purposes of the researcher. Before the researcher can analyze the
content of a focus group session, he or she must convert it into specific units of infor-
mation. The particular organizing structure a researcher chooses will depend on the
particular purpose of the research, but there are specific steps in the structuring process
that are common to all applications. These steps are unitizing, sampling, and recording.
Unitizing involves defining the appropriate unit or level of analysis. It would be
possible to consider each word spoken in a focus group session as a unit of analy-
sis. Alternatively, the unit of analysis could be a sentence, a sequence of sentences,
or a complete dialogue about a particular topic. Krippendorf (2004) suggests that
in content analysis, there are three kinds of units that must be considered: sampling
units, recording units, and context units. Sampling units are those parts of the
larger whole that can be regarded as independent of each other. Sampling units
tend to have physically identified boundaries. For example, sampling units may be
defined as individual words, complete statements of an individual, or the totality of
an exchange between two or more individuals.
Recording units tend to grow out of the descriptive system that is being
employed. Generally, recording units are subsets of sampling units. For example,
the set of words with emotional connotations would describe certain types of
words and would be a subset of the total words used. Alternatively, individual state-
ments of several group members may be recording units that make up a sampling
unit that consists of all the interaction concerned with a particular topic or issue.
In this latter case, the recording units might provide a means for describing those
exchanges that are hostile, supportive, friendly, and so forth.
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 606
Context units provide a basis for interpreting a recording unit. They may be
identical to recording units in some cases, whereas in other cases they may be quite
independent. Context units are often defined in terms of the syntax or structure in
which a recording unit occurs. For example, in marketing research, it is often use-
ful to learn how frequently evaluative words are used in the context of describing
particular products or services. Thus, context units provide a reference for the con-
tent of the recording units.
Sampling units, then, represent the way in which the broad structure of the infor-
mation within the discussion is divided. Sampling units provide a way of organizing
information that is related. Within these broader sampling units, the recording units
represent specific statements and the context units represent the environment or
context in which the statement occurs. The way in which these units are defined can
have a significant influence on the interpretation of the content of a particular focus
group discussion. These units can be defined in a number of different ways. The def-
inition of the appropriate unit of analysis must be driven by both the purpose of the
research and the ability of the researcher to achieve reliability in the coding system.
The reliability of such coding systems must be determined empirically, and in many
cases involves the use of measures of interrater agreement.
It is seldom practical to try to unitize all discussion that arises in a focus group.
When multiple focus groups are carried out on the same general topic, complete
unitization becomes even more difficult. For this reason, most content analyses of
focus groups involve some sampling of the total group discussion for purposes of
analysis. The analyst may seek to identify important themes and sample statements
within themes, or use some other approach, such as examining statements made in
response to particular types of questions, or at particular points in the conversa-
tion. Like other types of sampling, the intent of sampling in content analysis is to
provide a representative subset of the larger population. It is relatively easy for a
researcher to draw incorrect conclusions from a focus group if he or she does not
take care to ensure representative sampling of the content of the group discussion.
One can support almost any contention by taking a set of unrepresentative state-
ments out of the context in which they were spoken. Thus, it is important for the
analyst to devise a plan for sampling the total content of group discussions. The
final stage of data making is the recording of the data in such a way so as to ensure
their reliability and meaningfulness. The recording phase of content analysis is not
simply the rewriting of a statement of one or more respondents. Rather, it is the use
of the defined units of analysis to classify the content of the discussion into cate-
gories such that the meaning of the discussions is maintained and explicated. It is
only after the researcher has accomplished this latter stage that he or she can claim
to actually have data for purposes of analysis and interpretation.
The recording phase of content analysis requires the execution of an explicit set of
recording instructions. These instructions represent the rules for assigning units
(words, phrases, sentences, gestures, and so on) to categories. These instructions must
address at least four different aspects of the recording process (Krippendorf, 2004):
1. The nature of the raw data from which the recording is to be done (transcript
tape recording, film, and so on)
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 607
The specific rules referred to above are critical to the establishment of the relia-
bility of the recording exercise and the entire data-making process. Furthermore, it
is necessary that the researcher make these rules explicit and demonstrate that the
rules produce reliable results when used by individuals other than those who devel-
oped them in the first place. Lorr and McNair (1966) question the practice of
reporting high interrater reliability coefficients when they are based solely on the
agreement of individuals who have worked closely together to develop a coding sys-
tem. Rather, these researchers suggest that the minimum requirement for establish-
ing the reliability of a coding system is a demonstration that judges using only the
coding rules exhibit agreement.
Once a set of recording rules has been defined and demonstrated to produce reliable
results, the researcher can complete the data-making process by applying the record-
ing rules to the full content of the material of interest. Under ideal circumstances,
recording will involve more than one judge, so that the coding of each specific unit can
be examined for reliability and sources of disagreement can be identified and cor-
rected. There is a difference between developing a generally reliable set of recording
rules and assuring that an individual element in a transcript is reliably coded.
The assessment of the reliability of a coding system may be carried out in a variety
of ways. As noted above, there is a difference between establishing that multiple
recorders are in general agreement (manifest a high degree of interrater reliability) and
establishing that a particular unit is reliably coded. The researcher must decide which
approach is more useful for the given research question. It is safe to conclude that in
most focus group projects, general rater reliability will be more important because the
emphasis is on general themes in the group discussion rather than specific units.
Computation of a coefficient of agreement provides a quantitative index of the
reliability of the recording system. There exists a substantial literature on coeffi-
cients of agreement. Treatment of this literature and issues related to the selection
of a specific coefficient of agreement are beyond the scope of this chapter. Among
the more common coefficients in use are kappa (Cohen, 1960), pi (Scott, 1955), and
alpha (Krippendorf, 2004). All these coefficients correct the observed level of agree-
ment (or disagreement) for the level that would be expected by chance alone.
Krippendorf offers a useful discussion of reliability coefficients in content analysis,
including procedures for use with more than two judges (see also Spiegelman,
Terwilliger, & Fearing, 1953).
Data making tends to be the most time-consuming of all the stages in content
analysis. It is also the stage that has received the greatest attention in the content
analysis literature. The reason for this is that content analysis involves data making
after observations have been obtained, rather than before. Content analysis uses the
observations themselves to suggest what should be examined and submitted to fur-
ther analysis, whereas many other types of research establish the specific domain of
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 608
interest prior to observation. In survey research, much of the data making occurs
prior to administration of the survey. Such data making involves identification of
reasonable alternatives from which a respondent selects an answer. Thus, data mak-
ing is a step in survey research, and all types of research, but it occurs prior to obser-
vation. In content analysis, data making occurs after observation.
Data Analysis. The recording or coding of individual units is not content analysis.
It is merely the first stage in preparation for analysis. The specific types of analyses
that might be used in a given application will depend on the purpose of the research.
Virtually any analytic tool may be employed, ranging from simple descriptive
analysis to more elaborate data reduction and multivariate associative techniques.
Much of the content analysis work that occurs in the context of focus group data
tends to be descriptive, but this need not be the case. Indeed, although focus group
data tend to be regarded as qualitative, proper content analysis of the data can make
them amenable to the most sophisticated quantitative analysis. This is well illus-
trated by development of computer-assisted methods for content analysis.
Among the more frequently cited software programs for content analysis are
TEXTPACK (Mohler & Zuell, 1998), Concordance (Watt, 2004), Wordstat (Provalis
Research, 2005), and TextQuest (Social Science Consulting, 2005). Software for text
analysis is frequently reviewed in journals such as Computers and the Humanities
and Literary and Linguistic Computing. Specialized dictionaries for use in conjunc-
tion with text analysis programs such as the General Inquirer and TEXTPACK are
also available. Antworth and Valentine (1998) provide a brief introduction to sev-
eral of these specialized programs and dictionaries.
Work on content analysis has also built on the research on artificial intelligence
and in cognitive science. This more recent work recognizes that associations among
words are often important determinants of meaning. Furthermore, meaning may
be related to the frequency of association of certain words, the distance between
associated words or concepts (often measured by the number of intervening
words), and the number of different associations. The basic idea in this work is that
the way people use language provides insights into the way people organize infor-
mation, impressions, and feelings in memory and, thus, how they tend to think.
The view that language provides insight into the way individuals think about the
world has existed for many years.
The anthropologist Edward Sapir (1929) has noted that language plays a critical
role in how people experience the world. Social psychologists have also long had an
interest in the role language plays in the assignment of meaning and in adjustment
to the environment (see e.g., Bruner, Goodnow, & Austin, 1956; Chomsky, 1965;
Sherif & Sherif, 1969). In more recent years, the study of categorization has become
a discipline in its own right and has benefited from research on naturalistic cate-
gories in anthropology, philosophy, and developmental psychology, and the work
on modeling natural concepts that has occurred in the areas of semantic memory
and artificial intelligence (see Hahn & Ramscar, 2001; Medin, Lynch, & Solomon,
2000, for a review of this literature).
Such research has been extended to the examination of focus groups. Building
on theoretical work in the cognitive sciences (Anderson, 1983; Grunert, 1982),
Grunert and Bader (1986) developed a computer-assisted procedure for analyzing
the proximities of word associations. Their approach builds on prior work on con-
tent analysis as well. Indeed, the data-making phase of the approach uses the KWIC
approach as an interactive tool for designing a customized dictionary of categories.
The construction of a customized dictionary of categories is particularly important
for the content analysis of focus groups because the range and specificity of topics
that may be dealt with by focus group interviews is very broad, and no general pur-
pose dictionary or set of codes and categories is likely to suit the needs of a
researcher with a specific research application.
For example, to analyze focus group sessions designed to examine the way
groups of respondents think and talk about computer workstations, the researcher
will need to develop a dictionary of categories that refer specifically to the features
of workstations, particular applications, and specific work environments. To ana-
lyze focus groups designed to examine the use of condoms among inner-city ado-
lescents, it is likely that a dictionary of categories that includes the slang vernacular
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 610
Use of virtual groups greatly expands the pool of potential participants and adds
considerable flexibility to the process of scheduling an interview. Busy profession-
als and executives, who might otherwise be unavailable for a face-to-face meeting,
can often be reached by means of information technologies. Virtual focus groups
may be the only option for certain types of samples, but they are not without some
costs relative to more traditional groups. The lack of face-to-face interaction often
reduces the spontaneity of the group and eliminates the nonverbal communication
that plays a key role in eliciting responses. Such nonverbal communication is often
critical for determining when further questioning or probing will be useful, and it
is often an important source of interplay among group members. Use of virtual
groups tends to reduce the intimacy of the group as well, making group members
less likely to be open and spontaneous.
The moderators role is made more difficult, since it is harder to control the par-
ticipants. Dominant participants are more difficult to quieten, and less active par-
ticipants are more difficult to recognize. On the other hand, the moderators task
can be aided by electronic monitoring equipment that keeps an ongoing record of
who has talked and for how long. A visual display can keep the names and fre-
quency of participation of group members before the moderator. Thus, the mod-
erator can draw out the quiet participant, just as in a more typical focus group.
Virtual groups can take several forms. Telephonic groups (essentially confer-
ence calls) have long been used by researchers but such groups are very awkward
and it is difficult to manage any serious group interaction. Spontaneity is highly
constrained in such groups. Real-time videoconferences have become a common
means for conducting virtual groups in the last several years. Videoconferencing
via telephone lines or the Internet can provide an opportunity for the moderator
to see participants and for participants to see the moderator and other partici-
pants. The success of such groups critically depends on the reliability of the tech-
nology. It is always important that a technical expert be available during the group
research.
Many research firms that specialize in focus group research now include virtual
group capabilities as part of their facility offerings. Virtual groups conducted by
videoconference are not a perfect substitute for on-site groups. The facial expres-
sions and other behavior of group members may not be visible at all or may not be
as visible as in face-to-face group encounters. Group interaction tends to be less
spontaneous. Such groups are inevitably more expensive than more traditional on-
site groups because of the cost of the technology, the need for a technician, and the
cost of connect time.
Two other alternatives for conducting virtual groups involve the use of chat
rooms and bulletin boards. Chat rooms involve real-time interaction among the
moderator and group members. Bulletin boards are asynchronous, so questions
can be posed and answers provided over some extended period of time. Such vir-
tual groups can be very real social groups, but many people remain uncomfortable
with such online sharing. It is also the case where the moderator and participants
cannot see one another, so information that might be present in facial expressions,
tone of voice, and other nonverbal behavior is lost.
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 612
Conclusion
With the advent of computer-assisted analysis and real-time, interactive electronic
focus groups, the issue of validity in focus group research may, on the surface, seem
to occupy a higher plane of importance and sophistication now that it is techno-
logically more accessible. However, the use of computers alone does not ensure
validity. Like other quantitative techniques, computer analysis of focus group
results also suffers from the GIGO (garbage in, garbage out) problem. Therefore, it
is worthwhile for social science researchers to take note of Brinberg and McGraths
(1985) succinct reminder that validity is not a commodity that can be purchased
with techniques . . . Rather validity is like integrity, character, or quality, to be
assessed relative to purposes and circumstances (p. 13).
In this regard, the validity of focus group findings should be assessed relative to
the research objectives and circumstances that gave rise to the research. Furthermore,
the issue of validity needs to be addressed throughout the focus group research
processfrom planning and data collection to data making, analysis, and interpre-
tation. The execution of each step of this research process has the potential to influ-
ence the validity of focus group findings, either positively or negatively. Understanding
the limitations and possible sources of bias at each stage of the focus group process
will enable the researcher to take appropriate measures to deal objectively with
threats to the integrity of the research results.
Discussion Questions
In this chapter, we have examined many facets of focus group research, including
the appropriate role of such research in the social sciences, the design and conduct
of focus group research, the interpretation of the results of focus group research,
and the types of research questions to which focus group research should appro-
priately be applied. Focus group research is not just a group conversation; it is a
complex research tool. You should carefully review this chapter before embarking
on focus group research. The questions that follow will help you identify some of
the critical issues and decisions associated with the use and conduct of focus group
research.
1. For what types of research is the group depth interview (focus group)
appropriate? For what types of research questions is a focus group inappropriate?
2. What are the differences between etic and emic research? How are these dif-
ferences relevant to the use and conduct of focus group research?
3. What does it mean to say that a focus group produces a single observation
rather than observations associated with each member of the group?
4. How does sampling differ in the context of focus groups as compared with
survey research? What are the implications of these differences for the interpreta-
tion of the results of focus group research?
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 613
5. How does the composition of a focus group influence the results obtained?
What are some of the social factors that can influence the interaction of focus group
members?
6. What is the role of the moderator of a focus group? What are the character-
istics of a good focus group moderator? Are there different styles for moderating
groups that may be more or less appropriate for particular types of groups?
7. What is an interview guide? What is a good question for focus group
research? What are the characteristics of good questions for use in a focus group?
8. How are probes and follow-up questions used in focus group research?
What is the effect of using probes and follow-up questions on the generalizability
of focus group research?
9. What types of results are produced by a focus group? How are such results
summarized and interpreted?
10. What is content analysis? How might it be applied to the results obtained
from focus group research?
11. Do you agree or disagree with the statement that focus groups should never
be used for evaluative research? Why or why not?
12. List examples of the types of questions for which focus groups might be
appropriate.
Exercises
1. Go online and do a search using the key words focus group research. Find
a report of a study that uses focus group research as the primary research method.
Based on what you have learned from this chapter, critique the research. In devel-
oping your critique consider the following questions:
a. What was the purpose of the research? How appropriate was focus group
research for the research question(s) addressed in the research?
b. How appropriate was the sample employed in the research? How general-
izable are the results of the research?
c. What types of questions were asked of the group(s)? Did these questions
fully address the issues that motivated the research? What other questions
might have been asked?
d. How were the results of the group(s) analyzed? Was this analysis appro-
priate? What alternative analyses would you suggest instead of, or in addi-
tion to, what was reported in the paper?
e. What was concluded as a result of conducting the group(s)? Do you agree
or disagree with the conclusion(s)? What would be an appropriate follow-
up to the research?
2. Pick a topic which you think is appropriate for investigation using a focus
group. Such topics might include determinants of customer satisfaction with a
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 614
References
Anderson, J. R. (1983). The architecture of cognition. Cambridge, MA: Harvard University
Press.
Andreasen, A. (1985, May/June). Backward market research. Harvard Business Review, 176178.
Antworth, E., & Valentine, J. R. (1998). Software for doing field linguistics. In J. Lawler &
H. A. Dry (Eds.), Using computers in linguistics: A practical guide (pp. 170196). New
York: Routledge.
Bogdan, R. C., & Biklen, S. K. (2006). Qualitative research for education: An introduction to
theory and methods (5th ed.). Boston: Allyn & Bacon.
Brinberg, D., & McGrath, J. E. (1985). Validity and the research process. Beverly Hills, CA:
Sage.
Bruner, J. S., Goodnow, J. J., & Austin, J. G. (1956). A study of thinking. New York: John Wiley.
Carey, M. A., & Smith, M. (1994). Capturing the group effect in focus groups: A special con-
cern in analysis. Qualitative Health Research, 4, 123127.
Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge: MIT Press.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological
Measurement, 20, 3746.
Denzin, N. K., & Lincoln, Y. S. (2005). The Sage handbook of qualitative research. Thousand
Oaks, CA: Sage.
Fern, E. (2001). Advanced focus group research. Thousand Oaks, CA: Sage.
Folch-Lyon, E., de la Macorra, L., & Schearer, S. B. (1981). Focus group and survey research
on family planning in Mexico. Studies in Family Planning, 12, 409432.
Fuller, T. D., Edwards, J. N., Vorakitphokatorn, S., & Sermsri, S. (1993). Using focus groups
to adapt survey instruments to new populations: Experience in a developing country.
In D. L. Morgan (Ed.), Successful focus groups: Advancing the state of the art (pp. 89104).
Newbury Park, CA: Sage.
Gottschalk, L. A., Winget, C. N., & Gleser, G. C. (1969). Manual of instructions for using the
Gottschalk-Gleser Content Analysis Scales. Berkeley: University of California Press.
Greenbaum, T. L. (2000). Moderating focus groups: A practical handbook and guide to focus
group research. Thousand Oaks, CA: Sage.
Grunert, K. G. (1982). Linear processing in a semantic network: An alternative view of con-
sumer product evaluation. Journal of Business Research, 10, 3142.
Grunert, K. G., & Bader, M. (1986, August). A systematic way to analyze focus group data.
Paper presented at the summer Marketing Educators Conference of the American
Marketing Association, Chicago.
Hahn, U., & Ramscar, M. (2001). Similarity and categorization. New York: Oxford University
Press.
Henderson, N. (2004). Same frame, new game. Marketing Research, 16, 3839.
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 615
Hoppe, M. J., Wells, E. A., Morrison, D. M., Gillmore, M. R., & Wilsdon, A. (1995). Using
focus groups to discuss sensitive topics with children. Evaluation Review, 19, 102114.
Janis, I. L. (1965). The problem of validating content analysis. In H. D. Laswell, N. Leites, &
Associates (Eds.), Language of politics (pp. 4267). Cambridge: MIT Press.
Jarrett, R. L. (1993). Focus group interviewing with low-income, minority populations: A
research experience. In D. L. Morgan (Ed.), Successful focus groups: Advancing the state
of the art (pp. 184201). Newbury Park, CA: Sage.
Knodel, J. (1995). Focus groups as a qualitative method for cross-cultural research in social
gerontology. Journal of Cross-Cultural Gerontology, 10(1/2), 720.
Krippendorf, K. (2004). Content analysis: An introduction to its methodology. Thousand Oaks,
CA: Sage.
Krueger, R. A., & Casey, M. A. (2000). Focus groups: A practical guide for applied research (3rd
ed.). Thousand Oaks, CA: Sage.
Lorr, M., & McNair, D. M. (1966). Methods relating to evaluation to therapeutic outcome. In
L. A. Gottschalk & A. H. Auerbach (Eds.), Methods of research in psychotherapy.
Englewood Cliffs, NJ: Prentice Hall.
Magill, R. S. (1993). Focus groups, program evaluation, and the poor. Journal of the Sociology
of Social Welfare, 20, 103114.
Marshall, C., & Rothman, G. B. (2006). Designing qualitative research. Thousand Oaks, CA:
Sage.
McCracken, G. (1988). The long interview. Newbury Park, CA: Sage.
Medin, D. I., Lynch, E. B., & Solomon, K. O. (2000). Are there kinds of concepts? Annual
Review of Psychology, 51, 121147.
Mohler, P. Ph., & Zuell, C. (1998). TEXTPACK: Short description. Mannheim, Germany: ZUMA.
Morgan, D. L. (1997). Focus groups as qualitative research (2nd ed.). Thousand Oaks, CA: Sage.
Nelems, J. (2003, February). Qualitatively speaking: The focus grouppopular but danger-
ous. Quirks Marketing Research Review. Retrieved March 26, 2005, from www.quirks
.com/articles/article.asp?arg_ArticleId=1086
Neuendorf, K. A. (2001). The content analysis guidebook. Thousand Oaks, CA: Sage.
Provalis Research. (2005). WORDSTAT v4.0: Content analysis and text mining module for
Simstat and QDA Miner. Montreal, Quebec, Canada: Author.
Rook, D. W. (2003). Out-of-focus groups. Marketing Research, 15(2), 1115.
Rook, D. W. (2007). Lets pretend: Projective methods reconsidered. In R. W. Belk (Ed.),
Handbook of qualitative research methods in marketing (pp. 143155). Hillsdale, NJ:
Earlbaum.
Sapir, E. (1929). The status of linguistics as a science. Language, 5, 207214.
Scott, W. A. (1955). Reliability of content analysis: The case of nominal coding. Public
Opinion Quarterly, 19, 321325.
Shamdasani, P., & Stewart, D. W. (1992). Analytical issues in focus group research. Asian
Journal of Marketing, 1(1), 2742.
Sherif, M., & Sherif, C. W. (1969). Social psychology. New York: Harper & Row.
Smith, R. G. (1978). The message measurement inventory: A profile for communication analy-
sis. Bloomington: Indiana University Press.
Social Science Consulting. (2005). TextQuest: Software for text analysis. Rudolstadt,
Germany: Author.
Spiegelman, M. C., Terwilliger, C., & Fearing, F. (1953). The reliability of agreement in con-
tent analysis. Journal of Social Psychology, 37, 175187.
Stewart, D. W., Shamdasani, P. N., & Rook, D. W. (2007). Focus groups: Theory and practice
(2nd ed.). Thousand Oaks, CA: Sage.
18-Bickman-45636:18-Bickman-45636.qxp 7/28/2008 11:45 AM Page 616
Stone, P. J., Dunphy, D. C., Smith, M. S., & Ogilvie, D. M. (1966). The general inquirer:
A computer approach to content analysis. Cambridge: MIT Press.
Templeton, J. F. (1994). The focus group: A strategic guide to organizing, conducting, and
analyzing the focus group interview (2nd ed.). New York: McGraw-Hill.
Vaughn, S., Schumm, J. S., & Singagub, J. (1996). Focus group interview in education and
psychology. Thousand Oaks, CA: Sage.
Watt, R. J. C. (2004). Concordance: Manual for version 3.2. Dundee, UK: Concordance Software.
Wellner, A. (2003, March). The new science of focus groups. American Demographics,
25(2), 2933.
West, M. D. (2001). Theory, method, and practice in computer content analysis. Progress in
communication sciences (Vol. 16). Westport, CT: Ablex.
Index-Bickman-45636:Bickman Sample 7/28/2008 7:42 PM Page 617
Author Index
617
Index-Bickman-45636:Bickman Sample 7/28/2008 7:42 PM Page 618
Holditch-Davis, D., 289 Jobber, D., 482, 483, 484, 486, 487
Hollister, R., 170 Johnsen, J. A., 459, 460
Holub, M., 435 Johnson, B., 283, 287, 294, 295, 299, 308
Hook, Gregory, 259 Johnson, B. T., 347
Hooper, M. L., 304 Johnson, C. C., 156
Hoover, B., 510, 514, 517, 519, 532, 535 Johnson, C. L., 21
Hopkins, K. D., 482483, 487 Johnston, J. J., 482, 483
Hoppe, M. J., 599 Jolson, M. A., 481
Hornik, J., 488 Jones, J., 109
Hornik, R. C., 14 Jones, W. H., 478, 479, 487, 489
Horrigan, J., 417 Joop, H., 513
Hough, R., 406 Jordan, B., 407
House, C. C., 518 Jordan, B. K., 406
House, J. S., 479 Jordan, G. B., 7
Houston, M. J., 484, 488, 489 Jordan, S., 545
Howard, J., 108 Julnes, G., 151, 173
Howe, P., 299 Jni, P., 360
Hox, J., 62
Hsiung, S., 206207 Kaftarian, S. J., 222, 234
Hsueh, J., 203 Kahn, B., 420
Hubbard, C., 416 Kahn, C. E., Jr., 427
Hubbard, R., 483, 484, 485 Kahn, J., 328
Huberman, A. M., 216, 219, 222, 228, 229, Kalafatis, S. P., 484, 486
233, 235, 239, 243, 245, 301, 314n1, Kalaian, H., 362
332, 575 Kaloupek, D., 116
Huebner, R., 7 Kalton, G., 102, 103, 404
Huff, S. M., 459 Kamins, M. A., 19
Huggins, V., 417 Kane, M., 440, 442, 446, 449, 453, 460, 461
Hughes, E. C., 215 Kann, L., 390
Hughes, R., 484 Kanouse, D., 483
Humphreys, L., 116 Kanso, A., 479, 481, 482, 487, 489
Hunt, M., 347 Kanuk, L., 479, 482, 484
Hunter, A., 285, 289 Kaplan, A., 216
Hunter, J. E., 345, 346, 357 Kaplan, R. N., 321
Hunter, R., 345 Kaplowitz, M., 498
Hurt, L. E., 459 Kardorff, E., 573
Huxley, P. J., 459, 461 Karweit, N., 295
Kassam-Adams, N., 115
Ilgen, D., 109 Katz, J., 109
Iyengar, S., 361 Kawash, M. B., 484
Kazanas, H. C., 439
Jabine, T. B., 383, 401 Keeter, S., 80, 521
Jackson, G. B., 346, 348 Keith, D., 460
Jackson, K., 460, 463 Kellam, S., 164
Jackson, R. W., 485 Kelling, George M., 273
Jackson, S., 460, 463 Kelly, J., 517
Jacob, B., 197 Kelsh, J., 334, 335
Jacobs, R. N., 264 Kemnitzer, D. S., 551
Janis, I. L., 604 Kemper, E., 295
Jansen, G., 225 Kemper, P., 170
Jarrett, R. L., 599 Kennedy, C., 521
Jefferson, R. W., 484 Kenny, D. A., 8, 203
Jenkins, G., 334, 335 Keohane, R., 330
Jenkins, G. M., 196 Keohane, R. O., 306
Jenkins, L., 460 Kephart, W. M., 479, 488
Jensen, H. L., 483 Kerbow, D., 277278
Jiao, Q. C., 283, 291, 293 Kerin, R. A., 481, 484, 485
Index-Bickman-45636:Bickman Sample 7/28/2008 7:42 PM Page 625
Smith, R., 492 Strauss, A. L., 215, 220, 222, 223, 225, 227,
Smith, R. G., 604 237, 238, 302, 574
Smith, R. M., 418 Strauss, C., 545
Smith, T. K., 390 Strauss A. L., 239
Smith, W. R., 102 Street, B., 545
Smulyan, L., 234 Stringfield, S., 295
Snijders, T., 462 Stuart, J., 459
Snijders, T. A. B., 61, 62, 66 Stufflebeam, D. L., 152, 159
Snodgrass, S., 196 Suchman, E. A., 478
Snow, J., 475n Suchman, L., 407
Sobol, A. M., 156 Sudman, S., 103, 375, 385, 387, 389, 476, 515
Solomon, K. O., 609 Sullivan, W. M., 221
Solomon, P., 162 Sundra, D. L., 461
Song, C., 498 Sutton, A. J., 359, 360, 361
Song, F., 151 Sutton, R. I., 156, 255
Sorbom, D., 190 Swaminathan, H., 197
Sorensen, G., 156 Swan, J. E., 485
Southern, D., 460 Swatos, W., 551
Southern, D. M., 459, 461 Swayne, L., 321
Sowden, A. J., 151 Sykes,W., 529
Spar, S., 158 Szent-Gyorgyi, A., 435
Spiegelman, M. C., 607
Spindler, G. D., 559 Tankard, J. W., 348
Spindler, L., 559 Tanur, J. M., 401
Spirduso, W. W., 220, 223, 226, 227 Tarnai, J., 516, 517, 519, 525, 534
Spradley, J. P., 302, 555, 558, 579 Tashakkori, A., 283291, 293296, 302, 306,
Sproull, L. S., 567 308, 310, 311, 313, 314n1
Sproull, R. F., 567 Taveggia, T. C., 347348
Spybrook, J., 153, 163 Taylor, R., 158
Stafford, J. E., 481 Taylor, S., 481, 520
Stanley, B., 111, 161 Taylor, S. J., 558, 576
Stanley, J. C., 150, 182, 198, 214, 347 Teddlie, C., 284288, 290296, 299, 302, 304,
Stanton, J., 332 306, 308, 310, 311, 313, 314n1
Starer, A., 520 Temple, J. A., 191
Stasny, E. A., 523, 524 Templeton, J. F., 603
Staw, B. M., 255 Terpening, W. D., 482, 493
Stec, J., 530 Terwilliger, C., 607
Steeh, C., 521, 521 Tesch, R., 216, 219, 235, 242
Steel, P., 128 Thacher, D., 255
Steele, F., 61 Thiemann, S., 47
Stefancic, A., 25 Thistlethwaite, D. L., 198
Steinke, I., 573 Thompson, B., 156
Stemler, S., 575 Thompson, R., 484
Stempel, G. H., 417 Thompson, R. A., 119, 140n6
Steve, K., 510, 517, 519, 525, 532, 535 Thompson, R. L., 425
Stevenson, J. S., 420 Thompson, S. G., 358
Stewart, A. L., 406 Thompson, W., 295
Stewart, D. W., 19, 479, 481, 484, 589, Tiao, G. C., 196
599, 603 Tichy, N., 323
Stillman, F., 460 Tilley, B., 156
Stokes, G., 460 Ting, C. B., 498
Stokols, D., 462 Titscher, S., 575
Stone, E., 335 Todd, P., 201
Stone, P. J., 608 Tolman, D. L., 222
Story, M., 156 Tompson, T. N., 521
Straf, M. L., 347, 401 Totora, R. D., 533
Strauss, A., 302 Tourangeau, R., 375, 384, 387, 400, 425, 498
Index-Bickman-45636:Bickman Sample 7/28/2008 7:42 PM Page 632
Subject Index
635
Index-Bickman-45636:Bickman Sample 7/28/2008 7:42 PM Page 636
651
ABA-Bickman-45636:ABA-Bickman-45636.qxp 7/28/2008 7:29 PM Page 652
Samuel J. Best is Associate Professor of Political Science and Director of the Center
for Survey Research and Analysis at the University of Connecticut. He has written
numerous academic articles and books, including a volume for Sage, titled Internet
Data Collection.
Carol Cosenza joined the Center for Survey Research, University of Massachusetts
at Boston in 1988. She is currently a project manager and also coordinates the
Centers cognitive testing and question evaluation work. She has been involved in
all phases of the survey processfrom question design to data coding and analysis.
The recent focus of her methodological research has been comparing different ways
that survey questions can be evaluated and how to understand what is learned from
652
ABA-Bickman-45636:ABA-Bickman-45636.qxp 7/28/2008 7:29 PM Page 653
that testing. She has also been working on a series of studies of how the details of
question wording affect data quality. She graduated from Dartmouth College and
had her MSW from Boston University.
Floyd J. Fowler Jr. has been a senior research fellow at the Center for Survey
Research at University of Massachusetts Boston since 1971. He was Director of the
Center for 14 years. He is the author (or coauthor) of four textbooks on survey
methods, as well as numerous research papers and monographs. His recent work
has focused on studies of question design and evaluation techniques and applying
survey methods to studies of medical care. He received a PhD from the University
of Michigan in 1966.
Systems, Inc., in 1993 after a successful career in the management and growth of
community-based cultural and learning organizations. Her current methodology
and service interests include supporting grant-funded centers in start-up manage-
ment skills for researchers and the linkage of planning, action, and evaluation in
public sector organizations.
Allison Karpyn is Director of Research and Evaluation at The Food Trust,
a Philadelphia-based nonprofit organization committed to providing access to
affordable nutritious foods. In addition, she teaches program planning and evalua-
tion as well as community assessment courses in the MPH program at Drexel
University. She is a member of The American Public Health Association, Society for
Public Health Education and the American Evaluation Association and certified as
a professional researcher by the Marketing Research Association. She earned her
bachelors degree in public health at The Johns Hopkins University and her masters
and doctorate degrees in policy research evaluation and measurement at The
University of Pennsylvania.
Paul J. Lavrakas, is a research psychologist and is currently serving as a method-
ological research consultant for several public and private sector organizations. He
served as vice president and chief methodologist for Nielsen Media Research from
2000 to 2007. Previously, he was a professor of journalism and communication
studies at Northwestern University (19781996) and at Ohio State University
(OSU; 19962000). During his academic career, he was the founding faculty direc-
tor of the Northwestern University Survey Lab (19821996) and the OSU Center
for Survey Research (19962000). Among his publications, he has written a widely
read book on telephone survey methodology and served as the lead editor for three
books on election polling, the news media, and democracy, as well as coauthoring
four editions of The Voters Guide to Election Polls. He served as a guest editor for a
special issue of Public Opinion Quarterly on Cell Phone Numbers and Telephone
Surveys (2007, Vol. 71, No. 5), and also is the editor of the Encyclopedia of Survey
Research Methods that Sage will publish in 2008. He was a corecipient of the 2003
AAPOR Innovators Award for his work on the standardization of survey response
rate calculations.
James J. Lindsay has worked as a program evaluator, specialized in developing and
implementing evaluations of publicly funded programs. As a social psychologist
trained in basic research, he has an excellent grasp of research methodology and
statistics and has published papers on multiple topics, including human aggression
and behavior related to the natural environment. As Project Coordinator for the
University of Minnesota Volunteerism Project at the Institute, he is responsible for
the analysis of the data and reporting of results. He earned a PhD in 1999 from the
University of Missouri.
Mark W. Lipsey is Director of the Center for Evaluation Research and
Methodology and Senior Research Associate at the Vanderbilt Institute for Public
Policy Studies. His professional interests are in the areas of program evaluation
research, social intervention, field research methodology, and research synthesis
(meta-analysis). The topics of his recent research have been risk and intervention
ABA-Bickman-45636:ABA-Bickman-45636.qxp 7/28/2008 7:29 PM Page 656
for juvenile delinquency and substance use, early childhood education programs,
and issues of methodological quality in program evaluation research. He is a recip-
ient of awards from the American Evaluation Association, the Society of Prevention
Research, and the Campbell Collaboration, a Fellow of the American Psychological
Society, and coauthor of the program evaluation textbook, Evaluation: A Systematic
Approach, and the meta-analysis primer, Practical Meta-Analysis.
Julia Littell, PhD, is a professor at the Graduate School of Social Work and Social
Research, at Bryn Mawr College, where she has taught since 1994. She was Research
Director for the National Family Resource Coalition, a Senior Research Fellow at
the Chapin Hall Center for Children, and a lecturer at the School of Social Service
Administration at the University of Chicago. She is coauthor of Systematic Reviews
and Meta-Analysis, Putting Families First: An Experiment in Family Preservation, and
numerous articles and chapters on research and evaluation methods, research syn-
thesis, and child welfare services. She is a member of the editorial boards of Children
and Youth Services Review and the Journal on Social Work Education. She has served
as adviser on research and evaluation projects for community-based and govern-
mental agencies at all levels and for independent foundations. She currently serves
as Editor and Cochair of the International Campbell Collaboration (C2) Social
Welfare Coordinating Group and is a member of the C2 Steering Group. She is a
2006 recipient of the Pro Humanitate Literary Award presented by the Center for
Child Welfare Policy of the North American Resource Center for Child Welfare to
authors who exemplify the intellectual integrity and moral courage required
to transcend political and social barriers to champion best practice in the field
of child welfare. She earned her undergraduate degree from the University of
Washington and her MSW and PhD from the University of Chicago.
Thomas W. Mangione is senior research scientist at John Snow, Inc., in Boston,
Massachusetts, and is Director of its Survey Research Facility. During his graduate
training he worked at the University of Michigans Survey Research Center, one of
the worlds premier survey research facilities. He has had more than 35 years of sur-
vey research experience using in-person, telephone, and self-administered data col-
lection modes. He has published several articles and two books on survey research
methodology. He also has been teaching survey research methodology at both the
Boston University and Harvard University schools of public health since the mid-
1970s. He obtained his PhD in organizational psychology from the University of
Michigan in 1973.
Melvin M. Mark is Professor and Head of Psychology at the Pennsylvania State
University. A past president of the American Evaluation Association, he has also
served as editor of the American Journal of Evaluation where he is now Editor
Emeritus. His interests include the theory, methodology, practice, and profession of
program and policy evaluation. He has been involved in evaluations in a number of
areas, including prevention programs, federal personnel policies, and various edu-
cational interventions including STEM program evaluation. Among his books are
Evaluation: An Integrated Framework for Understanding, Guiding, and Improving
Policies and Programs (with Gary Henry and George Julnes) and the recent SAGE
ABA-Bickman-45636:ABA-Bickman-45636.qxp 7/28/2008 7:29 PM Page 657
Handbook of Evaluation (with Ian Shaw and Jennifer Greene), as well as two new
books Evaluation in Action: Interviews With Expert Evaluators (with Jody Fitzpatrick
and Tina Christie) and What Counts as Credible Evidence in Applied Research and
Contemporary Evaluation (with Stewart Donaldson and Tina Christie, Sage) and
the forthcoming Social Psychology and Evaluation (with Stewart Donaldson and
Bernadette Campbell).
She has served on the Accreditation Council of the Association for the Accreditation
of Human Research Protection Programs (AAHRPP).
David W. Stewart is Dean of the A. Gary Anderson Graduate School of
Management at the University of California, Riverside. He is a past editor of the
Journal of Marketing and is the current editor of the Journal of the Academy of
Marketing Science. He has authored or coauthored more than 200 publications and
7 books. He received his PhD and MA in psychology from Baylor University and his
BA in psychology from the University of Louisiana at Monroe.
Abbas Tashakkori is Professor of Research and Evaluation Methodology and
Associate Dean for Research and Graduate Studies in the College of Education of
Florida International University. He has published extensively in national and inter-
national journals and has coauthored or coedited three books. He has a rich history
of research, program evaluation, and writing on minority and gender issues, uti-
lization of integrated methods of research, and teacher efficacy and job satisfaction.
He is a founding coeditor of the Journal of Mixed Methods Research. His latest work
in press is a book with Charles Teddlie titled Foundation of Mixed Methods Research:
Integrating Quantitative and Qualitative Techniques in the Social and Behavioral
Sciences (Sage, expected 2009).
Charles Teddlie is the Jo Ellen Levy Yates Professor (Emeritus) in the College of
Education at Louisiana State University. He is the author of 12 books and numer-
ous chapters and articles on research methods and school/teacher effectiveness.
These include The Foundations of Mixed Methods Research: Integrating Quantitative
and Qualitative Techniques in the Social and Behavioral Sciences (with Abbas
Tashakkori, 2009), The Handbook of School Effectiveness Research (with David
Reynolds, 2000), and Schools Make a Difference: Lessons Learned from a Ten-Year
Study of School Effects (with Sam Stringfield, 1993).
William M. Trochim is Professor of Policy Analysis and Management at Cornell
University and is the Director of Evaluation for the Weill Cornell Clinical and
Translational Science Center, Director of Evaluation for Extension and Outreach,
and Director of the Cornell Office for Research on Evaluation. He is currently
President of the American Evaluation Association. His research is broadly in the
area of applied social research methodology, with an emphasis on program plan-
ning and evaluation methods. In his career, he developed quasi-experimental alter-
natives to randomized experimental designs, including the regression discontinuity
and regression point displacement designs. He created a structured conceptual
modeling approach that integrates participatory group process with multivariate
statistical methods to generate conceptual maps and models useful for theory
development, planning, and evaluation. He has been conducting research with the
National Institutes of Health and the National Science Foundation on the use of
systems theory and methods in evaluation. He has published widely in the areas of
applied research methods and evaluation and is well-known for his textbook, The
Research Methods Knowledge Base, and for his Web site on social research methods.
He received his PhD from the Department of Psychology at Northwestern
University in methodology and evaluation research.
ABA-Bickman-45636:ABA-Bickman-45636.qxp 7/28/2008 7:29 PM Page 660
Anthology (2004) and The World of Education (2005). In 1998, he founded the
Robert K. Yin Fund at MIT, which supports seminars on brain sciences as well
as other activities related to the advancement of predoctoral students. He has a
BA from Harvard College (magna cum laude) and a PhD from MIT (brain and
cognitive sciences).
ABA-Bickman-45636:ABA-Bickman-45636.qxp 7/28/2008 7:29 PM Page 662