Conducting Educational Research
Conducting Educational Research
E D UC AT I O N A L
RESE A R C H
CO N D UC TIN G
ED U C ATIO N A L
R ES E AR C H
=
Sixth Edition
BRUCE W. TUCKMAN
and
BRIAN E. HARPER
All rights reserved. No part of this book may be reproduced in any form or by any
electronic or mechanical means, including information storage and retrieval systems,
without written permission from the publisher, except by a reviewer who may quote
passages in a review.
™ The paper used in this publication meets the minimum requirements of American
National Standard for Information Sciences—Permanence of Paper for Printed Library
Materials, ANSI/NISO Z39.48-1992.
PART 1 Introduction 1
Chapter 1 The Role of Research 3
PART 2 Fundamental Steps of Research 21
Chapter 2 Selecting a Problem 23
Chapter 3 Reviewing the Literature 41
Chapter 4 Identifying and Labeling Variables 67
Chapter 5 Constructing Hypotheses and Meta-Analyses 85
Chapter 6 Constructing Operational Definitions of Variables 105
PART 3 Types of Research 121
Chapter 7 Applying Design Criteria: Internal and External Validity 123
Chapter 8 Experimental Research Designs 149
Chapter 9 Correlational and Causal-Comparative Studies 181
Chapter 10 Identifying and Describing Procedures for Observation
and Measurement 205
Chapter 11 Constructing and Using Questionnaires,
Interview Schedules, and Survey Research 243
PART 4 Concluding Steps of Research 287
Chapter 12 Carrying Out Statistical Analyses 289
Chapter 13 Writing a Research Report 315
■ v
VI ■ BRIEF CONTENTS
References 503
Index 509
About the Authors 517
Contents
PART 1 Introduction 1
■ vii
viii ■ CONTENTS
References 503
Index 509
About the Authors 517
Supplementary teaching and learning tools have been developed to accom-
pany this text. Please contact Rowman & Littlefield at textbooks@rowman
.com for more information on the following:
INTRODUCTION
=
= CHAPTER ONE
OBJECTIVES
■ What Is Research?
■ 3
4 ■ CHAPTER ONE
■ Validity in Research
not systematically dealt with in the study. Internal validity affects observers’
certainty that the research results can be accepted, based on the design of the
study.
A study has external validity if the results obtained would apply in the real
world to other similar programs and approaches. External validity affects observ-
ers’ ability to credit the research results with generality based on the procedures
used.
The process of carrying out an experiment—that is, exercising some con-
trol over the environment—contributes to internal validity while producing
some limitation in external validity. As the researcher regulates and controls
the circumstances of inquiry, as occurs in an experiment, he or she increases
the probability that the phenomena under study are producing the outcomes
attained (enhancing internal validity). Simultaneously, however, he or she
decreases the probability that the conclusions will hold in the absence of the
experimental manipulations (reducing external validity). Without procedures
to provide some degree of internal validity, one may never know what has
caused observed effects to occur. Thus, external validity is of little value with-
out some reasonable degree of internal validity, which gives confidence in a
study’s conclusions before one attempts to generalize from them.
Consider again the example of the science educator who was designing
a new program for fifth graders. For several reasons, his experiment lacked
internal validity. To begin with, he should have applied his different teaching
techniques to the same material to avoid the pitfall that some material is more
easily learned than other material. He might rather have taught both units to
one group of students using the lecture approach and both units to another
group of students using films. Doing so would help to offset the danger that
films might be especially appropriate tools for a single unit, because this special
appropriateness would be less likely to apply to two units than to one.
By using two different films and two different lectures, he would also
minimize the possibility that the effect was solely a function of the merits of a
specific film; it is less likely that both films will be outstanding (or poor) than
that one will be. In repeating the experiment, the science teacher should be
extremely cautious in composing his two groups; if one group contains more
bright students than the other, obviously that group’s advantage would affect
the results. (However, the use of two groups is the best way to ensure that
one teaching approach does not benefit from the advantage of being applied
last.) The educator should also, of course, ensure that his end-of-unit tests are
representative of the learning material; because both groups will get the same
tests, however, their relative difficulty ceases to be as important as it was in the
original plan.
THE ROLE OF RESEARCH ■ 7
When examining data from two points in time, researchers may be misled by over-
evaluating or underevaluating the effect of an intervening experimental manipula-
tion. The data from Time 1 may itself be the result of unusual circumstances that
render it nonrepresentative of the prevailing conditions. Moreover, phenomena
other than the manipulation may account for any change that occurs. Data from
more than Time 1 and Time 2 should be examined to evaluate any manipulation
between them.
Clearly, one must put time changes in a proper context. Certain changes will
normally occur over time, having little to do with conditions one has imposed
on the situation. One must distinguish between such ordinary changes over
time and those caused by an intervention.
8 ■ CHAPTER ONE
Comparing Groups
The demands of internal validity, or certainty, are most easily met by confining
research to a laboratory, where the researcher can control or eliminate the irrel-
evant variables and manipulate the relevant ones. However, elimination of many
variables, regardless of their centrality to the problem in question, may limit the
external validity, or generality, of the findings. Success in the laboratory may not
indicate success in the real world, where activities are subject to the influence of
real variables that have been shut out of the laboratory. Thus, many research prob-
lems require field settings to ensure external validity. The crux of the problem is to
operate in the field and still achieve internal validity.
In some situations in the field, however, it is impossible to apply fully the
rules of internal validity. Often, for example, any proposed improvements in a
school must ethically be made at the same time for the whole school rather than
just for certain experimental groups. To evaluate the effects of such changes,
the researcher must choose some approach other than the recommended one
of applying the change to and withholding it from equivalent groups.
An incontrovertible fact remains, however, that research, even that carried
out in the field rather than in the laboratory, must impose some artificialities and
restrictions on the situation being studied. It is often this aspect that administrators
find most objectionable. Campbell and Dunnette react to such criticism as follows:
THE ROLE OF RESEARCH ■ 9
There are at least two possible replies to the perceived sterility of con-
trolled systematic research. On the one hand, it is an unfortunate fact of
scientific life that the reduction of ambiguity in behavioral data to toler-
able levels demands systematic observations, measurement, and control.
Often the unwanted result seems to be a dehumanization of the behav-
ior being studied. That is, achieving unambiguous results may generate
dependent variables that are somewhat removed from the original objec-
tives of the development program and seem, thereby, to lack relevant
content. This is not an unfamiliar problem in psychological research. As
always, the constructive solution is to increase the effort and ingenuity
devoted to developing criteria that are both meaningful and amenable
to controlled observation and measurement. (Campbell and Dunnette,
1968, p. 101)
■ Survey Research
experience being evaluated, the researcher can discover whether the interpreta-
tions of data correspond to the real situation.
Another type of research that often suffers from the absence of a designed
comparison is the follow-up survey. For example, studies of incomes and pro-
jected lifetime earnings of students who attend college suggest considerable
economic gains from a college education. However, to properly evaluate the
economic benefits of a college education, it would be necessary to compare
the projected incomes of students who attended college for less than 4 years
to the incomes of high school graduates in order to determine whether any
advantage accrues to the college student. Because earnings depend in some
degree on other factors besides, or in addition to, the program of study taken
by students, a researcher should not draw conclusions based on an exam-
ination of the graduates of only one program. Such a conclusion requires
comparisons between students who experience different types of education.
The term comparison should be stressed, because survey research limited
to a single group often leads to invalid conclusions about cause-and-effect
relationships.
Perhaps because of its simplicity, survey research abounds in education.
A potentially useful technique in education, as it is in public opinion polling
and the social sciences, the survey has undeniable value as a means of gather-
ing data. It is recommended, however, that surveys be undertaken within a
research design utilizing comparison groups. When properly constructed and
when employed within a proper design, questionnaires and interviews may be
used to great advantage. This book discusses the survey as a research instru-
ment in Chapter 11.
The matter of ethics is important for educational researchers. Because the sub-
jects of their studies are the learning and behavior of human beings, often children,
research may embarrass, hurt, frighten, impose on, or otherwise negatively affect
the lives of the participants. To deal with this problem, the federal government has
promulgated a Code of Federal Regulations for the Protection of Human Sub-
jects (U.S. Department of Health and Human Services, 1991). This code sets forth
specifications for Institutional Review Boards to review research proposals and
ensure that they provide adequate protection for participants under guidelines set
forth in the code. These protections are on the following pages.
Of course, one may ask, “Why do research at all if even one person
might be compromised?” However, the educational researcher must begin
by asserting, and accepting the assertion, that research has the potential to
help people improve their lives. Therefore it must remain an integral part of
human endeavor. Accepting the assertion that research has value in contrib-
uting to knowledge and, ultimately, to human betterment, it is still necessary
to ask, “What ethical considerations must the researcher take into account in
designing experiments that do not interfere with human rights?” The follow-
ing sections review these considerations and suggest guidelines for dealing
with them.
First and foremost, a person has the full right not to participate at all in a study.
To exercise this right, prospective participants must be informed about the
research, and their formal consent to participate must be obtained. As set forth
in the federal code, informed consent requires that prospective participants be
provided with the following information:
Right to Privacy
All participants in a study enjoy the right to keep from the public certain infor-
mation about themselves. For example, many people would perceive an invasion
Participant_______________________________ Date_______________
14 ■ CHAPTER ONE
of privacy in test items in psychological inventories that ask about religious con-
victions or personal feelings about parents. To safeguard the privacy of the sub-
jects, the researcher should (1) avoid asking unnecessary questions, (2) avoid
recording individual item responses if possible, and, most importantly, (3) obtain
direct consent for participation from adult subjects and from parents and teach-
ers for participation by children.
All participants in human research have the right to remain anonymous, that
is, the right to insist that their individual identities not be salient features of
the research. To ensure anonymity, many researchers employ two approaches.
First, they usually want to group data rather than record individual data; thus
scores obtained from individuals in a study are pooled or grouped together and
reported as averages. Because an individual’s scores cannot be identified, such a
reporting process provides each participant with anonymity. Second, wherever
possible, subjects are identified by number rather than by name.
Before starting any testing, it is wise to explain to the subjects that they
have not been singled out as individuals for study. Rather, they should under-
stand that they have been randomly selected in an attempt to study the popula-
tion of which they are representatives. This information should reassure them
that the researcher has no reason to compromise their right to anonymity.
Similar to the concerns over privacy and anonymity is the concern over
confidentiality: Who will have access to a study’s data? In school studies,
students and teachers both may be concerned that others could gain access
to research data and use them to make judgments of individual character or
performance. Certainly, participants have every right to insist that data col-
lected from them be treated with confidentiality. To guarantee this protec-
tion, the researcher should (1) roster all data by number rather than by name,
(2) destroy the original test protocols as soon as the study is completed,
and, when possible, (3) provide participants with stamped, self-addressed
envelopes to return questionnaires directly (rather than turning them in to a
teacher or principal).
Finally, every participant in a study has the right to expect that the researcher
will display sensitivity to human dignity. Researchers should particularly
THE ROLE OF RESEARCH ■ 15
reassure potential participants that they will not be hurt by their participation.
Although some studies, by their very nature, require that their true purposes
be camouflaged (or at least not divulged) before their completion, participants
have the right to insist that the researcher explain a study to them after it is
completed. This is a particularly important protection to overcome any nega-
tive effects that might result from participation.
The research process described in this book applies the scientific method: Pose
a problem to be solved, construct a hypothesis or potential solution to that
problem, state the hypothesis in a testable form, and then attempt to verify the
hypothesis by means of experimentation and observation. The purpose of this
book is to provide the potential researcher with the skills necessary to carry
out this research process. This section lists and briefly describes the steps in the
research process, which are discussed in detail in subsequent chapters.
Identifying a problem
Identifying a problem can be the most difficult step in the research process.
One must discover and define for study not only a general problem area but
also a specific problem within that area. Chapter 2 presents sample models for
helping to identify and define problem areas and problems for potential study.
Constructing a hypothesis
After identifying a problem, the researcher often employs the logical pro-
cesses of deduction and induction to formulate an expectation for the outcome
of the study. That is, he or she conjectures or hypothesizes about the relation-
ships between the concepts identified in the problem. This process is the topic
of Chapter 5.
■ Self-Evaluations
This book includes a procedure for self-evaluation, that is, for measuring and
improving your learning of its content and for mastering its objectives in the
Competency Test Exercises that appear at the end of every chapter. Its exer-
cises correspond to the objectives listed at the beginning of the chapter. Addi-
tionally, there are supplementary materials that will allow you to evaluate your
understanding of key concepts in each chapter.
■ Summary
d. Identifying a problem
e. Writing a final report
f. Constructing devices for measurement
g. Resolving discipline problems
h. Identifying and labeling variables
i. Constructing an experimental design
j. Adjusting individual initiative
k. Constructing a hypothesis
l. Reviewing the literature
7. Describe in one sentence each of the four individual ethical rights of a par-
ticipant in an experiment.
■ Recommended Reference
Sieber, J. E. (1992). Planning ethically responsible research. Newbury Park, CA: Sage.
PA R T
2
FUNDAMENTAL STEPS
OF RESEARCH
=
= CHAPTER TWO
Selecting a Problem
OBJECTIVES
■ Characteristics of a Problem
Although the task of selecting a research problem is often one of the most
difficult steps in the research process, it is unfortunately the one for which
the least guidance can be given. Problem selection is not subject to specific,
technical rules or requirements like those that govern research design, mea-
surement techniques, and statistics. Fortunately, however, some guidelines can
be offered.
A good problem statement displays the following characteristics:
■ 23
24 ■ C H A P T E R T WO
A research problem is best stated in the form of a question (as distinct from
declarative statements of the hypotheses derived from the problem; see Chap-
ter 4). Consider some examples:
• The purpose of the study was to discover the relationship between rote
learning ability and socioeconomic status.
SELECTING A PROBLEM ■ 25
Empirical Testability
Questions about ideals or values are often more difficult to study than ques-
tions about attitudes or performance. Some examples show problems that
would be difficult to test: Should people disguise their feelings? Should chil-
dren be seen and not heard? Some problems represent moral and ethical issues,
such as: Are all philosophies equally inspiring? Should students avoid cheating
under all circumstances? These types of questions should be avoided. After
completing Chapter 6 on operational definitions, you may feel that you can
bring some ethical questions into the range of solvable problems, but in general
they are best avoided.
From the infinite number of potential problems for study, it is wise for
researchers to narrow the range of possibilities to problems that correspond to
their interests and skills. To accomplish this goal, some scheme for classifying
problems provides useful help. Two such schemes are offered in Figures 2.1
and 2.2. (These are only basic illustrations; you should feel free to use any other
scheme that more clearly fits your frame of reference.)
2 6 ■ C H A P T E R T WO
FIGURE 2.2 An Inquiry Model Patterned After One Proposed by Cruikshank (1984)
prior achievement. If your interest is context variables, you might look at class
size, amount of funding, or school climate. If your interest is content variables,
you might look at the nature and scope of the curriculum. Choosing instruc-
tion variables would mean looking at such aspects as time-on-task, the model
of instruction employed by the teacher, or the use or nonuse of computers.
Outcomes might cover a wide variety of learner areas, perhaps also dealing
with changes in any of the other categories (for example, teacher variables).
On the basis of this model, you could identify a large number of prospec-
tive studies and then evaluate each based on the conceptual considerations dis-
cussed in the next subsection. Models such as these, or others, may help you to
narrow the range of problems you want to consider.
• Is the relationship between the quality of one’s support system and the
resolution of an encounter experience mediated by ego identity status?
• Do supportive communities produce citizens that are more likely to express
an expectation for racial harmony than do less supportive communities?
• Do more affluent individuals have support systems that are quantita-
tively more expansive and qualitatively more supportive than less affluent
individuals?
• Does the resolution of an encounter experience affect one’s sense of agency?
Characteristics of Instruction
variables not only of materials (or a curriculum) but also of equipment and of
the philosophy or plan for instructional management embodied in the teacher’s
guide.
Instructional materials may include the following kinds of resources:
Some teachers use teaching styles that are student centered, some rely more
on lecturing; some teachers display warm attitudes, some more formal ones;
SELECTING A PROBLEM ■ 31
Components of Instruction
The general categories that define the three principal components of an instruc-
tional system are student, teacher, and materials (Figure 2.4). The complexity
of classroom activity and the fact that students, teachers, and materials may all
affect outcomes suggest strongly that researchers should simultaneously study
two or possibly all three of these sources. The variable of principal interest
becomes the characteristic of instruction, while the secondary and tertiary vari-
ables, that is, the components of instruction, allow the researcher to extend the
focus of a study from a single cause to multiple potential causes.
Student characteristics that influence the learning process include aptitude,
ability, prior achievement, IQ, learning rate, age, gender, personality, learning
3 2 ■ C H A P T E R T WO
style, and social class. Some students learn faster than others do. Still others
bring more prior experience in the instructional area and greater prior achieve-
ment to a learning situation. Either characteristic will affect learning outcomes
apart from any qualities of the teacher or instructional materials. Moreover, a
concern with individual differences, coupled with a realization of their extent
and importance, should compel an examination of at least one individual dif-
ference measure in each classroom study.
Teacher characteristics are often included in research studies as variables.
These may include such background information on the teacher as years of
teaching experience, degrees held, amount of specialized training, and age. An
alternative profile might cover teacher attitudes, benefits, perceptions, or phi-
losophies as measured by a test completed by the teachers themselves. A third
category of teacher traits addresses the styles or behaviors that characterize
their teaching; that is, the observed behavior of the teachers in contrast to their
own self-descriptions. A sample instrument for reporting on a teacher’s style
as observed by students is the Tuckman Teacher Feedback Form shown later in
the book in Figures 10.7 and 10.8.
The kinds of learning materials used in a classroom and the subject mat-
ter taught may affect instructional outcomes. The same instructional approach
may vary in effectiveness for teaching social studies as compared to teaching
science, for teaching factual content as compared to teaching conceptual con-
tent, for teaching unfamiliar materials as compared to teaching familiar materi-
als, or for teaching materials organized by topic as compared to teaching unor-
ganized material.
Rather than making the overgeneralization that Treatment A is better than
Treatment B for teaching anything, classroom researchers must restrict their
generalizations to the kinds of materials used in their studies or to particular
content or subject matter. To extend these generalizations, they can choose to
examine more than one type of learning material.
Student Outcomes
Figure 2.4 lists five categories of student outcomes (which are similar to the
categories listed by Gagné & Medsker, 1995). The proposed categories have
two noteworthy features. First, they relate to the ultimate recipient of influ-
ence in the classroom—the student. They represent areas in which students
may be expected to change or gain as a result of classroom experiences (the
input). Second, they represent a more complete set of differentiated outcomes
than the single category of outcome that is often the exclusive target of class-
room intervention research, namely, subject matter achievement. Hence, the
SELECTING A PROBLEM ■ 33
Another way to evaluate potential research problems involves the three cat-
egories of variables shown in Table 2.1. Situational variables refer to condi-
tions present in the environment, surrounding or defining the task to be
performed, and the social background of that performance. Tasks can vary
in their familiarity, meaningfulness, difficulty, and complexity, and on the
training, practice, and feedback that may be provided. They can also be per-
formed alone or in groups, and groups may vary in size, leadership, roles,
and norms.
SELECTING A PROBLEM ■ 35
Most researchers do not pursue isolated studies; they carry out related stud-
ies within larger programs of research, giving rise to the term programmatic
research. Programmatic research defines an underlying theme or communality,
partly conceptual and partly methodological, for component studies. Concep-
tual communality identifies a common idea or phenomenon that runs through
all the studies in the series or research program. Methodological communality
defines a similar approach for component studies, often typified by reliance
on a single research setting or way of operationalizing variables. Studies built
around reinforcement theory, for example, shared this common conceptual
framework and the common methodology of the Skinner Box to study the
effects of varying reinforcers or schedules of reinforcement on the strength of
the bar-pressing response.
Within programmatic research, one can generate individual studies by
introducing new situational, dispositional, or behavioral variables or new char-
acteristics of instruction, components of instruction, or student outcomes, to
use terms from Table 2.1 and Figure 2.4, respectively. After determining the
conceptual base and methodological approach of a study, numerous research
problems can be identified. Students undertaking research for the first time can
facilitate the process of generating a research problem by identifying an ongo-
ing program of research and “spinning off” a problem that fits within it.
Consider the following example, taken from the senior author’s own
work. The original research problem was to determine ways of helping college
students to learn from text resources, an important practical problem, since
college students are expected to learn much of the content of their courses
by reading textbooks. The obvious outcome choice would focus on course
achievement, as measured by scores on examinations. A review of the research
SELECTING A PROBLEM ■ 37
This section lists and discusses some critical criteria to apply to a chosen prob-
lem before going ahead with a study of it. Try these questions out on your
potential problem statements.
1. Workability. Does the contemplated study remain within the limits of your
resource and time constraints? Will you have access to the necessary sample
in the numbers required? Can you come up with an answer to the problem?
Is the required methodology manageable and understandable to you?
3 8 ■ C H A P T E R T WO
■ Summary
It should not represent an ethical or moral question, but one that can be
tested empirically (that is, by collecting data).
2. Researchers employ schemes for narrowing the range of problems for con-
sideration. Among these tools, certain conceptual models lay out proposed
sets of linkages between specific variables. The input-process-output
model is one model.
3. Classroom research models typically classify variables as representing (a)
characteristics of instruction (such as instructional materials), (b) compo-
nents of instruction (such as teacher or student characteristics), and (c) stu-
dent outcomes (such as learning-related behaviors).
4. Another problem framework involves (a) situational variables (such as task
or social characteristics), and (b) dispositional variables (such as intelligence
or anxiety), as they affect (c) behaviors such as performance or satisfaction.
5. In choosing a problem, pay particular attention to its (a) workability or
demands, (b) critical mass or size and complexity, (c) interest to you and
others, (d) theoretical value or potential contribution to our understanding
of a phenomenon, and (e) practical value or potential contribution to the
practice of education.
7. Construct a problem statement with at least three variables that fits the
attrition model shown in Figure 2.3. Label the category into which each
variable falls.
8. Critique the research problems constructed in Exercises 6 and 7 in terms of
their:
a. Theoretical value
b. Practical value
■ Recommended Reference
Cronbach, L. J., & Snow, R. E. (1981). Aptitudes and instructional methods (2nd ed.).
New York, NY: Irvington.
= CHAPTER THREE
OBJECTIVES
Research begins with ideas and concepts that are related to one another through
hypotheses about their expected relationships. These expectations are then
tested by transforming or operationalizing the concepts into procedures for
collecting data. Findings based on these data are then interpreted and extended
by converting them into new concepts. (This sequence, called the research
spectrum, is displayed later in the book in Figure 6.1.) But where do research-
ers find the original ideas and concepts, and how can they link those elements
to form hypotheses? To some extent the ideas come out of the researchers’
heads, but to a large extent they come from the collective body of prior work
referred to as the literature of a field. For example, reference to relevant studies
helps to uncover and provide:
■ 41
42 ■ CHAPTER THREE
• information about work that has already been done and that can be mean-
ingfully extended or applied;
• the status of work in a field, reflecting established conclusions and poten-
tial hypotheses;
• meanings of and relationships between variables chosen for study and
hypotheses;
• a basis for establishing the context of a problem;
• a basis for establishing the significance of a problem.
• Bilingual education
• Cooperative learning
• Cultural differences/multicultural education
• Educational goals/goal setting
• Grouping for instruction
• Mathematics learning and achievement
• Preschool interventions
• Heading instruction
• Restructuring schools/shared decision making
• School climate
R E V I E W I N G T H E L I T E R AT U R E ■ 4 3
• Self-regulated learning
• Teacher as researcher
• Teacher efficacy
• Teacher preservice training
• Writing instruction
In situations that call for original research, it is necessary to survey past work
in order to avoid repeating it. More importantly, past work can and should be
viewed as a springboard into subsequent work, the later studies building upon
and extending earlier ones. A careful examination of major studies in a field
of interest may suggest a number of directions worth pursuing to interpret
prior findings, to choose between alternative explanations, or to indicate useful
applications. Many studies, for example, conclude with the researchers’ sug-
gestions for further research. The mere fact that a study has never been done
before does not automatically justify its worth, though. Prior work should
suggest and support the value of a study not previously undertaken. This point
will be discussed further in Chapter 5.
A researcher can acquire much valuable insight by summarizing the past work
and bringing it up to date. Often, such activity yields useful conclusions about
the phenomena in question and suggests how those conclusions may be applied
in practice. Many researchers choose to review literature specifically to reduce
the enormous and growing body of knowledge to a smaller number of work-
able conclusions that can then be made available to subsequent researchers and
practitioners. The constantly expanding body of knowledge can retain its value
if it is collated and synthesized—a process that enables others to see significant
overlaps as well as gaps and to give direction to a field.
44 ■ CHAPTER THREE
Variables must be named, defined, and joined into problems and hypoth-
eses. This is a large task, made both more meaningful and more manage-
able when undertaken in a broader context than a single research situation.
That context comes from the literature relevant to a chosen area of study. If
every researcher were to start anew, constructing entirely original meanings
and definitions of variables and creating his or her own hypothetical links
between them, knowledge would become chaotic rather than a summative
undertaking. Synthesis and application would become difficult if not impos-
sible to achieve. The mere act of creating all these original thoughts itself
subjects one to enormous difficulty, especially for the novice. To do a mean-
ingful study, prior relationships between variables in the chosen area must be
explored, examined, and reviewed in order to build both a context and a case
for a subsequent investigation with potential merit and applicability. Such a
review process will help both in understanding the phenomena in question
and in explaining them to a report’s readers. It will also be an invaluable
asset in suggesting relationships to expect or to seek. It will provide useful
definitions, suggest possible hypotheses, and even offer ideas about how to
construct and carry out the study itself. It will save much unnecessary inven-
tion while providing insight into methods for applying critical inventiveness
in building upon and extending past work.
A brief example may be helpful. Sutton (1991) reviewed and synthesized a
series of studies on gender differences in computer access and attitudes toward
computers from 1984 to 1990. The relationship between gender and access as
reflected in 15 studies is shown in Table 3.1.
As the last two columns of the table show, three comparisons had deter-
mined that boys had significantly more access than girls to computers in
school, and they had significantly more access to computers at home in 10 of
the comparisons. This synthesis reflects a clear relationship between student
gender and computer access during this period—a balance shifted in favor of
boys.
A subsequent figure later in the same article depicted the results of 43
studies (reporting 48 comparisons) relating student gender to attitudes toward
technology, using the same format as Table 3.1. Of the 48 comparisons, 26
showed significantly more positive attitudes by boys than by girls. Consider-
ing those results together with those in Table 3.1 suggests a possible explana-
tion that attitudes toward new technologies like computers are based on access
to them. The results presented in this review have the potential for generating
a number of hypotheses for further research.
TABLE 3.1 Summary of Research on Gender Differences in Computer Access in School and at Home
Study Ander- Becker Martinez Chen Linn Fetler Miura Swad- Camp- Arenz Collis, Col- Culley Johnson Levin &
son, & & Mead ener & bell & Lee Kass, & bourn & Gordon
Welch, Sterling Jarrett Kieren Light
&
Harris
Study Year 1984 1987 1988 1986 1985 1985 1986 1986 1989 1990 1989 1987 1988 1987 1989
Location U.S.A. U.S.A. U.S.A. Califor- Califor- Califor- Califor- Colo- Okla- Wiscon- Canada Britain Britain Britain Israel
nia nia nia nia rado homa sin
and
Kansas
Total N 15, 000 265* 24,000 1,138 51,481 7,343 400 259 1,067 306 3,000 56 984 144 222
4,800
Grade 3rd, 7th, K-6th 3rd High High 6th 6th-8th 4th-8th 7th-12th Middle 11th Middle High High 8th – 10th
11th Middle 7th School School 12th school school
High 11th
School
School 0 + + 0 + - - # - - # - - - *
Access + +
+ +
Home - - + # # # + # # # * * + *
Access + - #
+
Key:
* Sample size is of teachers, not students
# Significant differences, favoring boys
0 No significant difference
+ Data favoring boys, no significance reported
- Data not provided by study
The context of a research problem is the frame of reference orienting the reader
to the area in which the problem is found and justifying or explaining why
the phenomenon is, in fact, a problem. This information appears in the open-
ing statement of any research report, and it may draw upon or refer to prior
published or unpublished work. This reference appears, not to justify specific
hypotheses, but to identify the general setting from which the problem has
been drawn.
semiannually in the middle and at the end of each year. RIE identifies all docu-
ments identified by ED numbers, listing them in numerical order in the Docu-
ments Resume section. The number listings are accompanied by short abstracts
and sets of descriptors, key words that identify their essential subject matter to
aid searches and retrieval. These descriptors are taken from the Thesaurus of
ERIC Descriptors, which lists and defines all descriptors used by the system.
RIE also catalogs each entry into a subject index (by major descriptor), an
author index, and an institution index. A sample entry appears in Figure 3.1.
Each major descriptor in ERIC is identified by an asterisk. For example,
the descriptor “*Student Motivation” appears next to last in Figure 3.1. Any
Abstracts
middle school, social sciences, theory and practice) that replace the descrip-
tors of the other systems. Titles, abstracts, and order numbers are provided.
Volumes of these abstracts appear monthly, with cumulative author indexes
spanning a year. Computerized searches of dissertation abstracts are provided
through DATRIX (the computer retrieval service of University Microfilms)
guided by key words, which match words appearing in the titles or subject
headings of the dissertations. Compounds of these key words allow research-
ers to limit searches to specific information. Such a search returns a list of dis-
sertation titles and order numbers fitting the key words. Dissertations on the
list may then be ordered from University Microfilms.
PsychLIT
Indexes
The Education Index (New York: H. W. Wilson Co., 1929–), for example,
appears monthly, listing studies under headings (for example, CLASSROOM
MANAGEMENT), subheadings (for example, Research), and occasionally
sub-subheadings, titles, and references.
This source covers all articles in approximately 200 educational journals
and magazines. A search of the May 1989 issue combining the example head-
ings above discovers the following entry:
entries covering the 12 months from the preceding July. In both monthly and
yearly volumes, entries are indexed both by subject and by author.
A useful monthly index is the Current Index to Journals in Education
(CIJE) (Phoenix, AZ: The Oryx Press, 1969–). Set up in a manner parallel to
RIE (including descriptors and one-sentence abstracts), this source indexes the
contents of almost 800 journals and magazines devoted to education and related
fields from all over the world. By using the Thesaurus of ERIC Descriptors,
this index allows coordinated literature searching of published and unpublished
sources. Volumes of CIJE appear monthly with cumulated volumes appearing
semiannually in June and December.
A useful index for tracing the influence or for following the work of a
given author is the Social Sciences Citation Index (Philadelphia: Institute for
Scientific Information, 1973–). This index lists published documents that refer-
ence or cite a given work by a given author. If you find an important study in
your area of interest that was completed a few years before, you may want to
see if any follow up work has been done by the same or other authors. You can
do this by looking up the important study (by its author’s name) in the Cita-
tion Index, where you will discover titles of later documents that have referred
to it. You can then track down these more recent articles. This index is the only
resource for tracking connections between articles forward in time.
Finally, Xerox’s Comprehensive Dissertation Index lists dissertation titles
indexed by title and coordinated with Dissertation Abstracts International and
DATRIX.
Reviews
Reviews are articles that report on and synthesize work done by researchers
in an area of interest over a period of time. Reviewers locate articles relevant
to their topics; organize them by content; describe, compare, and often cri-
tique their findings; and offer conclusions and generalizations. Of course, such
a review includes full references of all articles on which it reports.
The principal review journal in education is the quarterly Review of Edu-
cational Research (Washington, DC: American Educational Research Associa-
tion, 1931–). This journal presents comprehensive reviews of a wide variety of
educational topics, with an emphasis on synthesis and updating. As an example
of its coverage, a recent issue contains the following titles:
The article by Sutton on gender and computer access discussed above came
from this review journal.
Review articles are excellent sources for researchers who wish to locate the
bulk of work in an area of interest without having to search it out themselves.
Many disciplines related to education have their own review journals (for
example, the Psychological Bulletin; Washington, DC: American Psychological
Association, 1904–).
Review articles are found not only in review journals but also in hand-
books, yearbooks, and encyclopedias. The best-known of these resources for
education and cognate fields are:
Journals and books are primary sources in educational research. They con-
tain the original work, or “raw materials,” for secondary sources like reviews.
Ultimately, researchers should consult the primary sources to which abstracts
and reviews have led them. Also, these primary sources themselves contain
54 ■ CHAPTER THREE
literature reviews (although often short ones) in which researchers will find
useful input for their own planned work. Moreover, as educational research
proliferates, increasing numbers of books attempt to review, synthesize, and
suggest applications of the work in an area.
One distinction separates research journals from other types of journals
or magazines. A research journal publishes reports of original research stud-
ies, including detailed statements of methodology and results. These journals
are refereed; that is, prior to publication, articles are reviewed and critiqued by
other researchers in the area, whose judgments guide decisions about inclusion
and exclusion (or improvement) of submissions. Because they maintain such
high standards, these journals usually reject at least half of the manuscripts they
receive. Non-refereed journals usually contain discursive articles with occasional
reviews or primary research articles included, but these reports may be written
in a less technical manner than research journal articles to meet the needs of their
readers. Researchers interested in technical accounts and actual research results
should consult research journals for information about studies that interest them.
A partial list of research journals in educational areas appears in Figure 3.3.
The Internet
The process for a literature search involves (1) choosing interest areas and
descriptors, (2) searching for relevant titles and abstracts, and (3) locating
important primary source documents.
The matrix of interest areas shown in Figure 3.4 illustrates one simple way
of considering at least two interest areas at once.
For example, consider a researcher studying incentives for changing the
instructional behavior of elementary school teachers as a function of their per-
sonalities. A search of the ERIC system on the Elementary School descriptor
by itself would locate titles numbering in the thousands; such a general search
would be a waste of time and money. Instead, a researcher could request a search
of all articles that simultaneously contained the descriptor Elementary School
(or Elementary School Teachers) plus various combinations of the following
descriptors: Behavior Change, Changing Attitudes, Change Strategies, Change
Agents, Intervention, Incentive Systems, Locus of Control, Credibility, Beliefs,
Personality, Reinforcement, and Diffusion. The resulting search yielded 69 titles,
most of which were highly relevant to the study under consideration.
The key is the choice of descriptors—in various combinations. Begin with
interest areas, such as those shown in Figure 3.4, and then consider two major
sources of descriptors: relevant concepts (for example, Reinforcement, Behav-
ior Change) and variables (for example, Personality, Locus of Control). Gener-
ate as many relevant descriptors as possible. Be sure to consider the potential
variables of your study as a basis for selecting descriptors.
Consider another example where the use of multiple descriptors greatly
narrows the range of articles located, making for more useful and manageable
search results. Suppose you were interested in a topic in health education such
as using counseling as a means of reducing students’ school-related stress. If
you were to do a search of titles using one of the three relevant descriptors: (1)
Stress, (2) Counseling, or (3) Health, you would obtain a list represented by
one of the three circles below.
If, however, you were to combine two of the descriptors, such as Stress
plus Counseling, the search result would be much smaller and more on target,
reflected by the shaded area where the two circles below overlap. Using all
three descriptors at once would yield an even smaller and even more relevant
set of articles, as reflected by the overlapping part of the three circles below.
R E V I E W I N G T H E L I T E R AT U R E ■ 5 7
FIGURE 3.4 A matrix of possible interest areas defined by the intersection of two
sets of descriptors
15. General
58 ■ CHAPTER THREE
A good search should include three major categories of documents: (1) pub-
lished articles, (2) unpublished articles, and (3) dissertations.1 An ERIC search
is a must, because it provides access not only to the ERIC file of unpublished
documents (which are identified by ED numbers) but also to journal articles
(that is, published papers) cataloged in the Current Index to Journals in Edu-
cation (CIJE; identified by Ej numbers). For example, the 69 titles discovered
in the search on teacher change and personality included 13 articles and 56
unpublished documents. Although the journal article titles can be located via
a manual search using the Educational Index, Psychological Abstracts, Socio-
logical Abstracts, and so forth, or by searching through issues of CIJE, this
slow and tedious process still would not yield unpublished documents, which
are largely inaccessible from any source other than ERIC. Hence, the second
step in the search after selecting descriptors should be to conduct a computer-
ized search of the ERIC file including CIJE.
The next step should be to carry out a dissertation search. After you con-
tact DATRIX and input a set of key words (its counterpart of descriptors),
it will generate a list of relevant dissertation titles. Again, remember that the
cost of a search is a function of the number of titles located; to minimize this
cost, combine key words rather than searching for single matches. Specificity
increases relevance and reduces cost.
The last step in the general search process is to locate handbooks, year-
books, and encyclopedias in the reference section of the library and read the
relevant sections, taking particular notice of the references provided. The
Review of Educational Research is a particularly useful source of this kind of
reference material on specific topics. Starting about 5 years before the cur-
rent issue, read through the index of all titles to locate any review articles
that seem relevant to your subject. Locate these articles and select from them
the references most relevant to your area of interest. These references will
primarily cite journal articles that previous reviewers have selected for their
relevance.
Titles and abstracts provide limited information about past work. The ERIC
search provides titles and very short (often single-sentence) abstracts. The
DATRIX search provides only titles. Review articles provide titles of sources
they discuss along with descriptions (also usually of limited length) in their
text. Both ERIC and DATRIX provide document identification numbers that
can lead you to full abstracts in RIE and in Dissertation Abstracts Interna-
tional, respectively. However, the only complete description of a resource is to
be found in the original (or primary) source document itself.
Expense and time constraints prevent full consideration of all the titles
yielded by the various searches. The researcher must be selective, choosing titles
that seem most relevant for further examination. Consulting abstracts, where
available, will help in identifying the most potentially useful and relevant arti-
cles. These articles must then be located or obtained. Unpublished documents
identified through the ERIC search can be purchased either in microfiche or on
paper from the ERIC Document Reproduction Service; simply fill in the form
and follow the procedures described in the back of RIE. Often, these docu-
ments can also be obtained from a library housing the complete ERIC collec-
tion. Microfiche copies are considerably less costly than paper copies, and they
may, in fact, be the only ones available from the library.
Dissertations chosen for further examination can be ordered in microfiche
or on paper from University Microfilms, Inc., Ann Arbor, Michigan. Paper
copies are convenient but costly; thus, one should avoid purchasing a large
number of them. Journal articles must be located directly in the journals in
which they appeared. Libraries are the major sources of journal collections
(although reprints of recent articles can often be obtained by writing directly to
their authors). Once a search locates articles, a researcher can photocopy them
for convenient access.
Of the three types of documents—journal articles, dissertations, and
unpublished reports—journal articles are the most concise and the most tech-
nically valuable sources, because they have satisfied the high requirements for
journal publication. Dissertations are lengthy documents, but supervision by
faculty committees helps to ensure good information. Unpublished reports are
usually lengthy and typically are the poorest of the three types in quality—
although, conversely, are of the greatest usefulness to the practitioner, in con-
trast to the researcher (Vockell & Asher, 1974). Hence, journal sources should
be examined most closely and completely in the literature review.
Primary documents reveal not only potentially useful methodologies and
findings but also additional references to relevant articles. This interconnected-
ness of significant articles on a topical area enables a researcher to backtrack
from one relevant study to others that preceded it. This backtracking process
often leads to the richest sources of useful and important work in the area of
interest, loading to studies that have been singled out by other researchers for
review and inclusion. Hence, one researcher builds a study on his or her previ-
ous work as well as on that of other researchers, adding to the research in an
area. Finding your way into this collection of interlocking research often deliv-
ers the whole collection for your discovery and review. This access is the payoff
60 ■ CHAPTER THREE
of the literature review process—enabling you to fit your own work into the
context of important past research.
Dissertation searching can also have its payoff. Because a dissertation com-
monly contains an extensive review of the literature of its subject, locating a
relevant dissertation can provide you with a lengthy list of significant titles.
These can be tracked down and examined for inclusion in your own review.
Another viable strategy for searching the literature is to start with a journal
article, review article, or dissertation highly relevant to your area of interest
and then search out, locate, and read all of the sources in its reference list. In
this way, you can find the most relevant ones. Each will include a list of refer-
ences that can then be located and read, and you can continue following up
references in each of the next batch of articles. Cooper (1982) calls this model
of tracking backward to locate antecedent studies the ancestry approach.
Another approach, mentioned previously, is to locate an important article
of interest, and then to locate all of the articles that cite it in their reference
lists using the Social Sciences Citation Index (SSCI). Cooper (1982) calls this
method the descendancy approach, since it focuses on the studies that have
“descended” from a major one. For example, when Tuckman and Jensen (1977)
were asked to determine the impact of Tuckman’s (1965) theory of small-group
development on the field, they went to SSCI to locate all the subsequent stud-
ies that had cited the original article. The researchers then reviewed the find-
ings of those later studies.
abstracts of 100 to 200 words. Although you may incorporate this information
into your own abstract, for an important article, it is preferable to prepare a
more detailed version.
The abstract should be headed by the full reference exactly as it will appear
in your final reference list. (Reference format is described in Chapter 13.) The
abstract itself should be divided into the following three sections: (1) purpose
and hypotheses, (2) methodology, and (3) findings and conclusions. You will
probably not use all of this information in writing your review, but because
you do not know what and how much you will need, it is wise to have it all
(particularly if you borrow copies of the articles and have to return them).
Identify and summarize in a sentence or two the purpose of the study you
are reviewing. Then locate the hypotheses or research questions, and record
them verbatim if they are not too long. If they are lengthy, summarize them.
Underline the names of the variables. In the second paragraph of your abstract,
briefly describe the methodology of the research, including sample characteris-
tics and size, methods for measuring or manipulating the variables, design, and
statistics.
The final paragraph should include a brief summary of each finding and a
clear, concise statement of the paper’s conclusion. Do not trust to memory for
recalling important details of the article. Because you will be reviewing many
studies, their details will become blurred in your memory. Any information
that seems important should be put in the abstract.
Because you will ultimately want to categorize the study when writing up
your literature review, it is useful at this time to generate a category system.
Categories usually reflect the variables of the proposed study or the descrip-
tors used in locating the source document. Write this category name at the top
of the card and file it accordingly for easy use. A sample review abstract of a
journal article is shown in Figure 3.5.
Relevance: Do the citations bear on the variables and hypotheses? (6) Organi-
zation: Is the presentation of the literature review well organized with a clear
introduction, subheadings, and summaries? (7) Convincing argument: Does
the literature help in making a case for the proposed study?
The literature search should be a systematic review aimed at both relevance
and completeness. An effort should be made not to overlook any material that
might be important to the purpose of the review. Fortuitous findings are likely,
but on the whole systematic planning yields better results than luck does. The
process begins with the smallest traces of past work—titles—and then expands
to more detailed abstracts and then to the complete articles and documents
themselves. Finally, these complete sources are reduced to relevant review
abstracts that you yourself write in preparation for a review article, if that is
your purpose. The final reduction produces a set of references, which appear at
the end of your report.
R E V I E W I N G T H E L I T E R AT U R E ■ 6 3
■ Summary
1. The literature review provides ideas about variables of interest based on
prior work that has contributed to an understanding of those variables.
Prior work contributes to the development of new hypotheses.
2. Literature sources include major computerized collections of abstracts
such as ERIC for education, PsycLIT for psychology, and DATRIX for
dissertations. Additional sources include indexes such as the Education
Index and Citation Index, review journals such as the Review of Edu-
cational Research and Annual Review of Psychology, and original source
documents such as journals and books.
64 ■ CHAPTER THREE
3. To conduct a literature search, follow these steps: (1) Choose interest areas
and descriptors (that is, variable names or labels that classify studies in
literature collections such as ERIC). (2) Search by hand or computer for
relevant titles and abstracts. (3) Locate, read, and abstract relevant articles
in primary sources.
4. In preparing an abstract of a research article, briefly describe its purpose
and hypotheses, methodology, and findings and conclusions.
5. A good literature review section should be sufficient in its coverage of the
field, clear, empirical, up-to-date, relevant to the study’s problem, well
organized, and supportive of the study’s hypotheses.
8. Turn to the end of Chapter 14, and find the long abstract of a study enti-
tled, Evaluating Developmental Instruction. Prepare a 100-word abstract
of this study.
■ Recommended References
Cooper, H. M. (1989). Integrating research: A guide for literature reviews (2nd ed.).
Newbury Park, CA: Sage.
Cooper, H. M., & Hedges, L. V. (Eds.). (1994). The handbook of research synthesis.
New York, NY: Russell Sage Foundation.
= CHAPTER FOUR
OBJECTIVES
Consider this research question: Among students of the same age and intelli-
gence, is skill performance directly related to the number of practice trials, the
relationship being particularly strong among boys, but also holding, though
less directly, among girls? This research question, which indicates that practice
increases learning, involves several variables:
■ 67
68 ■ CHAPTER FOUR
For the purpose of explanation, this section deals solely with the relationship
between a single independent variable and a single dependent variable. How-
ever, it is important to note that most experiments involve many variables,
not just a single independent-dependent pair. The additional variables may be
independent and dependent variables or they may be moderator or control
variables.
Many studies utilize discrete—that is, categorical—independent variables.
Such a study looks either at the presence versus the absence of a particular
treatment or approach, or at a comparison between different approaches.
Other studies utilize continuous independent variables. The researcher’s obser-
vations of such a variable may be stated in numerical terms indicating degree or
amount.
I D E N T I F Y I N G A N D L A B E L I N G VA R I A B L E S ■ 6 9
The following list reports a number of hypotheses drawn from studies under-
taken in a research methods course; the independent and dependent variables
have been identified for each one.
Consider also the following two examples drawn from journal sources:
researcher varies test conditions between ego orientation (“write your name
on the paper, we’re measuring you”) and task orientation (“don’t write your
name on the paper, we’re measuring the test”). The test taker’s previously mea-
sured test-anxiety level, a “personality” measure characteristic, is included as a
moderator variable. The combined results show that highly test-anxious people
functioned better under task orientation, and people of low-test anxiety func-
tioned better under ego orientation. This interaction between the independent
variable, the moderator variable, and the dependent variable is shown graphi-
cally in Figure 4.3.
Because educational research studies usually deal with highly complex
situations, the inclusion of at least one moderator variable in a study is highly
recommended. Often the nature of the relationship between X and Y remains
poorly understood after a study because the researchers failed to single out and
measure vital moderator variables such as Z, W, and so on.
A number of hypotheses drawn from various sources can help to illustrate the
variables. The moderator variable (along with the independent and dependent
variables) has been identified for each example below.
■ Control Variables
A single study cannot enable one to examine all of the variables in a situa-
tion (situational variables) or in a person (dispositional variables); some must
be neutralized to guarantee that they will not exert differential or moderat-
ing effects on the relationship between the independent variable and the
dependent variable. Control variables are factors controlled by the experi-
menter to cancel out or neutralize any effect they might otherwise have on
observed phenomena. The effects of control variables are neutralized; the
I D E N T I F Y I N G A N D L A B E L I N G VA R I A B L E S ■ 7 5
effects of moderator variables are studied. (As Chapter 7 will explain, the
effects of control variables can be neutralized by elimination, equating across
groups, or randomization.)
Certain variables appear repeatedly as control variables in educational
research, although they occasionally serve as moderator variables. Gender,
intelligence, and socioeconomic status are three dispositional variables that are
commonly controlled; noise, task order, and task content are common situ-
ational control variables. In constructing an experiment, the researcher must
always decide which variables to study and which to control. Some of the bases
for this decision are discussed in the last section of this chapter.
■ Intervening Variables
1. Students taught by discovery (a) will perform better (b) on a new but
related task (c) than students taught by rote (a2).
2. Students taught by discovery (a1) will develop a search strategy (b1)—an
approach to finding solutions—that will enable them to perform better
on a new but related task (c), while students taught by rote will learn
solutions but not strategies (b2), thus limiting their ability to solve trans-
fer problems (c).
The symbols a1 and a2 refer to the two levels of the independent variable,
whereas c refers to the dependent variable. The intervening variable (presence
or absence of a search strategy) is identified as b1 and b2 in the second statement.
Intervening variables can often be discovered by examining a hypothesis
and asking the question, “What characteristic of the independent variable will
cause the predicted outcome?”
78 ■ CHAPTER FOUR
The relationship between the five types of variables described in this chapter is
illustrated in Figure 4.4. Note that independent, moderator, and control vari-
ables are inputs or causes: the first two types are the causes being studied in the
research whereas the third represents causes neutralized or “eliminated” from
influence. At the other end of the figure, dependent variables represent effects
while intervening variables are conceptual assumptions that intervene between
operationally stated causes and operationally stated effects.
Example 1
Example 2
Example 3
Consider a study designed to provide feedback for teachers about their in-
class behavior from (1) students, (2) supervisors, (3) both students and super-
visors, and (4) neither. Students’ judgments are again obtained after 12 weeks
to determine whether teachers given feedback from different sources have
shown differential changes of behavior in the directions advocated by the
feedback. Differential outcomes are also considered based on years of teach-
ing experience of each teacher.
80 ■ CHAPTER FOUR
• The independent variable is source of feedback. Note that this single inde-
pendent variable or factor includes four levels, each corresponding to a
condition (labeled 1, 2, 3, and 4).
• The moderator variable is years of teaching experience. This single factor
includes three levels (1 to 3 years of teaching experience, 4 to 10 years, and
11 or more years).
• Control variables are students’ grade level (10th, 11th, or 12th grade),
students’ curricular major (vocational only), teachers’ subject (vocational
only), and class size (approximately 15 students).
• The dependent variable is change in teachers’ behavior (as perceived by
students). The purpose of the study is to see how feedback from different
sources affects teachers’ behavior.
• The intervening variable could be identified as the responsiveness of the
teacher to feedback from varying sources, based on the perceived motiva-
tion and perceived value of feedback for each teacher.
Example 4
After selecting independent and dependent variables for a study, the researcher
must decide which factors to include as moderator variables and which to
exclude or hold constant as control variables. He or she must decide how to
treat the total pool of other variables (other than the independent) that might
affect the dependent variable. In deciding which variables to include and which
I D E N T I F Y I N G A N D L A B E L I N G VA R I A B L E S ■ 8 1
to exclude, the researcher should take into account theoretical, design, and
practical considerations.
Theoretical Considerations
Design Considerations
Beyond the questions already cited, a researcher might ask questions that relate
to the experimental design chosen and its adequacy for controlling for sources
of bias. The list should include the question:
• Have my decisions about moderator and control variables met the require-
ments of experimental design for dealing with sources of invalidity?
Practical Considerations
A researcher can study only so many variables at one time. Human and finan-
cial resources limit this choice, as do deadline pressures. By their nature, some
variables are harder to study than to neutralize, while others are as easily stud-
ied as neutralized. Although researchers are bound by design considerations,
they usually find enough freedom of choice that practical concerns come into
play. In dealing with practical considerations, the researcher must ask ques-
tions like these:
This last concern is a highly significant one. Educational researchers often have
less control over their situations than design and theoretical considerations
alone might necessitate. Thus, researchers must take practical considerations
into account when selecting variables.
■ Summary
1. An independent variable, sometimes also called a factor, is a condition
selected for manipulation or measurement by the researcher to determine
its relationship to an observed phenomenon or outcome. It is the presumed
cause of the outcome, and manipulation creates discrete levels. (Measure-
ment may also enable the researcher to divide it into discrete levels.)
2. A dependent variable is an outcome observed or measured following
manipulation or measurement of the independent variable to determine
the presumed effect of the independent variable. It is usually a continuous
quantity.
3. A moderator variable is a secondary independent variable, selected for
study to see if it affects the relationship between the primary independent
variable and the dependent variable. It is usually measured, and it often
represents a characteristic of the study participants (for example, gender or
grade level or ability level). It too is often divided into levels.
4. A control variable is a characteristic of the situation or person that the
researcher chooses not to study. Its presence and potential impact on the
dependent variable must be canceled out or neutralized.
5. An intervening variable is a factor that theoretically explains the reason
why the independent variable affects the dependent variable as it does. This
concept is created or influenced by the independent variable that enables it
to have its effect. It is a hypothetical variable rather than a real one, as the
other types are.
6. Independent, moderator, and control variables all may affect the depen-
dent variable, presumably by first affecting an intervening variable.
7. Researchers must decide how to deal with potentially influential variables
other than the independent variable. No variable that may affect the depen-
dent variable can be ignored. Each must be treated as either a moderator
variable, and hence studied, or a control variable, and hence eliminated.
8. Choosing to treat a variable as a moderator variable is based on (a) theo-
retical considerations (How likely is the variable to affect the independent
variable-dependent variable relationship?), (b) design considerations (Will
the choice allow adequate control?), and (c) practical considerations (Can
the researcher call on sufficient resources and manageable techniques for
accomplishing it?).
I D E N T I F Y I N G A N D L A B E L I N G VA R I A B L E S ■ 8 3
■ Recommended Reference
Martin, D. W. (1991). Doing psychology experiments (3rd ed.). Monterey, CA: Brooks/
Cole.
= CHAPTER FIVE
Constructing Hypotheses
and Meta-Analyses
OBJECTIVES
■ Formulating Hypotheses
What Is a Hypothesis?
The next step in the research process after selecting a problem and identifying
variables is to state a hypothesis (or hypotheses). A hypothesis, a suggested
answer to the selected problem, has the following characteristics:
■ 85
86 ■ CHAPTER FIVE
Thus, hypotheses that might have been derived from the problem state-
ments listed on pages 24 and 25 are:
Hypotheses are often confused with observations. These terms refer, however,
to quite different things. Observation refers to what is—that is, to what can be
seen. Thus, researchers may look around in a school and observe that most of
the students are performing above their grade levels.
From that observation, they may then infer that the school is located in
a middle-class neighborhood. Though the researchers do not know that the
neighborhood is middle-class (that is, they have no data on income level), they
expect that most people living there are of moderate means. By making explicit
their expectation that schools of advanced learners are in middle-class neigh-
borhoods the researchers make a specific hypothesis setting forth an anticipated
relationship between two variables—academic performance and income level.
To test this specific hypothesis, the researchers could walk around the neigh-
borhood, observe the homes, and ask the residents to reveal their income levels.
C O N S T R U C T I N G H Y P O T H E S E S A N D M E TA - A N A LY S E S ■ 8 7
From the first statement, the researcher may, for example, deduce the spe-
cific hypothesis that people spend less time doing whatever they do well because
they achieve efficiency at that activity. From the second general statement, the
researcher may instead deduce that people spend more time doing what they do
well because they enjoy doing it. The specific hypothesis deduced depends on
the more general assumptions or theoretical position from which the researcher
begins.
In induction, in contrast, the researcher starts with specific observations
and combines them to produce a more general statement of a relationship,
namely a hypothesis. Many researchers begin by searching the literature for
relevant specific findings from which to induce hypotheses (a process consid-
ered in detail in Chapter 3). Others run exploratory studies before attempting
to induce hypothetical statements about the relationships between the vari-
ables in question. One example of induction began with research findings that
obese people eat as much immediately after meals as they do some hours after
meals, that they eat much less unappealing food than appealing food, and that
they eat when they think it’s time for dinner even if little time has elapsed since
eating last. These observations led a researcher to induce that for obese people,
hunger is controlled externally rather than internally, as it is for people of nor-
mal weight.
Induction begins with data and observations (empirical events) and pro-
ceeds toward hypotheses and theories; deduction begins with theories and
general hypotheses and proceeds toward specific hypotheses (or anticipated
observations).
From any problem statement, it is generally possible to derive more than one
hypothesis. As an example, consider a study based on the problem statement:
What is the combined effect of student personality and instructional procedure
on the amount of learning achieved? Three possible hypotheses that can be
generated from this statement are:
Both induction and deduction are needed to choose among these possi-
bilities. Many theories, both psychological and educational ones, deal with
the relationship between student personality and the effectiveness of differ-
ent teaching techniques. The match-mismatch model described by Tuckman
(1992a) suggests that when teaching approaches are consistent with students’
personalities, students learn more from the experiences. Because a student
most comfortable with concrete learning prefers structure, and one most com-
fortable with abstract learning prefers ambiguity, the logical deduction is that
Hypothesis 1 is the most “appropriate” expectation of the three. Moreover,
observation tends to confirm a strong relationship between what students like
and their personalities. In other words, empirical evidence provides support
for the induction that Hypothesis 1 is the most appropriate choice.
Consider a study based on a problem to determine the effect of group-con-
tingent rewards in modifying aggressive classroom behaviors. (Group-contin-
gent rewards result from an arrangement that aggressive action by one group
member causes rewards to be withheld from all members.) At first glance, these
three hypotheses might be offered:
(concrete) level to the conceptual (abstract) level. This movement to the con-
ceptual level allows generalization of the results of research beyond the specific
conditions of a particular study, giving the research wider applicability.
Research requires the ability to move from the operational to the concep-
tual level and vice versa. This ability influences not only the process of con-
structing experiments but also that of applying their findings.
Consider the following hypothetical study. The staff development depart-
ment of a large school district has decided to run three in-service workshops
for all the schools in the district. The purpose of the workshops is to help
teachers and administrators work together in establishing priorities and pro-
grams for helping inner-city students develop communication and problem-
solving skills.
Label these Workshops A, B, and C. At first glance, the research problem
might seem to compare the relative success of each workshop in helping par-
ticipants to plan programs. However the researchers may set more ambitious
goals than merely concluding that one workshop was more successful than the
others; they may want to determine how the workshops differed in order to
discover what characteristics of one led to its superior effectiveness.
Two dimensions or concepts were identified to classify these workshops.
The first was the concept of structure, that is, predesignated specification of
what was to happen, when, and for what purpose.
The second concept dealt with the task orientation of the workshops,
that is, what kinds of problems they addressed. The researchers distinguished
between cognitive problems (those dealing with thinking and problem-solv-
ing) and affective problems (those dealing with feelings and attitudes).
The theory of mastery learning (Bloom, 1976) states that if learners possess the
necessary cognitive and affective entry behaviors for a new learning task, and
if the quality of instruction is adequate, then they should all learn the task. The
theory can be diagrammed as follows:
C O N S T R U C T I N G H Y P O T H E S E S A N D M E TA - A N A LY S E S ■ 9 3
In fact, a number of studies have tested these hypotheses (e.g., Slavin &
Karweit, 1984; see pages 351–368; 723–736).
Thus, with its multiple components and theoretical linkages and connec-
tions, a theory is a bountiful source of hypotheses for researchers to test. The
collective result of their inquiries serves as a test of the validity of the the-
ory. Theory, therefore, represents a good source from which hypotheses can
be derived. Its benefits come from helping to ensure a reasonable basis for
hypotheses and because tests of the hypotheses confirm, elaborate, or discon-
firm the theory.
94 ■ CHAPTER FIVE
Incorrect Dur- Feedback .75 .56 .62 .49 .51 .53 .57
ing Instruction No Feedback .25 .41 .17 .21 .42 .13 .20
Significance* ++ ++ + ++ ++ + ++ ++ + ++
Key:
* Statistical significance of the comparison of feedback and no-feedback probabilities
++ Statistically significant and positive
+ Nonsignificant and positive
0 No difference
? Nonsignificant with no reported direction
- Nonsignificant and negative
The same authors also report effect size values from eight studies comparing
different types of feedback, as shown in Table 5.2. When subjects are informed
only of whether their answers are right or wrong, effect sizes (the first row of
numbers) are small and mostly negative, averaging –0.08. This figure reflects an
average mean difference only 8 percent the size of the standard deviation. Since
Cohen (1988) considers 0.2 to be a small effect size, 0.5 a moderate one, and 0.8
a large one, the negative average indicates essentially no effect. However, when
subjects are guided to or given the correct answers when they give the wrong
responses, effect sizes (the second row of numbers) are larger and positive, aver-
aging 0.31. Adding explanations to the feedback produces varying effect sizes
(the last row of figures), ranging from almost no effect (0.05) to a substantial one
(1.24).
Based on these and other results, the authors present the following five-
stage model of learning, which suggests the varying role of feedback: (1) ini-
tial state— including prior experience and pretests, (2) activation of search and
retrieval strategies, (3) construction of response, (4) evaluation of response, and
(5) adjustment of cognitive state. The most positive impact of feedback has
appeared in the fourth stage, when it affects the evaluation of the correctness
of a response and guides changes to it, if necessary. However, the model itself
can be used as a source of research hypotheses about the possible impact of
feedback at different points in the learning process. For example, studies might
test two hypotheses:
• The effect of feedback is greater for relatively complex content than for
comparatively simple content.
• The effect of feedback is greater when students receive relatively few cues,
organizers, and other instructional supports than when they receive exten-
sive supports.
The authors found positive but modest effects of moral education, and
their discussion suggests several testable hypotheses for further study, such as:
When all types of reward are aggregated, overall, the results indicate that
reward does not negatively affect intrinsic motivation on any of the four
measures. . . . When rewards are subdivided into reward type . . . , reward
expectancy . . . , and reward contingency, the findings demonstrate that
people who receive a verbal reward spend more time on a task once the
reward is withdrawn; they also show more interest and enjoyment than
nonrewarded persons.
1. Starting with the answer. Meta-analysis often begins with the intention to
support or discredit a theoretical position rather than merely to discover.
This creates the possibility of bias (both in the meta-analysts and in the
responses of critics).
2. Selecting a straw man. If a researcher begins with a theoretical bias, then
the opposing bias becomes the straw man to be knocked down by the
meta-analysis.
3. Averaging across competing effects or different groups. This problem is a major
issue in meta-analysis, which often averages results of studies that may not be
comparable, thus overstating some effects and understating others. Averag-
ing, the core procedure in meta-analysis, can obscure contingent relationships
between variables, depending on which studies are averaged. For example, if
tangible rewards diminish intrinsic motivation while verbal rewards enhance
it, then averaging across studies that evaluate these two conditions may
give the appearance that rewards exert neither helpful nor harmful effects.
4. Selectively interpreting results. By reporting averages of results, and even
averages of averages, rather than reporting more detailed results, meta-
analysis can obscure important differences.
5. Falling prey to the quality problem. Meta-analysts sometimes give equal
weight to methodologically sound studies and those with methodological
problems.
6. Falling prey to the quantity problem. Meta-analysts sometimes lump stud-
ies of unique variations together with those of more common variations,
because the researchers cannot find enough of the former to form a group
or cluster. This choice obscures the effects of unique variations and the
possibility of using these results to refine the interpretations of more com-
mon variations.
7. Combining confounding variables. If variables are correlated or tend to
occur together, then they cannot be separated. Such a situation prevents
grouping treatments of specific variables in multiple studies.
8. Disregarding psychological “effect size.” Meta-analysis should weight
results according to the difficulty of obtaining them. For example, results
obtained by observing real behavior under naturally occurring circum-
stances should carry more weight than those where the subjects merely
report to the experimenter what they think they would do; the former
more accurately portray real psychological processes than the latter do.
(Of course, applying the weights would require that the meta-analyst make
judgments, thus increasing the likelihood of bias.)
In its simplest form, a hypothesis for research in the classroom can be stated
as follows:
Stated differently:
The first format included two variables; the second and third included a
third variable, as well.
These formats could accommodate many specific hypotheses for class-
room research. A change to a specific variable also changes the hypothesis;
however, the formats can be used repeatedly by inserting different variables.
■ Testing a Hypothesis
be rejected if tests find differences large enough to indicate real effects. That
is, a researcher can conclude that it is untrue that nondirective and direc-
tive teachers instruct students with equal effectiveness if one group is shown
clearly to teach more effectively than the other does. Those results would
not, however, justify a conclusion affirming the directional hypothesis that
nondirective teachers instruct students more effectively than do directive
teachers, because variables other than the characteristics of the teachers may
have contributed to the observed outcomes. Although the test allows the
researcher to reject the null hypothesis and conclude that the effectiveness of
the two groups of teachers is not equal, one should not then conclude that a
specified hypothesis is absolutely true or false; if so, different kinds of errors
may lead to acceptance of hypotheses that are false or to rejection of hypoth-
eses that are true.
Researchers can evaluate a hypothesis without stating it in null form; for
ease of discussion and understanding, they may prefer to state it in directional
form. However, for purposes of statistical testing and interpretation, they
always evaluate the null hypothesis.
In addition to the null hypothesis (A1 = A2) and each of the two possible
directional hypotheses (A1 > A2, A2 > A1, researchers must also acknowledge
what might be called a positive hypothesis (A1 ≠ A2). This position states, unlike
the null hypothesis, that the treatment levels will vary in their effects, but it dif-
fers from the directional hypotheses by not stating which treatment level will
produce the greatest effect. In other words, it is a nondirectional hypothesis. As
such, it adds little to a research document. Directional hypotheses are preferred
because they go beyond merely saying that “something will happen”; they say
exactly what will happen. This specificity helps researchers to determine that a
study provides a basis for accepting or rejecting expectations of differences.
Offering directional hypotheses helps to give a rationale and a focus to
a study, although statistical tests actually evaluate null hypotheses. Hypoth-
esizing a difference without specifying its direction, as in a so-called positive
hypothesis, adds little or nothing to the process.
This chapter’s discussion of hypotheses has featured words and phrases
such as effective and structured instructional procedures. As you may have rec-
ognized, words and phrases such as these do not lend themselves to experimen-
tal testing. A hypothesis, even a null hypothesis, is not directly testable in the
form in which it is generally stated. Its very generality, a distinguishing char-
acteristic, limits its direct testability. To become testable, therefore, researchers
must transform it into a more specific or operationalized statement. A hypoth-
esis is operationalized (made testable) by providing operational (testable) defi-
nitions for its terms (variables). But before variables can be defined, they must
be labeled. Approaches to this task are the subject of the following chapter.
C O N S T R U C T I N G H Y P O T H E S E S A N D M E TA - A N A LY S E S ■ 1 0 3
■ Summary
■ Recommended References
Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Meta-analysis: Cumulating
research findings across studies. Beverly Hills, CA: Sage Publications.
Snow, R. E. (1973). Theory construction for research on teaching. In R. M. W. Travers
(Ed.), Second handbook of research on teaching (pp. 77–112). Chicago, IL: Rand
McNally.
= CHAPTER SIX
Constructing Operational
Definitions of Variables
OBJECTIVES
■ 105
106 ■ CHAPTER SIX
measurement of their speeds; if they exceed 15 miles per hour, the driver receives
a speeding ticket.
In contrast, the driver tries to use a different operational definition of
speeding in a school zone: In a marked school zone, a car is speeding only if the
speed exceeds 15 miles per hour when children are near or on the street. Accord-
ing to the driver, a car is speeding in a school zone if (1) its speed exceeds 15
miles per hour, and (2) children are near or on the street. The driver believes
that the speed of the car is important only when children are present.
Another operational definition, but an impractical one, might define speed-
ing in a school zone on the basis of outcome after the fact: If a car going at any
speed in a school zone hits a child and hurts him or her, then the car was speed-
ing. Thus, if a child hit by a car is injured, the car was speeding, but if the child
gets up and walks away uninjured, the car was not speeding, even though it ran
into a child. For obvious reasons, this operational definition does not provide
a useful criterion for judgments of speeding in a school zone.
Consider an illustration nearer to the subject at hand. Suppose that you
are the school principal and a teacher asks you to remove a youngster from the
class due to aggressiveness. Suppose also that you respond by indicating that
you like aggressive learners and you feel that aggressiveness (that is, active chal-
lenging of the relevance of instructional experiences) is a useful quality to bring
to the learning situation. The teacher responds by saying that aggressiveness
means being “filled with hate and potentially violent.”
These illustrations suggest some conclusions:
language so that any reader from any background understands exactly what is
being said and in sufficient detail to allow replication of the research.
they constitute an adequate definition for scientific purposes. Often, more than
one operational definition can be constructed for a single variable, but each must
be sufficiently operational to meet the criterion of exclusiveness, as discussed in
a later section.
Review a few additional examples of manipulation-based operational
definitions:
self-reports from their subjects; the subjects might, for example, fill out ques-
tionnaires or attitude scales to report their own thoughts, perceptions, and
emotions. Thus, one static-property operational definition of course satisfac-
tion might be the perception—as reported by subjects on questionnaires—that
a course has been an interesting and effective learning experience. In contrast, a
dynamic-property operational definition of course satisfaction would be based
on observable behaviors, such as recommending the course to friends, enroll-
ing in related courses, or enrolling in other courses taught by the same teacher.
Note that static-property operational definitions describe the qualities,
traits, or characteristics of people or things. Thus, researchers can construct
them to define any type of variable, including independent, dependent, and
moderator variables (those not manipulated by the researcher). When such a
definition specifies a person’s characteristic, it cites a static or internal quality
rather than a behavior like that specified by a dynamic-property definition.
Static-property operational definitions often lend themselves to measure-
ment by tests, although feasibility of testing is not a requisite part of the
definition. However, operational definitions are statements of observable
properties—traits, appearances, behaviors—and statements of such proper-
ties are prerequisites to measuring them. For people, a static-property defi-
nition is measured based on data collected directly from the subjects of the
study, representing their self-descriptions of inner states or performances.
To clarify, consider a few additional examples of static-property opera-
tional definitions:
Examples
• Expressive discourse: Writing in which the writer is asked to tell the reader
how the writer feels about or perceives something.
• Explanatory discourse: Writing in which the writer is asked to present
actual information about something to the reader.
• Persuasive discourse: Writing in which the writer is asked to take and sup-
port a position and attempt to convince the reader to agree with it.
Within the process of testing a hypothesis, the researcher must move repeat-
edly from the hypothetical to the concrete and back. To get the maximum value
from data, he or she must make generalizations that apply to situations other
than the experiment itself. Thus, the researcher often begins at the conceptual
or hypothetical level to develop hypotheses that articulate possible linkages
C O N S T R U C T I N G O P E R AT I O N A L D E F I N I T I O N S O F VA R I A B L E S ■ 1 1 5
Structure
• Formal planning and structuring of the course
• Minimizing informal work and group work
• Structuring group activity when it is used
• Rigidly structuring individual and classroom activity
• Requiring factual knowledge from students based on absolute sources
Interpersonal
• Enforcing absolute and justifiable punishment
• Minimizing opportunities to make and learn from mistakes
• Maintaining a formal classroom atmosphere
• Maintaining formal relationships with students
• Taking absolute responsibility for grades
and operationalizing have been combined and recombined in the total research
process.
Testability
Predictions
• Prediction: Students who see the school as a place they enjoy and where
they like to be will be less frequently cited for fighting or talking back to a
teacher than those who see the school as a place they do not enjoy.
• Hypothesis. Programs offering stipends are more successful at retaining
students than are programs without such payments.
• Prediction: The dropout rate among adults enrolled in training and retrain-
ing programs will be smaller in programs that pay stipends to students who
attend than in comparable programs that do not pay stipends.
• Hypothesis. Performance in paired-associate tasks is inversely related to
socioeconomic status.
• Prediction: Students whose parents earn more than $50,000 a year will
require fewer trials to perfectly learn a paired-associate task than will stu-
dents whose parents earn less than $20,000 a year.
• Hypothesis. In deciding on curricular innovations, authoritarian school
superintendents will be less inclined to respond to rational pressures and
more inclined to respond to expedience pressures than will nonauthoritar-
ian school superintendents.
• Prediction: In judging curricular innovations, school superintendents who
react to the world in terms of superordinate-subordinate role distinctions
and power and toughness will less frequently acknowledge the opinions
of subject matter experts as sources of influence and more frequently
acknowledge input from their superiors than will superintendents who do
not react to the world in these terms.
The researcher has now developed operational definitions of the variables and
restated the hypotheses in the operational form called predictions. He or she is
now ready to conduct a study to test these predictions and thus the hypotheses.
The next step requires a decision about how to control and/or manipulate the
variables through a research design.
The schematic of the research process in Figure 6.1 places the steps and
procedures already described in perspective relative to those covered in the
remainder of this book. It also outlines the sequence of activities in research
that form the basis for this book.
Note that research begins with a problem and applies both theories and
findings of other studies, located in a thorough literature search, to choose
variables that must be labeled (these points were covered in Chapters 2, 3, and
4) and construct hypotheses, as described in Chapter 5. These hypotheses con-
tain variables that must be then operationally defined, as described in this
chapter, to construct predictions. These steps might be considered the logical
118 ■ CHAPTER SIX
■ Summary
teachers are those who move around a lot and talk loud and fast to
students.
4. The technique of static-property operational definition specifies a variable
according to how people describe themselves based on self-reporting. For
example, self-confidence is confirmed in your statement that you believe
you will succeed.
5. Operational definitions can be evaluated based on exclusiveness, that is,
the uniqueness of the variables that they define. An operational definition
that simultaneously fits a number of variables lacks exclusiveness.
6. Although researchers begin at the conceptual level with broadly defined
variables and hypotheses, to study those variables and test those hypoth-
eses, they must operationalize them. Formulating operational definitions
is an activity that occurs between conceptualizing a study and developing
the methodology to carry it out. Operational definitions help researchers
to make hypotheses into testable predictions.
7. A prediction (previously called a specific hypothesis) is a hypothesis in which
the conceptual names of the variables have been replaced by their opera-
tional definitions. Predictions are then tested by the methods designed for
research studies.
8. The research spectrum treats predictions as the bridge between the logical
and conceptual stage described in previous chapters and the methodologi-
cal stage described in subsequent chapters.
■ Recommended Reference
Martin, D. W. (1991). Doing psychology experiments (3rd ed.). Monterey, CA: Brooks/
Cole.
PA R T
TYPES OF RESEARCH
=
= CHAPTER SEVEN
OBJECTIVES
■ 123
124 ■ CHAPTER SEVEN
To ensure internal validity or certainty for a study, the researcher must estab-
lish experimental controls to support the conclusion that differences occur as a
result of the experimental treatment. In an experiment lacking internal validity,
the researcher does not know whether the experimental treatment or uncon-
trolled factors produced the difference between groups. Campbell and Stanley
(1966) identified classes of extraneous variables that can be sources of internal
bias if not controlled.
This section reviews such factors, dividing them into three groups: (1) expe-
rience bias—based on what occurs within a research study as it progresses; (2)
participant bias—based on the characteristics of the people on whom the study
is conducted; (3) instrumentation bias—based on the way the data are collected.
History
In research, the term history refers to events occurring in the environment at
the same time that a study tests the experimental variable. If a study tests a
specific curriculum on a group of students who are simultaneously experienc-
ing high stress due to an external event, then the measured outcomes of the
experimental test may not reflect the effects of the experimental curriculum but
rather those of the external, historical event. Researchers prevent limitations on
internal validity due to history by comparing results for an experimental group
to those for a control group with the same external or historical experiences
during the course of the experiment.
126 ■ CHAPTER SEVEN
Testing
Invalidity due to testing results when experience of a pretest affects subsequent
posttest performance. Many experiments apply pretests to subjects to deter-
mine their initial states with regard to variables of interest. The experience of
taking such a pretest may increase the likelihood that the subjects will improve
their performance on the subsequent posttest, particularly when it is identical
to the pretest. The posttest, then, may not measure simply the effect or the
experimental treatment. Indeed, its results may reflect the pretest experience
more than the experimental treatment experience itself (or, in the case of the
control group, the absence of treatment experience). A pretest can also blur
differences between experimental and control groups by providing the control
group with an experience relevant to the posttest.
Researchers often seek to avoid testing problems by advocating unobtru-
sive measures—measurement techniques that do not require acceptance or
awareness by the experimental subjects. In this way, they hope to minimize
the possibility that testing will jeopardize internal validity. (If subjects do not
directly provide data by voluntarily responding to a test, they do not experi-
ence a test exposure that could benefit their performance.)
In more traditional experimental designs, the problem of testing can be
avoided simply by avoiding pretests. (In fact, they are often unnecessary steps.)
The next chapter presents research designs that avoid pretests. Apart from
introducing possible bias, pretests are expensive and time-consuming activities.
Expectancy
A treatment may appear to increase learning effectiveness as compared to that
of a control or comparison group, not because it really boosts effectiveness, but
because either the experimenter or the subjects believe that it does and behave
according to this expectation. When a researcher is in a position to influence
the outcome of an experiment, albeit unconsciously, he or she may behave in
a way that improves the performance of one group and not the other, which
A P P LY I N G D E S I G N C R I T E R I A ■ 1 2 7
alter results. So-called “smart” rats were found to outperform “dumb” ones
when experimenters believed that the labels reflected genuine differences. Such
experimenter bias has been well-documented by Rosenthal (1985).
Subjects may also form expectations about treatment outcomes. Referred
to by some as demand characteristics, these self-imposed demands for perfor-
mance by subjects, particularly by those experiencing an experimental condi-
tion, result from a respect for authority and a high regard for science. Motivated
by these feelings, the subjects attempt to comply with their own expectations
of appropriate results for the experiment. Expectancy effects can be controlled
by use of the double-blind techniques described earlier in the chapter.
Selection
Many studies attempt to compare the effects of different experiences or treat-
ments on different groups of individuals. Bias may result if the group experi-
encing one treatment includes members who are brighter, more receptive, or
older than the group receiving either no treatment or some other treatment.
Results for the first group may change, not because of the treatment itself, but
because the group selected to receive the treatment differs from the other in
some way. The personal reactions and behaviors of individuals within a group
can influence research results. In other words, “people factors” can introduce
a bias.
Random assignment minimizes the problems of selection by ensuring that
any person in the subject pool has an equal probability of becoming a member
of either the experimental group or the control group. Because experimental or
control subjects assigned randomly should not differ in general characteristics,
any treatment effects in the study should not result from the special charac-
teristics of a particular group. In research designs that call for selection as a
variable (for example, intelligence) under manipulation, subjects are separated
systematically into different groups (for example, high and low) based on some
individual difference measure, thus providing for control.
Obviously, if a researcher fails to control selection bias, he or she cannot
say that the outcome of the study does not reflect initial differences between
groups rather than the treatment being evaluated. Detailed procedures for min-
imizing selection bias are described in a later section of this chapter on equating
experimental and control conditions.
Maturation
Maturation refers to the processes of change that take place within subjects
during the course of an experiment. Experiments that extend for long periods
128 ■ CHAPTER SEVEN
Statistical Regression
When group members are chosen on the basis of extreme scores on a particular
variable, problems of statistical regression occur. Say, for instance, that a group
of students takes an IQ test, and only the highest third and the lowest third are
selected for the experiment, eliminating the middle third. Statistical processes
would create a tendency for the scores on any posttest measurement of the
high-IQ students to decrease toward the mean, while the scores of the low-
IQ students would increase toward the mean. Thus, the groups would differ
less in their posttest results, even without experiencing any experimental treat-
ment. This effect occurs because chance factors are more likely to contribute to
extreme scores than to average scores, and such factors are unlikely to reappear
during a second testing (or in testing on a different measure). The problem is
controlled by avoiding the exclusive selection of extreme scorers and including
average scorers.
Experimental Mortality
Researchers in any study should strive to obtain posttest data from all subjects
originally included in the study. Otherwise bias may result if subjects who
withdraw from the study differ from those who remain. Such differences rel-
evant to the dependent variable introduce posttest bias (or internal invalidity
based on mortality). This bias also occurs when a study evaluates more than
one condition, and subjects are lost differentially from the groups experiencing
the different conditions.
As an example, consider a study to follow up and compare graduates of
two different educational programs. The researchers may fail to reach some
members of each group, for example, those who have joined the armed forces.
Moreover, one of the two groups may have lost more members than the other.
The original samples may now be biased by the selective loss of some indi-
viduals. Because the groups have not lost equally, the losses may not be ran-
dom results; rather, they may reflect some bias in the group or program. If the
A P P LY I N G D E S I G N C R I T E R I A ■ 1 2 9
purpose of the follow-up study were to assess attitudes toward authority, for
example, graduates who had joined the armed services would differ systemati-
cally from other graduates on this variable. Failure to obtain data from these
individuals, then, would bias the outcome and limit its effectiveness in assess-
ing the attitudes produced by the educational program. Data from a represen-
tative sample of the graduates might support conclusions quite different from
those indicated by the more limited input.
To avoid problems created by experimental mortality, researchers often
must choose reasonably large groups, take steps to ensure their representative-
ness, and attempt to follow up subjects who leave their studies or for whom
they lack initial results.
Of course, factors that affect validity may occur in combination. For example,
one study might suffer from invalidity due to a selection-maturation interac-
tion. Failure to equate experimental and control groups on age might create
problems both of selection and maturation bias, because children at some ages
mature or change more rapidly than do children at other ages. Moreover, the
nature of the changes experienced at one age might be more systematically
related to the experimental treatment than the changes experienced at another
age. Thus, two sources of invalidity can combine to restrict the overall validity
of the experiment.
Instrumentation Bias
Instrumentation is the measurement or observation procedures used during
an experiment. Such procedures typically include tests, mechanical measur-
ing instruments, and judgment by observers or scorers. Although mechanical
measuring instruments seldom undergo changes during the course of a study,
observers and scorers may well change their manner of collecting and record-
ing data as the study proceeds. Because interviewers tend to gain proficiency
(or perhaps become bored) as a study proceeds, they may inadvertently pro-
vide different cues to interviewees, take different amounts and kinds of notes,
or even score or code protocols differently, thus introducing instrumentation
bias into the results.
A related threat to validity results if the observers, scorers, or interviewers
become aware of the purpose of the study. Consciously or unconsciously, they
may attempt to increase the likelihood that results will support the desired
hypotheses. Both the measuring instruments of a study and the data collec-
tors should remain constant across time as well as constant across groups (or
conditions).
130 ■ CHAPTER SEVEN
1. Establish the reliability or consistency of test scores over items and over
time, thus showing that a test consistently measures some variable.
2. Use the same measure for both pretest and posttest, or use alternate forms
of the same measure.
3. Establish the validity of a test measure, thus showing that it evaluates what
you intended to measure.
4. Establish a relative scoring system for a test (norms) so that scores may be
adapted to a common scale.
5. Gather input from more than one judge or observer, keep the same judges
or observers throughout the course of study, and compare their judgments
to establish an index of interjudge agreement.
If the samples drawn for a study are not representative of the larger popu-
lation, a researcher may encounter difficulty generalizing findings from their
results. For instance, an experiment run with students in one part of the coun-
try might not yield results valid for students in another part of the country;
a study run with urban dwellers as subjects might not apply to rural dwell-
ers, if some unique characteristic of the urban population contributes to the
effects found by the experiment. Thus, the desire to maintain external validity
demands samples representative of the broadest possible population. The tech-
niques for accomplishing this are described in Chapter 11.
1. The work of Welch and Walberg (1970) tends to indicate that pretesting may threaten
validity less seriously than previously assumed. The point bears reemphasizing, however:
Pretesting is a costly and time-consuming process, and researchers may prefer to avoid it in
certain situations (see Chapter 8).
132 ■ CHAPTER SEVEN
the experimenters, “Weren’t you frightened?” the subject calmly replied, “Oh,
no. It was only an experiment.”
Often a curriculum produces results on an experimental basis that differ
from those in general application because of the Hawthorne effect. This effect
was discovered and labeled by Mayo, Roethlisberger, and Dickson during per-
formance studies at the Hawthorne works of the Western Electric Company
in Chicago, Illinois, during the late 1920s (see Brown, 1954). The research-
ers wanted to determine the effects of changes in the physical characteristics
of the work environment as well as in incentive rates and rest periods. They
discovered, however, that production increased regardless of the conditions
imposed, leading them to conclude that the workers were reacting to their role
in the experiment and the importance placed on them by management. The
term Hawthorne effect thus refers to performance increments prompted by
mere inclusion in an experiment. This effect may lead participants, pleased by
having been singled out to participate in an experimental project, to react more
strongly to the pleasure of participation than to the treatment itself. However,
the rested conditions often yield very different results when tried on a nonex-
perimental basis.
Multiple-Treatment Interference
Randomization
Matched-Pair Technique
Matched-Group Technique
A similar but less extensive matching procedure calls for assigning individuals
to groups in a way that gives equal mean scores for the groups on the critical
selection variables. Thus, two groups might be composed to give the same mean
age of 11.5, or between 11 and 12. Individuals might not form equivalent pairs
across groups, but the groups on the average would be equivalent to one another.
Groups can also be matched according to their average scores on a pretest mea-
sure of the dependent variable; this technique guarantees average equivalence of
the groups at the start of the experiment.
Often, researchers must complete statistical comparisons to ensure that
they have produced adequately matched groups. Note, however, that this tech-
nique, as the previous one, may lead to regression effects, as described earlier
in the chapter; experimenters should avoid it in favor of random assignment
in other than uncommon circumstances. (It is, however, appropriate to com-
pare the composition of intact groups after the fact, to determine whether they
match one another.)
If all subjects serve as members of both the experimental and control groups,
then researchers can usually assume adequate control of selection variables.
However, many situations do not allow this technique, because the experimen-
tal experience will affect performance in the control activity, or vice versa. In
learning and teaching studies, for instance, after completing the experimental
treatment, the subject no longer qualifies as a naive participant; performance on
the control task will reflect the subject’s experience on the experimental task. In
other words, this technique controls adequately for selection bias, but it often
creates insurmountable problems of maturation or history bias. Subjects in the
control and experimental groups may be the same individuals, but the relevant
history of each person, and hence the present level of maturation, differs in
completing the second task because of experience of the first.
In instances where Ss can serve as their own controls, careful researchers
must control for order effects by counterbalancing. Half of the Ss, chosen at
random, should receive the experimental treatment first, while the remainder
first serve as controls. (A later section of this chapter discusses counterbalanc-
ing in more detail.)
The population is the entire group that a researcher sets out to study. The sample
is the group of individuals chosen from that number to participate in the study.
136 ■ CHAPTER SEVEN
Differing only in their experiences of the independent variable, the control and
experimental groups should share as much as possible the same experiences or
A P P LY I N G D E S I G N C R I T E R I A ■ 1 3 7
history in every other respect. Researchers face serious difficulty ensuring that
the experiences of the two groups will be comparable outside the experimental
setting; realistically, the maximum amount of control they can exercise comes
from simply establishing a control group with members drawn from the same
population as the experimental group. However, within the experiment itself,
control efforts must also target many potentially confounding variables (that
is, sources of internal invalidity due to history). A number of methods of such
control are available.
Method of Removal
Method of Constancy
Experiences other than those resulting from the manipulation of the inde-
pendent variable should remain constant across the experimental and control
groups. If the manipulation includes instructions to subjects, these should be
written in advance and then read to both groups to guarantee constancy across
conditions. Tasks, experiences, or procedures not unique to the treatment
should be identical for experimental and control groups. Experimental settings
should also be the same in both cases. In an experiment that contrasts an expe-
rience with its absence, the researcher must not leave uncontrolled the factors
of time, involvement in the experiment, and exposure to materials. To main-
tain constancy on these factors, the control group should experience an irrel-
evant treatment (rather than none at all) that takes as long as the experimental
treatment and provides the same amount of exposure, thus providing the same
amount of involvement. A design appropriate for controlling the Hawthorne
effect, a risk if experimental Ss are treated and control Ss ignored, is discussed
in Chapter 8.
Researchers encounter difficulty, not in deciding how to provide constancy,
but in determining what experiences require constancy. If they fail to main-
tain constancy on potential confounding variables, their designs lack internal
validity and fail to justify conclusions. Variables such as amount of material
138 ■ CHAPTER SEVEN
exposure, time spent in the experiment, and attentions from the experimenter
are occasionally overlooked as control variables.
Teacher effect can be controlled by keeping the teacher constant, that is,
by assigning the same teacher to both treatment and control classes. (This
approach does limit the generality of the study, however.)
Method of Counterbalancing
Table 7.2 shows that the study defined four groups and assigned two expe-
riences per group to provide the required controls. Each group experienced
each passage structure once and each passage content once, while each passage
structure was paired with each passage content twice. This method produced
four possible orders of the structure/content combination, and each group
experienced one of these orders.
The experiment could not systematically control some combinations of
experiences, because to do so would have required either too many groups or
too many experiences per group. For instance, unique combinations of struc-
ture and content would ideally require that each group experience each of the
four possible combinations rather than just two. For practical reasons, Taylor
Table 7.3 summarizes procedures for controlling both participant effects and
experience effects as they affect the internal validity (certainty) and external
validity (generality) of a study. To understand applications of these principles,
consider a study of teacher enthusiasm and its effect on student achievement
(McKinney et al., 1983).
The following quotations provide the necessary information:
PRECAUTIONS TO ENSURE:
IN DEAL-
ING WITH: Certainty Generality
Participants Control all individual differences Make the sample as representative as
between groups on IQ, prior achieve- possible of population from which it
ment, sex, age, etc., by: is drawn by:
1. Random assignment: group/ 1. Random selection: sample/
group population
2. Matching 2. Stratified sampling (See
3. Establishing equivalence statisti- Chapter 10)
cally after the fact
Experiences Control all differences in experiences Make experimental conditions as
between groups, other than the inde- representative of real-life experi-
pendent variable, by: ences as possible by:
1. Employing a control group 1. Remaining as unobtrusive as
2. Providing each group with com- possible
parable subject matter or tasks 2. Minimizing the salience of the
3. Equalizing teacher effects across experiment and experimenter
groups 3. Utilizing double-blind procedure
A P P LY I N G D E S I G N C R I T E R I A ■ 1 4 1
The study included six teachers, each one teaching each of the three treat-
ments (high, medium, and low enthusiasm) across three different social studies
topics (cultural diffusion, arable land, and tertiary production), as shown in
Table 7.4 (McKinney et al., 1983, p. 251).
This approach effectively controlled for history bias from sources such
as content of lessons, order of treatment, teacher effect, and time of day. Such
strict controls maximize internal validity, but they also raise other issues, as
noted by the authors:
In the final task, the observers appraised the success of the manipulation,
assuring themselves that teachers indeed manifested high, medium, and low
enthusiasm as required in the design:
Observers were present during each treatment period in the course of the
study to verify that the treatments were followed. They were not told at
which level of enthusiasm the teachers would be teaching, and they were
144 ■ CHAPTER SEVEN
rotated each day so that each observer rated all of the teachers in the par-
ticular school. (p. 250)
Based on these observer ratings, the researchers then evaluated the success
of the manipulation. As Table 7.5 confirms, it clearly was a success.
■ Summary
of internal invalidity, leaving blank the two chosen as answers for Exercises
1 and 2. (Possible sources of internal validity are history, selection, matura-
tion, testing, instrumentation, statistical regression, experimental mortality,
stability, expectancy, and interactive combinations of factors.)
1a: ____________________________________________________________
1b: ____________________________________________________________
1c: ____________________________________________________________
1d: ____________________________________________________________
2a: ____________________________________________________________
2b: ____________________________________________________________
2c: ____________________________________________________________
2d: ____________________________________________________________
■ Recommended References
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis
issues for field settings. Chicago, IL: Rand McNally.
Mitchell, M., & Jolley, J. (1992). Research design explained (2nd ed.). Fort Worth, TX:
Harcourt Brace.
Rosenthal, R., & Rosnow, R. L. (1969). Artifact in behavioral research. New York, NY:
Academic Press.
= CHAPTER EIGHT
OBJECTIVES
This section reviews the system of symbols used in the chapter to specify
research designs.1
An X designates a treatment (the presence of an experimental manipula-
tion) and a blank space designates a control (the absence of a manipulation).
When treatments are compared, they are labeled X1, X2, and so on. An O
■ 149
150 ■ CHAPTER EIGHT
Unfortunately, all too many researcher studies employ three common research
procedures that do not qualify as legitimate experimental designs, because they
do not control adequately against the sources of internal invalidity. These are
referred to as preexperimental designs, because they are component pieces or
elements of true experimental designs. Because they are inadequate as they
stand, they are also called non-designs. Because students gain from knowledge
of what they should not do as well as what they should do, this section reviews
these unacceptable designs.
2. These subscripts function solely for differentiation; they have no systematic meaning
regarding sequence.
E X P E R I M E N TA L R E S E A R C H D E S I G N S ■ 1 5 1
Intact-Group Comparison
X O1
————
O2
A control group that does not receive the treatment (X) acts as a source of com-
parison for the treatment-receiving group, helping to prevent bias from effects
such as history and, to a lesser extent, maturation. Validity increases, because
some coincidental event that affects the outcome will as likely affect O2 as O1.
However, the control group Ss and the experimental group Ss are neither
selected nor assigned to groups on a random basis (or on any of the bases
required for the control of selection bias). The dashed line between the groups
indicates that they are intact groups. Moreover, by failing to pretest the Ss, the
researchers lose the ability to confirm the essential equivalence of the control
and experimental group Ss. Thus, this approach is an unacceptable method,
because it controls for neither selection invalidity or invalidity based on exper-
imental mortality. That is, it gives no information about whether one group
was already higher on O (or some related measure) before the treatment, which
may have caused it to outperform the other group on the posttest.
Although differences between O2 and O1 probably do not result from dif-
ferent histories and rates of maturation during such an experiment, researchers
should not simply assume that the observed outcomes are not based on differ-
ences that the Ss bring with them to the experiment. Because the intact-group
comparison does not satisfactorily control for all sources of invalidity, it is
considered a preexperimental design. By virtue of their shortcomings, this and
the other non-designs do not eliminate potential alternative explanations of
their findings, so they are not acceptable or legitimate experimental designs.
152 ■ CHAPTER EIGHT
This experimental method offers potentially the most useful true design. It can
be diagrammed as:
R X O1
R O2
The posttest-only control group design provides ideal control over all threats
to validity and all sources of bias. The design utilizes two groups, one that
experiences the treatment while the other does not, so it controls for history
and maturation bias. Random assignment to the experimental or control group
prevents problems of selection and mortality. In addition, this design controls
for a simple testing effect and the interaction between testing and treatment by
giving no pretest to either group.
Data analysis for the posttest-only control group design centers on com-
parisons between the mean for O1 and the mean for O2.
Recall from the previous chapter the discussion of a study by McKinney
et al. (1983) on the effects of teacher enthusiasm on student achievement. This
research project illustrates the posttest-only control group design. It examined
the independent variable, teacher enthusiasm, by establishing three levels: high,
medium, and low enthusiasm. Subjects were randomly assigned to treatments,
taught a prescribed unit, and tested to determine their achievement on the unit.
The design would be diagrammed as:
R X1 O1 (high enthusiasm)
R X2 O2 (medium enthusiasm)
R X3 O3 (low enthusiasm)
The purpose of this study was to determine if selected students who were
absent from school and who received calls to their homes from the princi-
pal via a computer message device, would have a better school attendance
record than those students whose homes were not called.
This sentence gives the problem statement of the study. Subjects were ran-
domly assigned to either the to-be-called or not-to-be-called conditions, and a
“posttest” evaluated their attendance after the eighth month of the school year.
The researchers designated a control or comparison group and randomly
assigned Ss to conditions, providing suitable control for internal validity. The
posttest-only control group design may be used where such requirements can
be met.
R O1 X O2
R O3 O4
Two groups are employed in this design: The experimental group receives a
treatment (X) while the control group does not. (Random assignment is used
to place Ss in both groups.) Both groups are given a pretest (O1 and O3) and a
posttest (O2 and O4). The use of a pretest is the only difference between this
design and the previously discussed one.
By subjecting a control group to all the same experiences as the experi-
mental group except the experience of the treatment itself, this design con-
trols for history, maturation, and regression effects. By randomizing Ss
across experimental and control conditions, it controls for both selection and
mortality. This design, therefore, controls many threats to validity or sources
of bias.
However, administration of a pretest does introduce slight design difficul-
ties beyond those encountered in the posttest-only control group design. The
pretest-posttest control group design allows the possibility of a testing effect
(that is, a gain on the posttest due to experience on the pretest); this potential
for bias may reduce the internal validity. Also, the design lacks any control for
the possibility that the pretest will sensitize Ss to the treatment, thus affecting
external validity. In other words, the design does not control for a test-treat-
ment interaction. Moreover, it lacks control for the artificiality of an experi-
ment that may well be established through the use of a pretest.
154 ■ CHAPTER EIGHT
■ Factorial Designs
R O1 X Y1 O2
R O3 Y1 O4
R O5 X Y2 O6
R O7 Y2 O8
In this example, two groups receive the experimental treatment and two
groups do not. One group receiving the treatment and one group not receiving
the treatment are simultaneously categorized as Y1, while the remaining two
groups, one receiving and one not receiving the treatment, are categorized as
Y2. Thus, if Y1 represented a subgroup receiving oral instruction and Y2 a sub-
group receiving written instruction, only half of each subgroup would receive
the treatment. Moreover, random assignment determines the halves of each
subgroup to experience or not experience the treatment.
It is equally possible to create a factorial design by modifying the posttest-
only control group design, as illustrated in the diagram, again for a two-factor
situation:
R X Y1 O1
R Y1 O2
R X Y2 O3
R Y2 O4
FIGURE 8.1 Factorial Design (4x2x2) for an Instructional Study with Two Intact-Group
Moderator Variables
Y variable seems to moderate the X variable. That is, training produces more
pronounced effects for subjects with high memory spans than for those with
low memory spans. Thus, the two independent variables seem to generate a
conjoint effect as well as separate effects. (Of course, these conclusions would
have to be substantiated by an analysis of variance.) The factorial research
design allows the researcher to identify both simultaneous and separate effects
of independent variables. That is, it allows a researcher to include one or more
moderator variables.
158 ■ CHAPTER EIGHT
A study may incorporate multiple outcome measures but analyze each one
in a separate evaluation. In such a case, time or trials function, not as modera-
tor variables, but merely as multiple dependent variables. The research design
is repeated for each dependent variable.
Sometimes, however, a study bases a moderator variable on repeated or
multiple measurements of a single dependent variable, such as multiple perfor-
mance trials or an immediate retention test followed by a delayed retention test.
A design should explicitly indicate simultaneous analysis of data from multiple
times or trials, but no common notation has been presented for this purpose. To
represent repeated measurement of a dependent variable (also called a within-
subjects variable), notation should include multiple Os following the representa-
tions of independent and moderator variables. For example, in a study, randomly
assigned subjects might experience either real practice (X1) or imaginary practice
(X2) in shooting foul shots. They would then complete five test trials of 20 shots
each, and the researcher would evaluate the outcome as a moderator variable.
The design would look like this:
R X1 O1 O2 O3 O4 O5
R X2 O6 O7 O8 O9 O10
■ Quasi-Experimental Designs
to carry experimental control to its reasonable limit within the realities of par-
ticular situations.
Time-Series Design
O1 O2 O3 O4 X O5 O6 O7 O8
Like the time-series design, the equivalent time-samples design suits situations
when only a single group is available for study and the group requires a highly
predetermined pattern of experience with the treatment. The second condition
E X P E R I M E N TA L R E S E A R C H D E S I G N S ■ 1 6 1
means that the researcher must expose the group to the treatment in some sys-
tematic way:
X1 O1 X0 O2 X1 O3 X0 O4
This design, too, is a form of time-series. Rather than introducing the treatment
(X1) only a single time, however, the researcher introduces and reintroduces it,
making some other experience (X0 ) available in the absence of the treatment.
The equivalent time-samples design satisfies the requirements of internal
validity, including controlling for historical bias. In this regard, it is superior to
the time-series design, since it further reduces the likelihood that a compelling
extraneous event will occur simultaneously with each presentation of the treat-
ment (X1). Thus, a comparison of the average of O1 and O3 with the average
of O2 and O4 will yield a result unlikely to be invalidated by historical bias.
Moreover, the analysis can be set up to determine order effects, as well:
First Second
Administration Administration
X1 O1 O3
X0 O2 O4
4. This procedure for controlling selection threats to validity was described in Chapter 7.
162 ■ CHAPTER EIGHT
O1, X0 O2 X1 O3 X0 O4 X1 O5
A third illustration of this design is a study by Taylor and Samuels (1983) that
compared children’s recall of normal and scrambled passages. The same children
read both types of passage, but half, chosen at random, read the normal passage
first and the scrambled passage second, while the other half read the passages in
the reverse order: scrambled first and normal second. (Refer back to Table 7.2.)
By reversing the sequence of the experiences, the researchers avoided possible
bias due to order effects. Because the same subjects experienced both treatments,
that is, read both types of passage, the study required two separate passages. If
the same passage content were used twice, students might remember it the sec-
ond time, thereby introducing history bias and reducing internal validity. Sub-
jects read two passages, Passage A (about bird nests) and Passage B (about animal
weapons). Both passages included the same number of words and were evaluated
to be at the same reading level. Half of the subjects, randomly chosen, read a nor-
mal (N) version of Passage A and a scrambled (S) version of Passage B while the
other half reversed this arrangement: scrambled A and normal B. The treatment
therefore required four test booklets, each with two passages as follows: SA/NB,
NA/SB, SB/NA, NB/SA; one-quarter of the subjects received each test booklet.
This precaution allowed researchers to avoid possible threats to internal validity
while using subjects as their own controls.
The equivalent time-samples design shows some weakness in external
validity. If the effect of the treatment differs in continuous application from
its effect when dispersed, the results will not allow generalization beyond the
experimental situation. That is, if the effect of X1 when administered over time
(X1 →)is different from the effect of X1 when introduced and reintroduced (as
it is in an equivalent time-samples design: X1 X0 X1 X0 ), then the results of such
a study would not justify valid conclusions about the continuous effect of X1.
E X P E R I M E N TA L R E S E A R C H D E S I G N S ■ 1 6 3
O1 X O2
———————
O3 O4
design, the researcher can compare the intact groups on their pretest scores (O1
versus O3) and on their scores for any control variables that are both appropri-
ate to selection and potentially relevant to the treatment; examples include IQ,
gender, age, and so on.
Note that the pretest is an essential precautionary requirement here as
compared to its optional role in a true experimental design. The recommended
posttest-only control group design includes no pretest. In the pretest-posttest
control group design, the pretest merely establishes a baseline from which to
evaluate changes that occur or incorporates a further safeguard, beyond ran-
dom assignment, to control for selection bias. The nonequivalent control group
design must employ a pretest, however, as its only control for selection bias.
Only through a pretest can such a study demonstrate initial group equivalence
(in the absence of randomized assignment).
Lack of random assignment to groups necessitates some basis for initially
comparing the groups; the pretest provides the basis. Thus, a researcher might
begin working with two intact classes to compare the process approach to
teaching science with the traditional textbook approach. At the outset of the
experiment, she or he would compare the groups primarily on knowledge of
that material they are about to be taught but also possibly on prior science
achievement, age, and gender to ensure that the groups were equivalent. If the
pretest establishes equivalence, the researcher can continue with diminished
concern about threats to internal validity posed by selection or mortality bias.
If the pretest shows that the groups are not equivalent on relevant measures,
alternative designs must be sought (or special statistical procedures applied).
More often, however, pretesting uncovers (1) bias in group assignment on
irrelevant variables but (2) equivalent pretest means on the study’s dependent
and control variables; such results confirm the nonequivalent control group
design as a good choice. This design is not, therefore, as good a choice as the
pretest-posttest control group design, but it is greatly superior to the one-
group pretest-posttest design.
In one situation, however, researchers must exercise caution in implement-
ing the nonequivalent control group design: where the experimental group is
self-selected, that is, the participants are volunteers. Comparing a group of vol-
unteers to a group of non-volunteers does not control for selection bias, because
the two groups differ—at least in their volunteering behavior and all the moti-
vations associated with it. Where the researcher can exercise maximum control
over group assignment, he or she should recruit twice as many volunteer sub-
jects as the treatment can accommodate and randomly divide them into two
groups—an experimental group that receives the treatment and a control group
from which treatment is withheld. Effects of the treatment versus its absence
on some dependent variable would then be evaluated using one of the two true
E X P E R I M E N TA L R E S E A R C H D E S I G N S ■ 1 6 5
experimental designs. When the volunteers cannot be split into two groups, the
separate-sample pretest-posttest design (described later in this section) should be
used. The nonequivalent control group design is usually inappropriate, however,
where one intact group is composed of volunteers and the other is not. Although
the design is intended for the situations that cause some suspicion of bias in the
assignment of individuals to intact groups, it gives valid results only when this
bias is not relevant to the dependent variable (demonstrated by equivalence of
group pretest results). The bias imposed by comparing volunteers to non-volun-
teers is typically relevant to almost any dependent variable.
A study conducted by a group of three students in a research methods course
illustrates the use of the nonequivalent control group design. The researchers
undertook the study to assess the effects of a program to improve handwriting
among third graders. The school in their study made available two third-grade
classes, but no evidence suggested that the classes had been composed without
bias. The school principal was not willing to allow the researchers to alter the
composition of the classes, and they could not undertake treatment and con-
trol conditions in the same classroom (that is, the treatment was a class activ-
ity). Consequently, the nonequivalent control group design was employed. A
random choice (flipping a coin) determined which of the two classes would
receive the treatment. A handwriting test to measure the dependent variable
was administered both before and after the subjects experienced either the
treatment or the comparison activity. Moreover, the researchers confirmed
equivalence of the two intact classes on a number of control variables: gen-
der, chronological age, standardized vocabulary test score, standardized read-
ing comprehension test score, involvement in instrumental music lessons, and
presence of physical disorders. Thus, the researchers satisfactorily established
that the possible selection bias due to differences between the groups was not
strongly relevant to the comparisons to be made in the study. The effect of the
treatment was assessed by comparing the gain scores (that is, posttest minus
pretest) of the two groups on the dependent variable. Analysis of covariance is
another commonly used tool in such cases.
A study by Slavin and Karweit (1984) evaluating mastery learning versus
team learning illustrates a factorialized (2 × 2) version of the nonequivalent
control group design:
O1 X O2
———————
———————
O3 O4
gain from its initial position of superiority (or inferiority) should not be nearly
so great as the gain of the experimental group.5
This design allows researchers to compare a treatment to a control when-
ever assignment to groups has resulted from systematic evaluations of test or
performance scores. When treatment Ss have been drawn exclusively from
among those scoring on one side of a cutoff point and control Ss from among
those scoring on the other side, this design can be usefully applied.
O1 X O2
—————————
O 3 X O4
O1 X O2
———————————
O5 O3 X O4
Note the addition of O5, which makes this version of the separate-sample
design also a version of the nonequivalent control group design. However,
researchers often cannot make such a change, because one of the conditions
necessitating the use of the separate-sample design is restriction of opportuni-
ties to test subjects only immediately before and after the treatment.
The threat posed by maturation to the validity of a separate-sample pre-
test-posttest design is illustrated in a faculty study (mentioned early in Chapter
1) of the effects of student teaching on students’ perceptions of the teaching
profession. To control for the history bias encountered when this kind of study
lacks a control group, the faculty researcher used this separate-sample design:
O1 O2 (juniors)
———————————
O3 O4 X O5 (seniors)
The Ss experienced student teaching in the spring of the senior year. The
researcher recognized the difficulty of evaluating the effects of student teach-
ing on students’ perceptions of the profession without comparing it to a con-
trol group. Students not in the teacher education program would make poor
control group members, because their perceptions of teaching as a profession
would likely differ from those of students in the program. A longitudinal
study, perhaps involving the time-series design, would have required more
time than most research projects can be expected to take. The researcher solved
E X P E R I M E N TA L R E S E A R C H D E S I G N S ■ 1 6 9
Patched-Up Design
X O1
————
O2
It controlled for maturation and history effects but not at all for selection bias.
The patched-up design shown below combines these two preexperimental
designs to merge their strengths and overcome their shortcomings:
Class A X O1
——————
Class B O2 X O3
170 ■ CHAPTER EIGHT
Single-Subject Design
The single-subject design allows three variants. In the first, the subject
experiences the control or baseline (A) and experimental (B) condition once
each (creating the A-B design). In the second variant, the subject experiences
the baseline condition twice and the experimental condition once (creating
the A-B-A design). The third variant gives each experience twice (A-B-A-B
design). Multiple repetitions enhance the internal validity of the design (which
is already somewhat limited; as described earlier in the chapter) by increasing
the stability of the findings. Results of repeated trials substitute for compari-
sons of results across subjects, an impossibility in the single-subject design.
An example of results using the A-B approach (also represented as X0 O1
X1 O2 ) are shown in Figure 8.4. The researcher monitored a single subject, a
seventh-grade boy in a class for emotionally handicapped students, on two
measures over a 16-week period: (1) rate of chair time out (the solid line), and
(2) percentage of escalated chair time out (the dashed line). A chair time out
(CTO) served as a punishment for three disruptive behaviors. A student disci-
plined in this way was required to sit in a chair at the back of the classroom for
5 minutes. If the recipient continued disruptive behavior, the consequence was
an escalated CTO, which meant continuing to sit in the back of the classroom,
with the added consequence of it being regarded as a more serious offense. If a
student received more than one escalated CTO, he or she lost the privilege of
participating in a special end-of-week activity.
The researcher observed the subject for 10 weeks under normal conditions,
termed the baseline, before the onset of a 6-week treatment. The treatment
awarded bonus points to the subject for receiving no CTOs and no escalated
CTOs for an entire day. The bonus points could be exchanged for prizes dur-
ing the special end-of-week activity. The results clearly show that the treatment
was accompanied by a reduction in disruptive behavior by the subject.
The single-subject design suffers from weak external validity. Many ques-
tions surround any effort to generalize to others results obtained on a single
subject. To increase external validity, such a design should be repeated or
replicated on other subjects to see if similar results are obtained. The results
of each subject may be presented in individual reports. Adding subjects will
ultimately render the design indistinguishable from the equivalent time-sam-
ples design.
The term ex post facto indicates a study in which the researcher is unable to
cause a variable to occur by creating a treatment and must examine the effects
of a naturalistically occurring treatment after it has occurred. The researcher
attempts to relate this after-the-fact treatment to an outcome or dependent
measure. Although the naturalistic or ex post facto research study may not
always be diagrammed differently from other designs described so far in the
chapter, it differs from them in that the treatment is included by selection
rather than manipulation. For this reason, the researcher cannot always assume
a simple causal relationship between independent and dependent variables. If
observation fails to show a relationship, then probably no causal link joins the
two variables. If a researcher does observe a predicted relationship, however,
he or she cannot necessarily say that the variables studied are causally related.
Chapter 9 covers two types of ex post facto designs—the correlational design
and the criterion-group design.
because they knew that they were being observed and felt increasingly impor-
tant by virtue of their participation in the experiment.
Many school-system studies compare results of experimental treatments
to no-treatment or no-intervention conditions. They risk recording differences
based not on the specifics of the interventions, but on the fact that any inter-
ventions took place. That is, observed benefits may not accrue from the details
of the interventions but from the fact that the subjects experienced some form
of intervention. The effects may not be true but reactive—that is, a function of
the experiment—which reduces external validity.
Similar problems in experiments can result from expectations by certain
key figures in an experiment, such as teachers, about the likely effects of a treat-
ment, creating another reactive effect. Such expectancies operate, for instance,
in drug research. If someone administering an experimental drug knows what
drug a subject receives, he or she may form certain expectancies regarding
its potential effect. For this reason, such a study should ensure that the drug
administrator operates in the blind, that is, remains unaware of the kind of
drug administered to particular subjects in order to avoid the effects of those
expectancies on the outcome of the experiment.
Rather than identifying the two familiar groups to test an intervention (the
experimental group and the control group), researchers may gain validity by
introducing a second control group that specifically controls for the Hawthorne
effect. What is the difference between a no-treatment control and a Hawthorne
control? A no-treatment control group like that typically employed in inter-
vention studies involves no contact at all between the experimenter and the Ss
except for the collection of pretest data (where necessary) and posttest data.
The Hawthorne control group, on the other hand, experiences a systematic
intervention and interaction with the experimenter; this contact introduces
some new procedure that is not anticipated to cause specific effects related to
the effects of the experimental treatment or intervention. That is, the researcher
deliberately introduces an irrelevant, unrelated intervention to the Hawthorne
control group specifically in order to create the Hawthorne effect often associ-
ated with intervention. Thus the experimental and Hawthorne control groups
experience partially comparable interventions, both of which are expected to
produce a Hawthorne or facilitating effect. However, the Hawthorne control
condition is unrelated and irrelevant to the dependent variables, so a compari-
son of its outcome with that for the treatment group indicates the differential
effect of the experimental intervention.
For example, a study might seek to evaluate a technique for teaching first
graders to read, identifying as the dependent variable a measure of reading
1 74 ■ C H A P T E R E I G H T
FIGURE 8.5 Experimental Design With Controls for Reactive Effects: Hawthorne and
Expectancy Effects
R X Ep O1
R X En O2
R H Ep O3
R H En O4
■ Summary
R X O1 R O1 X O2
R O2 R O3 X O4
O1 X O2
————
O3 O4
O1 X O2
——————
O3 X O4
Also, a patched-up design resembles this one, but it omits O1. The last
three designs can be factorialized by the addition of one or more modera-
tor variables.
6. In the single-subject design, a variation on the equivalent time-samples
design, a single subject serves as his or her own control. Variations of base-
line or control (A) and experimental treatment (B) include A-B, A-B-A,
and A-B-A-B, depending on how many times the subject experiences each
level of the independent variable.
7. When the independent variable is not manipulated, ex post facto designs
are used. These designs lack the certainty of experimental designs since
they cannot adequately control for experience bias (see Chapter 9).
8. A research design may need provisions to control for external validity based
on experiences (namely the Hawthorne effect and other reactive effects). In
place of a normal control group, such a study would incorporate a Haw-
thorne control group whose members received a treatment other than the
one being tested. This group would experience the same reactive effects as
the treatment group, but would not experience the experimental treatment.
A second approach, to control particularly for expectations, would include
expectation as a moderator variable, with one level each of the treatment and
control carrying positive expectations, and one level of each carrying neutral
expectations.
178 ■ CHAPTER EIGHT
1. Rank order the three designs according to their adequacy for controlling
for history bias:
1. Most adequate a. Time-series design
2. Next most adequate b. One-group pretest-posttest design
3. Least adequate c. Pretest-posttest control group design
2. Rank order the four designs according to their adequacy for controlling for
selection bias:
1. Most adequate a. Patched-up design
2. Next most adequate b. Intact-group comparison
3. Next least adequate c. Posttest-only control group design
4. Least adequate d. Nonequivalent control group design
3. Prediction: Student teachers who are randomly assigned to urban schools to
gain experience are more likely to choose urban schools for their first teaching
assignments than student teachers who are randomly assigned to nonurban
schools.
Construct an experimental design to test this prediction.
4. Prediction: Students given programmed math instruction will show greater
gains in math achievement than students not given this instruction, but this
effect will be more pronounced among students with high math aptitude
than among those with low aptitude.
Construct an experimental design to test this prediction.
5. Which of the following circumstances create the need for a quasi-experi-
mental design? (More than one may be a right answer.)
a. Experimenter cannot assign Ss to conditions.
b. Experimenter must employ a pretest.
c. Experimenter must collect the data himself or herself.
d. The program to be evaluated has already begun.
6. Which of the following circumstances create the need for a quasi-experi-
mental design? (More than one may be a right answer.)
a. The study includes more than one independent variable.
b. No control group is available for comparison.
c. The pretest is sensitizing.
d. Every member of the sample must receive the treatment.
7. Prediction: Student teachers who choose urban schools to gain experience
are more likely to choose urban schools for their first teaching assignments
than student teachers who choose nonurban schools.
a. Why must researchers employ a quasi-experimental design to test this
prediction?
b. Construct one.
E X P E R I M E N TA L R E S E A R C H D E S I G N S ■ 1 7 9
■ Recommended References
Keppel, G. (1991). Design and analysis: A researcher’s handbook (3rd ed.). Englewood
Cliffs, NJ: Prentice-Hall.
Mitchell, M., & Jolley, J. (1992). Research design explained (2nd ed.). Fort Worth, TX:
Harcourt Brace.
Trochim, W. M. K. (Ed.). (1986). Advances in quasi-experimental design and analysis.
San Francisco, CA: Jossey-Bass.
= CHAPTER NINE
OBJECTIVES
■ 181
182 ■ CHAPTER NINE
field placements, while his urban students expressed a similar discomfort when
they entered suburban and rural sites.
Dr. Jacobs decided to investigate this issue more closely. He located an
instrument that asked preservice teacher interns to rate their confidence level on
a scale from 1 to 100 with respect to classroom management, planning lessons,
and building relationships with students. The instrument required that interns
evaluate their confidence across these three areas for urban, rural, and suburban
field placements. At the beginning of the semester he administered it to all of his
students who were planning to enter a field placement. Instead of asking them to
provide their names, Dr. Jacobs asked that they indicate the area of the country
where they’d grown up and attended school.
After collecting data across several semesters, Dr. Jacobs then compared
the average confidence rating for each area between interns who lived in rural,
suburban, and urban areas. Just as he suspected, he found that interns were less
confident across the board when they entered a field site that was different from
the area in which they had lived, despite the coursework they’d completed. It
seemed that, despite the university’s claims to the contrary, students who were
not raised in an urban environment still felt unprepared to teach urban students.
Dr. Rebecca Ballard was a new assistant professor of education at the large
urban university in her city. In addition to helping to prepare preservice ele-
mentary school teachers, she conducted research that helped her undergradu-
ate students teach and apply study strategies. Dr. Ballard was quite surprised to
find that just a couple of weeks into her first fall semester at the university, her
own college students were struggling. Specifically, while her undergrads were
quite adept and memorizing key names and terms, they failed to connect what
they were learning with actual classroom situations. As a result, when they
were tested on the application of theoretical contexts, they did quite poorly.
Dr. Ballard came up with an interesting idea. First, she provided her ten
students with a list of terms on which they would be tested later that week.
She then offered extra credit for a short story that described how each term
applied to students’ own everyday lives. On the day of the exam, she col-
lected each student’s story before administering the test. Using a spreadsheet,
she took note of the number of stories each student completed and his or her
grade on the exam:
T
HE SITUATIONS presented above illustrate two research designs
described in this chapter: correlational design and causal-comparative
design. We will begin by describing the purpose and methodology that
guides correlational research, presenting examples to clarify the specific steps to
conducting and interpreting a correlational study. We will then describe causal-
comparative research, specifically outlining the applications of this methodol-
ogy. We will close by describing a third research methodology—longitudinal
design, drawing comparisons among the three with respect to the strengths and
weaknesses of each design.
■ Correlational Research
In a correlational study a researcher collects two or more sets of data from a group
of subjects for analysis that attempts to determine the relationship between them:
O1 O2
Consider the study by McGarity and Butts (1984) that examined the relationship
between the effectiveness of teacher management behavior and student engage-
ment in 30 high school science classes. Effective teacher management behavior was
operationally defined by observer judgments on a list of 10 behavioral indicators.
Student academic engagement was operationally defined by the number of stu-
dents out of 10 observed per minute who were attending to instruction, working
on seatwork, or interacting on instructional content.
The study found a high correlation between the two measures. Note that
the researchers did not assign teachers to management behavior conditions or
otherwise directly affect any of the targeted variables. Hence, the high correla-
tion does not indicate whether teachers who effectively manage their classes
184 ■ CHAPTER NINE
possible outcomes: (1) that there is no relationship between or among the vari-
ables of interest, (2) that there is a positive correlation between or among the
variables of interest, or (3) that there is a negative correlation between or among
the variables of interest.
Some correlational studies will determine that certain variables do not
relate to one another. A practical example of this would be a research study
which investigates the relationship between students’ ratings of the quality of
school lunches and their sophomore social studies grade average. These hypo-
thetical data are displayed in Table 9.1.
One way to represent these data is to use a scatterplot. For our fictional
example, we will plot school lunch ratings on the x-axis and sophomore social
studies grades on the y-axis, yielding representative data points for each stu-
dent. A scatterplot allows researchers to visually represent the direction and
the strength of the relationship between variables. A scatterplot for this data set
is displayed in Figure 9.1.
TABLE 9.1 Students’ Ratings of School Lunch With Sophomore Social Studies Grades
Ratings of School Lunch (0–100) Sophomore Social Studies Grades (0–100)
56 95
76 89
99 60
40 75
71 60
68 90
30 59
91 94
80 40
72 94
You will immediately notice that the data points seem to be randomly scat-
tered throughout with no clearly discernable pattern. If the points portrayed
a trend, they would more closely form a line running either from the bottom-
right corner to the upper-left corner of the graph or from the upper-right cor-
ner to the bottom-left, depending on the nature of the relationship. Instead, if
one were to draw a line connecting each point on the graph, it would almost
resemble a circle. This suggests that there is probably a very weak or nonexis-
tent relationship between these two variables.
A more precise way to describe this relationship would be to calculate
the correlation coefficient. The correlation coefficient is a statistical represen-
tation of the relationship between two variables. Researchers who wish to
know the strength of a relationship between two variables will use a bivari-
ate correlation coefficient, while those who wish to explore the relationship
among three or more variables will use a multivariate correlational method.
In both instances, the purpose of a correlational coefficient is to mathemati-
cally express the degree of the relationship along a range from –1.0, which
indicates a perfect negative correlation, to 1.0, which indicates a perfect posi-
tive correlation. A coefficient of 0 indicates that the variables of interest are
not related. In this case, the coefficient is .01. This indicates that there is no
relationship between the two variables. Simply stated, a student’s rating of
the quality of lunch does not vary systematically with his or her social studies
grade.
The story of Dr. Rebecca Ballard that we used to open this chapter is an
excellent example of the second possible outcome of a correlational study. Dr.
Ballard asked her underperforming students to write application-based sto-
ries to help them remember key terms and concepts. She then administered a
weekly examination, recording the scores of each study as well as the number
of stories they completed. Table 9.2 summarizes these data.
A preliminary glance at these data suggest a trend: it seems that the greater
the number of the stories a student has written the higher his or her grade on
the examination. A scatterplot, (displayed in Figure 9.2) confirms this visually.
Compare this scatterplot to the first example. In that instance, there seemed
to be no clear pattern to the data points; a high social studies grade was just as
likely to correspond with a low lunch rating as it was with a high one. This
example is clearly different. Lower scores on the exam, as measured on the
y-axis fall on the left side of the x-axis, while higher scores fall on the right side
of the x-axis. Visually, if one were to draw a line connecting every data point
within, it would resemble a diagonal line from the bottom right hand corner to
the upper right.
Thus far, this examination of the data suggests a positive correlation. As
one variable increases, the other variable decreases proportionately. Again, it is
important to note that correlational studies do not suggest causation—the fact
that two variables move in tandem suggests a relationship, but does not necessar-
ily mean that one variable is causing a change in the other. In this case, it would
seem that students who wrote more questions also did better on the exam. At
this point, we do not know if writing more questions caused students to perform
better on the exam, only that the number of stories written correlated positively
with exam grades.
Calculating a correlation coefficient for this data set further confirms this
relationship. As we stated earlier, a positive correlation will yield a coefficient
between –1.0 and 1.0. A weak positive correlation would fall between .1 and .3
(with the corresponding weak negative correlation falling between –.1 and –.3),
a moderate positive correlation would yield a correlation between .3 and .5
(the corresponding negative falling between –.3 and –.5), and a strong positive
correlation would be between .5 and 1.0 (or –.5 and –1.0 for a strong negative
correlation). These data yield a coefficient of .90, which suggests a very strong
TABLE 9.3 Average Class Lab Grade and Weekly Hours Spent Lecturing
Number of Hours Spent Lecturing That Average Class Grade on the Weekly Lab
Week (Out of 10 Total Class Hours) (0–100)
9 79
10 69
2 88
4 84
6 75
10 62
2 92
5 90
3 93
7 74
C O R R E L AT I O N A L A N D C A U S A L - C O M PA R AT I V E S T U D I E S ■ 1 8 9
Again a clear trend is evident in the scatterplot. With respect to the two
extremes, the higher lab grades fall on the left hand, or lower end of the lecture-
hours continuum, while the lower grades fall on the right, or higher end of the
lecture-hours continuum. Mid-range grades correspond to weeks of moderate
lecture. The calculated correlation coefficient is –.88, which indicates a very
strong negative correlation between the number of hours Jessica Ina spent lec-
turing each week and the corresponding week’s lab grade. Specifically, as the
hours spent lecturing increase, lab grades decrease, and as the number of hours
spent lecturing decrease, the class’s average lab grade increases.
In summation, a simple correlational study focuses on the relationship
between two variables. While correlational studies do not establish causation,
they are useful to either predict future behavior or explain complex phenom-
ena. An investigation of the relationship between two variables will yield one
of three possible outcomes: a finding of no relationship, a positive correlation,
or a negative correlation between the variables.
Thus far we have discussed only studies that focus on the relationship between
two variables. While this is helpful in interpreting the direction and strength of
a relationship, it does not truly represent the majority of correlational research
studies. Many studies are interested in the relationship among several variables.
This is consistent with the two purposes of correlational studies: to explore and
predict. Researchers who conduct explanatory studies may examine the rela-
tionship among multiple variables, ultimately omitting those relationships that
are found to be weak or nonexistent. Stronger relationships between variables
suggest the need for additional research.
Researchers who conduct predictive studies wish to use what is known
about existing relationships to anticipate change in a particular variable. This
may be represented mathematically:
190 ■ CHAPTER NINE
y1 = a + bx1
This is known as a prediction equation. Here, x stands for the predictor vari-
able. The predictor variable is used to make a prediction about the criterion
variable (y). Both a and b in this equation are constants that are calculated
based upon the available data. Assume we wish to predict a basketball player’s
professional scoring average based upon her scoring average in college. If she
averaged 19.4 points per game in college, and we assign a = .25 and b = .72, her
first year scoring average would be .25 + (.72 × 19.4), which equals 14.22 points
per game as a professional. Later, we can compare the criterion variable with
the player’s actual performance to determine the accuracy of the prediction
equation.
Many research studies use a technique called multiple regression, which
investigates the relationship between multiple predictor variables on a criterion
variable. This is done using the following equation:
Again, in this formula, y stands for the criterion variable, or that which the
researcher wishes to predict based upon the existing correlational data. x1, x2,
and x3 represent the multiple predictor variables of interest. One doctoral stu-
dent, for example, wished to investigate the variables that contribute to first-year
teachers’ confidence. She collected a number of different data sources: under-
graduate grade point average, teachers’ ratings of the school, student teaching
evaluations, ratings of teachers’ knowledge of subject matter, and each teacher’s
age. Each of these variables can then be entered into the multiple regression equa-
tion, which will enable the researcher to not only predict changes in the criterion
variable (in this instance, teacher confidence) but also learn the independent con-
tribution of each to changes in a teacher’s confidence level.
2. Defining the sample. Participants are identified and selected from the desired
target population. The population of interest is determined by the problem
statement. Researchers who wish to investigate the relationship between the
use of study strategies and academic achievement would obviously select
participants from among an age-appropriate student population. This sam-
ple should consist of at least 30 participants to ensure the accuracy of the
relationship, though larger numbers of participants is desirable.
3. Choosing an instrument. When considering what instrument to use in
a correlational study, the most important consideration is validity. An
instrument should yield data that accurately measure the variables of
interest. It is also important to recognize what type of data is sought.
Table 9.4 outlines five basic correlational statistics that correspond to
one of four data types: continuous, rank-ordered, dichotomous, and cat-
egorical. Researchers must understand what type of data an instrument
measures to ensure that they are correctly calculating and interpreting the
correlational statistic.
4. Designing the study. Most correlational designs are quite straightforward.
Each variable is measured in each participant, which enables the correla-
tion of the various scores. Data collection usually takes place in a fairly
short period of time, either in one session or in two or more sequential
sessions.
5. Interpreting the data. As explained in our earlier examples, correlation
coefficients are the statistical means of interpreting data. Should a relation-
ship between variables emerge, it is important to note again here that cor-
relation does not imply causation; that is to say, an existing relationship may
be due to the influence of Variable 1, that of Variable 2 or a third, unknown
cause.
■ Causal-Comparative Research
one state with those of its opposite. The causal-comparative design provides a
format for such analysis. A criterion group is composed of people who display
a certain characteristic that differentiates them from others, as determined by
outside observation or judgment, or by self-description.
Suppose, for example, that a researcher is interested in studying the fac-
tors that contribute to teaching competence. Before conducting a true experi-
ment intended to produce competent teachers by design, she or he would need
some ideas about what factors separate competent and incompetent teaching.
The causal-comparative design requires the researcher to identify two criterion
groups: competent teachers and incompetent teachers. This distinction might
reflect students’ or supervisors’ judgments. The researcher would then observe
and contrast the classroom behavior of these two groups of teachers in order
to identify possible outcomes of teacher competence. The researcher could also
examine the backgrounds and skills of these two teacher groups looking for
ideas about factors that give rise to competence in some teachers.
Consider again the correlational study of teacher management behavior and
student engagement by McGarity and Butts (1984). These authors also divided
the students into groups defined as those taught by teachers judged to be “com-
petent” in teacher management behavior and those taught by teachers judged to
be “incompetent” in that behavior. The study then compared these two criterion
groups, competent and incompetent teachers, in terms of the amount of their
engagement with students. The competent teachers’ students were found to be
more engaged in learning than were incompetent teachers’ students.
C O R R E L AT I O N A L A N D C A U S A L - C O M PA R AT I V E S T U D I E S ■ 1 9 3
C O1 O1 C O2 C1 O1
————— ———
or or
O2 O3 O4 C2 O2
C1 Y1 O1
———————
C2 Y1 O2
———————
C1 Y2 O3
———————
C2 Y2 O4
To illustrate this approach, return again to the McGarity and Butts (1984)
study of teacher management and student engagement. These authors were
interested in determining whether the effect of competent versus incompetent
teachers on student engagement depended on student aptitude. To accomplish
this analysis, students were divided into three levels of aptitude: high, medium,
and low (Y1, Y2, Y3); they were then compared on engagement for competent
(C1 ) and noncompetent (C2) teachers.
194 ■ CHAPTER NINE
C1 Y1 O1
——————
C2 Y1 O2
——————
C3 Y1 O3
——————
C1 Y2 O4
——————
C2 Y2 O4
——————
C2 Y2 O5
——————
C3 Y2 O6
Because the study compares three different groups of students, rather than
evaluating the same individuals at three points in time, this design shows some
weakness in internal validity due to selection bias. This weakness can be some-
what reduced by including control variables such as socioeconomic status in
selecting the samples. However, collecting data on all three groups at the same
point in time would lessen the possible effect of history bias.
who is interested in academic motivation may question what causes some stu-
dents to be more motivated than others. Why are students selectively motivated
to pursue certain tasks? He or she might then speculate as to possible answers
to this question. Perhaps academic motivation is determined by goal orienta-
tion or attributional tendencies. This leads to the development of a statement
of purpose, which identifies the rationale for the research effort. For example,
a researcher may wish to identify the sources of academic motivation, which
allows for the investigation of multiple hypotheses.
■ Longitudinal Research
the same students each time. If those moving formed a random subset of the
original group, then that activity would not create a problem. But if, for exam-
ple, those who moved disproportionately represented children of military or
upwardly mobile families, or families experiencing divorce, then the loss of
their input might distort results by eliminating their systematic differences
from the rest of the group on the dependent variable: need for independence.
This effect would introduce considerable bias into the findings.
The cohort study offers the most practical way to control for various
selection biases in a longitudinal design, since this method is less susceptible
to selection bias than the trend study and less susceptible to experimental
mortality than the panel study. However, researchers must recognize that all
versions of the longitudinal design are susceptible to history (or experience)
bias, since the passage of time brings not only naturally occurring develop-
mental changes, which are the subject of study, but also changes in external
circumstances for society at large, a source of confounding influences. For
example, the nation could go to war, which might cause profound changes in
independence strivings, particularly among older children. Therefore, as for
all ex post facto designs, users should be careful to avoid making strong con-
clusions about cause (development) and effect (whatever dependent variables
are studied) based on longitudinal studies.
There are a number of ways to address threats to internal and external validity
with respect to the correlational, causal-comparative, and longitudinal designs as
shown in Table 9.5. Inexperienced researchers who use the correlational design
often randomly select a number of variables, collect data, and run multiple corre-
lations. This is problematic for two reasons. First, as the number of correlations
200 ■ CHAPTER NINE
run increases, so also does the likelihood of committing a Type I error. Secondly,
there should be a theoretical basis for the investigation to ensure that any correla-
tion that emerges between variables is not due to chance. Also, poorly selected
or improperly administered instruments may compromise the internal validity
of a study. As is the case with all research designs, the instrument that is to be
used must be carefully selected to reflect the research question, and all research
staff must be carefully trained to ensure that it has been administered properly.
Finally, while it may be tempting to interpret causality, it is important to note
that while two variables may be related to one another, this relationship does not
necessarily imply causality.
■ Summary
population (called a cohort study), or (c) the same individuals (called a panel
study). These three approaches vary in both practicality and ability to con-
trol for selection.
8. Users should be careful to avoid making strong conclusions about cause
and effect based on longitudinal studies.
a. Plot these data using a scatterplot. How might this help you to under-
stand the relationship between math and language scores?
b. What type of relationship (if any) exists between students’ scores?
C O R R E L AT I O N A L A N D C A U S A L - C O M PA R AT I V E S T U D I E S ■ 2 0 3
For the next series of questions, choose the design that is most appropriate.
■ Recommended References
Bauer, K. (2004). Conducting longitudinal studies. New Directions for Institutional
Research, 121, 75–88.
Jaccard, J., Becker, M., & Wood, G. (1984). Pairwise multiple comparison procedures:
A review. Psychological Bulletin, 96, 589–596.
Lancaster, B. P. (1999). Defining and interpreting suppressor effects: Advantages and
limitations. In B. Thompson (Ed.), Advances in social science methodology (pp.
139–148). Stamford, CT: JAI Press.
= CHAPTER TEN
OBJECTIVES
■ 205
206 ■ CHAPTER 10
■ Test Reliability
Test reliability means that a test gives consistent measurements. A ruler made
of rubber would not give reliable measurements, because it could stretch or
contract. Similarly unreliable would be an IQ test on which Johnny or Janie
scored 135 on Monday and 100 on the following Friday, with no significant
event or experience during the week to account for the discrepancy in scores.
A test that does not give reliable measurements is not a good test regardless of
its other characteristics.
Several factors contribute to unreliability in a test: (1) familiarity with the
particular test form (such as multiple-choice questions), (2) subject fatigue, (3)
emotional strain, (4) physical conditions of the room in which the test is given,
(5) subject health, (6) fluctuations of human memory, (7) subject’s practice
or experience in the specific skill being measured, and (8) specific knowledge
gained outside the experience evaluated by the test. A test that is overly sensi-
tive to these unpredictable (and often uncontrollable) sources of error is not a
reliable one. Test unreliability creates instrumentation bias, a source of internal
invalidity in an experiment.
Before drawing any conclusions from a research study, a researcher should
assess the reliability of his or her test instruments. Commercially available stan-
dardized tests have been checked for reliability; test manuals provide data rela-
tive to this evaluation. When using a self-made instrument, a researcher should
assess its reliability either before or during the study. This section briefly dis-
cusses four approaches for determining reliability.
Test-Retest Reliability
One way to measure reliability is to give the same people the same test on more
than one occasion and then compare individual performance on the two admin-
istrations. In this procedure, which measures test-retest reliability, each person’s
score on the first administration of the test is related to his or her score on the
second administration to provide a reliability coefficient.1 This coefficient can
vary from 0.00 (no relationship) to 1.00 (perfect relationship), but real evaluations
rarely produce coefficients near zero. Because the coefficient is an indication of
the extent to which the test measures stable and enduring characteristics of the test
taker rather than variable and temporary ones, researchers hope for reasonably
high coefficients.
The test-retest evaluation offers the advantage of requiring only one form of
a test. It brings the disadvantage that later scores show the influence of practice
1. This relationship is usually computed by means of a correlation statistic, as described in
Chapter 13.
P R O C E D U R E S F O R O B S E R VAT I O N A N D M E A S U R E M E N T ■ 2 0 7
and memory. They can also be influenced by events that occur between testing
sessions.
Because the determination of test-retest reliability requires two test admin-
istrations it presents more challenges than do the other three reliability testing
procedures (described in later subsections). However, it is the only one of the
four that provides information about a test’s consistency over time. This qual-
ity of a test is often important enough in an experiment to justify the effort to
measure it, particularly when the research design involves both pretesting and
posttesting.
Alternate-Form Reliability
Split-Half Reliability
The two approaches to reliability testing described so far seek to determine the
consistency of a test’s results over time and over forms. A researcher may also
want to make a quick evaluation of a test’s internal consistency. This judgment
involves splitting a test into two halves, usually separating the odd-numbered
items and the even-numbered items, and then correlating the scores obtained
by each person on one half with those obtained by each person on the other.
This procedure, which yields an estimate called split-half reliability, enables a
researcher to determine whether the halves of a test measure the same quality
or characteristic. The obtained correlation coefficient (r1 ) is then entered into
the Spearman-Brown formula to calculate the whole test reliability (r2 ):
208 ■ CHAPTER 10
The actual test scores that will serve as data in a research study are based
on the total test score rather than either half-test score. Therefore, the split-half
reliability measure can be corrected by the formula to reflect the increase in
reliability gained by combining the halves.
Kuder-Richardson Reliability
■ Test Validity
The validity of a test is the extent to which the instrument measures what it
purports to measure. In simple words, a researcher asks, “Does the test really
measure the characteristic that I will use it to measure?” For example, a test of
mathematical aptitude must yield a true indication of a student’s mathematical
aptitude. When you use a ruler to measure an object, you do not end up with a
valid indication of that object’s weight.
This section discusses four types of validity. A test’s manual reports on
these forms of validity so that the potential user can assess whether the instru-
ment measures what the title says it measures.
Predictive Validity
where pi and qi refer to the proportions of students responding correctly and incorrectly,
respectively, to item i.
P R O C E D U R E S F O R O B S E R VAT I O N A N D M E A S U R E M E N T ■ 2 0 9
Concurrent Validity
Construct Validity
A test builder might reason that a student with high self-esteem would be more
inclined than one with low self-esteem to speak out when unjustly criticized by
an authority figure; this reasoning suggests that such behavior can be explained
by the construct (or concept) of self-esteem.3 Such a proposed relationship
between a construct and a derivative behavior might provide a basis for deter-
mining the construct validity of a test of self-esteem. Such an evaluation might
seek to demonstrate the relationship of self-esteem test scores to a proposed
derivative behavior (such as speaking out in self-defense).
3. To relate the term construct to familiar language, this validity measure might indicate that some
independent variable causes self-esteem—an intervening variable or construct—which in turn leads
to the speaking-out behavior.
210 ■ CHAPTER 10
Content Validity
4. A further example of a method for establishing content validity appears in Chapter 14.
P R O C E D U R E S F O R O B S E R VAT I O N A N D M E A S U R E M E N T ■ 2 1 1
Nominal Scales
The term nominal means “named.” Hence, a nominal scale does not measure
variables, rather it names them. In other words, it simply classifies observations
into categories with no necessary mathematical relationship between them.
Suppose a researcher were interested in the number of happy and unhappy
students in a class. If an interviewer classified each child as happy or unhappy
based on classroom conversations, this classification system would represent a
nominal scale. No mathematical relationship between happy and unhappy is
implied; they simply are two different categories.5 Thus, the happiness variable
is measured by a nominal method.
When a study’s independent variable includes two levels—a treatment con-
dition and a no-treatment control condition (or two different treatments)—the
independent variable is considered a nominal one, because measurement com-
pares two discrete conditions. For example, splitting IQ scorers into high and
low groups would make IQ into a two-category nominal variable. Although
high and low denote an order, so they could be considered ordinal variables (as
discussed in the next subsection), they can also be treated simply as category
names and handled as nominal data. (For statistical purposes, two-category
“orders” are usually best treated as nominal data, as Chapter 12 discusses.)
The behavioral sampling form later in the chapter in Figure 10.9 gives an
example of a nominal scale. One discrete behavior is checked for each student
observed.
■ Ordinal Scales
The term ordinal means “ordered.” An ordinal scale rank orders things, cate-
gorizing individuals as more than or less than one another. (For two-level vari-
ables, the distinction between nominal and ordinal measurement is an arbitrary
one, although nominal scaling, especially of independent variables, simplifies
statistical analyses.)
Suppose the observer who wanted to measure student happiness inter-
viewed every child in the class and then rank ordered them from highest to
lowest happiness. Now each child’s happiness level could be specified by the
ranking. By specifying the rank order, the researcher has generated an ordinal
scale. If you were to write down a list of your 10 favorite foods in order of
preference, you would create an ordinal scale.
Although ordinal measurement may require more difficult processes
than nominal measurement, it also gives more informative, precise data.
Interval measurement, in turn, gives more precise results than come from
5. These categories may be scored 0 and 1, implying the simplest sort of mathematical rela-
tionship, that is, presence versus absence.
212 ■ CHAPTER 10
ordinal measurement, and ratio measurement gives the most precise results
of all.
Interval Scales
Interval scales tell not only the order of evaluative elements but also the inter-
vals or distances between them. For instance, on a classroom test, one student
scores 95 while another scores 85. These measurements indicate not only that
the first has performed better than the second but also that this performance
was better by 10 points. If a third student has scored 80, the second student has
outperformed the third by half as much as the first outperformed the second.
Thus, on an interval scale, a distance stated in points may be considered a rela-
tive constant at any point on the scale where it occurs.
In contrast, on the ordinal measure of happiness, the observer can identify
one child as more or less happy than another, but the data do not indicate how
much more or less happy either child is compared to the other. The difference
from any given child to the next on the rank order (ordinal) scale of happiness
does not allow statements of constant quantities of happiness.
Rating scales and tests are considered to be interval scales. One unit on a rating
scale or test is assumed to equal any other unit. Moreover, raw scores on tests can
be converted to standard scores (as described in a later section) to maintain interval
scale properties. As you will see, most behavioral measurement employs interval
scales. The scales later in the chapter in Figures 10.1, 10.2, 10.3, and 10.7 all illus-
trate interval measurement.
Ratio Scales
Ratio scales are encountered much more frequently in the physical sciences than
in the behavioral sciences. Because a ratio scale includes a true zero value, that is,
a point on the scale that represents the complete absence of the measured char-
acteristic, ratios are comparable at different points on the scale. Thus, 9 ohms
indicates three times the resistance of 3 ohms, while 6 ohms stands in the same
ratio to 2 ohms. On the other hand, because an IQ scale evaluates intelligence
according to an interval scale, someone with an IQ of 120 is more comparable to
someone with an IQ of 100 (they are 20 scale points apart) than is someone with
a 144 IQ to someone with a 120 IQ (they are 24 scale points apart). The intervals
indicate a larger difference in the second case, even though the ratios between the
two sets of scores are equal (120:100 = 144:120 = 6:5). This result occurs because
the IQ scale, as an interval scale, has no true zero point; intervals of equal size
indicate equal differences regardless of where on the scale they occur. Were the
IQ scale a ratio scale (which it is not), the two pairs of scores would be compara-
bly related, because each pair holds the ratio 6:5.
P R O C E D U R E S F O R O B S E R VAT I O N A N D M E A S U R E M E N T ■ 2 1 3
Scale Conversion
If the researcher decides to rate each child on a happiness scale (Choice 3),
and thus collects interval data, later data processing can always convert these
interval data to rank orderings (ordinal data). Alternatively, the researcher
could divide the children into the most happy half and the least happy half,
creating nominal data. Educational researchers typically convert from higher
to lower orders of measurement. They seldom convert from lower to higher
orders of measurement.
To select the appropriate statistical tests, a researcher must identify the
measurement scales—nominal, ordinal, or interval—for each of a study’s vari-
ables. Chapter 12 gives a more detailed description of the process of converting
data from one scale of measurement to another under the heading “Coding and
Rostering Data.”
Percentiles
score. It is computed by counting the number of obtained scores that fall below
the score in question in a rank order, dividing this number by the total number
of obtained scores, and multiplying by 100. Consider the following 20 test
scores:
95 85 75 70
93 81 75 69
91 81 74 65
90 78 72 64
89 77 71 60
98 93 88 80
97 92 86 80
95 91 85 79
94 91 83 77
94 89 81 75
The same score of 89 exceeds only 10 of the 20 scores in this group, plac-
ing it at the 50th percentile. This illustration shows the benefits of interpreting
scores relative to other scores.6
Standard Scores
6. Technically, a score of 88.5 would fall at the 75th percentile in the first example (with 5
scores above it and 15 below it) and at the 50th percentile in the second example (with 10 scores
both above and below it). The actual score of 89 must be defined as the midpoint of a range of
scores from 88.5 to 89.5.
7. See Chapter 12 for a more complete description of these terms and their determination.
P R O C E D U R E S F O R O B S E R VAT I O N A N D M E A S U R E M E N T ■ 2 1 5
This statistical device allows a researcher to adjust scores from absolute quanti-
ties to relative reflections of the relationship between all the scores in a group.
Moreover, standard scores are interval scores, because the standard deviation
unit establishes a constant interval throughout the scale. An absolute raw score
is converted to a relative standard score by (1) subtracting the group mean on
the test from the raw score, (2) dividing the result by the standard deviation, and
(3) adding a constant (usually 50) to avoid minus signs and multiplying by 10 to
avoid decimals. (This procedure is described by Thorndike and Hagen, 1991.)
By converting raw test scores into standard scores, a researcher can compare
scores within a group and between groups. She or he can also add the scores
from two or more tests to obtain a single score. This illustrates the relationship
between standard scores and the normal distribution curve. Raw scores falling at
the mean of the distribution are assigned the standard score of 50; scores falling 1
standard deviation above the mean are assigned the score of 50 plus 10, or 60; and
so on. Each standard deviation defines a span of 10 points on the standard scale.
This system gives scores a meaning in terms of their relationship to one another
by fitting them within the distribution described by the total group of scores.
Norms
more than 1,000 commercially available mental and educational tests. It has
appeared regularly over the past 30 years, and new, updated versions are reg-
ularly published.
An entry from the Yearbook includes the name of the test, the population
for which it is intended, the publication date, and the test’s acronym. Informa-
tion about norms, forms, prices, and scoring is also given, as well as an estimate
of the time required to complete the test and the names of the test’s author and
publisher. You can order specimen test kits for this and most other tests by
contacting the publishers at the addresses listed in the back of the Yearbook. In
addition, the compendium presents reviews of some of the tests by psychome-
tricians and cites studies using them.
Achievement batteries are sets of tests designed to measure the knowledge that
an individual has acquired in a number of discrete subject matter areas at one
or more discrete grade levels. Widespread use of such standardized batteries
facilitates comparisons of learning progress among students in different parts
of the country. Because such batteries serve an important evaluative func-
tion, many elementary and secondary schools (as well as colleges) adminis-
ter achievement batteries as built-in elements of their educational programs.
The Yearbook describes such batteries as the California, Iowa, and Stanford
achievement tests.
It also describes multi-aptitude batteries intended to measure students’
potential for learning rather than what they have already learned. Whereas
achievement tests measure acquired knowledge in specific areas (such as math-
ematics, science, and reading), aptitude tests measure potential for acquiring
knowledge in broad underlying areas (for example, verbal and quantitative
areas). Among the multi-aptitude batteries described in the Yearbook are the
Differential Aptitude Test and SRA Primary Mental Abilities.
The concept of intelligence or mental ability resembles that of aptitude,
each being a function of learning potential. General intelligence is typically
taken to mean abstract intelligence—the ability to see relations in ideas repre-
sented in symbolic form, make generalizations from them, and relate and orga-
nize them. The Yearbook includes group, individual, and specific intelligence
tests. Among the group tests, that is, those that can be administered to more
than one person at the same time, are the Cognitive Abilities Test, Otis-Lennon
Test of Mental Ability, and Short Form Test of Academic Aptitude. Among the
individually administered intelligence tests are the Peabody Picture Vocabu-
lary Test, Stanford-Binet Intelligence Scale, and Wechsler Intelligence Scale for
Children. So-called specific intelligence tests measure specific traits thought to
P R O C E D U R E S F O R O B S E R VAT I O N A N D M E A S U R E M E N T ■ 2 1 7
■ Criterion-Referenced Tests
When scores on a test are interpreted on the basis of absolute criteria rather
than relative ones, psychometricians refer to the process as criterion referenc-
ing. Rather than converting to standard scores or percentiles on the basis of
218 ■ CHAPTER 10
Item Analysis
1. Compute a total score on the test for each student in the sample.
2. Divide testing subjects into two groups based on total test scores: (a) an
upper group of those who scored at or above the sample median, and (b) a
lower group of those who scored below the sample median.
3. Array the results into the format shown in Table 10.1 to display the per-
centage of students in each group, upper and lower, who chose each poten-
tial answer on an item.
4. Compute the difficulty index, or the percentage of students who correctly
answered each item.
5. Compute the discrimination index, or the difference between the percent-
age of upper group and lower group students who gave the right answer
for each item.
The results of the item analysis for five items are shown in Table 10.1. Each
item allows five answer choices (A, B, C, D, E), and the correct one is marked
with an asterisk. Separate percentages are reported for the upper half and lower
9. The index of difficulty is computed as the number of subjects who pass an item, divided by the
total number in both groups; computed in this way, it should actually be called the index of easiness.
220 ■ CHAPTER 10
Item 8 A *B C D E Omit
Upper ½ 9 91 0 0 0 0 43 students responded to the item
Lower ½ 5 95 0 0 0 0 93 percent or 40 of 43 students tak-
ing the test responded correctly
All students 7 93 0 0 0 0 -4 percentage points separate the
upper and lower group on the cor-
rect answer
Item 4 A B C *D E Omit
Upper ½ 13 8 4 58 17 0 43 students responded to the item
Lower ½ 0 26 5 53 16 0 56 percent or 24 of 43 students tak-
ing the test responded correctly
All students 7 16 5 56 16 0 5 percentage points separate the
upper and lower group on the cor-
rect answer
Item 1 A B *C D E Omit
Upper ½ 21 0 67 8 4 0 43 students responded to the item
Lower ½ 21 16 42 0 21 0 56 percent or 24 of 43 students
All students 21 7 56 5 12 0 taking the test responded correctly
25 percentage points separate the
upper and lower group on the
correct answer
Item 52 *A B C D E Omit
Upper ½ 42 21 29 8 0 0 43 students responded to the item
Lower ½ 42 21 26 11 0 0 42 percent or 18 of 43 students
All students 42 21 28 9 0 0 taking the test responded correctly
0 percentage points separate the
upper and lower group on the
correct answer
*Indicates correct answer choice.
P R O C E D U R E S F O R O B S E R VAT I O N A N D M E A S U R E M E N T ■ 2 2 1
half of subjects taking the test, those designations based on their overall scores.
The analysis presumes that upper-half students are the more knowledgeable ones,
indicated by their high scores on the total test. The table also reports the percentage
of students in each group who omit each item (that is, who do not answer it), as
well as the percentage of all students choosing each option. Alongside data for each
item, the table reports three summary statistics: the number of students responding
to the item, the percentage correct or difficulty index, and the difference in perfor-
mance by upper-half and lower-half students, or the discrimination index. From
this display, judgments about each item’s performance can be made.
Consider the data for Item 16 in Table 10.1. This item had a difficulty index
of 65 and a discrimination index of 32. Despite the fact that two of the distrac-
tors lacked distractibility, the item would be considered a good one. Distractor B
contributes to this quality level by distracting 42 percent of lower-half subjects.
Now look at data for Item 8. It was a very easy item; 93 percent of the sub-
jects gave the correct answer. Such easy items always lack discrimination. This
item will not contribute to the reliability of the test and should be rewritten. A
good place to start would be to attempt to make the distractors more plausible
in order to increase their distractibility.10
Item 4, the third one listed in the table, poses intermediate difficulty (56
percent gave the right answer) which is good, and its distractors all worked,
another good trait. But the item lacked discrimination (only 5 percent), so it
does not contribute to reliability. Choice A distracted upper-half subjects but
not lower-half ones, so rewriting it may improve its discrimination.
Item 1, the fourth item in the table, has the same difficulty index as Item
4 (56) but a much higher discrimination index (25), and all of the distractors
worked. This is another good item. Finally, Item 52, with a difficulty index of
42 and a discrimination index of 0, is the worst item of the five. It should be
discarded and replaced by a new item.
You may also gather useful data by administering any other comparable or
related test to your pilot group to see how your test relates to the other one. If
this comparison shows a relationship between results from your test and those
from another performance test, you have confirmed concurrent validity. (The
relationship is usually tested by means of a correlation, a measure of the extent
to which two sets of scores vary together. Correlation techniques are described
in Chapter 12.) If your results relate to those of an aptitude test (as an example),
this finding may contribute to construct validity. Finally, if classroom perfor-
mance records are available, you can complete additional validity tests. For a
performance test, however, the establishment of content validity (a nonstatistical
concept) and the use of item analysis are usually sufficient to ensure effective test
construction.
Attempts to establish forms of validity other than content validity are usu-
ally unnecessary for tests of performance, although efforts to establish their
reliability give useful input. Although an item analysis contributes to the estab-
lishment of internal reliability, the Kuder-Richardson formula can also serve
this purpose, as discussed earlier in the chapter.
■ Constructing a Scale
1. Likert scale
2. Semantic differential
3. Thurstone scale
Likert Scale
A Likert scale lays out five points separated by intervals assumed to be equal dis-
tances. Because analyses of data from Likert scales are usually based on sum-
mated scores over multiple items, the equal-interval assumption is a workable
one. In the Thurstone scaling procedure, on the other hand, items are scaled by
Ss and chosen to satisfy the equal-interval requirement. This procedure is con-
siderably more complex than the Likert scale approach. It is formally termed an
equal-appearing interval scale. This scale allows subjects to register the extent of
P R O C E D U R E S F O R O B S E R VAT I O N A N D M E A S U R E M E N T ■ 2 2 3
1. Tendency to delay or put off tasks (e.g., Item 3, When I have a deadline, I
wait until the last minute).
2. Tendency to avoid or circumvent the unpleasantness of some task (e.g.,
Item 31, I look for a loophole or shortcut to get through a tough task).
3. Tendency to blame others for one’s plight (e.g., Item 20, I believe that other
people don’t have the right to give me deadlines).
Following this content analysis, specific items were written for each sub-
topic. Some items were written as positive indications of procrastination (e.g.,
Item 1, I needlessly delay finishing jobs, even when they’re important) so that
agreement with them reflected a tendency toward procrastination. Some items
were written as negative indications (e.g., Item 8, I get right to work, even on
life’s unpleasant chores) so that agreement with them reflected a tendency away
from procrastination. An item phrased as a positive indication of procrastina-
tion was scored by the following key:
SA = 5, A = 4, U = 3, D = 2, SD = 1
SA = 1, A = 2, U=3, D = 4, SD = 5
The reason for writing items in both directions was to counteract the
tendency for a respondent to automatically and unthinkingly give the same
answer to all questions. By reversing the scoring of negative indications of
224 ■ CHAPTER 10
Tuckman Procrastination Scale. This scale has been prepared so that you can indicate
how much each statement listed below describes you. Please write the letter(s) SA
(strongly agree), A (agree), U (undecided), D (disagree), or SD (strongly disagree) on
the left of each statement indicating how much each statement describes you. Please
be as frank and honest as possible.
______ 1.
I needlessly delay finishing jobs, even when they’re important.
______ 2.
I postpone starting in on things I don’t like to do.
______ 3.
When I have a deadline, I wait until the last minute.
______ 4.
I delay making tough decisions.
______ 5.
I stall on initiating new activities.
______ 6.
I’m on time for appointments.
______ 7.
I keep putting off improving my work habits.
______ 8.
I get right to work, even on life’s unpleasant chores.
______ 9.
I manage to find an excuse for not doing something.
______ 10.
I avoid doing those things that I expect to do poorly.
______ 11.
I put the necessary time into even boring tasks, like studying.
______ 12.
When I get tired of an unpleasant job, I stop.
______ 13.
I believe in “keeping my nose to the grindstone.”
______ 14.
When something’s not worth the trouble, I stop.
______ 15.
I believe that things I do not like doing should not exist.
______ 16.
I consider people who make me do unfair and difficult things to be rotten.
______ 17.
When it counts, I can manage to enjoy even studying.
______ 18.
I am an incurable time waster.
______ 19.
I feel that it’s my absolute right to have other people treat me fairly.
______ 20.
I believe that other people don’t have the right to give me deadlines.
______ 21.
Studying makes me feel entirely miserable.
______ 22.
I’m a time waster now but I can’t seem to do anything about it.
______ 23.
When something’s too tough to tackle, I believe in postponing it.
______ 24.
I promise myself I’ll do something and then drag my feet.
______ 25.
Whenever I make a plan of action, I follow it.
______ 26.
I wish I could find an easy way to get myself moving.
______ 27.
When I have trouble with a task, it’s usually my own fault.
______ 28.
Even though I hate myself if I don’t get started, it doesn’t get me going.
______ 29.
I always finish important jobs with time to spare.
______ 30.
When I’m done with my work, I check it over.
______ 31.
I look for a loophole or shortcut to get through a tough task.
______ 32.
I get stuck in neutral even though I know how important it is to get
started.
______ 33. I never met a job I couldn’t “lick.”
______ 34. Putting something off until tomorrow is not the way I do it.
______ 35. I feel that work burns me out.
Note that the attitude topic or characteristic should not appear in the heading when
the scale is administered because an awareness of the topic may influence responses.
Items in bold type represent a short form of the scale.
procrastination, the scale provides a total score that reflects the degree of pro-
crastination. A person with a tendency to procrastinate would agree with the
positive indications and disagree with the negative ones, whereas a non-pro-
crastinator would respond in exactly the opposite manner.
The total pool of 35 items was then administered to a pilot group of Ss.
The responses they gave to each individual item were correlated (a statistical
procedure described in Chapter 12) with the total scores they obtained on the
whole scale. This item analysis procedure provides an indication of the degree
of agreement or overlap between each individual item and the total test, that
is, the extent to which each item measures what the total test measures. By
identifying items that best agree with the overall scale, the designer achieves
the greatest possible internal consistency. This procedure identified the 16 best
items, that is, those items showing the greatest amount of agreement with the
total score. (The choice to select 16 items was based on a determination that
those items showed high agreement with the total score and would make up a
scale that could be completed in a reasonably short time.)11
The same procedure used to develop the Tuckman Procrastination Scale was
used to develop the Likert scale shown in Figure 10.2, which measures students’
attitudes toward mathematics. In this case, the subtopics were (1) emotional
reaction to math, (2) competence in math, and (3) preference for math.
Semantic Differential
11. These 16 items are shown in bold type in the figure; they may replace the 35-item
complete scale if time limitations require an adjustment. These 16 items all measure the same
topic area, namely no tendency to delay or put off tasks, and their selection as a single, short
form of the scale was verified by a statistical procedure called factor analysis.
12. These dimensions are also identified by the statistical process known as factor analysis.
226 ■ CHAPTER 10
Math Attitude Scale. Each of the statements below expresses a feeling toward math-
ematics. Please indicate the extent of agreement between the feeling expressed in each
statement and your own personal feeling by circling one of the letter choices next to
each statement: SA = strongly agree, A = agree, U = undecided, D = disagree, or SD=
strongly disagree.
the most positive judgment. Adjective pairs are phrased in both directions to
minimize response bias.
Thurstone Scale
The final Thurstone scale presents the selected items in order of their scale
values, and respondents are instructed to check one or more with which they
agree. The scale or point values of those items (if subjects indicate more than
one) can then be averaged to obtain individual attitude scores. An illustration
of a Thurstone scale (actually five scales) appears in Figure 10.5.
Rating Scales
Teacher does not monitor 123456789 Teacher uses a system for moni-
student progress. toring student progress.
(continued)
232 ■ CHAPTER 10
FIGURE 10.7 Continued
checklist limits an observer to describing what has or has not transpired (pres-
ence or absence of an event) rather than indicating the degree of the behaviors
in question (as a rating scale would allow).
Coding Systems
A coding system offers a means for recording the occurrence of specific, prese-
lected behaviors as they happen. Essentially, it specifies a set of categories into
which an observer classifies ongoing behaviors. Like rating procedures, coding
techniques attempt to quantify behavior. If a researcher wants to determine the
effect of class size on the number of question-asking behaviors in a class, such
a system would code question-asking behavior during a designated block of
time in large and small classes in order to establish a measure of this behavior
as a dependent variable.
Rating and coding schemes convert behavior into measures. Rating scales
are completed in retrospect and represent observers’ memories of overall activ-
ities; coding scales are completed as coders observe (or hear) the behavior. A
coding system records the frequency of specific (usually individual) acts pre-
designated for researcher attention, whereas rating scales summarize the occur-
rence of types of behavior in a more global fashion.
Researchers employ two kinds of coding systems. Sign coding establishes
a set of behavioral categories; each time an observer detects one of these prese-
lected, codeable behaviors, he or she codes the event in the appropriate category.
For example, if a coding system included “suggesting” as a codeable act,
the coder would code an event every time a subject made a suggestion.
An example of such a sign coding system for teacher behavior appears
in Figure 10.8. The scheme lists 37 behaviors. Whenever a trained observer
encounters one of these 37 behaviors, she or he records the occurrence by cat-
egory. The behavior would be coded again only when it occurred again.
The second kind of coding, time coding, involves observer identification of
all preselected behavior categories that occur during a given time period, such
as 5 seconds. An act that occurs once but continues through a number of time
periods is coded anew for each time period, rather than only once as in a sign
coding system.
Compared with rating as a means of quantifying observations, coding has
both advantages and disadvantages. On the negative side, it exposes a study to
difficulties training coders and establishing intercoder reliability; coders must
complete a difficult and time-consuming process to carry out coding activi-
ties (often based on tape recordings, which introduce other difficulties); at the
completion of coding, a researcher may have little data besides a set of category
tallies. However, on the positive side, data yielded from coding approaches
234 ■ CHAPTER 10
more closely than that from other methods to what physical scientists call
“hard data.” Coding techniques may generate somewhat more objective pic-
tures of true events than do rating-scale techniques.
Considering both sides of the issue, researchers may prefer to avoid cod-
ing in favor of rating unless well-developed coding systems are available, and
unless they can call on the resources required to hire and train coders who will
listen to lengthy tape recordings.
Behavior Sampling
Child Observed
1 2 3 4 5 6
Listening to teacher
Listening to peer
Reading assigned
material
Reading for pleasure
Writing on pro-
grammed materials
Writing in a workbook
or worksheet
Writing creatively
(or on a report)
Writing a test
Talking (re: work)
to teacher
Talking (re: work)
to peer
Talking (socially)
to teacher
Talking (socially)
to peer
Drawing, painting,
or coloring
Constructing, experi-
menting, manipulating
Utilizing or attending
to AV equipment
Presenting a play or
report to a group
Presenting a play or
report individually
Playing or taking a
break
Distributing, monitor-
ing, or in class routine
Disturbing, bothering,
interrupting
Waiting, daydreaming,
meditating
Total
The user checks the box that indicates what each child being observed is doing
Source: Adapted from Tuckman (1985)
P R O C E D U R E S F O R O B S E R VAT I O N A N D M E A S U R E M E N T ■ 2 3 7
Over the period of time that the observations are made (for example,
three 5-minute observations a day, every day for a week), the series of
entries recorded should show a pattern across and within classrooms. Even
though the pattern for each classroom will be unique in some ways, data
may reveal an overall trend across all of the classrooms for a given condition
or treatment.
Techniques for behavior sampling are described in more detail by Tuck-
man (1985). The relationship between behavior sampling, coding, and rating
procedures is illustrated by a chart:
Microscopic Macroscopic
Molecular Compound
Concentrated Diffuse
Exact Impressionistic
Disconnected Inseparable
■ Summary
3. Test validity refers to the extent to which a test measures what it purports
to measure. Invalidity creates instrumentation bias.
4. Researchers can establish validity in four ways: (1) predictive validity,
established by comparing test scores to an actual performance that the test
is supposed to predict; (2) concurrent validity, established by comparing
test scores to scores on another test intended to measure the same char-
acteristic; (3) construct validity, established by comparing test scores to
a behavior to which it bears some hypothesized relationship; (4) content
238 ■ CHAPTER 10
c. Author(s)
d. Publisher
e. Time to complete the full test
f. Age range
g. Number of scores
h. Number of forms
i. Cost for the complete kit
j. Date of publication
8. Recent Mental Measurements Yearbooks include tests of sex knowledge.
Which of these tests would you use if you were doing a high school study
of sex knowledge? Why?
9. Consider the following test scores of six people on a four-item test
( ✓ = right, X = wrong):
1 2 3 4 5
Item 1 ✓ X X ✓ X
Item 2 ✓ ✓ ✓ ✓ X
Item 3 X X ✓ ✓ ✓
Item 4 ✓ X X ✓ X
Calculate the indexes of difficulty and discrimination for each item. Which
item would you eliminate? (Do your calculations on only the two highest
and two lowest scorers on the total test; eliminate the middle two scorers.)
10. In constructing a paper-and-pencil performance test, a researcher com-
pletes the following steps. List them in their proper order.
a. Perform an item analysis.
b. Eliminate poor items.
c. Develop a content outline.
d. Collect pilot data.
e. Establish content validity.
f. Write test items.
11. To test the items on a Likert-type attitude scale, the items are administered
to a pilot group and then correlations are run between _____________.
12. The semantic differential, when used in a general way, measures the factor
of _____________.
13. Two raters evaluating the same set of behaviors obtained an interra-
ter reliability of 0.88. This can be converted to a corrected reliability of
_____________ by averaging their judgments.
P R O C E D U R E S F O R O B S E R VAT I O N A N D M E A S U R E M E N T ■ 2 4 1
■ Recommended References
Conoley, J. C., & Impara, J. C. (Eds.). (1995). The twelfth mental measurements year-
book. Lincoln, NE: Buros Institute of Mental Measurements.
Oosterhof, A. (1994). Classroom applications of educational measurement (2nd ed.).
New York, NY: Macmillan.
Oosterhof, A. (1996). Developing and using classroom assessments. Englewood Cliffs,
NJ: Prentice-Hall.
Wittrock, M. C., & Baker, E. L. (Eds.). (1991). Testing and cognition. Englewood Cliffs,
NJ: Prentice-Hall.
= CHAPTER ELEVEN
OBJECTIVES
Dr. Candace Flynn looked sadly around the near-empty auditorium. As the
head of the student government council, she was responsible for the plan-
ning of a number of extracurricular activities for undergraduate students.
This particular event, an alumni guest speaker, was poorly attended, as were
all of the events scheduled thus far.
“I just don’t’ get it,” Dr. Flynn commented to a colleague who had taken
the time to stop by, “I thought more students would be interested.”
“It’s tough to get them to come out,” Dr. McLaughlin replied.
“I’m obviously out of touch with what our students want to see and
do,” Dr. Flynn continued, “Perhaps I need to find out more about how they
spend their time on campus.”
“Sounds like a plan to me,” answered Dr. McLaughlin. “Let me know if
I can do anything to help.”
■ 243
244 ■ CHAPTER 11
Survey research is a useful tool when researchers wish to solicit the beliefs
and opinions of large groups of people. Data collection involves introduc-
ing a number of related questions to a target population in order to find
out how a group feels about an issue or event. Our fictional Dr. Flynn,
who is concerned about poor student attendance to her scheduled events,
would benefit from creating and administering a survey, one that asks stu-
dents about the types of events they would be likely to attend and enjoy.
This example illustrates a major reason why survey research is conducted:
to describe the beliefs of a population. Obviously, it would be impossible to
gather responses of every member of a population—survey research instead
asks questions of a representative subsample. Through careful selection pro-
cedures put in place to assure that the group reflects the key demograph-
ics of the overall population of interest, researchers who conduct survey
research hope to infer the beliefs of the larger group from the representative
responses of this sample. In this chapter, we will examine the key steps in
conducting survey research, including identifying procedures for creating
and administering research instruments, identifying the target population
and selecting an appropriate sample, coding and scoring data, and following
up with respondents.
Questionnaires and interviews help researchers to convert into data the infor-
mation they receive directly from people (research subjects). By providing
access to what is “inside a person’s head,” these approaches allow investigators
to measure what someone knows (knowledge or information), what someone
likes and dislikes (values and preferences), and what someone thinks (attitudes
and beliefs). Questionnaires and interviews also provide tools for discovering
what experiences have taken place in a person’s life (biography) and what is
occurring at the present. This information can be transformed into quantitative
data by using the attitude or rating scales described in the previous chapter or
by counting the number of respondents who give a particular response, which
generates frequency data.
Questionnaires and interviews provide methods of gathering data about
people by asking them rather than by observing and sampling their behav-
ior. However, the self-report approach incorporated in questionnaires and
interviews does present certain problems: (1) Respondents must cooperate to
complete a questionnaire or interview. (2) They must tell what is rather than
what they think ought to be or what they think the researcher would like to
hear. (3) They must know what they feel and think in order to report it. In
CONSTRUCTING AND USING QUESTIONNAIRES ■ 245
practice, these techniques measure not what people believe but what they say
they believe, not what they like but what they say they like.
In preparing questionnaires and interviews, researchers should exercise
caution. They must constantly consider:
Certain forms of questions and certain response modes are commonly used in
questionnaires and interviews. This section deals with question formats and the
following section addresses response modes.
The difference between direct and indirect questions lies in how obviously the
questions solicit specific information. A direct question, for instance, might
ask someone whether or not she likes her job. An indirect question might ask
what she thinks of her job or selected aspects of it, supporting the research-
er’s attempt to build inferences from patterns of responses. By asking ques-
tions without obvious purposes, the indirect approach is the more likely of
the two to engender frank and open responses. It may take a greater number
of questions to collect information relevant to a single point, though. (Specific
administrative procedures may help a researcher to engender frank responses
to direct questions, as described later in the chapter.)
• Do you think that the school day should be lengthened? YES ˜˜˜˜ NO
versus
These two formats are indistinguishable in their potential for eliciting honest
responses. Usually, researchers choose between them on the basis of response
mode, as discussed in the next section.
Unstructured Responses
Fill-In Response
Note that the unstructured response mode differs from the structured, fill-
in mode in degree. The fill-in mode restricts respondents to a single word or
phrase, usually in a request to report factual information (although the third
example elicits a response beyond facts). The very wording of such a question
restricts the number of possible responses the respondent can make and the
number of words that can be used.
Tabular Response
The tabular response mode resembles the fill-in mode, although it imposes
somewhat more structure because respondents must fit their responses into a
table. Here is an example (figure 11.1):
CONSTRUCTING AND USING QUESTIONNAIRES ■ 249
Would be a
serious con-
Might stop me sideration but
Would stop from making wouldn’t stop Wouldn’t mat-
me change me ter at all
Endanger your
health
Leave your
family for
some time
Move around
the country
a lot
Leave your
community
Leave your
friends
Give up leisure
time
Keep quiet
about political
views
Learn a new
routine
Work harder
than you are
now
Take on more
responsibility
II. Looking at your present situation, what do you expect to be doing 5 years from
now?
____________________________________________________________________
Scaled Response
Specify Type
Dates
Next Previous of Work Name of
Job Title Performed Employer Annual Salary From To
The scale is used primarily to assess whether a high school student has
engaged in behaviors intended to learn about careers.
All scaled responses measure degree or frequency of agreement or occur-
rence (although a variety of response words may indicate these quantities).
They all assume that a response on a scale is a quantitative measure of judgment
or feeling. (Recall that Chapter 10 discussed priorities for constructing such
a scale.) Unlike an unstructured response, which requires coding to generate
useful data, a structured, scaled response collects data directly in a usable and
analyzable form. Moreover, in some research situations, scaled responses can
yield interval data.1
For example, the difference in frequency between N and S on the Career
Awareness Scale would be considered equivalent to the differences between
S and O and between O and A. Provided other requirements are met, such
interval data can be analyzed using powerful parametric statistical tests. (These
statistical procedures are described in Chapter 12.)
Ranking Response
1. See the early part of Chapter 8 for a discussion of the types of measurement scales.
252 ■ CHAPTER 11
Instructions: All of the questions below are about what you actually do. If you “Always”
do what the statement says, circle the 1 for A. If you “Often” do what the statement
says, circle the 2 for O. If you “Seldom” do what the statement says, circle the 3 for S. If
you “Never” do what the statement says, circle the 4 for N.
There are no right or wrong answers for these questions. We are interested only in what
you actually do.
For example, the difference in frequency between N and S on the Career Awareness
Scale would be considered equivalent to the differences between S and O and between
O and A. Provided other requirements are met, such interval data can be analyzed
using powerful parametric statistical tests. (These statistical procedures are described
in Chapter 12.)
CONSTRUCTING AND USING QUESTIONNAIRES ■ 253
with 5 indicating the most useful activity and 1 indicating the least useful
one. If any activity gave you no help at all, indicate this by a 0.)
___ Initial presentation by consultants
___ Initial small-group activity
___ Weekly faculty sessions
___ Mailed instructions and examples of behavioral objectives
___ Individual sessions with consultant
Checklist Response
• The kind of job that I would most prefer would be (check one):
___ (1) A job where I am almost always certain of my ability to perform
well.
___ (2) A job where I am usually pressed to the limit of my abilities.
• I get most of my professional and intellectual stimulation from (check one
of the following blanks):
___ A. Teachers in the system
___ B. Principal
___ C. Superintendent
___ D. Other professional personnel in the system
___ E. Other professional personnel elsewhere
___ F. Periodicals, books, and other publications
responses. At the same time, those responses yield less information for the
researcher. Nominal data are usually analyzed by means of the chi-square sta-
tistical analysis (described in Chapter 12).
Categorical Response
The categorical response mode, similar to the checklist but simpler, offers a
respondent only two possibilities for each item. (In practice, checklist items
also usually offer only two responses: check or no check on each of a series
of choices, but they may offer more possibilities.) However, the checklist
evokes more complex responses, since the choices cannot be considered inde-
pendently, as can categorical responses. Also, after checking a response, the
remaining choices in the list leave no further option.
A yes-no dichotomy is often used in the categorical response mode:
Analysis can render true-false data into interval form by using the number
of true responses (or the number of responses indicating a favorable attitude)
as the respondent’s score. The cumulative number of true responses by an indi-
vidual S on a questionnaire then becomes an indication of the degree (or fre-
quency) of agreement by that S—an interval measure. Counting the number of
Ss who indicate agreement on a single item provides a nominal measure. (See
the section on coding and scoring at the end of this chapter to see how to score
this and the other types of response modes.)
all the variables you are studying. One study might attempt to relate source
of occupational training (that is, high school, junior college, or on-the-job
instruction) to degree of geographic mobility; it would have to measure where
respondents were trained for their jobs and the places where they have lived.
A study might compare 8th graders and 12th graders to determine how favor-
ably they perceive the high school climate; it would have to ask respondents
to indicate their grade levels (8th or 12th) and to react to statements about the
high school climate in a way that indicates whether they see it as favorable or
not. A study concerned with the relative incomes of academic and vocational
high school graduates 5 years after graduation would have to ask respondents
to indicate whether they focused on academic or vocational subjects in high
school and how much money they were presently earning.
Thus, the first step in constructing questionnaire or interview questions is
to specify your variables by name. Your variables designate what you are trying
to measure. They tell you where to begin.
The first decision you must make about question format is whether to pres-
ent items in a written questionnaire or an oral interview. Because it is a more
convenient and economical choice, the questionnaire is more commonly used,
although it does limit the kinds of questions that can be asked and the kinds
of answers that can be obtained. A questionnaire may present difficulties in
obtaining personally sensitive and revealing information. Also, it may not
yield useful answers to indirect, nonspecific questions. Further, preparation
of a questionnaire must derail all questions in advance. Despite the possibility
of including some limited response-keyed questions, you must ask all respon-
dents the same questions. Interviews offer the best possibilities for gathering
meaningful data from response-keyed questions.
Table 11.1 summarizes the relative merits of interviews and questionnaires.
Ordinarily, a researcher opts for the additional cost and unreliability of inter-
viewing only when the study addresses sensitive subjects and/or when person-
alized questioning is desired. (Interviews are subject to unreliability, because
the researcher must depend on interviewers to elicit and record the responses
and often to code them, as well.) In general, when a researcher chooses to use
the unstructured response mode, interviewing tends to be the better choice
because people find it easier to talk than write; consequently, interviews gener-
ate more information of this type.
The choice of question format depends on whether you are attempting to
measure facts, attitudes, preferences, and so on. In constructing a question-
naire, use direct, specific, clearly worded questions, and keep response keying
256 ■ CHAPTER 11
No specific rules govern selection of response modes. In some cases, the kind
of information you seek will determine the most suitable response mode,
but often you must choose between equally acceptable forms. You can, for
instance, provide respondents with a blank space and ask them to fill in their
ages, or you can present a series of age groupings (for example, 20–29, 30–39,
and so on) and ask them to check the one that fits them.
The choice of response mode should be based on the manner in which
the data will be treated; unfortunately, however, researchers do not always
make this decision before collecting data. It is recommended that data analy-
sis decisions be made in conjunction with the selection of response modes.
In this way, the researcher (1) gains assurance that the data will serve the
intended purposes and (2) can begin to construct data rosters and to prepare
for the analyses. (See Chapter 12.) If analytical procedures will group age data
into ranges to provide nominal data for a chi-square statistical analysis, the
researcher would want to design the appropriate questionnaire item to collect
these data in grouped form.
CONSTRUCTING AND USING QUESTIONNAIRES ■ 257
1. Type of data desired for analysis. If you seek interval data to allow
some type of statistical analysis, scaled and checklist responses are the
best choices. (Checklist items must be coded to yield interval data, and
responses must be pooled across items. An individual checklist item yields
only nominal data.) Ranking provides ordinal data, and fill-in and some
checklist responses provide nominal data.
2. Response flexibility. Fill-ins allow respondents the widest range of choice; yes-no
and true-false items, the least.
3. Time to complete. Ranking procedures generally take the most time to
complete, although scaled items may impose equally tedious burdens on
respondents.
4. Potential response bias. Scaled responses and checklist responses offer the
greatest potential for bias. Respondents may be biased not only by social
desirability considerations but also by a variety of other factors, such as the
tendencies to overuse the true or yes answer and to select one point on the
scale as the standard response to every item. Other respondents may avoid
the extremes of a rating scale, thus shrinking its range. These troublesome
tendencies on the part of respondents are strongest on long questionnaires,
which provoke fatigue and annoyance. Ranking and fill-in responses are
less susceptible than other choices to such difficulties. In particular, rank-
ing forces respondents to discriminate between response alternatives.
5. Ease of scoring. Fill-in responses usually must be coded, making them
considerably more difficult than other response types to score. The other
types of responses discussed in this chapter are approximately equally easy
to score.
As pointed out earlier, the first step in preparing items for an interview sched-
ule is to specify the variables that you want to measure and then construct
questions that focus on these variables. If, for example, one variable in a study
is openness of school climate, an obvious question might ask classroom teach-
ers, “How open is the climate here?” Less direct but perhaps more concrete
questions might ask, “Do you feel free to take your problems to the princi-
pal? Do you feel free to adopt new classroom practices and materials?” Note
that the questions are based on the operational definition of the variable open-
ness, which has been operationally defined as freedom to change, freedom to
approach superiors, and so on. In writing questions, make sure they incorpo-
rate the properties set forth in the operational definitions of your variables.
(Recall from Chapter 6 that these properties may be either dynamic or static,
depending on which type of operational definition you employ.)
A single interview schedule or questionnaire may well employ more than
one question format accommodating more than one response mode. The
sample interview schedule in Figure 11.3 seeks to measure the attitudes of the
general public toward some current issues in public education such as cost,
quality, curriculum emphasis, and standards. The interview schedule is highly
CONSTRUCTING AND USING QUESTIONNAIRES ■ 259
FIGURE 11.3
260 ■ CHAPTER 11
The procedures for preparing questionnaire items parallel those for preparing
interview schedule items. Again, maintain the critical relationship between the
items and the study’s operationally defined variables. Constantly ask about
your items: Is this what I want to measure? Three sample questionnaires appear
in Figures 11.4, 11.5, and 11.6.
The questionnaire in Figure 11.4 was used in a follow-up study of com-
munity college graduates and high school graduates who did not attend col-
lege. The researcher was interested in determining whether the community
college graduates subsequently obtained higher socioeconomic status (that is,
earnings and job status) and job satisfaction than a matched group of people
who did not attend college. The items on the questionnaire were designed to
determine (1) earnings, job title, and job satisfaction (the dependent variables
[Items 1–7]); (2) subsequent educational experiences, in order to eliminate or
reclassify subjects pursuing additional education (a control variable) as well
as to verify the educational status distinction of 2-year college students versus
those who completed high school only (the independent variable [Items 8–15,
23]); (3) background characteristics, in order to match samples (Items 16–20,
24, 25); and (4) health, in order to eliminate those whose job success chances
were impaired (Items 21, 22).
The researcher intended for all respondents to complete all of the items
except Item 7, which was response keyed to the preceding item. (Items 12 to
15 also have response-keyed parts.) The result is a reasonably simple, easy-to-
complete instrument.
The sample questionnaire in Figure 11.5 employs scaled responses in an
attempt to measure students’ attitudes toward school achievement based on
the value they place on going to school and on their own achievement. This
questionnaire actually measures the following six topics related to a student’s
perceived importance or value of school achievement:
(continued)
262 ■ CHAPTER 11
FIGURE 11.4. Continued
Note that for each of the 19 items, the questionnaire provides a 4-point
scale for responses employing the statement format. (This sample resem-
bles the standard Likert scale shown in Chapter 10, except that it omits the
middle or “undecided” response.) Note further that some of the items have
been reversed (Items 2, 5, 6, 9, 13, 14). These questions have been written
so that disagreement or strong disagreement indicates an attitude favoring
the importance of school achievement; on all the other items, agreement or
strong agreement indicates such an attitude. Agreement with Item 10 for
example, indicates that the respondent takes pride in school progress and
performance, a reflection of a positive attitude toward school achievement.
Disagreement with Item 9 indicates that the respondent does not feel that
CONSTRUCTING AND USING QUESTIONNAIRES ■ 263
Instructions: All questions are statements to which we seek your agreement or dis-
agreement. If you “Strongly Agree” with any statement, circle the 1. If you “Agree,” but
not strongly, with any statement, circle the 2. If you “Disagree,” but not strongly, circle
the 3. If you “Strongly Disagree” with any statement, circle the 4.
There are no right tor wrong answers for these questions. We are interested only in
how you feel about the statements.
Most studies benefit substantially from the precaution of running pilot tests
on their questionnaires, leading to revisions based on the results of the tests. A
pilot test administers a questionnaire to a group of respondents who are part
of the intended test population but who will not be part of the sample. In this
way the researcher attempts to determine whether questionnaire items achieve
the desired qualities of measurement and discrimination.
If a series of items is intended to measure the same variable (as the eight
items in Figure 11.6 are), an evaluation should determine whether these
items are measuring something in common. Such an analysis would require
266 ■ CHAPTER 11
Based on these data, the researcher should decide to eliminate Items 3 and
5 which fall below .50, and to place the other eight items in the final scale, con-
fident that the remaining items measure something in common.
Item analysis of questions intended to measure the same variable in
the same way is one important use of the data collected from a pilot test.
However, item analyses are not as critical for refining questionnaires as
they are for refining tests. Responses to questionnaire items are usually
reviewed by eye for clarity and distribution without necessarily running an
item analysis.
A pilot test can uncover a variety of failings in a questionnaire. For
example, if all respondents reply identically to any one item, that item prob-
ably lacks discrimination. If you receive a preponderance of inappropriate
responses to an item, examine it for ambiguity or otherwise poor wording.
Poor instructions and other administration problems become apparent on a
pilot test, as do areas of extreme sensitivity. If respondents refuse to answer
certain items, try to desensitize them by rewording. Thus, pilot tests enable
researchers to debug their questionnaires by diagnosing and correcting these
failings.
CONSTRUCTING AND USING QUESTIONNAIRES ■ 267
■ Sampling Procedures
Random Sampling
The population (or target group) for a questionnaire or interview study is the
group about which the researcher wants to gain information and draw conclu-
sions. A researcher interested in the educational aspirations of teachers, for
example, would focus on teachers as the population of the study. The term
defining the population refers to a process of establishing boundary conditions
that specify who shall be included in or excluded from the population. In the
example study, the population could be defined as elementary school teachers,
or public school teachers, or all teachers, or some other choice.
Specifying the group that will constitute a study’s population is an early
step in the sampling process, and it affects the nature of the conclusions that
may be drawn from a study. A broadly defined population (like “all teachers”)
maximizes external validity or generality, although such a broad definition may
create difficulties in obtaining a representative sample, and it may require a
large sample size. Conversely, defining the population narrowly (for example,
268 ■ CHAPTER 11
the population into and out of the study and for reducing the variability of the
sample.
The first step in stratified sampling is to identify the stratification param-
eters, or variables. Each stratification parameter represents a control variable,
that is, a potential source of error or extraneous influence that may provide an
alternative explanation for a study’s outcome. Assume that you want to con-
trast the teaching techniques of male and female elementary school teachers.
The study would restrict the population to elementary school teachers, because
that is a specified control variable, and it would sample across male and female
teachers, because gender is the independent variable. You are concerned, how-
ever, that teaching experience may be an extraneous influence on your results.
To offset this potential source of error, first you would determine the distribu-
tion of years of experience for male and for female elementary school teachers;
then you would select the sample in proportion to these distributions. (The
selection of specific subjects within each stratum or proportion would be done
randomly.) The other control variables would be treated in a similar way.
Consider sampling procedures for national political polls. Results are usu-
ally reported separately for different age groups and for different sections of
the country. The studies treat age and geography as moderator variables and
define separate samples according to them. However, within each age and geo-
graphical group, such a study may control for gender, race, religion, socioeco-
nomic status, and specific location by proportional stratification. If half of the
young people in the northeastern United States are male, then males should
constitute half of the sample of northeastern young people. If 65 percent of the
southeastern middle-aged group is poor, then poor people should make up 65
percent of the sample of this group. (Of course, terms like middle-aged and
poor must be operationally defined.) The pollsters then consider these sub-
population differences in evaluating the outcomes of their studies.
Consider the example on sampling 300 presidents of 2-year colleges. Some
bias may still affect results in spite of this random selection due to overrepre-
sentation of private colleges. To control for this factor, use it as a variable or
parameter for stratified sampling. Suppose one-quarter of the 2-year colleges
are private schools and three-quarters are public institutions. In proportional
stratified sampling, you would embody these percentages in your sample. In a
sample of 300 college presidents, you would want 75 from private, 2-year col-
leges and 225 from public ones (the specific individuals in each stratum being
randomly chosen). These specifications ensure creation of a sample systemati-
cally representative of the population.
To accomplish this stratified sampling method, you would make two sepa-
rate alphabetical lists, one of private colleges, the other of public schools. You
would then use your table of random numbers to select 75 private and 225
270 ■ CHAPTER 11
public colleges from the two lists, respectively. Of course, you could go further
and control also for factors such as urban versus rural setting or large ver-
sus small colleges. However, in considering stratification, remember that each
additional control variable complicates the sampling procedure and reduces
the population per category from which each part of the sample is drawn. The
sampling plan for this study is shown in Figure 11.7.
Random choice is the key to overcoming selection bias in sampling; strati-
fication adds precision in ensuring that the sample contains the same propor-
tional distribution of respondents on selected parameters as the population.
Where stratified sampling is used, within each stratum, researchers must choose
sample respondents by random methods to increase the likelihood of eliminat-
ing sources of invalidity due to selection other than those controlled through
stratification. The combination of stratification and random selection increases
Sample Population
1
Large = more than 2,000 students; small = fewer than 2,000 students.
CONSTRUCTING AND USING QUESTIONNAIRES ■ 271
the likelihood that the sample will be representative of the population. Because
it controls for selection invalidity based on preselected variables in a system-
atic way, stratification is recommended for use with the variables identified as
representing greatest potential sources of selection bias. For information about
determining sample size, see Chapter 12.
Initial Mailing
Follow-Ups
Sampling Nonrespondents
If fewer than about 80 percent of people who receive the questionnaire com-
plete and return it, the researcher must try to reach a portion of the nonre-
spondents and obtain some data from them. Additional returns of all or critical
portions of the questionnaire by 5 or 10 percent of the original nonrespondents
is required for this purpose.
This additional procedure is necessary to establish that those who have
not responded are not systematically different from those who have. Failure
to check for potential bias based on nonresponse may introduce both external
and internal invalidity based on experimental mortality (selective, nonrandom
loss of subjects from a random sample) as well as a potential increase in sam-
pling error.
Obtaining data from nonrespondents is not easy, since they have already
ignored two or three attempts to include them in the study. The first step is
to select at random 5 to 10 percent of these people from your list of nonre-
spondents, using the table of random numbers (Appendix A). Using their code
numbers, go through the table of random numbers and pick those whose num-
bers appear first, then write or call them. About a 75–80 percent return from
the nonrespondents’ sample may be all that can be reasonably expected, but
every effort should be made to achieve this goal.
Conducting an Interview
are both highly trained and experienced, study directors should give them the
names, addresses, and phone numbers of the people to be interviewed, along
with a deadline for completion. The interviewer may then choose the inter-
viewing order, or the researcher may recommend an order.
Typically, an interviewer proceeds by telephoning a potential respondent
and, essentially, presenting a verbal cover letter. However, a phone conversa-
tion gives the interviewer the advantage of opportunities to alter or expand
upon instructions and background information in reaction to specific concerns
raised by potential respondents. During this first conversation, an interview
appointment should also be made.
At the scheduled meeting, the interviewer should once again brief the
respondent about the nature or purpose of the interview (being as candid as
possible without biasing responses) and attempt to make the respondent feel at
ease This session should begin with an explanation of the manner of recording
responses; if the interviewer will record the session, the respondent’s assent
should be obtained. At all times, interviewers must remember that they are
data collection instruments who must try to prevent their own biases, opin-
ions, or curiosity from affecting their behavior. Interviewers must not devi-
ate from their formats and interview schedules, although many schedules will
permit some flexibility in choice of questions. The respondents should be kept
from rambling, but not at the sacrifice of courtesy.
Many questions, such as those presented in the form of rating scales or check-
lists, are precoded; that is, each response can be immediately and directly con-
verted into an objective score. The researcher simply has to assign a score to
each point on the list or scale. However, data obtained from interviews and
questionnaires (often called protocols) may not contribute to the research in the
exact form in which they are collected. Often further processing must convert
them to different forms for analysis. This initial processing of information is
called scoring or coding.
Consider Item 13 from the Career Awareness Scale, the sample question-
naire that appears in Figure 11.2:
You might assign never (N) a score of 1, seldom (S) a score of 2, often (O) a
score of 3, and always (A) a score of 4. You could then add the scores on all the
items to obtain a total score on the scale.
278 ■ CHAPTER 11
If you score strongly agree for the first item as 4, then you have to score the
strongly agree response for the second item as 1, because strong agreement with
the first item indicates that a respondent likes school whereas strong agreement
with the second item indicates a dislike for school. To produce scores on these
two items that you can sum to get a measure of how much a student likes
school, you have to score them in opposite directions.
Often a questionnaire or overall scale contains a number of subscales, each
of which measures a different aspect of what the total scale measures. In ana-
lyzing subscale scores, a scoring key provides extremely helpful guidance. Typ-
ically such a scoring key is a cardboard sheet or overlay with holes punched
so that when it is placed over an answer sheet, it reveals only the responses to
the items on a single subscale. One scoring key would be required for each
subscale. Using answer sheets that can be read by optical scanners and scored
by computers makes this process much easier.
Thus, in scoring objective items, such as rating scales and checklists, the
first step is identification of the direction of items—separating reversed and
non-reversed ones. The second step is assigning a numerical score to each point
on the scale or list. Finally, subscale items should be grouped and scored.
By their very nature, ranking items carry associated scores, that is, the rank
for each item in the list. To determine the average across respondents for any
particular item in the list, you can sum the ranks and divide by the number of
respondents. All ranking items can be scored in this way. This set of averages
can then be compared to that obtained from another group of respondents
using the Spearman rank-order correlation procedure (described in the next
chapter).
Some scales, such as those using the true-false and yes-no formats, lend
themselves primarily to counting as a scoring procedure. Simply count the
number of “true” or “yes” responses. However, you must still pay attention
to reversed items. A “false” answer on a reversed item must be counted along
with a “true” response on a non-reversed item. On a positively phrased item,
for example, a “yes” would get a score of 1, and a “no” would get a score of 0.
CONSTRUCTING AND USING QUESTIONNAIRES ■ 279
1. Scale scoring. Where the item represents a scale, each point on the scale is
assigned a score. After adjusting for reversal in phrasing, you can add a
respondent’s scores on the items within a total scale (or subscale) to get his
or her overall score.
2. Rank scoring. A respondent assigns a rank to each item in a list. Here, typi-
cally, average ranks across all respondents are calculated for each item in
the list.
3. Response counting. Where categorical or nominal responses are obtained
on a scale (such as true-false), a scorer simply counts the number of agree-
ing responses by a respondent. This count becomes the total score on the
scale for that respondent. Response counting works for a scale made up of
more than one item, all presumably measuring the same thing.
4. Respondent counting. Where a questionnaire elicits categorical or nominal
responses on single items, scoring can count the number of respondents
who give a particular response to that item. By properly setting up the
answer sheet in advance, mechanical procedures can complete respondent
counts. Respondent counting enables a researcher to generate a contingency
table (a four-cell table that displays the number of respondents simulta-
neously marking each of the two possible choices on two items) and to
employ chi-square analysis (described in the next chapter). An example of
a contingency table is provided in Figure 11.10.
Although a scorer can apply any one of the four techniques described above
to process fill-in and free-response items, the most common is respondent
counting. However, before counting respondents, the scorer must code their
responses. Coding is a procedure for reducing data to a form that allows tabu-
lation of response similarities and differences.
Suppose, for example, that an interviewer asks: Why did you leave school?
Suppose, also, that the following potential responses to this question have been
identified by the researcher:
After listening to the teacher’s free response to this question, the inter-
viewer could summarize his or her opinion by placing a check on the rating
scale. This method is an example of a scale-scoring approach to coding and
scoring an open-ended response. An examination of the ratings indicated by
the responses of the two groups of teachers would provide data to determine
whether tenured or nontenured teachers are more interested in teaching effec-
tiveness. Alternatively, the response could be precoded as simply: ______
seems interested, ______ seems disinterested. This application represents the
respondent-counting approach: Simply count the number of teachers in each
282 ■ CHAPTER 11
group (tenured and nontenured) who were seen as interested as well as those
seen as disinterested, and place the findings into a contingency table:
interested disinterested
in teaching in teaching
effectiveness effectiveness
tenured teachers
nontenured teachers
■ Summary
1. Which of the following statements does not describe a purpose for which
researchers use interviews and questionnaires?
a. Finding out what a person thinks and believes
b. Finding out what a person likes and dislikes
c. Finding out how a person behaves
d. Finding out what experiences a person has had
2. Which of the following limitations is not a shortcoming of a questionnaire
or interview?
a. The respondent may not know anything about the interviewer.
b. The respondent may not know the information requested.
c. The respondent may try to show himself or herself in a good light.
d. The respondent may try to help by telling you what you expect to
hear.
3. Match up the question types with the descriptions.
a. Indirect question 1. Declarative sentence form
b. Specific question 2. Requests reaction to a single object
c. Question of opinion 3. Next question depends on the response
d. Statement to this one
e. Response-keyed 4. Requests information for inferences
question 5. Asks how the respondent feels about
something
4. Match up the response types with the examples.
a. Scaled response 1. My favorite subject is (check one):
b. Fill-in response _____ English
c. Ranking response _____ Chemistry
d. Tabular response _____ Calculus
e. Checklist response 2. My favorite subject is calculus, (yes, no).
f. Unstructured response 3. How do you feel about chemistry?
g. Categorical response 4. English is a subject I (like a lot, like
a little, dislike a little, dislike a lot)
5. My favorite subject is _________.
Like
Dislike
CONSTRUCTING AND USING QUESTIONNAIRES ■ 285
11. Which of the following subjects is not ordinarily discussed in a cover letter?
a. Protection afforded the respondent
b. Anticipated outcome of the study
c. Legitimacy of the researcher
d. Purpose of the study
12. You are planning to do a study of the relationship between a teacher’s
length of teaching experience and his or her attitudes toward discipline of
students. You are sending out a questionnaire including an attitude scale
and a biographical information sheet. Construct a sample cover letter to
accompany this mailing.
■ Recommended References
Berdie, D. R., Anderson, J. F., & Niebuhr, M. A. (1986). Questionnaires: Design and use
(2nd ed.). Metuchen, NJ: Scarecrow Press.
Fowler, F. J. (1993). Survey research methods (2nd ed.). Beverly Hills, CA: Sage.
Lavrakas, P. J. (1987). Telephone survey methods: Sampling, selection, and supervision.
Newbury Park, CA: Sage.
PA R T
4
CONCLUDING STEPS
OF RESEARCH
=
= C H A P T E R T W E LV E
OBJECTIVES
T
HIS CHAPTER will respond to Dr. Richards’s problem, namely
the systematic analysis of data. In doing so, we will review the basic
principles that allow us to draw conclusions concerning the issues of
interest in research studies. Research studies consider many different types of
environmental observations; to accurately reflect the goals of our investiga-
tions, it is important to select the proper techniques to interpret these data.
We begin this chapter with a review of measures of central tendency and vari-
ability, then turn to the coding and rostering of data, and close by describ-
ing different statistical tests that may be used to address different types of
research questions.
■ 289
290 ■ CHAPTER 12
Mode
Within a particular data set, you may find clusters of observations. For exam-
ple, a high school teacher who collected mathematical efficacy ratings for her
first-period class prior to the marking period’s first big exam, gathered the fol-
lowing data:
75, 82, 45, 75, 69, 75, 90, 80, 75, 70, 75, 89, 83, 75, 77
It is useful for this teacher to know which score occurred most frequently
because this value indicates the efficacy level of the largest number of students.
This information suggests a trend within a distribution, namely that a group of
students feel the same way. This measure of central tendency is called the mode.
It is easy to calculate: simply determine which value (or values) appears most
often within a distribution of scores. In this example, the mode or modal value
is 75, as six students report this level of efficacy, far more than any other single
report. It is important to note that though these data provide only one modal
C A R R Y I N G O U T S TAT I S T I C A L A N A LY S E S ■ 2 9 1
value, it is possible for a distribution of scores to have two modal values; in this
instance, we refer to the data as bimodal.
While the modal value does reveal a vital characteristic of a distribution
and it is very easy to calculate, other measures of central tendency are needed
to provide additional analysis of data. We will next introduce the most impor-
tant of these measures, mean.
Mean
The mean, or average, is computed by adding a list of scores and then dividing
by the number of scores. It is also more commonly known as the average of a
set of observations, often denoted as Mx, signifying the mean of a set of scores,
represented as X. Its algebraic formula is:
∑X
N
where the mean, ∑X is the sum of the Xs, or individual scores, and N is the
number of scores.
Consider an example. Fifteen students took a mathematics exam, earning
the following scores:
98 89 78
97 89 73
95 84 70
93 82 60
90 82 50
To determine the mean score on this math test, add the 15 scores (that is, ∑X =
1,230), then divide that sum by N (15) to give 82.0.
The mean reveals more about a particular distribution of observations than
the modal value because it considers every data point in a set, while the mode
considers only the most common value. Because of this, however, the mean is
subject to the influence of outliers, or scores that are far below or far above a
central point. Let us consider a further example of a set of test scores from our
fictional mathematics class:
98 89 30
97 89 73
95 84 70
12 82 60
90 82 50
292 ■ CHAPTER 12
Using our formula for mean, we calculate an average of 73.4 (1,101 divided
by 15). This statistic should be interpreted with caution, however; clearly, the
average scores were dramatically influenced by two exceptionally low scores
(“12” and “30”, respectively). A teacher who relies solely on the mean to con-
clude that her students did not do well on this exam would be in error. There is
a need for an additional central tendency measure that will not be as sensitive
to outlying scores.
Median
The median is the score in the middle of a distribution: 50 percent of the scores
fall above it, and 50 percent fall below it. In the table of 15 scores, the median
score is 84. Seven scores are higher and seven are lower than that one. In a list
containing an even number of scores, the middle two scores would be averaged
to get the median.
The median is not as sensitive to extreme scores as is the mean. In fact,
this is the value of the median as a statistic of a measure of central tendency;
unlikely scores that are unusually high or unusually low in a distribution do
not influence the median to the same extent that they do the mean. The mean
defines the “middle” of a set of scores considered in terms of their values,
whereas the median defines the “middle” of the distribution in terms of the
number of scores.
In our second set of scores, we identify a median of 82. Note that the mean
of the 15 scores is lower than the median, because two or three extremely low
scores reduce the total. In this case, the median score is a more accurate rep-
resentation of the performance of the class as a whole than is the mean, which
was artificially lowered by a couple of lower scores.
Researchers should consider the mean, median, and mode in tandem, as the
combination of these statistics reveals vital truths about the distribution of
observations as a whole. In doing so, it will become apparent that the distribu-
tion can be described in one of three ways. In a symmetrical distribution, the
mean, median, and mode are values that occur very close together. In this case,
a researcher should use the mean value to draw inferences about the distribu-
tion of scores. In a positively skewed distribution, the mode is the lowest of
the three statistics and the mean is the highest. This happens when a few high
outliers are within a data set. A practical example may be a set of scores on
an examination where most students do poorly but one or two students earn
C A R R Y I N G O U T S TAT I S T I C A L A N A LY S E S ■ 2 9 3
X X–M (X – M)2
9 -3 9
11 -1 1
12 0 0
13 1 1
15 3 9
Based upon a mean of 12, we can now calculate the sum of squared devia-
tions for this data set, which, using the steps above, is calculated as 20. Plugging
this number into the formula, we find that the variance for this data set is 5, that
is: 20 / (5 – 1) = 5.
The variance indicates how spread out a distribution may be. We can see
the usefulness of this measure quite clearly when we compare our two recent
distributions:
At first glance, it would seem that sample (a) is distributed more “widely” than
is sample (b), where scores seem to cluster around the top end of the range.
Though the range statistic for each sample is the same, calculation of the vari-
ance confirms that sample (a) does have a wider distribution, with a variance
statistic of 6, while sample (b) produces a variance statistic of 5.4.
Many times it is helpful also to know where individual scores may fall
within a distribution. This can be discovered by calculating the standard devia-
tion, which is simply the square root of the variance statistic. For samples (a)
and (b), the standard deviation statistics are 2.45 and 2.33 respectively. Typi-
cally, increasing standard deviation reflects wider dispersion from the mean of
the highest and lowest scores.
The standard deviation statistic is particularly useful when you are working
with a normal distribution of scores. For example, IQ testing in this country
has yielded a mean IQ score of about 100 with a calculated standard deviation
of about 15. For a normal distribution, this means that the majority of the
population (68%) will have IQ scores that fall within one standard deviation
of the mean (between 85 and 115).
Thus far, we have discussed statistics of central tendency and variance that
are useful in identifying key trends in a data distribution. We now turn our
attention to the process of preparing your data for more advanced analyses—
that of coding and rostering.
C A R R Y I N G O U T S TAT I S T I C A L A N A LY S E S ■ 2 9 5
Ordinarily, a researcher does not analyze data in word form. For instance, a
data-processing device cannot conveniently record the fact that a subject was
male or female through the use of those words. The solution is to assign a
numerical code: Perhaps each male subject would be coded as 1 and each female
subject as 2. The numerical code gives another name to a datum, but one that
is shorter than its word name and therefore easier to record, store, process, and
retrieve.
Such codes are used regularly in data processing. Similar techniques allow
you to code characteristics like the subject’s name, gender, socioeconomic sta-
tus, years of education completed, and so on. Consider the simple data codes
shown in Table 12.1.
Note that the data are collected in nominal categories designated by word
name (e.g., single, married, divorced) and that the word name of each category
is then replaced by a number. The researcher makes an arbitrary choice of
which number represents which word. Typically, however, consecutive num-
bers are chosen; when appropriate, one number (usually the last in the series) is
reserved for a category labeled “other” or “miscellaneous.”
Numerical data codes are essential for nominal data, which are typically
collected in word form and must therefore be coded to obtain numerical
indicators of categories. These codes can also be assigned to interval data (or
ordinal data) if, to facilitate data storage or analysis, you desire to replace a long
series of numbers with a shorter one, or if you choose to convert these data to
nominal form. (Coding produces nominal data because it groups scores into
categories.) For instance, if your subjects’ ages run from a low of 16 years old
to a high of 60 years old, you can replace an interval scale of 45 possible scores
(individual ages) with a compressed scale of five categories (which would then
be considered nominal categories for computations): Ages 11 to 20 receive the
code 1; ages 21 to 30 receive 2; ages 31 to 40 receive 3; ages 41 to 50 receive 4;
and ages 51 to 60 receive 5.
To summarize, researchers have several options with interval data. They
can retain them in interval form and use the two-digit numbers collected for
subjects’ ages, or they can treat the data as classes or categories by coding them
into nominal form. If they choose to use a statistic requiring nominal data, such
as chi-square analysis, they would have to adopt the second option.
Coding can also define ordinal categories within a series of scores. Thus, 1
might represent scores in the high 20 percent of a series, 2 those in the second
highest 20 percent, and so on, down to 5 for those in the lowest 20 percent.
Consider an example of scores from a class of 25 students:
98 84 79 70 60
96 84 78 69 58
94 84 77 68 53
92 80 77 68 42
87 80 71 65 30
Code 1 2 3 4 5
This coding system ensures that an equal number of scores will fall into
each coding category; it offers an attractive system for a study that requires
equal numbers of scores in a category. (It can also be used to compare the dis-
tributions of two independent samples using chi-square analysis.)
Some examples of more complex coding systems for converting interval
scores to ordered categories appear in Table 12.2. Computations based on such
categories are still likely to treat them as nominal variables. Note that the cod-
ing schemes avoid labeling any category as 0. It is recommended that a coding
category begin with 1 or 01 and that 0 be used only to designate no data. Note
also that in each case, the number of digits in the code must be the same as the
number of digits in the last category of the coded data. Because the number 10
in Example 2 has two digits, a two-digit code must be used from the beginning.
A coding scheme for 350 categories would require three-digit codes, because the
number 350 has three digits.
C A R R Y I N G O U T S TAT I S T I C A L A N A LY S E S ■ 2 9 7
EXAMPLE 3 EXAMPLE 4
Grade Grade
1 = 90 and above 1 = top 10% (percentile 91–100)
2 = 80–89 2 = next to top 20% (percentile 71–90)
3 = 70–79 3 = middle 40% (percentile 31–70)
4 = 60–69 4 = next to lowest 20% (percentile 11–30)
5 = 59 and below 5 = lowest 10% (percentile 1–10)
Once you have prepared the coding scheme, the next step is to either pre-
pare a data sheet or enter the data into a computer. It is helpful to indicate on
a separate piece of paper the independent, moderator, control, and dependent
variables for each analysis.
The dependent variables in this study include rate, accuracy, and compre-
hension scores on the school’s own reading test and scores on the Picture Read-
ing Test. In addition, the number of days absent from school were rostered, as
well as the scores on a group IQ test administered at the end of the experiment.
The sample roster sheet appears in Table 12.4.
Note that the first five items on the roster have been designated by codes,
while the remaining six are actual scores. Decimal points can be eliminated
when they are in a constant position for each variable and add nothing to the
data. However, maintaining them, as in the grade equivalency scores on the
Picture Reading Test in Table 12.4 (next to last column), often aids interpreta-
tion of the data roster and subsequent results.
Recall that the purpose of a statistical test is to evaluate the match between
the data collected from two or more samples. Further, statistical tests help to
determine the possibility that chance fluctuations have accounted for any dif-
ferences between results from the samples.
To choose the appropriate statistic, first determine the number of inde-
pendent and dependent variables in your study. (For statistical purposes, con-
sider moderator variables as independent variables.) Next, distinguish nominal,
ordinal, and interval variables. (These terms are explained in Chapter 10.)
Notice in the table that if both independent and dependent variables are
interval measures, correlation techniques (parametric correlations) may be
employed. Ordinal measurement generally calls for nonparametric techniques.
C A R R Y I N G O U T S TAT I S T I C A L A N A LY S E S ■ 2 9 9
COMPREHENSION
DAYS ABSENT
TREATMENT
ACCURACY
GROUP IQ
RATE
SUBJECT
SEX
IQ
NUMBER
01 1 1 2 00 21 07 18 2.4 115
02 1 1 3 01 19 10 16 3.1 095
03 1 2 3 01 09 04 17 0.9 101
04 1 2 1 05 17 02 10 1.6 097
05 2 1 2 00 22 14 04 1.8 122
06 2 1 2 1 14 06 11 2.0 124
07 2 2 1 08 13 12 18 1.1 110
08 2 2 1 07 16 03 16 1.2 104
09 2 1 1 1 01 11 04 17 0.3 122
10 2 1 1 2 04 18 09 12 0.9 100
11 2 1 2 1 01 25 02 14 2.9 101
12 2 1 2 2 03 23 12 15 1.0 099
13 2 2 1 3 06 11 08 10 2.1 133
14 2 2 1 3 09 17 11 14 2.1 130
15 2 2 2 2 06 29 14 11 1.0 129
16 2 2 2 1 00 13 08 16 2.0 103
17 3 1 1 3 02 15 10 12 3.0 092
18 3 1 1 1 06 17 10 09 2.8 104
19 3 1 2 2 08 27 06 23 2.0 101
20 3 1 2 2 10 25 14 17 1.7 093
21 3 2 1 1 01 13 12 16 2.2 109
22 3 2 1 1 02 24 09 21 2.7 131
23 3 2 2 2 06 31 10 14 2.5 105
24 3 2 2 3 03 15 15 18 2.9 108
25 4 1 1 2 05 19 10 17 1.8 111
26 4 1 1 1 11 13 12 18 0.6 130
27 4 1 2 2 01 25 08 12 1.5 090
28 4 1 2 1 10 30 09 13 1.9 100
29 4 2 1 3 07 19 19 24 2.6 119
30 4 2 1 1 03 11 15 23 2.0 124
31 4 2 2 2 03 18 10 20 3.5 101
32 4 2 2 3 09 24 15 15 2.1 095
300 ■ CHAPTER 12
Researchers rely on a basic tool kit of six commonly used statistical tests.
If you are dealing with two interval variables, use a parametric correlation
called the Pearson product-moment correlation. When dealing with two ordinal
variables, most researchers use a Spearman rank-order correlation. With two
nominal variables, they use the chi-square statistic. For a study with a nominal
independent variable and an interval dependent variable with only two condi-
tions or levels, use a t-test; use analysis of variance to evaluate more than two
conditions or more than one independent variable. Finally, the combination of
a nominal independent variable and an ordinal dependent variable requires a
Mann-Whitney U-test (a nonparametric version of the t-test).
Researchers often transform variables to render the data they collect
suitable to specific statistical tests (which may differ from those originally
anticipated for the studies). For instance, if interval performance data are
available in a two-condition study, but they do not satisfy the conditions for
a t-test (normal distribution, equal sample variance), you could transform the
interval dependent variable into an ordinal measure and use a Mann-Whitney
U-test.
Consider another example. Suppose you are studying the effect of pro-
grammed science materials on learning, with student IQ as a moderator vari-
able. One of the independent variables is a nominal variable—programmed
learning versus traditional teaching—whereas the second, IQ, is an interval
variable. The dependent variables, you decide, are subsequent performance on
an achievement test (interval) and attitudes as measured by an attitude scale
(interval). How should you proceed? The first step is to convert the second
independent variable (IQ) from an interval variable to a nominal variable.
(Recall from Chapter 10 that you can always convert from a higher order of
measurement to a lower order—from interval to ordinal or nominal, or from
ordinal to nominal—but that converting from a lower to a higher order of mea-
surement is not advised.)
To convert an interval variable to a nominal variable, separate the students
into groups based on their scores on the interval measure. Place the scores on
IQ (or another interval variable) in numerical order (that is, essentially, recast
the interval data in ordinal form) and locate the median score. You can then
label everyone above the median as high IQ and everyone below the median as
low IQ, thus assigning Ss to a high category or a low category. As an alternative,
the students could be broken into three groups—high, medium, and low—by
dividing the total group into equal thirds, or tertiles. Categorical assignment to
groups represents nominal measurement.
This chapter will next review five commonly used parametric and non-
parametric statistical tests. Table 12.5 describes the application of each to its
appropriate research investigation.
C A R R Y I N G O U T S TAT I S T I C A L A N A LY S E S ■ 3 0 1
that the treatment should have an impact on student grades that outweighs the
differences between individual students in the class. He also assumes that his
instructional strategies will have a specific impact on each of his students; that
is to say, that the scores they produce on his exams will be independent of one
another. Finally, our fictional professor assumes that exam scores will be nor-
mally distributed; that is, the scores his students produce will fall on the normal,
bell-shaped distribution curve.
The different instructional strategies produced the following average exams
scores for the three classes:
F = MS between / MS within
Here, F represents the ratio between the differences between groups and the
differences within each group (also known as “error”). Simply stated, an F sta-
tistic that is higher than 1 indicates the presence of a treatment effect (though
the statistical significance of that effect is a bit more complicated).
The first step in calculating the ANOVA for a data set is to calculate the
total sum of squares (SST), which is the sum total of the difference between
each data point and the mean for the data set. This can be defined as follows:
As we stated earlier, the mean scores for the three groups are 72, 79, and
91, respectively (we’ll use these later). The total mean for this data set is 80.67.
Calculating the total sum of squares would simply be a matter of inserting each
data point and mean into the formula.
This statistic represents the total amount of variance that will be explained
by the sum of the treatment (between groups) and error (within groups). We
must next consider the degrees of freedom (df), or the number of options that
will vary within a particular investigation. This is calculated quite easily. The
degrees of freedom are always calculated by the total number of scores used to
calculate the sum of squares minus 1. For the dfm (model degrees of freedom)
simply subtract 1 from the total number of comparison groups. For this study,
we simply subtract 1 from 3 (lecture, cooperative learning, lecture/cooperative
learning), which yields a df of 2. For the dfT (total degrees of freedom) we
simply subtract 1 from 30 (the total number of participant scores), which yields
29. For the dfR (residual degrees of freedom), we subtract the model degrees of
freedom from the total degrees of freedom, which yields 27.
Our next step is to calculate just how much of the total sum of squares (vari-
ance) can be explained by the treatment conditions. This statistic, also known as
the model sum of squares, can be calculated by using the following formula:
Here n equals the total number of subjects in a particular group, M equals the
mean for a particular group, and MG equals the total mean for the sample.
304 ■ CHAPTER 12
When we consider our individual group means (72, 79, and 91) and our total
group mean (80.67), we calculate the following statistic:
Our next step is to calculate the residual sum of squares, or the amount
of variance that was not explained by the treatment. This is done by subtract-
ing the model sum of squares from the total sum of squares (in this example,
3404.67 – 1846.67), which for our fictional study gives us 1558.00.
By simply observing these numbers, one would conclude that there is a
greater degree of variability between groups than within each group. We must
next calculate the mean sum of squares (MS) for both the model and residual
sum of squares. This statistic is calculated by simply dividing each of the model
sum of squares and residual sum of squares by their respective degrees of free-
dom. This statistic is calculated as follows:
dfm = 1846.67 / 2
dfm = 923.33
dfR = 1558 / 27
dfR = 57.07
F = MS between / MS within
923.33/57.07
F = 16.00
Statistical tests are major tools for data interpretation. By statistical testing,
a researcher can compare groups of data to determine the probability that dif-
ferences between them are based on chance, providing evidence for judging the
validity of a hypothesis or inference. A statistical test can compare the means
in this example relative to the degree of variation among scores in each group
to determine the probability that the calculated differences between the means
reflect real differences between subject groups and not chance occurrences.
By considering the degree of variation within each group, statistical tests yield
estimates of the probability or stability of particular findings. Thus, when a
researcher reports that the difference between two means is significant at the .05
level (usually reported as p < .05), this information implies a probability less than
5 out of 100 that the difference is due to chance. (That is, the likelihood that the
distribution of scores obtained in the study would occur simply as a function of
chance is less than 5 percent.) On this basis, a researcher can conclude that the
differences obtained were most likely the result of the treatment.
C A R R Y I N G O U T S TAT I S T I C A L A N A LY S E S ■ 3 0 5
ΣXY – (ΣX)(ΣY) / n
兹(SSx)(SSy)
To calculate the coefficient statistic, she will first need to perform a few
simple calculations. For ΣXY, multiple each X value (efficacy) by its corre-
sponding Y value (exam average) and sum the total. For both (ΣX) and (ΣY),
simply sum the totals for each variable. For (SSx) and (SSy) it is a bit more
complex—you will need to use the following formula:
This requires that we calculate both SX2 and SY2 as well as squaring the SX and
SY. The resulting table should resemble the following:
With this information, we can now plug the appropriate numbers into the
formula:
65936 – (792)(828)
10
兹 (737.6)(1205.6)
rxy = .38
In this instance, the correlation coefficient informs the researcher that there
is a moderately strong positive correlation between efficacy and exam score for
this sample.
Using this information, we can also calculate a line of best fit, that is to say,
a linear regression. This statistic allows us to predict values for one variable
using a fairly simple equation:
Y = mx + b
The slope (m) for the regression line can be calculated using many of the
statistics we calculated for the correlation coefficient:
n(ΣXY) – (ΣX)(ΣY)
m=
n(ΣX2) – (ΣX)2
ΣY – m(ΣX)
b=
n
By plugging in the data from our efficacy/exam score study, we can easily cal-
culate the regression equation:
10(65936) – (792)(828)
m=
10(63464) – (792)2
m = .49
828 – .49(792)
b=
10
b = 43.99
308 ■ CHAPTER 12
Mann-Whitney U-Test
n1(n1 +1)
U1 = R1 –
2
In this equation, n1 represents the sample size for sample 1, while R1 represents
the sum of ranks for sample 1. When calculating U for multiple samples, the
smallest of the U values is used when consulting significance tables.
An example might consider a physical education teacher who wishes to
investigate whether girls or boys in general post faster 50-yard-dash times.
After testing 20 students, he ranks them in order of finish:
B B BG B B G G G B B G B B G G G G BG G
Using this rank-order information, we can calculate the U statistic for both
boys (U1) and girls U2):
10 (10 +1)
U1 = (1+2+3+5+6+10+11+13+14+18) –
2
U1 = 26
10 (10 +1)
U2 = (4+7+8+9+12+15+16+17+19+20) –
2
U2 = 72
C A R R Y I N G O U T S TAT I S T I C A L A N A LY S E S ■ 3 0 9
For purposes of analysis, we would use the smaller U1 statistic (26) when con-
sulting the significance table (See Appendix A). A finding of significance for
this investigation would suggest that males tend to run faster 50-yard-dash
times than females.
A worksheet for the U-test appears as Figure IV in Appendix B. This
worksheet has been set up for experiments involving fewer than 20 and more
than 8 observations in the larger of two samples. For larger sample sizes, use
techniques described in Siegel (1956).
Chi-Square Statistic
(observed – expected)2
χ2 = Σ
expected
Males Females
A 4 9
B 8 9
C 7 4
D 6 2
F 5 2
Totals 30 26
((3 – 4)2 / 4) + ((8 – 8)2 /8) + ((7 – 6)2 /6) + ((3 – 8)2 / 8) +((4 – 4)2 / 4))
χ2 = 3.80
310 ■ CHAPTER 12
of a Type I error, it is ordinarily set at 5 percent, or the .05, level, meaning that
researchers accept only 5 chances out of 100 of making such an error. If the
Type I error were considered to be four times more serious than the Type II
error, then Type II error would be represented by a significance level of .20.
A two-sided or two-tailed statistical test compares the null hypothesis that the
means of two distributions are equal (H0 ) against the alternative that they are not
equal—meaning that the first may be either larger (H1 ) or smaller (H2) than the
second. This term implies that the two tails or two sides of the normal distribu-
tion contribute to estimates of probabilities. A two-tailed test is concerned with
the absolute magnitude of the difference regardless of sign or direction.
A researcher concerned with the direction of such a difference might
employ a one-tailed or one-sided test. However, the one-tailed approach is
a much less conservative evaluation than the two-tailed approach. A given
degree of difference between means indicates half the likelihood that it resulted
by chance in a one-tailed test as the same difference indicates in a two-tailed
test. In other words, a difference that yields a p value at the .05 level in a one-
tailed test will yield a p value at the .10 level in a two-tailed test. A one-tailed
test thus doubles the probability of a Type I, or false positive, error. For this
reason, two-tailed tests are recommended.
■ Summary
is usually set at .05; power at .80; and effect size at .2 (small), .5 (medium),
or .8 (large). A medium effect size requires two groups of 64 subjects each.
5. Nominal data should be coded numerically and all data rostered, either by
hand or electronically, prior to analysis.
6. Analysis of variance provides a statistical tool for partitioning the variance
within a set of scores according to its various sources in order to determine
the effects of variables individually (main effects).
7. The Spearman rank-order correlation compares two sets of ranks to deter-
mine their degree of equivalence.
8. The chi-square (χ2) test uses contingency tables to identify differences
between two distributions of nominal scores.
36 92 74 85 39
98 41 40 90 45
47 73 58 70 22
49 62 67 71 52
54 68 81 78 50
Group 1 Group 2
a. 75 k. 81 a. 72 k. 79
b. 88 l. 89 b. 81 l. 83
c. 80 m. 77 c. 70 m. 73
d. 85 n. 84 d. 80 n. 85
e. 78 o. 88 e. 70 o. 82
f. 90 p. 93 f. 85 p. 90
g. 82 q. 91 g. 73 q. 80
h. 88 r. 81 h. 82 r. 78
i. 76 s. 85 i. 72 s. 81
j. 82 t. 87 j. 78 t. 74
11. Now use the data given in Exercise 5 in a slightly different way. Instead of
thinking of Groups 1 and 2 as they are labeled, think of them as Tests 1 and
2, administered to a single group of Ss. The letter appearing alongside each
score is now the subject identification code. Compute a Spearman rank-
order correlation between the scores for the 20 Ss on Test 1 (the Group 1
data) and the scores for these same Ss on Test 2 (the Group 2 data). Report
the r, and the level of significance (if any).
12. Again, consider Groups 1 and 2 to be Tests 1 and 2, respectively, as you
did in Exercise 11. But this time only consider the first 10 Ss. (Again use
the letters alongside the scores as identification codes.) Compute a Pearson
product-moment correlation between the Test 1 and Test 2 scores for the
first 10 Ss. Report the r value and the level of significance (if any).
■ Recommended References
Glass, G. V., & Hopkins, K. D. (1996). Statistical methods in education and psychology
(3rd ed.). Englewood Cliffs, NJ: Prentice-Hall.
Linn, R. L. (1986). Quantitative methods in research on teaching. In M. C. Wittrock
(Ed.), Handbook of research on teaching (3rd ed.). New York, NY: Macmillan.
Tatsuoka, M. M. (1988). Multivariate analysis: Techniques for educational and psycho-
logical research (2nd ed.). New York, NY: Macmillan.
Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles of experimen-
tal design (3rd ed.). New York, NY: McGraw-Hill.
= CHAPTER THIRTEEN
OBJECTIVES
■ 315
316 ■ CHAPTER 13
This and the next five sections deal with the preparation of the parts of an edu-
cational research report in the form of either a dissertation or a journal article
manuscript.
This section describes the preparation of the introductory section. Depend-
ing upon the length of the section, subsection headings may or may not appear
(for example, “Context of the Problem,” “Statement of the Problem,” “Review
of the Literature,” “Statement of the Hypotheses,” “Rationale for the Hypoth-
eses,” and so on).
The first paragraph or two of the introduction should acquaint the reader with
the problem addressed by the study. This orientation is best accomplished by
providing its background. One accepted way to establish a frame of refer-
ence for the problem is to quote authoritative sources. Consider the following
opening paragraphs of introductions from each of two articles in the American
Educational Research Journal:
1. Because the use of headings is optional, the original material cited as examples often
lacks these headings. When missing from the original, the appropriate headings have been
added for clarity.
WRITING A RESEARCH REPORT ■ 317
Note that each introduction identifies the area in which the researchers find
their problem. Additionally, they state their reasons for undertaking the projects;
the introductions point out that the problems have not been fully studied or that
the current studies promise useful contributions to understanding. These illus-
trations are short because each was drawn from a journal that emphasizes brevity
of exposition. In other forms of research reports (and in some journal articles, as
well), context statements may run somewhat longer than these examples. How-
ever, three paragraphs is recommended as a maximum.
definitions of the variables, so the problem statement should identify the vari-
ables in their conceptual form rather than in operational form. The variables
should be named, but no description of measurement techniques is necessary
at this point.
One or two sentences will normally suffice to state a research problem.
Often the statement begins: “The purpose of this study was to examine the
relationship between . . .” or “The present study explored . . .”2 Here are prob-
lem statements for the two previously quoted studies:
Problem
The purpose of the present study was to determine if any differential effects
occur in the behavior of mildly mentally retarded children in EMR class-
rooms as a result of variations in class size. Does class size make a difference
in the relative frequency of significant classroom behaviors such as atten-
tion, communication, and disruption? [Forness & Kavale, 1985, p. 404]
Problem
We addressed these issues by asking the following questions:
2. In a research proposal, is or will be would be substituted for was, and explores or will
explore would replace explored. A proposal is written in the present or future tense and a
final report in the past tense.
WRITING A RESEARCH REPORT ■ 319
Problem
The purpose of this study was threefold. An attempt was made to test the
differential effects of the verbal praise of an adult (the type of reinforce-
ment most often utilized by classroom teachers) (1) on a “culturally disad-
vantaged” as opposed to a white middle-class sample, (2) as a function of
the sex of the agent, and (3) as a function of the race of agent and recipient
of this reinforcement.
Problem
The purpose of this study was to determine whether girls who plan to
pursue careers in science are more aggressive, more domineering, less con-
forming, more independent, and have a greater need for achievement than
girls who do not plan such careers.
Problem
It was the purpose of this study to determine what differences, if any,
existed in the way principals of large and small schools and principals
(collectively) and presidents of teacher organizations viewed the level of
involvement of the principal in a variety of administrative tasks.
The purpose of the literature review is to expand upon the context and back-
ground of the study, to refine the problem definition, and to provide an empiri-
cal basis for subsequent development of hypotheses. The length of the review
may vary depending upon the number of relevant articles available and the
purpose of the research report. Dissertations are usually expected to provide
more exhaustive literature reviews than journal articles. Although some dis-
sertation style manuals recommend devoting an entire chapter (typically the
second) to a review of the literature, building the review into the introductory
chapter has the advantage of forcing the writer to keep the review relevant
to the problem statement and the hypotheses that surround it. This section
examines the task of writing a literature review according to the procedures
described in Chapter 3.
A good guideline for selecting the literature to cover in the review sec-
tion is to cite references dealing with each of the variables in the study, paying
special attention to articles that deal with both variables. Literature concerning
conceptually similar or conceptually related variables should likewise be cited.
Subheadings should reflect the major variables (key words) of the literature
review. The review should expand its descriptions of articles as their relevance
to the study increases. Remember that the purpose of the literature review is to
320 ■ CHAPTER 13
provide a basis for formulating hypotheses. In other words, review articles not
for their own sake but as a basis for generalizing from them to your own study.
Consider the following organization of subheadings for the literature
review section in a study of the relationship between teacher attitudes and
teaching style:
Teacher Attitudes
Overview and Definitions
Open-Minded versus Closed-Minded: General Studies
Open-Minded versus Closed-Minded: Relation to Teaching
Humanistic versus Custodial: General Studies
Humanistic versus Custodial: Relation to Teaching
Teaching Style
Overview and Definitions
Directive versus Nondirective: General Studies
Directive versus Nondirective: Relation to Teacher Attitudes
Hypotheses
The specific hypotheses investigated were:
Hypotheses
Thus, in the case of computers, we expected individuals’ attributions about
how much they enjoy using computers to decline significantly over time.
We also expected gender to play a significant role with girls’ assessments
322 ■ CHAPTER 13
Hypotheses
Accordingly, the specific purpose of the present training study was to
examine the role of knowledge of information sources in children’s ques-
tion-answering abilities through the examination of an instructional pro-
gram designed to heighten their awareness of information sources. It was
predicted that as a result of training, (a) students’ awareness of appropri-
ate sources of information for answering comprehension questions would
WRITING A RESEARCH REPORT ■ 323
Rationale
Although the studies by Raphael et al. (1980) and by Wonnacott and
Raphael (1982) suggested that knowledge about the question-answering
process and sources of information for answering comprehension questions
is important, both studies were essentially descriptive. Thus, they cannot
provide causal explanations of the relationship between students’ strategic
(i.e., meta-cognitive) knowledge and actual performance. Belmont and But-
terfield (1977) suggested that training studies can provide such informa-
tion about cognitive processes. They proposed that successful intervention
implies a causal relationship between the means trained and the goal to be
reached; that is, one can learn if a component of a process is related to a
goal, or cognitive outcome, by manipulating the process. Similar sugges-
tions were proposed by Brown, Campione, and Day (1981) in their discus-
sion of “informed” training studies where students are taught about a strat-
egy and induced to use it and are given some indication of the significance
of the strategy Finally, Sternberg (1981) provided an extensive discussion
of prerequisites for general programs that attempt to train cognitive skills,
including suggestions such as the need to link such training to “real-world
behavior” as well as to theoretical issues. [Raphael & Pearson, 1985, p. 219]
Operational Definitions
Rhetorical Questions: Questions that do not expect some participation
by the reader. Such questions never require the student to do anything,
324 ■ CHAPTER 13
Operational Definitions
One of the most promising of these methodologies is the method of
repeated readings (Dowhower 1989; Samuels, 1979). In this approach,
readers practice reading one text until some predetermined level of flu-
ency is achieved. . . .
A related technique used to improve reading fluency is repeated lis-
tening-while-reading texts. This method differs from repeated readings in
that the reader reads the text while simultaneously listening to a fluent
rendition of the same text. [Rasinski, 1990, p. 147]
Predictions
It was therefore predicted that teachers labeled as innovative by virtue
of their applying for funds to develop an innovative classroom program
Predictions
Specifically, teachers displaying more liberal political attitudes as mea-
sured by the Opinionation Scale and the Dogmatism Scale (both devel-
oped by Rokeach) were expected to show more liberal tendencies toward
the treatment of students as evidenced by a greater emphasis on autonomy
for students and allowing students to set their own rules and regulations
and enforce them.
Predictions
Readers of a research proposal or report are usually concerned with the relevance
of the problem it addresses both to practice and to theory. Many people highly
value research that makes primary or secondary contributions to the solutions
of practically oriented educational problems. The field of education also needs
research that establishes and verifies theories or models. To these ends, a report
introduction benefits by indicating the value or potential significance of the prob-
lem area and hypothesized findings to educational practice, educational theory, or
both. Again, some examples are offered:
This section discusses a recommended set of categories for describing the meth-
ods and procedures of a study. Each category may correspond to a subheading
in the research report. Such a high degree of structure for the method section
is recommended, because this section contains detailed statements of the actual
steps undertaken in the research.
Subjects
and IQ (median and/or range). All potential sources of selection bias covered by
control variables should be identified in this section.
Providing such information allows another researcher to select a virtually
identical sample if he or she chooses to replicate the study. In fact, the entire
method section should be written in a way that provides another researcher
with the possibility of replicating your methodology.
Consider some examples:
Participants
The participants were 109 juniors and seniors in college, all preparing to
be teachers. They were enrolled in three sections of an educational psy-
chology course required for teacher certification during the summer term.
The course lasted 15 weeks. All three sections met once a week (on con-
secutive days) at the same time of day, covered the same content (learning
theories), used the same textbook, and were taught by the same instruc-
tor (the researcher). For the students in each section, the average age was
between 20 and 22, the average percentage of females was between 65%
and 70%, and the average score on the reading subtest of the College Level
Academic Skills Test (CLAST) was between 315 and 320. A comparison
of the three classes on age, gender, and CLAST scores showed them to be
equivalent (F < 1 in all three cases), thus satisfying the requirements for a
quasi-experimental design. Correlations between CLAST reading scores
and achievement in this course have been found to be about .5 (Tuckman,
1993). [Tuckman, 1996a, p. 200]
Subjects
The sample consisted of 53 eighth-grade subjects (34 girls, 19 boys) attend-
ing a public middle school. Subjects were judged most likely to be from
middle-class families and were predominantly white (87%). The subjects
were classified into two groups: good and poor readers. Subjects were
primarily classified on the basis of stanine scores obtained on a reading
subtest of the Comprehensive Test of Basic Skills (CTBS), which had been
administered in the spring of the previous school year. Subjects were clas-
sified as poor readers (n = 27) if their stanine score on the vocabulary and
comprehension reading subtest of the CTBS was 3 or below. Subjects were
classified as good readers (n = 26) if they had stanine scores of 6 or above.
Subjects with stanine scores in the 4 to 5 range were thought of as aver-
age readers and unsuited for this investigation. Subjects were also rated
by their teachers as either good or poor readers according to such criteria
as fluency, oral reading errors, and comprehension. Ages for the subjects
ranged from 12 years, 4 months to 14 years, 11 months. No emphasis was
328 ■ CHAPTER 13
placed on sex because similar studies in the past generally reported a non-
significant difference between the sexes in points gained due to answer
changing. [Casteel, 1991, pp. 301–302]
Some studies (but definitely not all) incorporate certain activities in which all
subjects participate or certain materials that all subjects use. These tasks and
materials represent neither dependent or independent variables; rather than
being treatments themselves, they are vehicles for introducing treatments. In
a study comparing multiple-choice and completion-type response modes for
a self-instructional learning program, for example, the content of the learning
program would constitute the task, because Ss in both groups would experience
this content. Apart from the content that remained constant across conditions,
one group would experience multiple-choice questions in its program, while the
second would experience completion-type questions. Program question format
is thus the independent variable, and it would be described in the next report
section (on “Independent Variables”). Program content is the task and would
be described in this section. Activities experienced by all groups are described in
this section; thus, if the content of a program or presentation were constant for
both groups, it would be described under “Tasks.” Activities experienced by one
or some but not all of the groups are described in the “Independent Variable”
section. Frequently, however, studies include no common activity or task, and
the report on such a study would entirely omit this section.
Some examples suggest the information appropriate to this section:
In this study, all Ss received the same instructional material; the only differ-
ence consisted in the conditions under which it was used. In the study excerpted
in the next quote, all Ss received the same tasks but with differing instructions,
allowing the researchers to solicit the dependent variable measures:
WRITING A RESEARCH REPORT ■ 329
In another study, each student learned the subject matter described in the
next excerpt. However, in the different conditions of the independent variable, the
instruction was controlled in different ways. This process is described in detail in
the first example in the later section of this chapter on “Independent Variables.”
Independent Variables
In this section, the researcher report should describe independent (and mod-
erator) variables, each under a separate heading.4 Researchers generally must
explain two types of independent variables—manipulated and measured
variables. The description of a manipulated variable (often referred to as a
treatment) should explain the manipulation or materials that constituted the
treatment (such as, what you did or what you gave). Be specific enough so
that someone else can replicate your manipulation. Identify each level of the
manipulation or treatment, itemizing each for emphasis.
This example describes a manipulated independent variable with three lev-
els or conditions:
Treatments
Incentive motivation condition. One class (n = 36) took a seven-item,
completion-type quiz at the beginning of each class period. A sample
item is: “A consequence of a response that increases the strength of the
response or the probability of the response’s reoccurrence is called a (an)
_____” (ans.: reinforcer). The quiz covered the textbook chapter assigned
for that week. It was projected via an overhead projector, and 15 min were
allowed for its completion. No instruction on the chapter had been pro-
vided before the quiz. The only information resource for the student was
the textbook itself. Following the quiz, students exchanged papers and the
instructor discussed the answers so that students could grade one anoth-
er’s tests. Students were informed that the average of their quiz grades
would count for one half of their grade for that segment, the same as the
end-of-segment achievement test. Each segment of the course involved 5
weeks of instruction and covered from four to five textbook chapters.
Learning strategy condition. One class (n = 35) was given the home-
work assignment of identifying the 21 most important terms in the
assigned chapter and preparing a definition of each term and a one-sen-
tence elaboration of each definition. A list of approximately 28 terms was
predetermined by the instructor for each chapter, and students’ choices
had to fit this list. The text included many signals so that it was not dif-
ficult for students to identify each term and information about it. The
text did not include a list of major terms in each chapter or a glossary,
so students had to identify the important terms on their own. For exam-
ple, in the chapter on reinforcement theory, reinforcer was identified as a
key term. An example of a student definition would be “something that
increases the likelihood of occurrence of the response it follows,” and the
student’s elaboration might be “getting something good to eat after doing
my homework.”
Students were given 1 hr of training, including examples and practice,
before they started and another hour after having done two assignments.
They were also given feedback on all aspects of each assignment so their
proficiency would improve. Each assignment was graded (A, B, or C)
based on number of correct terms included, correctness of definitions, and
appropriateness of elaborations. The grades were averaged and counted
for half of the segment grade, the same as the average of quiz grades in the
incentive motivation condition.
Control condition. One class (n = 38) heard only lectures in class on
the chapters. No quizzes were given, and no homework was assigned. This
is the manner in which the course is typically taught. [Tuckman, 1996a, pp.
200–201]
WRITING A RESEARCH REPORT ■ 331
Moderator Variable
Formal Operational Reasoning Test (FORT). This paper-and-pencil test
constructed by Roberge and Flexer (1982) was used to evaluate subjects’
logical thinking abilities. It contains subtests that can be used to assess
subjects’ level of reasoning for three essential components of formal oper-
ational thought: combinations, propositional logic, and proportionality
(cf. Greenbowe et al., 1981).
Roberge and Flexner illustrated the content validity of the FORT
by describing the relationship between each of the FORT subtests and
the corresponding Inhelder and Piaget (1958) formal operations scheme,
and they presented factor analytic evidence of the construct validity of
the FORT. Furthermore, Roberge and Flexner reported test-retest reli-
ability coefficients (2-week interval) of .81 and .80 for samples of seventh
and eighth graders, respectively, on the combinations subtest. They also
reported internal consistency reliability coefficients (K-R Formula 20) of
.75 and .74 for samples of seventh and eighth graders, respectively, on the
logic subtest; and internal consistency coefficients of .52 and .60 for sev-
enth and eighth graders, respectively, on the proportionality subtest.
Subjects also were classified as high operational or low operational
on the basis of their performance on the FORT. To be classified as high
operational, the subjects had to correctly answer at least 60% of the items
on two (or more) of the FORT subtests. Subjects whose scores did not
meet this criterion were classified as low operational. [Roberge & Flexner,
1984, pp. 230–231]
This example might have given more detail about the measure than was
required. The amount of detail reported varies as a function of the familiarity of
the instrument, the requirements of the readers, and the report’s space allocation.
332 ■ CHAPTER 13
Dependent Variables
Dependent Variables
1. Mathematics Achievement. The Mathematics Computations and Con-
cepts and Applications subscales of the Comprehensive Test of Basic Skills
(CTBS) were the achievement criterion measures. Fourth graders took
Level 2, Form S, while fifth and sixth graders took Level H, Form U. Stan-
dardized rather than curriculum-specific tests were used to be sure that the
learning of students in all treatments was equally likely to be registered on
the tests. The CTBS Computations scales covered whole number opera-
tions, fractions, and decimals, objectives common to virtually all texts
and school districts, and the Concepts and Applications scales focused on
measurement, geometry, sets, word problems, and concepts, also common
to most texts and school districts.
District-administered California Achievement Test (CAT) scores were
used as covariates for their respective CTBS scores. That is, CAT Computa-
tions was used as a covariate for CTBS Computations, and CAT Concepts
and Applications was a covariate for CTBS Concepts and Applications.
Because of the different tests used at different grade levels, all scores were
transformed to T scores (mean = 50, SD = 10), and then CTBS scores were
adjusted for their corresponding CAT scores using separate linear regres-
sions for each grade. These adjusted scores were used in all subsequent
analyses. Note that this adjustment removes any effect of grade level, as
the mean for all tests was constrained to be 50 at each grade level.
2. Attitudes. Two eight-item attitude scales were given as pre- and
posttests. They were Liking of Math Class (e.g., “this math class is the
best part of my school day”) and Self-Concept in Math (e.g., “I’m proud
of my math work in this class”; “I worry a lot when I have to take a math
test”). For each item, students marked either YES!, yes, no, or NO! Coef-
ficient alpha reliability estimates on these scales were computed in an ear-
lier study (Slavin et al., 1984) and found to be .86 and .77, respectively.
[Slavin & Karweit, 1985, pp. 355–356]
Dependent Variable
A 50-multiple-choice-item test, matched to instructional content, was given
to measure end-of-segment achievement. The test had a K-R reliability of
.82. Virtually all the test questions related to key terms, a central feature of
WRITING A RESEARCH REPORT ■ 333
Procedures
The procedures section should describe any operational details that have not
yet been described and that another researcher would need to know to repli-
cate the method. Such details usually include (1) the specific order in which
steps were undertaken, (2) the timing of the study (for example, time allowed
for different procedures and time elapsed between procedures), (3) instructions
given to subjects, and (4) briefings, debriefings, and safeguards.
Consider some illustrations:
Procedures
Standardized mathematics scores were gathered for each student before
the study. The 20th percentile was the median score for the 47 students
and was used to classify students as “below average” or “low” in prior
mathematics achievement. Those students below the 20th percentile were
classified as low achievement, and those above the 20th percentile were
classified as below average achievement for purposes of this study.
The students were randomly assigned to one of the three treatment
groups, stratified to ensure [that] approximately equal numbers of males
and females with low and below average achievement were assigned to
each treatment. Each student received a brief review of computer oper-
ation and was instructed to proceed with the lesson. At the conclusion
of the lesson the elapsed time was noted and the immediate posttest was
administered. One week later students were given the parallel retention
test in their classroom. [Goetzfried & Hannafin, 1985, p. 275]
334 ■ CHAPTER 13
Procedure
Students participated in research sessions lasting 55 minutes a day for 11
days. Each condition was assigned a separate classroom comparable in
size. The curriculum unit used for instruction was a science unit on the
ecology of the wolf. Each day the teachers would explain the day’s task to
the students, distribute the appropriate materials, and review the condi-
tion’s nature. The teachers followed a daily script detailing what they were
to say and do each day. [Johnson & Johnson, 1985, p. 245]
Procedures
After permission to conduct the study was granted, we searched school
records to obtain student-ability scores for each subject. All subjects were
given the IAR questionnaire to measure their beliefs in internal versus
external control over academic responsibility (Crandall et al., 1965). This
measure was given to subjects in their English classes several days prior to
receiving the treatment.
Subjects were randomly assigned to one of the two treatment con-
ditions. One half of the subjects completed the learner-controlled les-
son, and the other one half completed the program-controlled lesson. To
receive the treatment, we brought subjects in groups of 14 to an Apple
computer lab for 1 hour on 3 consecutive days. Seven subjects using the
learner-controlled lesson and 7 using the program-controlled lesson were
represented in each group.
On the 1st day, we told the subjects that they would be using a com-
puter lesson to learn about some ideas used in advertising. On each day,
subjects were asked to work through the lesson until they were finished
and to raise their hands to indicate when they were done. At the end of the
lesson on the 3rd day, all the subjects completed the confidence measure
and then took the posttest. A formative evaluation of these procedures
was conducted prior to the actual study. No problems were found at that
time, and none occurred during the study. [Klein & Keller, 1990, p. 142]
Data Analysis
The data analysis section of a research report describes the statistical design
used and the statistical analyses undertaken. It is usually not necessary to
describe these procedures step-by-step. If the study relied on common sta-
tistical tests (such as analysis of variance, t-tests, chi-square analysis, correla-
tion), the test may simply be named and its source referenced. More unusual
approaches require more detail.
These points are illustrated in some examples:
WRITING A RESEARCH REPORT ■ 335
Analysis
For purposes of interpretive clarity, scores on the memory tests were con-
verted into percentages. Means and standard deviations for each test (i.e.,
FR, FIB, and MC) as a function of role group (teacher vs. learner) and
verbal ability (high vs. low) are presented in Table 1.
To assess performance differences between groups, a 2 × 2 × 3 within-
subjects analysis of variance (ANOVA) was performed on test scores using
type of test (FR, FIB, and MC) as the within-subjects measure. Role condi-
tion (teacher vs. learner) and verbal ability (high vs. low) were the between-
groups factors. [Wiegmann, Dansereau, & Patterson, 1992, pp. 113–114]
examples. (The examples chosen for this text were selected in part for their
brevity. Moreover, the examples were drawn from journal sources, which place
a premium on space, thus resulting in a terse style.) To obtain some idea of
length and level of detail, read research reports of the same form that you are
about to prepare (that is, dissertations, master’s theses, journal articles), paying
particular attention to form. Occasions will undoubtedly arise when a particu-
lar study will require more or fewer categories or a different order of presenta-
tion than that shown in this section.
The purpose of the results section in a research report is to present the out-
comes of the statistical tests that were conducted on the data, particularly as
they relate to the hypotheses tested in the study. However, the results section
omits discussion, explanation, or interpretation of the results. These functions
are carried out in the discussion section, which follows. Tables and figures are
usually essential to a results section, with the text briefly describing the con-
tents of those visual displays in words.
The best structure for the results section relates its information to the
hypotheses the study sets out to test. The first heading announces results for
Hypothesis 1, the second for Hypothesis 2, and so on. (Such subdivisions
would not be necessary, of course, in a study with only a single hypothesis.) In
general, each heading would then be followed by several elements:
Declarative Knowledge
On the pretest, no differences on the declarative knowledge variables were
found between subjects. On the immediate posttest, there was an effect of
the dispersion of the examples on the number of characteristics of wind-
flowers mentioned on the declarative knowledge test, F(1, 45) = 4.62, p <
.03. Subjects in the narrow-dispersion conditions remembered more char-
acteristics than did subjects in the wide-dispersion conditions (see Table
4). On the delayed posttest, no effects were noticed. [Ranzijn, 1991, p. 326]
Results
Table I shows the means and standard deviations for rule recall and appli-
cation. A significant difference for prior achievement was found, F(1, 34)
= 16.74, p < .0005. A prior achievement-by-scale interaction was also
detected, (1, 34) = 6.63, p < .01. Below average students scored higher
across both the rule and application scales, but proportionally higher on
application items.
A significant difference in instructional time was found for CAI strat-
egy, F(2, 38) = 15.80, p < .001. As shown in Table II, the linear strategy
338 ■ CHAPTER 13
averaged less time to complete than both the externally controlled adap-
tive strategy, p < .05, and the learner advisement strategy, p < .01. The time
differences between the adaptive and advisement strategies were also sig-
nificant, p < .01. A significant effect was again detected for prior achieve-
ment, F(1, 38) = 4.88, p < .05. Below average students used less time to
complete treatments than low achievement students. [Goetzfried & Han-
nafin, 1985, p. 276]
Results
Table II presents the mean learning scores, rote and conceptual, for the
experimental and control groups (maximum scores on each part were 24).
Subjects who learned in order to teach evidenced significantly greater con-
ceptual learning than subjects who learned in order to be tested (t = 5.42;
df = 38; p < .001), although the two groups did not differ on rote learning
(t = 1.39).
As indicated earlier, subjects were asked to keep track of how long
they spent learning the material, after it was suggested that they spend
approximately 3 hours. Results revealed no difference in the amount of
time spent (t = .69); the experimental group reported spending an average
of 2.55 hours working on the material, and the control group reported
spending an average of 2.71 hours. [Benware & Deci, 1984, p. 762]
Procedural Knowledge
A multivariate analysis of variance (MANOVA) and subsequent univari-
ate analysis of variance (ANOVA) revealed no significant effects on the
pretest nor on the immediate posttest. On the delayed posttest, there was
a significant effect of the dispersion of the examples on the number of cor-
rectly classified color pictures of windflowers (COL.WIND), F(1, 45) =
8.14, p < .006, on the number of correctly classified windflowers (WIND.
TOT), F(1, 45) = 4.81, p < .03, and on the total number of correctly clas-
sified flowers (TOTAL), F (1, 45) = 3.52, p < .06. This means that subjects
who were presented with the widely dispersed video examples classified
more flowers correctly than did subjects who were presented with the
narrowly dispersed examples (see Tables 2 and 3).
The analysis also showed an interaction between the number and dis-
persion of the examples, F(1, 45) = 3.86, p < .05. This means that subjects in
the Narrow-4 condition performed less well than did subjects in the Nar-
row-1 condition on the delayed posttest. However, subjects in the Wide-4
condition performed better than did subjects in the Wide-1 condition.
[Ranzijn, 1991, pp. 325–326]
WRITING A RESEARCH REPORT ■ 339
Results
The results of the analysis revealed a significant three-way interaction, F(2,
72) = 3.48, p ≤ .05. Analyses of simple effects (Kirk, 1982) revealed that
high verbal ability participants in the learner-role condition outperformed
both high ability participants in the teacher-role condition, FR: F(1, 72)
= 4.76, p ≤ .05; FIB: F(1, 72) = 16.90, p ≤ .01; MC: F(1, 72) = 6.33, p ≤ .05.
High verbal ability participants also outperformed low ability participants
in the learner-role condition, FR: F(1, 72) = 7.29, p ≤ .05: FIB: F(1, 72) =
45.07, p ≤ .05; MC: F (1, 72) = 17.05. p ≤ .05. In contrast, low verbal ability
participants in the teacher-role condition outperformed low verbal par-
ticipants in the learner-role condition on the FR test, F(1, 72) = 3.37, p =
.07, and the FIB test, F(1, 72) = 19.30, p ≤ .01, but not on the MC test. (MSe
= 60.81 for all interactions.) There were no significant differences between
high and low ability participants in the teacher-role condition. No other
comparisons were made. [Wiegmann et al., 1992, p. 114]
Results
The achievement test performance of each experimental group was as fol-
lows: (a) incentive motivation group mean = 82.8% (SD = 9.3), (b) learn-
ing strategy group mean = 71.6% (SD = 9.4), and (c) control group mean =
66.9% (SD = 12.6). The analysis of variance (ANOVA) for the difference
between the three group means yielded F(2, 106) = 21.69, p < .001. The
Newman-Keuls test revealed that the incentive motivation group earned
a significantly higher test score (p < .001) than did either of the groups
in the other two conditions. The effect size was near or above 1.00 for
each comparison with the incentive motivation group. The mean score of
the learning strategies group exceeded that of the control group (p < .10).
[Tuckman, 1996a, p. 202]
Additional Results
The number, percentage, and type of revised answers made by the two lev-
els of readers on the multiple-choice test are presented in Table 4. Among
the 53 students in the investigation, 652 revisions were made in answers on
the 76-item test, of which 415 or 64% represented changes from wrong to
right answers, resulting in a net gain in the scores. On the other hand, 139
or 19% of those revisions made were from right to wrong answers, which
lowered scores accordingly. There were 4,028 total responses. Group dif-
ferences with respect to the three types of response changes were small.
Poor readers were more likely than good readers to make a revision of
wrong to right.
340 ■ CHAPTER 13
Ninety-eight percent of all subjects (96% good and 100% poor read-
ers) changed at least 1 answer; almost two-thirds of the subjects changed
responses to at least 11% of the 76 items. The ratio of subjects gaining to
subjects losing points was 10:3 for poor readers and 5:1 for good readers.
Moreover, the ratio of changes for gains to changes for losses was about
2:1 for both good and poor readers. Simply stated, when subjects of both
groups made revisions, their changes resulted in a net gain in points twice
that of points lost through revision. [Casteel, 1991, p. 306]
The discussion section of a research report considers the nuances and shades of
the findings; finally, this material gives scope for displaying the perceptiveness
and creativity of the researcher and writer. A critical part of the research report,
this section is often the most difficult to write, because it is the least structured.
The details of the research dictate content in the introduction, method, and
results sections, but not in the discussion section.
The discussion section, however, does have a frame of reference: It follows
the introduction section. Elements of this discussion must address the points
raised in the introduction. But within this frame of reference, the writer is free
to use whatever art and imagination he or she commands to show the range and
depth of significance of the study. The discussion section ties the results of the
study to both theory and application by pulling together the theoretical back-
ground, literature reviews, potential significance for application, and results of
the study.
Because a research report’s discussion section is such a personalized
expression of a particular study by a particular researcher, it would be unwise
to recommend definite categories for this section like those provided for previ-
ous sections. It may be helpful, however, to identify and describe the various
functions of the discussion section.
orient them to the discussion that follows. Three examples of conclusion sum-
maries appear below:
Conclusions
Classroom behavior appears to differ as a function of size of EMR class-
rooms. These differences are most apparent in communication of EMR
pupils and are furthermore in the direction that might be expected, that
is, more verbalization or gestures in smaller classrooms, less in medium-
size classrooms, and least in the largest classrooms. In attending or nonat-
tending behavior, subjects apparently tended to be more attentive in either
large or small classrooms, as compared to medium-size classes. In class-
room disruption, post hoc differences suggested significantly less misbe-
havior by subjects in smaller classes compared to subjects in medium-size
classrooms, when such behavior involves teachers; but, when it involves
peers, misbehavior appeared significantly less often in medium-size class-
rooms but only when these are compared to larger classrooms. [Forness
& Kavale, 1985, p. 409]
Conclusions
The results of this study show, as expected, that presenting subjects
broadly dispersed examples in an instruction for natural concepts had
a positive effect on the development of procedural knowledge. Subjects
who received broadly dispersed examples classified more objects correctly
on a delayed posttest than did subjects who received examples that were
centered around the prototype. Further, it was shown that the number of
visually presented examples in the instruction did not significantly influ-
ence classification skill. [Ranzijn, 1991, pp. 326–327]
Conclusions
The results indicate that enhancing incentive motivation by giving quizzes
helps students, primarily those with low GPAs, perform better on regu-
lar achievement tests than a prescribed learning strategy that is aimed at
improving text processing. This finding suggests that poorly performing
students do not necessarily lack text-processing skills. Rather, they lack
the motivation to process the text. [Tuckman, 1996a, p. 206]
Discussion to Interpret
What do the study’s findings mean? What might have been happening within
the conduct of the study to account for the findings? Why did the results turn
out differently from those hypothesized or expected? What circumstances
342 ■ CHAPTER 13
accounted for the unexpected outcomes? What were some of the shortcomings
of the study? What were some of its limitations? The discussion section must
address these kinds of questions.
It must offer reasoned speculation. It may even include additional analy-
ses of the data, referred to as post-hoc analyses, because they are done after
the main findings are seen. Such analyses are introduced into the discussion
section to support the interpretation and to account further for findings that
on the surface appear inconsistent or negative with respect to the researcher’s
intentions.
For example, in one study, a researcher hypothesized that, among students
whose parents work in scientific occupations, males are more likely to choose
an elective science course then females. No mention was made of students
whose parents do not work in scientific occupations. The hypothesis was not
supported, however; students whose parents worked in scientific occupations
were as likely to choose a science elective whether they were male or female.
This finding contradicted other researchers’ prior findings of gender differ-
ences. To try for clarification, the researcher ran a post-hoc analysis of students
whose parents were not employed in science fields and presented the results in
the discussion section. As in the prior studies, the researcher found that males
chose science more frequently than did females. Because this analysis was not
planned in advance but occurred after and as a result of seeing the planned
analyses, it was considered post hoc and placed in the discussion section.
Review some further examples:
The results of this study showed that working in groups increased the per-
formance of middle self-efficacy subjects whereas the performance of high
and low self-efficacy subjects decreased. Why should shared outcomes or
shared fate have the effect of helping those subjects average in self-efficacy
while hindering those either high or low in self-efficacy? It is not uncom-
mon for groups to have an averaging or leveling effect on the performance
of individual group members. High believers are discouraged from work-
ing up to their level of expectation because their extra effort may bear no
fruit if their teammates perform at a lower level. Low believers may feel
that their effort is unnecessary because their teammates will “carry them.”
Only those in the middle may see the benefits of having partners and the
benefits of performing at a level higher than they might if they were work-
ing alone. [Tuckman, 1990a, pp. 295–296]
An alternative explanation for the present findings is that low ability stu-
dents benefited from assuming the teacher role simply because the role
allowed them to be more of a leader or more in charge of the sequence of
activities during the interaction. Therefore, the teacher role simply served
to disinhibit low ability students from contributing to the interaction and
motivated them to learn the material. On the other hand, high ability stu-
dents benefited more from assuming the learner role because the role not
only removed the burden of teaching lower ability students but also pro-
vided them with a partner who was more likely to contribute to the inter-
action. [Wiegmann et al., 1992, p. 114]
Another issue that affects the interpretation of the results is whether the
quizzes functioned as a direct training aid or targeted study guide for the
achievement tests rather than as an incentive to study. Unavoidably, some
quiz items covered content also covered on one of the exams, as did terms
that were studied in the homework assignments in the learning strategy
condition. However, the nature of the items was quite different in the two
experimental conditions. In the quizzes, students were given the definition
of a conditioned stimulus and asked to supply the term conditioned stimu-
lus as the correct answer. In the homework assignments, the students were
asked to both define and elaborate on the term conditioned stimulus. On
the achievement test, students were given one actual, unfamiliar example
of conditioning in a natural environment and asked to identify one of its
elements, namely, the conditioned stimulus. If the quizzes were merely
study guides, they should have helped students at all three GPA levels,
particularly those at the middle and low levels. Although they did help the
low-GPA students substantially, they had no effect at all on the middle-
GPA students. [Tuckman, 1996a, pp. 208–209]
How can the differences between these findings and previous research
be explained? It might be that at least some of the differences lie in the
344 ■ CHAPTER 13
Discussion to Integrate
Not only must the discussion section of a research report unravel findings and
inconsistencies, as part of its interpretation function, but it must also attempt to
put the pieces together to achieve meaningful conclusions and generalizations.
Studies often generate disparate results that do not seem to “hang together.”
A report’s discussion section should include an attempt to bring together the
findings—expected and unexpected, major and ancillary—to extract meaning
and principles. Some brief examples illustrate this kind of material:
ability of students to influence one another may have been weaker than in
past cooperative learning studies. [Tuckman, 1990a, p. 296]
Discussion to Theorize
More theoretically, results seriously question the fundamental and still pop-
ular Galtonian view that a test is no more or less than a sample of the exam-
inee’s responses to a standard nonpersonal stimulus. Although the present
study required examiners to administer the CELF in accordance with the
user’s manual, the dissimilar performance of handicapped, but not non-
handicapped, children across familiar and unfamiliar tester conditions sug-
gests that the two groups attributed different meanings to the tester and test
situation. This suggestion that the “standard” test condition was perceived
differently by the handicapped and nonhandicapped seems reasonable if it
is appreciated that, by requiring the speech- and/or language-impaired chil-
dren to respond to the CELF, handicapped subjects were asked to reveal
their disabilities. In contrast, nonhandicapped subjects, by definition, were
presented with an opportunity to demonstrate competence. Such a concep-
tualization is consonant with Cole and Bruner’s (1972) theoretical work,
which argued that, despite efforts to objectify tests, select subgroups of the
346 ■ CHAPTER 13
population will subjectivize them in ways that reflect their unique experien-
tial backgrounds. [Fuchs et al., 1985, pp. 195–196]
If poorly performing students lack the motivation to process the text, why
would regular quizzes activate motivation? Overmier and Lawry’s (1979)
theory of incentive motivation states that incentives can motivate perfor-
mance by mediating between a stimulus situation and a specific response.
Assuming that students in the incentive motivation condition value doing
well on quizzes (and based on informal discussions with students follow-
ing Experiment 1, it would appear that they do), they would be motivated
to apply their existing text-processing skills and thereby learn more. The
text-processing homework assignments, although performed well by the
students in the learning strategy condition, apparently had less incentive
value to motivate students to achieve success or avoid failure. The goal of
completing homework was primarily its completion. It was not associated
with the same consequences for success and failure as quizzes. [Tuckman,
1996a, pp. 206–207]
5. Dissertation writers often choose to follow the “Discussion” section with a sepa-
rate “Conclusions and Recommendations” section, highlighting their conclusions and
recommendations based on the results their studies produced.
WRITING A RESEARCH REPORT ■ 347
discussion section, typically toward the end, you should examine your findings
in the light of suggested applications, as these examples illustrate:
From an applied perspective, the findings of this study leave the potential
motivator of students in a quandary. How does one tailor-make or customize
motivational conditions to the needs of students differing in levels of self-
efficacy? If using groups helps those in the middle self-efficacy level, using
goal-setting helps those at the low level, and leaving them to their own devices
helps those at the top, how then can all three techniques be employed at the
same time? The answer may lie in not trying to affect all students in the same
manner.
One suggestion is to identify those students who are low in academic
self-efficacy based on their past lack of self-regulated performance and
work with them separately, possibly after class, to engage them in the goal-
setting process. Such efforts should focus on helping these students to set
attainable goals and to specify when and where they will engage in neces-
sary goal-related performance. As the students least likely to perform on
their own, these would be the ones on which to expend one’s primary
effort.
Cooperative group assignments on self-regulated tasks should per-
haps be used on a voluntary basis so students can choose whether or not
they wish to bind their fate to others or to work alone. This would enable
students of average self-efficacy to gain the support of group members
without simultaneously hampering those at either high or low self-efficacy
levels. The recommendation, therefore, is to personalize or individualize
motivational enhancement efforts to the greatest degree possible. [Tuck-
man, 1990a, pp. 297–298]
The results of the two experiments suggest that achievement among stu-
dents of college age, particularly those who tend to perform poorly, can
be enhanced by increasing their incentive motivation to study the text on
a regular basis and that frequently occurring quizzes, as used in this study,
may be an effective technique for enhancing incentive motivation. Because
quiz grades appear to constitute a strong study incentive for college stu-
dents, frequent testing may be a better inducement for effective and timely
processing of textbook content than using homework assignments as a
required strategy for this purpose. [Tuckman, 1996a, p. 209]
Finally, further work may help in understanding the areas discussed here.
One important project would be applying the analyses used here to data
from earlier years. It could also be important to look at underachievement
in specific courses, to determine other variables that influence variations
in underachievement, and to examine the long-range implications of ado-
lescent underachievement, especially its relation to educational and occu-
pational attainment. [Stockard & Wood, 1984, pp. 835–836]
■ The References
Every item in the reference list must be specifically cited in the text and vice
versa. To see how this format treats other types of references (for example,
dissertations, government reports, convention papers), obtain a copy of The
Publication Manual of the APA or examine references for articles appearing in
the Journal of Educational Psychology. Many journals outside of psychology
now use The Publication Manual of the APA as a stylistic guide (for example,
American Educational Research Journal).6
■ The Abstract
■ Preparing Tables
Tables are extremely useful tools for presenting the results of statistical tests as
well as mean and standard deviation data. The results of analyses of variance
and correlations (when sufficient in number) are typically reported in tabular
form.
Table 13.1, an analysis of variance table, indicates the source of variance
along with many supporting data: degrees of freedom (df) associated with each
source, sums of squares of the variance, mean squares of the variance for effects
and error terms, F ratios for both main effects and interactions, and p values.
The study from which this table came evaluated two treatment methods (reader
type and response type) as they affected the number of test answers revised
by students. The same author also prepared Table 13.2, an example of a table
of means, and a statistical comparison of results combined into a single table.
Similarly, Table 13.3 combines means and statistical results but also includes
standard deviations. Table 13.4 displays analysis of variance results, while Table
13.5 provides accompanying means and standard deviations. Table 13.6 is an
example of a tabular display of correlations.
Two additional examples illustrate tabular presentations of other kinds
of results. (The samples shown in this chapter do not necessarily illustrate all
TABLE 13.4 ANOVA Results for Each Achievement Test and the Combined
Achievement Tests, by Condition and GPA level
Condition GPA Level Interaction Error
Test (df = 1) (df= 2) (df = 2) (df = 109)
Test 1
MS 4.82 56.23 41.31 11.95
F 0.04 4.71b 3.46b
Test 2
MS 32.54 82.94 24.61 7.58
F 4.29b 10.94c 3.25b
Test 3
MS 51.91 91.40 17.21 12.88
F 4.03b 7.10c 1.34
Combined
MS 185.05 667.29 213.65 65.64
F 2.82a 10.16c 3.25b
a
p < .10; bp < .01
Source: Tuckman (1996a).
possible kinds of tables.) Table 13.7 shows a contingency table used in con-
junction with a chi-square analysis, and Table 13.8 shows a table of frequency
counts.
For further information on preparing tables, see The Publication Manual
of the American Psychological Association (6th ed., 2009). This source also gives
instructive guidance for preparing figures. For further input, examine tables
and figures that appear in journal articles.
Tables often play a useful role in a research report’s method section by
depicting a complex arrangement among conditions, an experimental design,
a sequence of procedures, or numbers and characteristics of subjects in a com-
plex study. Table 13.9 illustrates an application of a table to display the number
WRITING A RESEARCH REPORT ■ 353
TABLE 13.5 Means and Standard Deviations for the Incentive Motivation and
Learning Strategy Groups on the Three Achievement Tests and the Combined
Achievement Tests
Test
Combined
Condition 1 2 3 Achievement Tests
Incentive motivation
(n = 56)
M 73.0 79.6 76.5 76.5
SD 11.0 8.5 9.7 9.7
Learning strategy
(n = 59)
M 73.0 75.7 72.0 73.6
SD 11.9 10.5 13.9 12.1
Source: Tuckman (1996a).
Note: N = 173
a
p < .05; bp < .01 ; cp < .001
Source: Adapted from Pintrich and DeGroot (1990).
of subjects by grade taking part in a large, complex study as well as the means
and standard deviations for the separate groups on a number of control vari-
ables. This type of table should appear in the method section under the heading
“Subjects” to help clarify the details of complex studies.
Figures often provide useful tools for presenting research results, as well. Data
collected over time are often amenable to graphic presentation, as are data dis-
playing statistical interactions, means, and so on.
354 ■ CHAPTER 13
TABLE 13.8 Number of Pupils and Teachers in the St. Louis Metropolitan Area;
1969–1982
Figure 13.1 illustrates the use of a bar graph to display means in a way that
highlights an interaction between an independent variable (condition: incen-
tive motivation versus learning strategy) and a moderator variable (grade point
average: high versus medium versus low). The figure shows how these vari-
ables affected the study’s dependent variables (scores on three achievement
tests). Representing three variables simultaneously can be a difficult job, but it
is done with great clarity in this illustration.
Figure 13.2 shows another complex interaction of variables. The graph
clearly illustrates the interaction between the two variables in question (that
is, handicapped versus nonhandicapped and familiar versus unfamiliar testing
conditions) on the dependent measure (CELF score). Certainly, a report writer
might struggle to say in words what this graph depicts with relative ease.
Figures can also be used effectively in the method section to illustrate
tasks or other aspects of methodology. Figure 13.3 illustrates the types of test
items used to measure each of two dependent variables, high-level mathematics
achievement and low-level mathematics achievement.
FIGURE 13.1 Mean Test Scores for the Two Treatment Groups on Each of the Three
Tests Across Three Levels of Grade Point Average (GPA)
Source: From Tuckman (1996a). Reprinted by permission of the author and publisher.
WRITING A RESEARCH REPORT ■ 357
Source: From Peterson and Fennema (1985). Reprinted by permission from the publisher.
358 ■ CHAPTER 13
■ Summary
■ Recommended References
American Psychological Association. (2009). Publication manual (6th ed.). Washing-
ton, DC: American Psychological Association.
Calfee, R. C., & Valencia, R. R. (1991). APA guide to preparing manuscripts for journal
publication. Washington, DC: American Psychological Association.
Campbell, W. G., & Ballou, S. V. (1990). Form and style: Theses, reports, term papers
(8th ed.). Boston, MA: Houghton Mifflin.
Dees, R. (1993). Writing the modern research paper. Boston, MA: Allyn & Bacon.
Henson, K. T. (1995). The art of writing for publication. Boston, MA: Allyn & Bacon.
Locke, L. F., Spirduso, W. W., & Silverman, S. J. (1993). Proposals that work: A guide
for planning dissertations and grant proposals (3rd ed.). Newbury Park, CA: Sage.
Turabian, K. L., & Honigsblum, B. B. (1987). A manual for writers of term papers, the-
ses, and dissertations (5th ed.). Chicago, IL: University of Chicago Press.
PA R T
5
ADDITIONAL
APPROACHES
=
= CHAPTER FOURTEEN
OBJECTIVES
The labels formative and summative describe two types of evaluation. Forma-
tive evaluation refers to an internal evaluation of a program, usually under-
taken as part of the development process, that compares the performance of
participating students to the objectives of the program. Such analysis attempts
to debug learning materials or some other form of program under develop-
ment by trying them out on a test group. Such tryouts enable the developers to
tell whether the materials work as expected and to suggest changes. Formative
evaluation often leads a program developer “back to the drawing board.”
Summative evaluation, demonstration,1 is a systematic attempt to determine
whether a fully developed program is meeting its objectives more successfully
than might be obtained from alternative programs (or no program). Summa-
tive evaluation uses the comparison process to evaluate a fully implemented
1. When Chapter 1 linked the terms evaluation and demonstration, it was referring to
summative evaluation.
■ 365
366 ■ CHAPTER 14
2. This chapter makes no attempt to survey the literature on evaluation and describe all
possible evaluation models. Rather, it explains one model for evaluating educational pro-
grams. Of course, a researcher might choose to implement alternative models described by
Tuckman (1985).
C O N D U C T I N G E VA L U AT I O N S T U D I E S ■ 3 6 7
Step 1 Identification of the program’s aims and objectives (the dependent variable)
Step 2 Restatement of these aims and objectives in behavioral terms (an opera-
tional definition)
Step 3 Construction of a content valid (or appropriate) test to measure the behav-
iorally stated aims and objectives (measurement of the dependent variable)
Step 4 Identification and selection of a control, comparison, or criterion group
against which to contrast the test group (establishing the independent
variable)
Step 5 Data collection and analysis
3. The terms program and intervention are used interchangeably, although an educational
program is only one form of intervention.
368 ■ CHAPTER 14
expect of students who have completed the experience that we do not expect of
students who have not?” They may look to the developer of the intervention
to help them answer this question.
Thus, the first step in the summative evaluation process is to approach the
people who will implement the intervention and ask, “What aims and objec-
tives should this intervention accomplish? What abilities do you expect stu-
dents to gain by experiencing the program?” In response to such questions,
they may respond in several ways; examples include: (1) The program will help
the students develop an appreciation of art. (2) It will help them to enhance
their understanding of themselves. (3) It will provide them with the skills they
need to enter the carpentry trade. (4) It will increase their chances of becoming
constructive citizens. (5) They will know more American history than they did
before they started. (6) They will increase their interest in science.4
Each of these statements is an example of the kinds of aims that program
implementers identify and their likely ways of expressing their intentions.
Thus, Step 1 identifies the dependent variable of the evaluation, but largely in
conceptual (vague and ambiguous) terms that are difficult to measure.
In completing the first step, the researcher has identified the dependent variable
for the evaluation. He or she has also made substantial progress toward formu-
lating a hypothesis about the dependent variable, stating that it should attain a
certain magnitude after the subjects experience the intervention exceeding that
for comparable subjects who have experienced some other or no other inter-
vention. The next step is to produce an operational definition of this dependent
variable, which will move the evaluator one step closer to the concrete terms
and dimensions on which he or she can base the development or selection of
valid measures.
In completing this second step, the evaluator asks some questions of himself
or herself and the program’s implementers (and occasionally of its developers):
How can we tell whether the aims and objectives of the intervention, outlined
previously, have been achieved? What observable and measurable behaviors
will the students exhibit if these aims and objectives have been achieved that
they will not exhibit if these aims and objectives have not been achieved? At
this stage, the evaluator does not ask, “How will they be different after the
intervention?” Instead, she or he asks, “What difference can we see in them?”
4. The subsequent measurement stage should also look for unintended or unanticipated
outcomes, because these often occur, and information about them helps in the evaluation
process.
C O N D U C T I N G E VA L U AT I O N S T U D I E S ■ 3 6 9
Unfortunately, no one can look inside the heads of the students to deter-
mine whether they appreciate, understand, are interested in, or are motivated
by the program under evaluation. Judgments are limited to their overt actions
and self-reports—that is, an evaluator can only study their behavior. Any con-
clusions about thoughts, fears, and the like can only be inferred from some
study of behavior. Thus, the aims and objectives of the intervention must be
operationally defined in behavioral terms. The conceptual (vague and ambig-
uous) statements of aims and objectives must be replaced by statements of
behavior.
In practice, an intervention of any size will likely have many aims or
objectives, rather than just one. Moreover, in transforming these objectives
into statements of behaviors that define them or imply their presence, evalua-
tors often must deal with a number of behaviors associated with each aim and
objective rather than with one behavior per objective. For this reason, evalua-
tion requires that they articulate a series of behavioral objectives that will rep-
resent the identified dependent variables.
The first criterion for such an operational definition requires an explicit
statement in specific behavioral terms. That is, the definition must include an
action verb, as in the following example: “Upon completion of the program,
the student will be able to (1) identify or point to something with specified
properties; (2) describe or tell about those properties; (3) construct or make
something with those properties; or (4) demonstrate or use a procedure of a
particular nature.” Words like identify, describe, construct, demonstrate, and so
on are action verbs that indicate behavior. They are required elements of oper-
ationally defined behavioral objectives. To specify something in behavioral
terms, use behavioral words that specify doing rather than knowing. Words
such as know, appreciate, and understand are not action verbs for behaviors, so
they should not appear in operational definitions.
Figure 14.2 lists some suggested action verbs originally compiled by the
American Association for the Advancement of Science. (The specific illustra-
tions of the use of each one have been added by the authors.) By basing opera-
tional definitions on these action verbs, researchers can be sure they are writing
behavioral objectives. In addition, this standardization enables researchers to
compare objectives with a degree of certainty that a specific word has the same
meaning in various experimental situations.
The second element of a behavioral objective is the specific content in
which students will show mastery or competence. What should a student be
able to identify after completing the program under evaluation? What should a
student be able to describe? What should a student be able to construct?
The third element of the objective is a specification of the exact conditions
under which the student will exhibit the expected behavior: “Given a list of 20
370 ■ CHAPTER 14
IdentifyGiven a list of eight statements, the student shall identify all that are instances
of hypotheses.
NameThe student shall name four statistical tests for comparing two treatments with
small n’s and outcomes that are not normally distributed.
State a RuleThe student shall state a rule limiting the transformation of interval, ordi-
nal, and nominal measurements, one to the other.
OrderGiven a list of ten statements, the student shall order them in the correct
sequence to represent the research process.
DemonstrateGiven a set of data, the student shall demonstrate the procedure for
their analysis using analysis of variance procedures and a worksheet.
ConstructGiven the following set of data for processing by analysis of variance, the
student shall construct computer instructions for an ANOVA program.
Apply a RuleGiven the following set of interval data, the student shall convert them
to nominal measures (high, middle, and low) using the rule of the tertile split.
InterpretGiven the following set of analyzed data and hypothesis, the student shall
interpret the outcome of the experiment and the support it provides for the hypothesis.
items, the student shall identify . . .” or “Using the following pieces of equip-
ment, the student shall construct or demonstrate . . . .” These examples illus-
trate how an operational definition must specify conditions.
Finally, if possible, a behavioral objective should specify the criterion for
judging satisfactory performance, such as the amount of time allowed for a stu-
dent to complete a task and how many correct responses he or she should make
in that amount of time. However, at this stage of behavioral objectification, an
acceptable operational definition may include only an action verb, a statement
of content, and any specific conditions.
Evaluators should not discourage those who implement a program from
stating creative and imaginative goals for the evaluation simply to avoid the
difficulty in restating them in behavioral terms. For instance, an objective for a
program intended to heighten students’ awareness of form in art should emerge
from identified behaviors that would indicate attainment of this goal. (“Given a
painting, a student will describe it in part by identifying its form.”) Because the
program implementers often look for subjective evidence of the attainment of
these creative or imaginative goals, the evaluator must work with them or other
experts to identify behaviors associated with these outcomes.
C O N D U C T I N G E VA L U AT I O N S T U D I E S ■ 3 7 1
Thus, the second step in the suggested evaluation model is to convert the
aims and objectives that represent the dependent variable into concrete and
observable statements of behavior—that is, to transform the dependent vari-
able statement into operational definitions or behavioral objectives.
FIGURE 14.3 Sample Behavioral Objectives and Content-Valid Test Items for Each
5. Researchers ordinarily think of tests as tools for evaluating individuals and their per-
formance. However, when a group of individuals who have commonly experienced an inter-
vention or training program take a test, one can pool their test data and examine them as
group indicators. Analysis with proper comparisons (as discussed in the next section) allows
such test data to contribute to an evaluation of the intervention or program.
372 ■ CHAPTER 14
Up to this point, the chapter has explained a process that in and of itself could
serve as a formative evaluation. The next step, the comparison process, truly
distinguishes formative from summative evaluation. The important differ-
ence is that summative evaluation is more than an attempt to describe behav-
iors that a student has acquired as a result of specific program experiences;
it goes further and judges the level of performance of these acquired behav-
iors against some standard of success or effective performance. Thus, unlike
formative evaluation, summative evaluation distinctly implies comparison of
some sort.
Evaluators can contrast results for three kinds of groups with those of the
experimental group to assess the effect of the treatment or intervention: con-
trol, comparison, and criterion groups. A control group is a group of Ss who
have not experienced the treatment or any similar or related treatment. A con-
trast between a treatment and a control group attempts to answer questions
like, “Would the behavioral objectives of the program have been met even if the
program had not occurred? Can these objectives be expected to occur sponta-
neously or to be produced by some unspecified means other than the program
under evaluation?” Students who complete a program may show more abili-
ties than they showed before they completed the program due to the effects of
history or maturation—sources of internal invalidity. To ensure that neither
history nor maturation is responsible for the change and pin down responsi-
bility to the intervention or program, an evaluator can compare results for an
3 74 ■ C H A P T E R 1 4
equivalent group of individuals who have not experienced the program against
results for those who have experienced it. This is control group contrast.
Very often, however, problems in evaluation take a somewhat different
form, posing the question, “Is the treatment or program producing stronger
behaviors or the same behaviors more efficiently than would be possible with
alternative programs or interventions?” When stated in this way, the prob-
lem moves beyond control to involve comparison. Thus, the evaluator could
compare results for an intervention or program group to those for a group
of students who have presumably been trained to attain the same behavioral
objectives in a different (and in many cases more traditional) way. A compari-
son of performance results for the two groups would answer the question, “Is
the new way better than the old way?”
Occasionally, evaluation questions take even a third form in which the
standard for judgment refers to some ideal state that students should attain.
Such a question might ask, “Have vocational students developed job skills suf-
ficient for reasonable success in the occupations for which they were trained?”
If the objective of a program is to develop enough competence in calculus to
allow students to solve specific problems in physics and aerodynamics, an eval-
uator might ask, “How does the students’ knowledge of calculus compare to
that of people who succeed at solving physics and aerodynamics problems?”
Questions like these ask for contrasts, not with results for a comparison group
that has completed an alternative experience, but with results for a criterion
group that displays the behavior in another context, namely, applications of
the knowledge to be acquired in the treatment (that is, calculus) to physics and
aerodynamics problems.
Very often evaluations of vocational or professional programs seek to eval-
uate progress toward objectives of preparing individuals for on-the-job com-
petence. To make this judgment, the evaluator chooses a criterion group from
among workers who demonstrate such competence in practice. Of course, he
or she must identify these individuals as a criterion group using a measuring
instrument other than the one developed to evaluate the intervention. Typi-
cally this group is chosen on the basis of criteria such as supervisors’ judg-
ments, promotion rates, salaries, or indications of mastery other than direct
measurement of competence and skill.
Questions of Certainty
An evaluator cannot assume that one or more classes received a set of experi-
ences while others did not. Simply because teachers are told to teach in a certain
way for example, or are even trained to teach that way, one cannot automati-
cally state that they did in fact teach that way. Nor does it assure that teachers
not so trained will not themselves manifest the same teaching behaviors as the
trained teachers out of habit or previous experience.
Evaluators must assure themselves that the independent variable has indeed
been fully implemented. To accomplish this goal, they must observe or oth-
erwise measure the characteristics that represent the essentials of the inter-
vention or treatment to ensure that those characteristics were always present
in the treatment condition and always absent in the control or comparison
condition. (Refer again to the last section of Chapter 7 for a discussion of this
procedure.)
Sampling
A researcher who has access to a single class often chooses to test an instruc-
tional intervention on that class. This situation would be a convenient setting
for naturalistic observation and exploratory work, but summative evaluation
requires the opportunity to control variables, which is difficult with a single
class. Some researchers may identify two sections of the same class and apply
the intervention in one while teaching the second by conventional methods.
This is another difficult situation to treat fairly; the researcher’s biases may be
showing by the time he or she gathers final results. Comparing one’s own class
taught experimentally to a colleague’s taught conventionally does not permit
the separation of treatment effects from teacher effects or student selection
effects. A better evaluation method would randomly assign two pairs of classes
to the experimental and control conditions:
C O N D U C T I N G E VA L U AT I O N S T U D I E S ■ 3 7 7
However, this procedure for assigning classes to conditions is not the best
to control for invalidity due to student selection. In effect, it uses the class as
the unit of analysis, because that is the unit of assignment, and reduces the total
number of observations to four. A better procedure would pool all the students
and then randomly assign each to one of the four groups. This random assign-
ment adequately controls for student selection effects:
A compromise between the two methods starts with intact classes, but it
randomly divides each class in half and then exposes one-half of each to the
control condition. Normal classroom circumstances often create difficulties,
however, for teaching each half of a class in a different way.
Reliability
All measuring instruments contain errors that affect their accuracy. Error
is quantified as a reliability coefficient, as explained in Chapter 10. Evalu-
ation, in particular, involves observational variables and instruments. To
establish reliability of judgment on these observational instruments, follow
these rules:
378 ■ CHAPTER 14
FIGURE 14.5 Sample Designs for Evaluation Studies: (A) Posttest-only control group
design; (B) Nonequivalent control group design
C O N D U C T I N G E VA L U AT I O N S T U D I E S ■ 3 7 9
Statistical Analysis
This analysis would yield information on three effects: (1) the main effect
of X—that is, whether the innovation (X2) was more effective overall than the
comparison (X1); (2) the main effect of Y—that is, whether the high group on
the moderator variable (Y2) overall outperformed the low group (Y1); and (3)
the interaction of X and Y—that is, whether the group high in the moderator
variable experiencing the innovation (X2Y2) differed as much from the group
high in the moderator variable receiving the comparison condition (X1Y2) as
the group low in the moderator group experiencing the innovation (X2Y1) dif-
fered from the group low in the moderator variable experiencing the compari-
son (X1Y1). When an interaction effect occurs, the result looks like Graph A or
B in Figure 14.6; when it does not occur, it looks like Graph C.7
6. Where after-the-fact comparisons show nonequivalence between the groups on rel-
evant selection factors, a researcher can adjust somewhat for differences by analysis of
covariance procedures.
7. Following the analysis of variance, it would be possible to do multiple range tests such
as the Newman-Keuls Multiple Range test or the Scheffé test in order to compare the three
means simultaneously using the error term from the analysis of variance. These techniques
are aptly described in Winer, Brown, and Michels (1991) and other statistics books. Where
pretest data are available, a researcher may conduct analysis of covariance of the posttest
scores with pretest scores as the covariate.
380 ■ CHAPTER 14
=
C O N D U C T I N G E VA L U AT I O N S T U D I E S ■ 3 8 1
(continued)
382 ■ CHAPTER 14
Type of Instruction
Individualized Conventional
Content Area Physics
Chemistry
Suppose your results revealed that the main effect for type of instruc-
tion was significant for both knowledge and attitude, and that it was based
on superior performance following individualized instruction in contrast to
conventional instruction. You could then conclude that the individualized
basic science course was more effective in improving science knowledge and
attitudes in community-college students than was the conventionally taught
version.
8
This illustration represents an actual evaluation. (See Tuckman and Waheed, 1981.)
■ Summary
All but the last question below are to be answered on the basis of the sample
evaluation report, “Evaluating Developmental Instruction,” which appears
below.
1. What dependent variables did the evaluation include, and how closely did
they fit the goals of the program under evaluation?
2. How accurately were the dependent variables measured?
3. What treatment was evaluated, and to what was it compared?
4. What evidence, if any, was offered that the treatment operated as intended?
5. What experimental design did the evaluation implement? How well did it
suit the situation, that is, how adequate were the controls?
6. Did the evaluation include a moderator variable? If so, name it.
7. What statistical test would you have used for this design?
8. Design an evaluation study to evaluate this book. Describe each step in the
process, being as concrete as possible.
C O N D U C T I N G E VA L U AT I O N S T U D I E S ■ 3 8 5
The project, modeled on the British infant school approach, was tested in two
elementary schools and included Grades 1 through 3 in one and Grades 1
through 5 in the other. For comparison purposes, the evaluator identified regu-
lar classrooms in Grades 1 through 5 of a matched control school in the same
community. The study was aimed at comparing developmental classrooms to
regular classrooms in terms of both process, that is, the behavior of teachers
presumably resulting from training, and product, the behavior of students pre-
sumably resulting from the behavior of teachers.
The study attempted “conversion” of teachers to the developmental
approach by means of in-service training and ongoing supervision. An initial
trip to England was followed up by visitations, and regular evening programs
throughout the year. Teachers so trained were expected to foster more diver-
sity and flexibility in their classrooms than would regular classroom teachers.
Hence, their teaching was expected to yield more positive student attitudes and
higher achievement.
Two classrooms at each grade level (1 through 5) in each school were ran-
domly selected from among available classrooms. Subsequently, comparisons
were made for grade levels 1 through 3, with two experimental schools and
one control, and grade levels 4 and 5, with one experimental school and two
controls. (Grades 4 and 5 in one of the experimental schools served as a second
control in the Grade 4 and 5 comparison.) The table summarizes the experi-
mental design:
■ Recommended References
Gredler, M. E. (1996). Program evaluation. Englewood Cliffs, NJ: Prentice-Hall.
Sanders, J. R. (1992). Evaluating school programs: An educator’s guide. Newbury Park,
CA: Corwin.
Worthen, B. R., & Sanders, J. R. (1987). Educational evaluation: Alternative approaches
and practical guidelines. New York, NY: Longman.
= CHAPTER FIFTEEN
Qualitative Research
Concepts and Analyses
OBJECTIVES
T
HIS BOOK has so far focused on methods for systematic, objective, and
quantitative measurement of variables and their relationships. Although
no researcher can ever carry out a totally systematic or totally objective
study, the procedures described have aimed at mirroring variables as objectively
as possible by representing them as numbers or quantities. In some situations,
however, researchers choose to rely on their own judgment rather than quantita-
tive measuring instruments to accurately identify and depict existing variables
and their relationships. This chapter discusses such qualitative research.
Bogdan and Biklen (2006) ascribe five features to qualitative research: (1) The
natural setting is the data source, and the researcher is the key data-collection
instrument. (2) Such a study attempts primarily to describe and only second-
arily to analyze. (3) Researchers concern themselves with process, that is, with
events that transpire, as much as with product or outcome. (4) Data analysis
■ 387
388 ■ CHAPTER 15
Phenomenological Emphasis
1. Both labels aptly describe the case study or ethnographic research methodology
described in detail later in this chapter.
Q U A L I TAT I V E R E S E A R C H ■ 3 8 9
Naturalistic Setting
Emergent Theory
Ethnographic research does not set out to test hypotheses. Rather than formu-
lating specific hypotheses on the basis of prior research or preconceived theo-
ries, the ethnographic approach calls for theories and explanations to emerge
from, and therefore remain grounded in, the data themselves (hence the term
grounded theory). Data, taken in context, come first; then the explanations
emerge from intensive examination of the data, providing a natural basis for
interpretation rather than an a priori one. Such an approach is also termed
holistic research, since the data are examined as a whole to find a basis for expla-
nation for observed phenomena. To support appropriate explanations, data
must incorporate “heavy” or detailed description of observations and events
from multiple perspectives so that situations can be reconstructed and reexam-
ined by the researcher after they have occurred.
■ Research Methodology
Consider, for example, the culture of the college classroom and, in par-
ticular, the behavior and performance of the professor toward the students. In
the description step, an ethnoscience researcher would ask students to describe
their teachers: how they teach, how they react to students, what they are like.
Students are also asked to give their opinions of their teachers. The researcher
also attempts, through interviews, to discover what categories students use
in determining what teachers are like and in formulating opinions of them.
For example, students may describe their teachers as using handouts, course
outlines, and schedules, leading to the discovery that students categorize the
behavior of teachers as organized or disorganized; a professor described as
soft-spoken and noncritical may be categorized by students as accepting, and
so on.
In the classification step, the researcher would refer to these categories in
drafting direct, probing questions to determine all the cues and characteristics
that students consider in deciding whether a particular professor is or is not
organized, is or is not accepting, is or is not flexible, and so on. Students would
be asked to classify their professors in terms of the categories discovered in the
previous step.
Finally, the researcher would make comparisons between classifications.
For example, connections might become evident between how organized a
professor is and how much he is liked by his students, or between how dynamic
a professor is and how much students feel they have learned from her or him.
Professors who differ in popularity can be compared in terms of students’ clas-
sifications of them on dimensions such as organization, flexibility, and so on.
In this way, the study would seek to learn not only how students think about
or categorize their professors but also about the connections between the qual-
ities that students perceive in their professors. Thus, it would reveal the mental
representations or maps that college students form of professors.
Bogdan and Biklen (2006) describe the constant comparative method as a
search by a researcher for key issues, recurrent events, or activities that then
become categories of focus. Further observation looks for incidents that reflect
the categories of focus to determine the diversity of the dimensions under the
categories. Such incidents are continually sought and described, guided by the
categories, in an effort to discover basic social processes and relationships.
In the example of the college classroom, the categories of focus may be dif-
ferent kinds of interactions between professor and students, such as asking ques-
tions, offering explanations, building relationships, and maintaining barriers.
Q U A L I TAT I V E R E S E A R C H ■ 3 9 5
Of course, not all qualitative designs are so open-ended that the problem
of study emerges entirely from the data. Many researchers identify problems
they want to study and seek to obtain qualitative data that bear on those issues.
■ Data Sources
Case study research usually gathers data from three types of sources: (1) inter-
views of various people or participants in the setting who are involved in the
phenomena of study; (2) documents such as minutes of meetings, newspaper
accounts, autobiographies, and depositions; and (3) direct observation of the
phenomena in action. The researcher collects data in any of these three ways to
acquire information related to the phenomena. This section discusses each of
these data sources in turn.
Interviews
One direct way to find out about a phenomenon is to ask questions of the
people who are involved in it in some way. Each person’s answers will reflect
his or her perceptions and interests. Because different people experience situ-
ations from different perspectives, a reasonably representative picture of the
occurrence and absence of a phenomenon may emerge and provide a basis for
interpreting it.
396 ■ CHAPTER 15
Types of Interviews
Researchers conduct four types of interviews as described by Patton (1990).
They range from totally informal, conversational exchanges to highly struc-
tured sessions asking closed-end, fixed-response questions. The type of inter-
view chosen depends on the context of the study and the kinds of questions to
be asked.
The first of these types of interview are described as informal interviews.
They are characterized by questioning strategies that are not predetermined;
that is to say, interview questions emerge in the natural flow of interpersonal
interaction. Though this methodology is less systematic, which may lend itself
to difficulties with analysis, it is particularly useful in that the interview itself
can be designed to fit the demands of an individual or particular circumstance.
The second type of interview is described as a guided approach, in which
the interview resolves issues of questioning, sequence, and topic coverage in
advance of the actual interview in the form of an outline. This approach lends
itself to somewhat systematic data collection because it does utilize a format,
though it may occasionally result in the unintentional omission of important
interview topics since that format is not strictly standardized.
The third type of interview is described as a standardized, open-ended
approach, which differs from the guided approach in that the specific interview
issues (questioning, sequence, topic coverage, etc.) are definitively worked out
in advance (rather than simply considered and described in outline form). This
methodology is highly systematic, which leads to uniformity in data collection
and ease of analysis, though it may be somewhat rigid.
A final type of interview methodology is described as closed and fixed.
Here, interview questions are worked out in advance and structured so that
interviewees respond to questions from among a set of predetermined alterna-
tives. Data collection using this method is fairly simple and straightforward,
though the validity of interviewee responses may be somewhat compromised
due to the limitation of response choices.
Specific Questions
In selecting questions, an interviewer should ask not only about intentions
but about actual occurrences. Information about occurrences and outcomes
can also be obtained from source documents, as described later in this section.
Q U A L I TAT I V E R E S E A R C H ■ 3 9 7
However, interviews often prove the major sources of information about peo-
ple’s intentions and other subjective elements of observed phenomena. Con-
sider a sample list of such interview questions and directives:
1. Describe the behavior that is going on here. Describe your own behavior.
Describe the behavior of other participants.
2. Describe the reasons behind the behavior that is going on here. Why are
you behaving the way you are? Why do you suppose others are behav-
ing the way they are? How are these reasons interrelated? How are they
affected by the setting?
3. Describe the effects of the behavior that is going on here. Describe the
effects of your behavior. Describe the effects of the behavior of other par-
ticipants. Are these effects interrelated?
actual events from their own perspectives, allowing the researcher to make the
contrasts in later analysis. Moreover, although outside interpretations of intent
give only speculative suggestions, and evaluation must treat them as such, the
interpretations of other observers might help a researcher formulate an under-
standing of the intentions that underlie the observed behaviors of participants.
Again, a list of examples must state them in extremely general language. Real
situations call for the most specific possible questions.
1. How did you come to observe the phenomena in question? What is your
role in these events? Under what conditions and circumstances did you
observe the phenomena?
2. Describe what is going on here. Identify the participants and the behavior
of each one.
3. Why do the participants behave as they do?
4. What effect does the behavior of the participants produce on one another
and on future events or outcomes?
1. Describe your experiences. Tell about what actually happened. How did
the teacher behave? How did you behave? How did the other students
behave? What was the sequence of events?
2. What caused things to happen as they did? Why did you behave the way
you did? Why did the teacher behave the way he or she did? Why did the
other students behave the way they did?
3. Did any incidents, either good or bad, occur that stand out in your mind?
4. Did you enjoy the experience? Was it an interesting one? Did you learn
from it?
The first set of questions deals with actual behavior, so it parallels the
questions asked of other participants and observers. The second set of ques-
tions concerns the reasons behind behavior, again in a parallel pattern. The
third question represents the critical incident technique, in which respondents
Q U A L I TAT I V E R E S E A R C H ■ 3 9 9
are asked to recall critical or outstanding incidents that can guide researchers,
either in forming hypotheses about outcomes or in illustrating generalizations
of results or conclusions.
A fourth set of questions represents a way of evaluating outcomes or phe-
nomena based on the subjective reactions of the participants. This set of ques-
tions aims to identify three levels of evaluation from their perspective:
I. Did the intended or expected experience occur? (Did you get what you
expected?)
II. Were you satisfied with what you received? (Was it what you wanted?)
III. Did you change as a result of the experience? (For example, did your
knowledge and/or competence improve?)
This fourth set of questions attempts to reveal whether the participants felt
satisfied with their experiences. Even when an interviewer asks participants
whether they have learned or improved, these questions really only ask for
their opinions of an experience’s worth, which essentially reflect their satis-
faction. Any attempt to actually test whether their knowledge and/or com-
petence improved would require some measurement of their level of relevant
knowledge or competence at the conclusion of the experience. The researcher
would then have to compare current levels to the levels prior to the experi-
ence, and to contrast the difference with that of a group who did not have
the same experiences. But that analysis is a quantitative approach quite differ-
ent from qualitative research. Satisfaction and self-estimates of change are by
definition subjective evaluations “measured” by asking participants for their
self-assessments.
Finally, useful further interviews might seek input from people who are nei-
ther participants or direct observers, but who are aware of a set of experiences
through secondhand information. In school research, such secondary sources
could be parents, for example. If a phenomenon is having an effect of great
enough magnitude, parents will be aware of it. Their impressions are worth
gathering, because subsequent experiences (that is, whether or not a program is
continued) may depend on them. Also, some studies may lack opportunities to
locate people who either participated in or observed events. Those researchers
must rely on secondary sources for answers to certain questions:
3. What are your impressions of why events happened as they did? How did
you arrive at these judgments (stated in the most specific possible terms)?
4. What was the result or outcome of the event? How did you determine that
this was, in fact, the result?
Interviewing Children
Qualitative researchers often must implement some special procedures for con-
ducting successful interviews with children. Questioning must accommodate
the limited verbal repertoires of children from preschool age through adoles-
cence. It must also anticipate the paradox that children seldom give responses as
socially controlled as the statements of adults, but on occasion they do strictly
censor their responses according to rigid rules. Moreover, because other, more
structured approaches like questionnaires are often impractical for research
with children, the interview becomes the data collection device of choice.
A primary goal in interviewing a child is to establish rapport, that is, a
positive relationship between the interviewer and the child. Exchanges based
on feelings of warmth, trust, and safety help to increase both the amount and
accuracy of the information that young subjects provide. Boggs and Eyberg
(1990) provide the comprehensive list of communication skills to guide the
adult interviewer.
The purpose of acknowledgement is to provide feedback to assure the
subject that the interviewer is listening and understanding. This input influ-
ences children to continue talking. The level of subtlety of the acknowledging
response must be matched to the child’s social development.
Descriptive statements of what the child is doing show the child that the
interviewer is accepting of the child’s behavior. This input also helps to focus
the child’s attention on the current topic and encourages the child to offer
further elaboration. Descriptive statements such as “That’s a hard question to
answer, isn’t it,” can be particularly helpful when a child responds to a question
only with silence. Reflective statements, when delivered with the proper inflec-
tion, demonstrate acceptance and interest in what the child says and convey
understanding. They can also prompt additional, clarifying statements by the
child. To avoid being seen as insincere, especially with adolescents, praise state-
ments should be offered only after establishing rapport, and then they should
be “labeled” to specify exactly what the interviewer is encouraging. Properly
introduced, especially in age-appropriate language, praise can greatly increase
a child’s information-giving on a particular topic.
Questions make explicit demands upon a child. Interviewers may ask open-
ended or closed-ended questions, but open-ended ones are preferred, because
they yield more information than the typical “yes” or “no” response to a
closed-ended question. Children often respond especially readily to indirect
Q U A L I TAT I V E R E S E A R C H ■ 4 0 1
must arrange for separation between the child interviewee and classmates. At
the onset of the meeting, the interviewer should practice the kinds of behaviors
described earlier to establish rapport with the child.
The interview is likely to be a new experience for the child. At the outset,
the purpose should be explained and the child given an opportunity to ask
questions. Any rules to be followed during the interview (e.g., no playing with
toys) should be made explicit at this time, and confidentiality should be assured.
The interview itself should move from least to most potentially distressing or
difficult topics; when and if the child shows resistance, the interviewer should
move to another topic and attempt to return to the more difficult one later in
the interview. The interviewer should respect the child’s ultimate decision not
to answer a particular question.
When the interview is complete, the interviewer should express apprecia-
tion for the child’s cooperation and give the child an opportunity to add any
unsolicited information. Successfully engaging a child in the interview pro-
cess requires good planning and skillful communication of an interviewer. The
interviewer must often be prepared to follow a less direct route in acquiring
information from a child than from an adult.
Documents
Observations
Observations, the third qualitative data source, can also provide quantita-
tive data, depending upon the techniques for recording observational data. If
observers record events on formal instruments such as coding or counting sys-
tems or rating scales, the observation will generate numerical data; hence they
form part of quantitative research. If an observer simply watches guided only
by a general scheme, then the product of such observation is field notes, and
the research is a qualitative study.
The target for observation is the event or phenomenon in action. In quali-
tative educational research, this process often means sitting in classrooms in the
most unobtrusive manner possible and watching teachers deliver instructional
programs to students. Such an observer does not ask questions as part of this
role, because that is interviewing. (Questions can be asked either before or after
observing.) An observer just watches. But the watching need not totally lack
structure. She or he usually watches for something, primarily (1) relationships
between the behaviors of the various participants (Do students work together
or alone?) (2) motives or intentions behind the behavior (Is the behavior spon-
taneous or directed by the teacher?) and (3) the effect of the behavior on out-
comes or subsequent events (Do students play together later on the playground
or work together in other classes?). Observers may also watch to confirm or
disconfirm various interpretations that have emerged from the interviews or
reports and to identify striking occurrences about which to ask questions dur-
ing subsequent interviewing.
The critical aspect of observation is watching, taking in as much as you can
without influencing what you watch. Be forewarned, however, that what goes
on in front of you, the researcher, will represent—at least in part—a performance
404 ■ CHAPTER 15
Transcribed Conversations
The next section of this chapter deals with specific procedures for conducting
a qualitative case study.
Source: This protocol was derived from work initially done by Gail Jefferson and reported in
“Explanation of Transcript Notation,” in Studies in the Organization of Conversational Interac-
tion, ed. K. Schenkein (New York: Academic Press, 1978), xi–xvi.
Q U A L I TAT I V E R E S E A R C H ■ 4 0 5
1 Vern: um, but once again if you were going to have them up
2 there, you might’ve taken a more proactive role in seating
3 them. (0.8) I don’t know if y- a boy girl boy girl
4 pattern will be better, or the ones who you know are going
5 to interact here, you do that. It’s like a seating chart=
6 Doug: um hum um [hum yeah
7 Vern: you kn]ow? AND um, um (1.0) I did it with
8 ninth graders so the likelihood that you’d have to do it
9 with first graders would be great.
10 Doug: um hum
11 Vern: OK?
12 Doug: yeah, that would be a good idea hhh ((nervous laugh))
13 Vern: WHAT, WHAT YOU NEED is to expand the repertoire of
14 skills that you can use to ensure classroom management.
15 and [whatchu h ]ad going on
16 Doug: um hum
17 Vern: up front was less than productive classroom management
18 because there were a number of times you had to
19 go Tim (0.8), you know, Zack, um m-m-m, you know,
20 whatever the names were or wha- whatever. u- w-
21 yo[u ha ]d to go on with
22 Doug: um:
23 Vern: that a few times. So that w- would be of something
24 you really need to focus on. The second thing that I
25 would mention here is is (3.0) oand in an art lesson, I
26 might add, there there isn’t an easy way of doing this,
27 but it’s something for you to think about.o (0.8) UM (2.3)
28 THE OLD, we’ve talked about this before, the old (0.7)
29 never five more than three directions to k- anybody at one
30 time=
31 Doug: =um hum=
The first step in conducting a qualitative study is to obtain copies of all available
documents describing the event or phenomenon (or its background) and carefully
study them. This preparation is the best and most objective way to orient yourself
to the situation that you are about to research. In reading the documents, take
particular note of (1) the setting, (2) the participants and their respective roles,
(3) the behaviors displayed by the various participants, (4) your perceptions of
the participants’ motivations or intentions, (5) the relationships between inten-
tions and behaviors, and (6) the results or consequences of the behavior.
406 ■ CHAPTER 15
The information that you glean from background documents will help you to
prepare your own plan for direct information gathering as part of your case study.
To collect some data for a qualitative or case study, you will have to accomplish
fieldwork during a site visit. This is ordinarily a period of time during which
the researcher enters the setting in which the event under study has occurred or
is occurring. Of course, a particular study may incorporate more than a single
visit, and the research may be conducted by more than a single researcher. To
use the time on site most efficiently and effectively, a researcher should plan as
specifically as possible how the time there will be spent. This planning should
include developing a visitation schedule and interview instruments.
A visitation schedule includes a list of all the people the researcher wants
to see and the amount of time intended to spend with each. Efficient use of
limited time calls for a visitation schedule made up of specific arrangements to
see specific people, such as teachers involved in a particular project. It should
also set aside time to make observations or to see people without specific
appointments.
A visitation schedule also helps with advance preparation of interview
questions, although you need not attempt to write in advance every question
you might ask.
After reading the documents and reviewing this chapter’s earlier discus-
sion about interview questions, you should be able to prepare a general line
of interview questions. Each scheduled interview may require a separate set of
questions, in which case each session should be sketched out in advance.
Preparation should also include development of a mechanism for recording
responses to interview questions. You may want to tape record each interview
to prevent the need for taking notes. If you choose to tape record, you must
request in advance permission from each interviewee, and you may record only
when this permission is granted. In place of or in addition to tape recording,
you should prepare a notebook for taking fieldnotes. Systematic prior mark-
ing of the notebook pages with interview questions or question numbers will
aid in taking and interpreting fieldnotes. A notebook should allow a page for
each observation’s fieldnotes by listing the date, time, teacher, and other, more
specific entries for the phenomena you will be observing. A sample page of
fieldnotes appears in Figure 15.4.
Good planning also should allow for observation. This preparation may
include a set of questions that you hope to answer as a result of the observation,
or it may list critical incidents. You may simply prepare to describe the activi-
ties of students and teachers during your visit.
Q U A L I TAT I V E R E S E A R C H ■ 4 0 7
1 Class returned to classroom after P.E. Mrs. H. had set up room for project work
and greeted children upon return by reminding them that they would now
work on their projects.
2. Without further direction, children dispersed to desks after taking project
“books” out of their regular storage area. (Projects were nature “books” done
on an individual basis, to relate personal experience, interest, and natural sci-
ence theme.)
3. Some children work alone, quietly, drawing or writing or pasting. Some talk in
pairs about project work. Some show work to teacher and ask for help. Some
scurry around looking for materials. [It is amazing to see how many things are
going on at once in an orderly yet comfortable fashion without the teacher
exerting overt management behavior. It is a stark contrast to children seated
in rows listening—or at least being quiet.]
4. At 10:55 teacher interrupts by striking gong and without any further instruc-
tions children gather around her for reading. Teacher proceeded to read a
story to entire class punctuated often by teacher asking questions and stu-
dents answering enthusiastically.
An Illustration
The data for the qualitative research project include the fieldnotes that you
bring back in your notebook and in your head, interview transcripts, plus any
information gleaned from program documents. Analysis of these data means
using the data to answer the questions the research set out to answer.
1. Review the data you have collected and develop category labels for classify-
ing them.
2. Identify enough specific examples of each category in the data to com-
pletely define or saturate each category, clearly indicating how to classify
future instances into the same categories.
Q U A L I TAT I V E R E S E A R C H ■ 4 0 9
Figure 15.5 illustrates the process of analyzing qualitative data. The researchers
were attempting to study the role of school peer groups in the transmission
of gender identities. As a data-generating device, they interviewed 10 fifth-
and sixth-grade students in a single public school using an approach called the
“talking diary,” and tape recorded the responses.
From the stream of responses, they identified a number of “facts” (some
of which have been italicized in Figure 15.5 for illustrative purposes) as well as
some of the data on which these facts are based. These facts are conclusions or
generalizations based on the specific answers students gave to the researchers’
questions. Based on these facts and others, the authors concluded that “gen-
der identities and relations were the primary focus of the peer groups. . . . In
a sense, a world of female students and a world of male students existed [and]
cross-gender contact . . . was interpreted in romantic terms only” (Eisenhart &
Holland, 1985, p. 329).
In an ethnographic study of teacher-supervisor conferences, Waite (1993)
relied most heavily on analysis of transcriptions of a number of such confer-
ences with four teachers. An example of one transcript fragment appeared
earlier in Figure 15.3. In this fragment, the teacher (Doug) meets with his
supervisor (Vern). As the transcript shows, he always agreed with, or at least
never disagreed with, the advice he received. (See Lines 6, 10, 12, 16, 22, and 31
of the figure.)
Based on his analysis, Waite (1993, p. 696) summarized his findings as
follows:
Data from the talking diary interviews revealed that in addition to engaging in
gender-segregated activities and indicating friendship depending on gender,
boys and girls made differing judgments in their normative statements. Girls,
especially, were prone to comment on the interpersonal styles of other girls. For
white girls, it was important to be seen as “nice,” “cute,” “sweet,” and “popular.”
Positive remarks referred to a girl who did not act stuck up, overevaluate her
assets, or flaunt her attractive features in front of her friends. Among black girls,
it was also important to be “nice” and “popular,” although the meanings of these
terms were somewhat different. For blacks, these terms referred to a girl who
demonstrated the ability to stand up for herself and who assisted others when
they were having difficulty or were in trouble. For them, girls who did not demon-
strate an intention to stand firm and help friends in the face of verbal or nonver-
bal challenges were criticized.
Girls, both black and white, also spent a great deal of time talking about their
appearance. They advised each other on such things as how often to wash one’s
hair, how to get rid of pimples, and how to dress in order to look good and be in
style.
Especially by the sixth grade, a large proportion of what girls talked about to
each other concerned romantic relationships with boys. Almost every lunch and
breakfast conversation included at least some mention of who was “going with”
whom, how to get boys to like you, how to get someone to “go with” you, what
to do if someone was trying to break up with you, how to steal someone else’s
boyfriend, what to wear on a date, where to go and how to get there, and who
was attractive or ugly and why. In the following examples taken from the notes,
girls advise one another on their romantic ventures. In the first example, Tricia
describes how she coaches one of her girlfriends to get a boy to take her to the
end-of-the-year banquet held for sixth graders.
I tell her what to say, how to do, how to dress . . . like how to fix her hair in the
morning, how to talk, how to laugh . . . just culture . . . everything culture tells
you to do, you do it. Then she’ll turn around and coach me back, with Jackson.
Another example concerns Jackie and her efforts to get Bob to take her to the
banquet.
Jackie frequently called Bob at night, arranged to run into him in the hall, and
sent notes to him via her friends. Bob was not particularly responsive to these
overtures. Finally, in desperation when he appeared on the verge of asking
someone else, Jackie discussed her problem with some of her girlfriends. One
friend suggested that Jackie had been too pushy and, as a result, Bob did not
like her anymore, though he had at one time. The friend suggested that the
way to catch a boy was to let him think that you are shy. The friend pointed
out that all the girls who had steady boyfriends were shy at school (Clement
et al. 1978: 191).
In this context, girls who were not interested in romantic relationships were con-
sidered strange. Ruth, for example, expressed her feelings of being “weird” as
follows:
Q U A L I TAT I V E R E S E A R C H ■ 4 1 1
I like boys, but I don’t like to go with anybody. Most girls are crazy about boys,
but I’m a little on the funny side. I like this boy who lives near me: I like to play
with him, but I don’t like to do anything with him [i.e., I don’t want him to be
my boyfriend].
Boys’ talk also revealed a concern with being liked by girls. For example, at a
skating party given by the researchers for the students, Joseph was overheard
telling Edward how to be successful in dealing with women: “You have to be
cool.”
Although boys gave attention to their interpersonal styles, boys also talked fre-
quently about their abilities in sports and in getting away with things at school.
Boys also wished to be seen as strong. As with some of the girls, a boy’s abil-
ity to defend himself, especially in contests with equals, was highly valued. One
of the girls, for example, made the following criticism of a male classmate:
I don’t want nobody to see me if I was a boy . . . wouldn’t want nobody seeing
me fighting a girl, but won’t fight a boy.
Another expression of this value came from a boy who was shorter than most of
his classmates;
Like if a big boy comes messing with me, Vernon’ll take him, but if a little
shrimp-o comes messing with Vernon, I take care of ’em. If one a little taller
comes messing with me, Joseph takes care of ’em. . . .
I like ’em, but not as much as I like boys. . . . I just can’t talk to them the way I
can boys . . . they don’t like sports, they don’t like to do nothing fun. . . . I be
nice to ’em because my mom says you’re spose to be polite.
These differences in the valued identities of girls and boys are reminiscent of the
findings of Coleman (1965). He reports that in the context of adolescent peer
groups, girls learn to want to make themselves attractive to others, especially
boys. Boys, on the other hand, develop interests in task-oriented activities, such
as sports, as well as learning the importance of being well liked.
Protocol Analysis
In order to study the thinking process students apply to learn or solve prob-
lems, researchers have developed a technique that asks students engaged in the
learning or problem-solving process to think aloud, that is, to say out loud
what they are thinking as they progress. These statements are tape recorded
and transcribed, representing what is called a protocol. These protocols are then
examined and coded to identify characteristics of the thinking process. This
technique, called protocol analysis, has been described in detail by Ericsson and
Simon (1993).
The coding scheme for a set of protocols can vary, depending upon what a
particular researcher is interested in determining. For example, Chi, DeLeeuw,
Chiu, and LaVancher (1994) were interested in finding out whether a special
form of mental construction, called a self-explanation, would improve acquisi-
tion of problem-solving skills. Self-explanations were defined as spontaneously
generated explanations that one makes to oneself as one studies worked-out
examples from a text. While such an example provides a sequence of action
statements, it lacks explanations or justifications for the actions chosen.
The researchers were interested in determining whether the number of self-
explanations students generated while studying the examples would be related
to the amounts they learned.
After reading each sentence of a 101-sentence text passage about the func-
tions of the human circulatory system, students were asked to think aloud
about the meaning of the particular sentence. Students’ statements about their
thinking were coded using protocol analysis. A statement was coded as a self-
explanation if it went beyond the information given in the sentence, that is,
if it inferred new knowledge. For example, students read, “These substances
(including vitamins, minerals, amino acids, and glucose) are absorbed from the
Q U A L I TAT I V E R E S E A R C H ■ 4 1 3
digestive system and transported to the cells”; expressions of thoughts like “the
purpose of hepatic portal circulation is to pick up nutrients from the digestive
system” or “eating a balanced diet is important for your cells” would be coded
as self-explanations. For another example, students read the sentence, “Dur-
ing strenuous exercise, tissues need more oxygen”; a self-explanatory thought
might be “the purpose of the blood is to transport oxygen and nutrients to the
tissues.”
The findings of the study showed that generating self-explanations did
indeed contribute to superior learning.
Another study using protocol analysis was done by Wineburg (1991). He
was interested in identifying differences, if any, in the ways that experts and
novices reasoned about historical evidence. He asked a group of working his-
torians and a group of high school seniors to think aloud as they reviewed a
series of written and pictorial documents about the Battle of Lexington. He
analyzed their protocols of the pictures using the following coding categories:
For the protocol analysis based on the subjects’ processing of the documents,
Wineburg applied three “heuristics” for coding: (a) corroboration—comparing
documents with one another; (b) sourcing—looking at the document’s source
before reading it; and (c) contextualization—placing the document in a con-
crete context with regard to time and place. Wineburg found that historians
employed much more sophisticated thinking processes than those of students,
making much greater use, in general, of qualification and contextualization.
These historians also used the other coding categories differently than did the
students. Protocol analysis enabled Wineburg to get a picture of the differences
in information processing by the two groups of people he studied.
■ Summary
1. Qualitative research takes place in the natural setting with the researcher as
the data-collection instrument. It attempts primarily to describe, focuses
on process, analyzes its data inductively, and seeks the meanings in events.
2. This category of research methods includes ethnography, responsive or
naturalistic evaluation, and case study research, to cover a variety of themes
dealing with unique, whole events in context. The researcher’s experience
and insight form part of the data.
3. Qualitative research (a) displays a phenomenological emphasis, focusing
on how people who are experiencing an event perceive it; (b) occurs in a
naturalistic setting based on field observations and interviews; (c) accom-
modates emergent theory, since explanations come from the observations
themselves.
4. Research questions typical in qualitative studies focus on culture, experi-
ence, symbols, understandings, systems, underlying order, meaning, and
ideological perspective. Problems studied in this way include plans, inten-
tions, roles, behaviors, and relationships of participants.
5. Qualitative methodology involves a set of research questions, a natu-
ral setting, and people behaving in that setting. Data collection focuses
on describing, discovering, classifying, and comparing through a process
often referred to as the constant comparative method.
6. Data sources include interviews, documents, observations, and transcribed
conversations. Interviews range from highly informal and conversational
exchanges to highly structured sessions that elicit fixed responses. Partici-
pants may be asked to describe behavior (their own and others’), reasons
behind or causes of behavior, and effects of behavior on subsequent events.
Participants or direct observers can also report on critical incidents, as well
as offer their opinions. Secondhand information can also be solicited.
7. Interviews with children require special skills regarding the following com-
munication actions: acknowledgment; descriptive, reflective, and praise
statements; questions; commands; and summary and critical statements.
The child interview itself includes the following stages: preparation, initial
meeting, beginning, obtaining the child’s report, and closing.
8. Transcribed conversations are tape recordings of interviews or conferences.
9. Qualitative researchers often review documents, including minutes,
reports, autobiographies, and depositions. Formal or informal observa-
tions provide additional data.
10. A qualitative study involves obtaining documents, conducting interviews,
and making observations. The data typically take the form of fieldnotes, or
transcripts of interviews, made during site visits. While on site, a qualita-
tive researcher should be prepared to answer questions (honestly) about
Q U A L I TAT I V E R E S E A R C H ■ 4 1 5
■ Recommended References
Bogdan, R. C., & Biklen, S. K. (2006). Qualitative research for education: An introduc-
tion to theory and methods (5th ed.). Boston, MA: Allyn & Bacon.
Eisner, E. W. (1991). The enlightened eye: Qualitative inquiry and the enhancement of
educational practice. New York, NY: Macmillan.
Fontana, A., & Frey, J. H. (1994). Interviewing: The art of science. In N. K. Denzin &
Y. S. Lincoln (Eds.), Handbook of qualitative research (pp. 361–376). Thousand
Oaks, CA: Sage.
Glesne, C., & Peshkin, A. (1992). Becoming qualitative researchers. New York, NY:
Longman.
= CHAPTER SIXTEEN
Action Research
OBJECTIVES
■ 417
418 ■ CHAPTER 16
Katie is a first-year art teacher in a local high school. As a new faculty mem-
ber, she is eager to get to know her students and nurture their talent and
creativity. On her very first day she is called in to the principal’s office where
she is told how important it is that her 10th-grade students do well on the
national standardized exam, especially in math, where last year’s students
struggled. In response, Katie works with the math department to design les-
sons that incorporate important mathematical concepts in her lessons on
form, function, and the creative process. She is confident that her students
will not only come to understand the relationship of math to art, but also that
they will perform well on the upcoming standardized test. How might Katie
research the effectiveness of this approach?
that those within an organization are most qualified to not only understand
its inner workings, but also to hold insight on ways to improve. The research
effort may originate in a single classroom or unit, but ultimately its impact may
extend to entire buildings and organizations. Conducted and applied correctly,
it inspires a reflective cycle in which an approach is considered, amended, and
then reconsidered with respect to data. Ultimately, all stakeholders will come
to understand ways to refine practices.
Regardless of the number of participants involved, all action research stud-
ies are guided by three principles:
phases: Arm, the preparatory process, and Act, the execution of the research
agenda. It is a cyclical process—the application of data-driven strategies should
lead to further investigation to refine the process in question. Action research-
ers arm themselves through observation and reflection and then subsequently
act on what they have seen, heard and read.
Phase 1: Arm
After careful reflection, this particular action research study addressed the
question: “How might small group instruction influence elementary students’
willingness to sing, dance, and play during music class?” It identifies both the
treatment/intervention (group size) and end state (student willingness to par-
ticipate in class).
Consider a seventh-grade teacher who is concerned about her students’
transition from elementary to middle school. Over several years, she has
noticed young people struggling to cope with the increasing academic and
self-regulatory demands of the new school year. The general question that will
guide her research might be:
Construct questions that focus on an area Write questions that are not relevant for a
of interest particular environment
Construct questions that can be researched Write questions that cannot be researched in
a particular environment
424 ■ CHAPTER 16
researcher search for existing studies, read the studies critically, and organize
existing research into coherent subsections. These steps allow the researcher to
convey expertise.
Searching for existing studies. Research articles, reports, and books can all
be sources for the literature review. Sources that are selected for the literature
review should reflect what is known in a particular area. Relevant research is
that which examines the key concepts of the proposed action research study.
It introduces existing theories and their relationship to current practice. Katie,
our fictional art teacher, is interested in the integration of math and art and the
way it may improve student performance on standardized examinations. She
can begin by searching for data on high-stakes testing or curriculum integra-
tion. Investigating pedagogical strategies in both mathematics and art may also
be helpful. Simply stated, articles that are ultimately accepted to be a part of the
literature review should clearly correspond to the variables of interest for the
research study.
Reading critically. Once potential sources have been identified, they must
be evaluated with respect to veracity and relevance. The reader must first iden-
tify the purpose of the study. What theory undergirds the research? How is it
relevant to the authors’ action research study? What particular insights does
it suggest with respect to variables of interest? Next, the reader should ana-
lyze the author’s methodology. What research questions were addressed? How
were data collected? From whom? What steps did the author take to assure the
validity and reliability and limit bias? Finally, the reader should address the
author’s interpretation of the data. Did the author draw appropriate and rea-
sonable conclusions? How do they apply to the variables that will be examined
in the action research study? While not every article is suitable for inclusion in
the literature review, a critical review of the literature will certainly inform the
research design.
Organizing research into coherent sections. Your literature review is
driven by the ideas that inspire the research study. These issues provide the
basis for the organizational framework for the literature review. Literature
reviews usually begin with an introduction, which reveals the key topic
and organizational pattern for the review. Next, the body of the literature
review outlines relevant reading. Many literature reviews are arranged the-
matically, categorizing research into groups that correspond to important
study variables. They may also be arranged sequentially, which entails out-
lining research studies chronologically. This approach is often used when it is
essential to understand a historical description of a particular field. Lastly, the
literature review usually concludes with a summary statement, followed by
research questions and hypotheses. Regardless of the organizational pattern,
action research study literature reviews are usually written for an audience of
AC T I O N R E S E A R C H ■ 4 2 5
1. What type of data will be collected? Researchers may wish to collect quan-
titative (surveys, questionnaires, school records, scaled data, etc.) or quali-
tative data (observations, journals, reports, etc.).
2. How often will data be collected? Studies may utilize a single data collec-
tion point, to be compared to a baseline (as may be the case with art teacher
426 ■ CHAPTER 16
Katie, who will use standardized test scores for her students as the data col-
lection point and compare them to previous years’ scores). They may also
use multiple data collection points, to establish change over time.
3. What is the timeline for data collection? Data collection may take place
over days, weeks, months, or even years. The timeline will be determined
by the goals of the study.
4. From whom will data be collected? Data may be collected from an entire
unit classroom or section, or only from a small subgroup within the larger
unit.
Phase 2: Act
We close this chapter by reviewing a sample study (Harper, 2006), which will
illustrate the process by providing examples of the specific elements of action
research that we have introduced throughout the chapter.
Problem Statement
As we introduced earlier in this chapter, there are two purposes of action
research: to understand classroom phenomena and to improve current prac-
tice. Specifically, this study examines why many students avoid challenging or
difficult tasks. As an active member in the research environment, the author of
this study understands the importance of helping his students improve current
practice.
AC T I O N R E S E A R C H ■ 4 2 9
Abstract
For prospective teachers, the development of self-regulatory behaviors—those
which embody an incremental framework—is vital. This study examines the self-
beliefs and academic behaviors of pre-service teachers. The results of this inves-
tigation suggest that high-achieving pre-service teachers endorse more strongly
held incremental views and are more likely to exhibit academic self-regulatory
behaviors in the face of challenge than are their lower-achieving counterparts.
Introduction
A mastery-oriented motivational pattern is a key component of academic
success. Such a perspective incorporates self-regulatory strategies towards
confronting and overcoming task-related setbacks. For more than a decade,
numerous research studies have lauded the effectiveness of this approach,
which leads to both greater persistence and greater performance (e.g., Pintrich
& Garcia, 1991). Unfortunately, even high-achieving students often retreat in the
face of challenges and obstacles. In spite of a substantial list of efficacy-building
successes, many students quickly withdraw from difficult, high-level tasks. Why
might this be so?
Dweck and her colleagues have introduced a framework which helps to
explain this conundrum. In this model, self-beliefs and goals create a motivational
framework which shapes the manner in which an individual will consider and
approach various tasks (Dweck & Leggett, 1988). Specifically, this theory identi-
fies two opposing ways in which an individual may consider a personal attribute;
from the perspective of an entity theorist, which holds that the attribute is rela-
tively fixed, or that of an incremental theorist, who holds that the attribute is
adaptable (Dweck & Leggett, 1988; Dweck et al., 1995; Hong et al., 1999; Dweck,
2000). The adoption of either perspective holds important ramifications for aca-
demic self-regulation (Dweck, 2000).
Those who express views consistent with that of an entity theorist are likely
to set different goals in achievement situations than those who embrace the
perspective of an incremental theorist. In a study of college students’ theories of
intelligence, Hong and her colleagues (Hong et al., 1998) discovered that students
who hold a fixed view of intelligence (entity theorists) were more likely to express
a performance-goal orientation and less likely to exhibit effortful, self-regulatory
behaviors in instances in which there was a threat of exposing their shortcomings
than students with a malleable view of intelligence (incremental theorists). Since
self-regulated behavior is predicated upon the strategic, goal-directed effort one
puts forth in a given situation, entity theorists who are faced with complex tasks
certainly face a higher level of risk for learned-helplessness and failure than do
incremental theorists (Dweck, 2000).
Additionally, in a study of Norwegian student teachers, Braton and Stromso
(2004) suggested that those who believed intelligence to be a fixed attribute were
(continued)
430 ■ CHAPTER 16
FIGURE 16-2 Continued
less likely to adopt mastery goals and more likely to adopt performance-avoidance
goals than were incremental theorists. Further, Sinatra and Cardash (2004) report
that those teachers who endorsed incremental views—specifically, those who
believed that knowledge evolves—were more likely to embrace new ideas and
pedagogical strategies than were those who expressed more static views of intel-
ligence. This study will extend the literature by investigating pre-service teachers’
epistemological beliefs and patterns of specific self-regulatory behaviors on highly
self-determined and more challenging academic tasks.
Problem
Specifically, this empirical investigation sought to answer the following questions:
Methodology
Participants
Participants in this study were those who voluntarily elected to participate from
among all students enrolled in two sections of an undergraduate educational
psychology course at a medium-sized Midwestern state university. The two
sections were taught at different times on the same day; they were otherwise
identical with respect to content and instruction, using the same textbook, syl-
labus, and PowerPoint-driven lectures. The course, which introduced theories of
motivation and learning to pre-service teachers, was regarded as a general edu-
cation requirement for undergraduate education majors. Students indicated their
desire to participate by signing and returning a consent form that outlined the
objectives for this study. Of the original target group of 48 students, 39 student-
participants were identified. This final group was comprised of 21 males and 18
females. Twenty students (10 males and 10 females) from the morning section
of the course chose to take part in the study, while 19 students (11 males and 8
females) self-identified as participants from the evening section. All were classi-
fied by the university as education degree–seeking, undergraduate students. The
mean age for the student participation was 26.61. The mean grade point average
for student participation was 2.85.
Instrument
In week one of the semester, students were administered the Theories of Intel-
ligence Scale (Dweck, 2000). This is a four-item instrument designed to inves-
tigate perceptions of the malleability of intelligence. Student-participants
completed this instrument by responding to four items on a 6-point Likert scale
which ranged from strongly agree (1) to strongly disagree (6). The four items of
this measure depict intelligence as a fixed entity (i.e., “You have a certain amount
of intelligence and you can’t really do much to change it”); confirmation and vali-
dation studies suggest that disagreement with these items reflects agreement
AC T I O N R E S E A R C H ■ 4 3 1
FIGURE 16-2 Continued
with incremental theory. Previous data suggest that, with respect to construct
validity, this measure is distinct from those of cognitive ability, self-esteem, and
self-efficacy (Dweck et al., 1995). Chronbach alpha reliability for this version of
the scale was established as .80 (Hong et al., 1999).
Task
As a regular feature of the educational psychology course, four objective exami-
nations were administered. These examinations consisted of 50 multiple-choice
Praxis-type items which were electronically scored. For comparative purposes,
the mean average of these four examination scores was utilized to create the
independent variable, enabling the comparison of the highest-performing stu-
dents in the course (those who scored at the 75th percentile or above) with their
relatively lower-scoring peers. Students were also made aware of a feature of the
course through which each student was given an opportunity to write and sub-
mit short-answer and multiple-choice questions for textbook chapters covered in
each week’s instruction. This methodology served two purposes for students in
the course: (1) this self-regulatory strategy helped them to learn the material and
prepare for the upcoming Praxis examination and (2) students were able to earn
extra-credit points towards their final grade in the course. The points earned for
writing a question and supplying the corresponding answer could then be used
to bolster a student’s mean grade in the course. Points earned were based upon
the cognitive complexity of the question: completion items were worth one point
each, multiple choice items measuring knowledge were worth two points each
and multiple-choice items measuring comprehension were worth three points
each. At the beginning of each week, students were also asked to indicate how
many items they expected to write for a particular week and, on a 10-point scale,
both how important it was for them to obtain bonus points and how confident
they were in their ability to complete this self-regulatory task.
Initially, students were informed that they were free to select from any of the
three question formats when composing their questions. For the final one half of
the course, (after the administration of exam 2) students were then informed that
only multiple-choice items measuring comprehension (3-point items) would be
accepted.
Results
Results from tests 1 through 4 were recorded and averaged for each student,
yielding a mean score. This score was then used to classify students into one of
two groups; those who scored at the 75th percentile or above (n=11) and those
who scored below the 75th percentile (n=28). For this sample, the mean score
for the four objective examinations was 78.42; those scoring at or above the 75th
percentile achieved a mean score of 86.00 or higher. See website http://rapidin-
tellect.com/AEQweb/win2006.htm.
Table 1 displays mean values for the Theories of Intelligence Scale. For each
item, those students whose average score was at or above the 75th percen-
tile expressed more strongly held incremental views (as evidenced in a higher
score for each item, which expresses a higher level of disagreement with entity
(continued)
432 ■ CHAPTER 16
FIGURE 16-2 Continued
Theoretical Framework
The author uses the work of Carol Dweck as a framework to guide the research
plan. By focusing on epistemological beliefs, as defined in Dweck’s work, the
author has specified what type of data will be collected. In this section, the
reader is introduced to key terms (i.e., the distinction between entity and incre-
mental thinkers) and the ways in which this may help to address the problem
at hand.
Research Questions
Here, the researcher specifies the goals of the investigation. While directional
hypotheses are not provided, they are implied in the theoretical framework;
namely that students who internalize and express an incremental view of intel-
ligence will exhibit more self-regulatory behavior, which in this instance is
manifested in differences in the academic performance and frequency of self-
regulatory behaviors of his students.
AC T I O N R E S E A R C H ■ 4 3 3
Sample
As is true of most action research studies, the author focuses on a small sample
size, in this case, students enrolled in two sections of an educational psychol-
ogy course. This will decrease the likelihood of generalizing the study’s results
to other student populations. This, however, is not the goal of the study. The
author wishes to evaluate the impact of self-beliefs and self-regulatory behav-
iors on academic achievement.
Methodology
The methodology for this study is informed by both the problem statement
and theoretical framework. Using Dweck’s framework, students’ beliefs about
intelligence are recorded. This allowed for the classification of students into
one of two groups; high- and low-scoring students. This classification then
allowed students to be compared on a self-regulatory task (the creation of
study-guide questions) and academic performance (exam scores). This allows
the author to investigate at least one possible explanation behind the problem
statement (why students avoid challenging tasks).
Data Collection
Data collection is a systematic process informed by the research design. This
researcher collected three sources of qualitative data: examination scores, num-
ber of questions written, and scores on the Theories of Intelligence Scale for
each student-participant.
Interpretation
As with all research studies, the interpretation of results is done with an eye
toward improving the research environment itself. These data suggest that
students who express a more incremental view of intelligence relative to their
classmates tend to exhibit more self-regulatory behaviors and score better
on classroom assessments. The author suggests two concrete changes to the
classroom environment in light of these findings: (1) to work toward helping
students develop a more malleable view of intelligence and learning and (2)
to teach students specific self-regulatory skills to aid in the retention of new
material.
■ Summary
■ Recommended References
Armstrong, J. (2001). Collaborative learning from the participants’ perspective. Paper
presented at the 42nd Annual Adult Education Research Conference, June 1–3,
2001, Michigan State University, East Lansing, Michigan.
Briscoe, C., & Peters, J. (1996). Teacher collaboration across and within schools: Sup-
porting individual change in elementary science teaching. Science Education, 81,
51–65.
Calabrese, R. L., Hester, M., Friesen, S., & Burkhalter, K. (2010). Using appreciative
inquiry to create a sustainable rural school district and community. International
Journal of Educational Management, 24(3), 250–265.
Harper, B. (2006). Epistemology, self-regulation and challenge. Academic Exchange
Quarterly, 10, 121–125.
Tabachnick, B., & Zeichner, K. (1998). Idea and action: Action research and the devel-
opment of conceptual change teaching of science. Idea and Action, 14, 309–322.
PA R T
6
THE “CONSUMER”
OF RESEARCH
=
= CHAPTER SEVENTEEN
OBJECTIVES
A
LL OF THE chapters preceding this one have discussed designing
and conducting a research study as preparation for carrying out that
activity. However, researchers (and nonresearchers, as well) are also
“consumers” of research when they read and attempt to understand research
articles appearing in journals. In fact, even a dedicated researcher typically
spends more time reading studies done by others than designing and conduct-
ing new research. Particularly in planning a research study, one needs to find
and read relevant literature.
When reading a research study, it is necessary to understand it, compre-
hending its problem, methodology, and results, in order to interpret and use its
findings. This understanding requires knowledge of what problem the study
investigated, what variables and operational definitions it articulated, how it
controlled for potentially confounding variables and measured or manipu-
lated variables of interest, what research design it employed, how it analyzed
data, what those data indicated, and what meaning the researcher found in the
results. These determinations require analysis of the study into all of its com-
ponent parts and elements.
■ 439
440 ■ CHAPTER 17
1. (a) Does the research report articulate a problem statement? If so, (b) what does it
say? (c) Is it a clear statement? (d) Is it introduced prior to the literature review?
2. Does the problem statement give a complete and accurate statement of the prob-
lem actually studied, or does it leave out something?
3. Does the study’s problem offer sufficient (a) workability, (b) critical mass, and (c)
interest?
4. (a) Does the problem studied offer theoretical and practical value? (b) Does the
report establish these criteria?
5. Does the literature review present a high-quality overview? Does it achieve ade-
quate (a) clarity, (b) flow, (c) relevance, (d) recency, (e) empirical focus, and (f)
independence?
6. Does the literature review include technically accurate citations and references?
7. (a) Does the introduction offer hypotheses? If so, (b) what are they? Are they (c)
directional, (d) clear, (e) consistent with the problem, and (f) supported by effective
arguments?
8. What actual variables does the study examine? Identify: (a) independent, (b) mod-
erator (if any), (c) dependent, and (d) control variables (only the most important
two or three).
9. (a) What intervening variable might the study be evaluating? (b) Was it suggested
in the research report?
10. What operational definitions did the researcher develop for the variables listed in
answering Question 8?
11. (a) What type of operational definition was used for each variable? (b) Was each
definition sufficiently exclusive to the corresponding variable?
12. In controlling for extraneous effects, (a) how did the study prevent possible bias to
certainty introduced by the participants it employed, and (b) did these precautions
completely and adequately control for those effects?
13. In controlling for extraneous effects, (a) how did the study prevent possible bias to
certainty introduced by the experiences it presented, and (b) did these precautions
completely and adequately control for those effects?
14. In controlling for extraneous effects, (a) how did the study prevent possible bias to
generality introduced by the participants it employed, and (b) did these precautions
completely and adequately control for those effects?
15. In controlling for extraneous effects, (a) how did the study prevent possible bias to
generality introduced by the experiences it presented, and (b) did these precau-
tions completely and adequately control for those effects?
16. (a) Which variables did the study manipulate? (b) How successfully did the
researcher carry out the manipulation?
17. (a) What design did the study employ, and (b) how adequately did it ensure
certainty?
18. For each measurement procedure in the study, (a) what evidence of validity does
the research report provide, and (b) does this information indicate adequate
validity?
19. For each measurement procedure (including observation) in the study, (a) what evi-
dence of reliability does the research report provide, and (b) does this information
indicate adequate reliability?
(continued)
442 ■ CHAPTER 17
FIGURE 17.1 Continued
20. (a) Which statistics did the study employ, (b) were they the right choices (or should
it have used different ones or additional ones), and (c) were the procedures and
calculations correctly completed?
21. (a) What findings did the study produce, and (b) do they fit the problem
statement?
22. Did the research report adequately support the study’s findings with text, tables,
and figures?
23. How significant and important were the study’s findings?
24. Did the discussion section of the research report draw conclusions, and were they
consistent with the study’s results?
25. (a) Did the discussion section offer reasonable interpretations of why results did
and did not match expectations, and (b) did it suggest reasonable implications
about what readers should do with the results?
using the questions in Figure 17.1 to illustrate the process. Readers are encour-
aged to attempt an analysis and critical evaluation of the study prior to reading
the explanations that follow.
The introductory section is the first section of an article. It usually does not
follow a heading because none is needed to tell the reader where the section
starts. The abstract, which precedes the introduction, is a condensed version
of the entire article rather than a part of any of its sections. The introductory
section typically introduces the reader to the problem and presents a literature
review. It may also offer hypotheses.
The problem is the question that the study seeks to answer. The introduction
presents or communicates it as a problem statement. The two are separated
here, because the statement of the problem does not always correspond to the
problem actually studied. Four criteria govern an evaluation of the problem
statement: location, clarity, completeness, and accuracy. The first step in ana-
lyzing and evaluating a problem statement is to read through the entire article
and locate every version of this statement. Such a sentence generally begins:
“The purpose of this study was . . .” or words to that effect. After identifying
the problem statement, analysis seeks to answer some questions about it.
1. (a) Does the research report articulate a problem statement? If so, (b)
what does it say? (c) Is it a clear statement? (d) Is it introduced prior to
the literature review?
A N A LY Z I N G A N D C R I T I Q U I N G A R E S E A R C H S T U DY ■ 4 4 3
to the variables that does not represent the variables actually studied, then the
problem statement is not an accurate one. Occasionally, researchers name their
independent variable in a way that better represents a possible intervening vari-
able (for example, calling “exercise” by the name “physical fitness” or “choos-
ing to continue a task” by the name “motivation”). This practice diminishes the
accuracy of the problem statement. Sometimes researchers talk about the mod-
erating effect of a variable and then do not apply statistical analysis that tests
for that effect, creating problems with inaccuracy of the problem statement.
Often, the true problem of a study is revealed only in the description
of data analysis techniques and results. These representations list the actual
variables and reveal the relationships actually tested. Tables, in particular, can
reveal a great deal about the problem actually studied.
The criteria for evaluating a research problem itself have already been men-
tioned in Chapter 2, so this discussion will only briefly review them.
3. Does the study’s problem offer sufficient (a) workability, (b) critical
mass, and (c) interest?
4. (a) Does the problem studied offer theoretical and practical value? (b)
Does the report establish these criteria?
The two most important criteria for evaluating a problem from the read-
er’s or reviewer’s perspective are its theoretical and practical value. Theoretical
value reflects a study’s contribution to a field’s understanding of a phenom-
enon. It addresses the question: Why did it happen? Then it attempts to answer
this question by articulating a theoretically based intervening variable. If no
one has studied the problem before or others have recommended that someone
study it, these facts do not provide a study with theoretical value. This value
comes from the study’s contribution to efforts to choose between alternative
explanations or to settle on one developed on the basis of prior theoretical
A N A LY Z I N G A N D C R I T I Q U I N G A R E S E A R C H S T U DY ■ 4 4 5
work. References to a theory in the study’s literature review and citations often
indicate that it builds on an established theoretical base. The study’s author
should explicitly demonstrate that base, rather than expecting readers to do
this on their own, to establish the study’s theoretical value. This background
should preferably be laid down in the introductory section of the article.
Practical value reflects the study’s contribution to subsequent practical
applications. Do the results of this study have the potential to change practice?
In an applied field like education, this value may result in potential changes in
the way people teach or study or administer institutions or counsel.
Theoretical and practical value represent the significance of a study,
the justification for undertaking it prior to seeing the results. Therefore,
the author should explicitly establish a study’s anticipated value or signifi-
cance in the introductory section so that the reader need not guess at or
imagine potential benefits. Studies in education and other applied fields
often promise practical value, but considerably fewer aspire to theoretical
value. However, the link between research and theory gives research its
explanatory power. Therefore, theoretical value should not be overlooked.
Literature Review
The quality of a literature review depends primarily on its clarity and flow.
It should lead the reader through relevant prior research by first establishing
the context of the problem, then reviewing studies that bear on the problem,
and ultimately providing a rationale for any hypotheses that the new study
might offer. One way to determine the sequence and flow of the literature
review, indeed of the entire introductory section, is to number its paragraphs
and then, on a separate sheet of paper, write a single summary sentence that
states the essence of each paragraph’s content. If a researcher can adequately
summarize information in this way, the process provides evidence of the clar-
ity of the literature review. Then by reading over these summary sentences in
order, she or he can evaluate the degree to which the literature review presents
a reasonable, logical, and convincing flow.
446 ■ CHAPTER 17
The quality of the literature review also depends on the degree of relevance
of all studies reviewed or cited, that is, how closely their topics are related
to the current study’s problem. “Name dropping” or “padding” by including
irrelevant citations reduces quality rather than enhancing it. Another useful
analysis tries to determine whether a literature review omits any relevant work,
but this determination of omissions is challenging for any reader not intimately
familiar with the field.
Finally, quality also depends on the recency of the work cited. Except for
“classic” studies in the field, a literature review should cite research completed
within 10 years of the study itself. An analyst evaluates the empirical focus of
the work cited in a literature review by determining whether most of it is data-
based as opposed to discursive. Judgments of the independence of the work
cited reflect the common expectation that a substantial portion of the citations
refer to studies by researchers other than the current study’s own author(s).
This evaluation is based on three considerations; (1) that the reference list
includes all articles cited in the text, (2) that the text cites all articles in the ref-
erence list, and (3) that all text citations and references display proper forms
according to some accepted format such as that in the Publication Manual of
the American Psychological Association (APA, 2009). Despite the usual editing,
surprising numbers of errors in these three areas come to the attention of care-
ful readers. These errors may cause difficulties in following up on particular
studies cited in an article’s literature review.
Variables
7. What actual variables does the study examine? Identify: (a) indepen-
dent, (b) moderator (if any), (c) dependent, and (d) control variables
(only the most important two or three).
judge their individual status (secondary to the major independent variable) and
the researcher’s reasons for including them (to see if they mediate or moder-
ate the relationship between the main independent variable and the dependent
variable).
Inspection of the method of analysis, tables, and results can often help a
reader to identify variables. The number of variables (or factors) in an analysis
of variance, for example, reveals the number of independent plus moderator
variables, while the numerical value of each factor reveals the number of lev-
els it contains. Thus, a 2 × 3 analysis of variance would include a two-level
independent variable and a three-level moderator variable. Variable names can
often be determined from analysis of variance source tables (when they are
provided) or from tables of means. The number of analyses that a study runs
often provide a clue to the number of dependent variables.
Analysis usually requires information from the method section of a research
report to determine the important control variables. A baseline or pretest score
on the dependent variable is often a study’s most important control variable.
Also, pay particular attention to gender, grade level, socioeconomic status, race
or ethnicity, intelligence or ability, time, and order.
Analysis to identify variables must avoid confusing variables and levels. A
categorical or discrete variable divides conditions or characteristics into levels.
A treatment versus a control condition would be two levels of a single variable
rather than two variables. High expectations versus low expectations would be
two levels of a single variable rather than two variables. Encouraging feedback
versus neutral feedback represent two levels of a single variable, as do good
versus poor readers . In order to vary, a variable must contain at least two
levels.
Continuous variables are not divided into levels. They contain numbers of
scores. Most studies include continuous dependent variables, while many (but
not all) include categorical or discrete independent and moderator variables.
A reader must also recognize the distinction between variables a study
measures and those it manipulates, a distinction clarified in the operational
definitions. Dependent variables are never manipulated, while other types can
be either measured or manipulated.
8. (a) What intervening variable might the study be evaluating? (b) Was
it suggested in the research report?
Hypotheses
9. (a) Does the introduction offer hypotheses? If so, (b) what are they?
Are they (c) directional, (d) clear, (e) consistent with the problem, and
(f) supported by effective arguments?
Operational Definitions
10. What operational definitions did the researcher develop for the vari-
ables listed in answering Question 7?
11. (a) What type of operational definition was used for each variable?
(b) Was each definition sufficiently exclusive to the corresponding
variable?
stated in words the specific aspect of the performance that made it an outstand-
ing example? As an operational definition becomes more exclusive, it supports
a stronger conclusion that a researcher studied the intended phenomenon and
not something else.
The method section is the middle section of a research article, typically pre-
ceded by the heading “Method.” It describes subjects, subject selection pro-
cesses, methods for manipulating or measuring variables, procedures followed,
and the research design. (This section sometimes describes statistical tests, but
we will defer the discussion of evaluating statistical procedures until the next
section.) Rather than analyzing and evaluating each of the topics described in
this section, a more meaningful analysis and evaluation would judge methodol-
ogy in terms of its adequacy for controlling any sources of bias that threaten
internal validity (certainty) and external validity (generality). Hence, the sub-
sequent presentation will be organized along these lines.
Remember that a study can either manipulate or measure its variables. Con-
sider, first, the issue of variable manipulation. Researchers manipulate variables
for two purposes: (1) to control extraneous variables (influences called control
variables in this book) in order to maximize certainty and generality, or (2) to
create independent variables that represent the results of a manipulation. An
evaluation of the first purpose for manipulation follows the model defined in
the four windows shown in Table 7.3. The first two questions asked in analyz-
ing this aspect of a study deal with certainty.
12. In controlling for extraneous effects, (a) how did the study prevent
possible bias to certainty introduced by the participants it employed,
and (b) did these precautions completely and adequately control for
those effects?
13. In controlling for extraneous effects, (a) how did the study prevent
possible bias to certainty introduced by the experiences it presented,
and (b) did these precautions completely and adequately control for
those effects?
14. In controlling for extraneous effects, (a) how did the study prevent
possible bias to generality introduced by the participants it employed
and, (b) did these precautions completely and adequately control for
those effects?
15. In controlling for extraneous effects, (a) how did the study prevent
possible bias to generality introduced by the experiences it presented,
and (b) did these precautions completely and adequately control for
those effects?
16. (a) Which variables did the study manipulate? (b) How successfully
did the researcher carry out the manipulation?
The final issue under manipulation and control deals with the effectiveness
of manipulations in creating the states or conditions required by the variables
being manipulated. Such a so-called manipulation check is a recommended part
of conducting a research study that employs a manipulation-based variable. In
the study of teacher enthusiasm described near the end of Chapter 7, teach-
ers were trained to display three different levels of enthusiasm. To determine
whether they did, in fact, create the intended levels as instructed, the research-
ers observed and rated the levels of teaching enthusiasm. (Table 7.5 reported
the results.) Data like this help readers to appraise the success of the manipula-
tion. Without any such evidence, readers are left guessing about the manipula-
tion’s success; they can only form critical opinions regarding the absence of
such important information.
Research Design
Identifying the research design implemented in a study helps with the task of
evaluating its certainty, while also clarifying its procedures. This analytical task
resembles that of identifying the different types of variables in a study in that
both help to clarify a study’s subject and how the researcher proceeded with the
454 ■ CHAPTER 17
investigation. Since the research report seldom names the design model imple-
mented in a study, the reader must figure it out from available information.
17. (a) What design did the study employ, and (b) how adequately did it
ensure certainty?
The first clue for identifying the design is whether the independent variable
was manipulated or measured. If the independent variable was manipulated, then
the study employed an experimental design; if the independent variable was mea-
sured, then it implemented an ex post facto design. Assume that the independent
variable was manipulated. Then the next clue is whether or not the researcher
presented a pretest, followed by whether subjects were randomly assigned to
levels of the independent variable, intact groups served as samples, or subjects
served as their own controls. The combination of these determinations differ-
entiates preexperimental, true experimental, and quasi-experimental designs and
their specific variations. A final determination checks for inclusion of moderator
variables, and if so, how many the design accommodated and how many levels
each one contained. This information helps a reader to identify factorial designs.
If the independent variable was measured, then the analytic reader seeks
to distinguish between correlational and criterion-group designs by determin-
ing whether or not this variable was divided into levels. (Only criterion-group
designs divide independent variables into levels.) If a perceptive reader finds
levels, then she or he must search further for moderator variables (a step pre-
sumably completed earlier) to determine whether the criterion-group design
was factorialized.
After determining all of these details, a useful additional step is to attempt
to draw the design, representing its exact particulars using the notation pro-
vided in Chapter 8.
As a way of evaluating a design, a reader can judge the degree to which
it contributes to or detracts from the study’s certainty. Additional speculation
could ask how the design might be improved. Occasionally, adding a pretest or
a moderator variable could provide such an improvement. For example, a study
that treated procrastination tendency as a control variable could have divided
up that variable into two or three levels and used it as a moderator variable. In
other words, the experimental design could have been factorialized. (Chapter 4
discusses considerations affecting the choice of whether to simply control a vari-
able or to study it as a moderator variable.)
Measurement of Variables
After considering manipulated variables, should a study include any, the next
step of the analysis considers measured variables, examining the quality of
A N A LY Z I N G A N D C R I T I Q U I N G A R E S E A R C H S T U DY ■ 4 5 5
18. For each measurement procedure in the study, (a) what evidence of
validity does the research report provide, and (b) does this informa-
tion indicate adequate validity?
Concerns in these sections will focus on statistical tests, the nature and presen-
tation of results, and functions of discussion.
Statistical Tests
20. (a) Which statistics did the study employ, (b) were they the right
choices (or should it have used different ones or additional ones), and
(c) were the procedures and calculations correctly completed?
The first part (a) is the analysis element of Question 20. It simply deter-
mines the statistical tests that the study applied to its data. This information
usually appears in the results section of a research report. For example, a study
may detail the choice of analysis of covariance (ANCOVA) in both the meth-
ods and results sections.
To answer the second part (b), first refer to the problem statement to see
if it specifies statistical tests to answer the question or questions it poses. If
this statement specifies a moderator variable, for example, then statistical tests
should have analyzed that variable together with the independent variable. As a
further step, look for some typical examples of questionable practices: continu-
ing to perform additional, subordinate statistical tests despite failure by initial,
superordinate ones to yield significant results; using parametric tests without
evidence of normally distributed data or with categorical data; not including
both levels of a moderator variable in the same analysis.
Check to see whether data bearing on all of the problems posed were actu-
ally subjected to statistical analysis in order to provide adequate answers. See
whether statistical tests actually completed the comparisons specified in the
problem statement. A researcher may, for example, test whether differences
in the independent variable affect the dependent variable but fail to follow up
this analysis by making direct comparisons between levels of the independent
variable.
The question of whether the study correctly carried out its statistical
tests (part c) requires a difficult judgment; often a reader cannot answer this
question from available information. A definitive answer requires sufficient
tables to allow confirmation of various aspects of the statistical approach.
This evaluative judgment also requires a reasonably strong background
in statistics. When a study provides analysis of variance source tables, for
example, check the entries to confirm that sources have not been overlooked
(that is, determine that the variance has been correctly partitioned), and that
A N A LY Z I N G A N D C R I T I Q U I N G A R E S E A R C H S T U DY ■ 4 5 7
mistakes have not been made in computing degrees of freedom and sums of
squares.
A critical reader must answer a number of questions about the results of a study.
21. (a) What findings did the study produce, and (b) do they fit the prob-
lem statement?
22. Did the research report adequately support the study’s findings with
text, tables, and figures?
This is the key question. Did the research discover anything of substance?
Did it reveal any significant differences? To find nothing is to prove nothing.
458 ■ CHAPTER 17
24. Did the discussion section of the research report draw conclusions,
and were they consistent with the study’s results?
25. (a) Did the discussion section offer reasonable interpretations of why
results did and did not match expectations, and (b) did it suggest rea-
sonable implications about what readers should do with the results?
Read the study reprinted in this section (Tuckman 1992b). The remainder of
this chapter will analyze and critique this study as an example.
=
The Effect of Student Planning and
Self-Competence on Self-Motivated Performance
Bruce W. Tuckman
Florida State University
The purpose of this study was to determine the effect of planning on stu-
dent motivation—in this case, the amount of effort put forth by college
students on a voluntary, course-related task. A second purpose was to
determine whether planning effects varied for students whose beliefs in
their own level of initial competence at the task varied from high to low.
Student motivation becomes an increasingly important influence on
teaching and learning outcomes as students progress through the grades.
As students get older, school requires them to exercise far greater control
over their own learning and performance if they are to succeed. Tuckman
and Sexton (1990) labeled the amount or level of performance self-regu-
lated performance and used it as an indication of motivation. In contrast to
quality of performance, which depends on ability, quantity of performance
can be assumed to represent the amount of effort students are willing to
apply to their assignments and school responsibilities. Because students
are not able to modify their ability levels in the short term, effort becomes
the prime causal attribute to modify if school outcomes are to be success-
ful (Weiner, 1980).
Self-regulated performance is posited to be different from self-regu-
lated learning, the latter being comprised of both competence and choice,
whereas the former primarily reflects choice. McCombs and Marzano
(1990) define self-regulated learning as a combination of will—a state of
motivation based on “an internal self-generated desire resulting in an inten-
tional choice,” which is primary in initiating self-regulation—and skill—“an
acquired cognitive or metacognitive competency” (p. 52). Self-regulated
performance is viewed as the result of a state of motivation that is based
considerably more on will than on skill (Tuckman & Sexton, 1990).
Self-regulated performance is largely a function of students’ self-beliefs
in their competence to perform (Bandura, 1986). However, it is affected
by external forces such as informational feedback and working in groups,
both of which tend to enhance self-regulated performance but primarily
among persons who believe themselves to be average in self-competence
(Tuckman & Sexton, 1989). Goal setting tends to motivate those low in
perceived self-competence (Tuckman, 1990), whereas encouragement
influences students at all self-competency levels to perform (Tuckman &
Sexton, 1991).
An important variable that would seem to affect self-regulated per-
formance is planning, yet little attention has been paid to it in research on
student motivation. The focus instead has been on goal setting, which is
only one aspect of planning. Other features of planning for goal-directed
performance include (a) where and how performing will be done, (b) the
reasons for performing, (c) the manner of dealing with potential obstacles
A N A LY Z I N G A N D C R I T I Q U I N G A R E S E A R C H S T U DY ■ 4 6 1
and unexpected events, and (d) the specification of incentives, if any, that
one will provide oneself for goal attainment (Tuckman, 1991a).
Some studies of goal setting have included some of the above features
(Gaa, 1979; Locke, Shaw, Saari, & Latham, 1981; Mento, Steel, & Karren, 1987;
Schunk, 1990; Tuckman, 1990), but not systematically, and have found pos-
itive effects on self-regulated performance. There have been, however, no
findings about the combined effect of all of the above aspects of planning
on motivation. There has also been no research on whether planning is dif-
ferentially effective for students varying in perceived ability.
Because research has shown a positive effect of goal setting on per-
formance, planning, which incorporates goal setting, was expected to
enhance the amount of performance of students in this experiment. This
enhancement effect was expected primarily for students whose percep-
tions of their own competence to perform the task was low to average.
This prediction is based on prior findings that external influences minimally
affect students who perceive themselves to be high in competence (Tuck-
man, 1991a).
Method
Subjects were 130 junior and senior teacher education majors in a large
state university. The majority were women, and the mean age was 21 years.
All were enrolled in one of four sections of a required course in educational
psychology, which covered the topics of test construction and learning
theory. All sections were taught by the same instructor.
The course included a procedure for allowing students to earn extra
credit toward their final grade. The procedure was called the Voluntary
Homework System or VHS (Tuckman & Sexton, 1989, 1990), and it served
as the performance task for this study. Subjects were given the opportunity
to write test items on work covered in that week’s instruction. Completion
items were worth 1 point each; multiple-choice items, 2 points each; and
multiple-choice comprehension items, 3 points each. Point values reflected
the effort required to produce each type of item. Submitted items were
loosely screened for quality and returned for corrections where necessary.
VHS extended over the first 4 weeks of a 15-week course, and the
points earned each week were cumulative. Subjects who earned 350
points or more received a full-grade bonus for the first third of the course
(e.g., a B became an A); subjects who earned between 225 and 349 points
received a two-thirds grade bonus (e.g., a B became an A-); subjects who
earned between 112 and 224 points received a one-third grade bonus (e.g.,
a B became a B+); and subjects who earned fewer than 112 points earned
no bonus (nor were they penalized). Bonus received, a reflection of the
462 ■ CHAPTER 17
Results
The initial equivalence of the four classes was determined by comparing
them on self-competence level, outcome importance, procrastination ten-
dency, and advanced vocabulary. None of the differences approached sig-
nificance (F = 0.34, 1.00, 0.17, 0.33; df = 3/129, respectively), leading to the
conclusion that the classes were equivalent. Hence, the design of the study
can be regarded as quasi-experimental with adequate control for potential
selection bias.
Eighteen subjects from the two sections that were given the planning
forms either failed to return them at the end of the 4 weeks or returned
them with little or nothing written on them. Their reasons for not doing the
planning, when asked, ranged from “no time” to “not necessary.” Therefore,
they were not included in the data analysis for the planning form group.
They were compared on perceived self-competence and on all of the control
variables with subjects who used the planning form and with subjects who
were not given forms; they were found not to differ significantly from either.
None of these 18 subjects earned double or triple performance bonuses.
Of the 54 subjects who used the planning form, 27 (50%) earned either
double or triple performance bonuses, compared with 16 (27.5%) of the 58
subjects not given the planning form. This comparison yielded a chi-square
value of 5.03 (df = 1, p < .05).
The effects of planning forms for subjects at two levels of perceived
self-competence (high and medium + low) were compared. Slightly more
than one-third of the high self-competence subjects in the planning form
group and the no planning form group earned double and triple bonuses.
Among the medium + low self-competence subjects, 21 of 37 subjects
(57%) in the group that used the planning form received double or triple
bonuses, compared with 9 of 38 subjects (24%) who did not use planning
forms. This comparison yielded a chi-square value of 7.24 (df = 1, p < .01).
Discussion
It can be concluded from the results of this study that using the planning
form had a strong positive effect on the self-regulated performance of
students to perform, particularly among students who believed that their
own performance capability was low to average. In fact, the use of the
planning form resulted in a greater percentage of students of medium and
low self-competence obtaining item-writing bonuses than students high in
self-competence.
Even when the planning form was used, it was surprising to see stu-
dents with low and medium perceived item-writing self-competence
464 ■ CHAPTER 17
To do so I will have to
overcome
because
The responsibility for writing items is
466 ■ CHAPTER 17
Author’s Note
This study was reported on at the annual meeting of the American Educa-
tional Research Association, Chicago, IL, 1991.
A N A LY Z I N G A N D C R I T I Q U I N G A R E S E A R C H S T U DY ■ 4 6 7
References
Bandura, A. (1977). Self-efficacy: Toward a unifying theory of behavior change. Psycho-
logical Review, 84, 191–215.
Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory.
Englewood Cliffs, NJ: Prentice-Hall.
Bandura, A. (1989). Human agency in social cognitive theory. American Psychologist, 44,
1175–1184.
French, E. P. F. (1963). Test kit of cognitive factors. Princeton, NJ: Educational Testing
Service.
Gaa, J. P. (1979). The effect of individual goal-setting conferences on academic achieve-
ment and modification of locus of control orientation. Psychology in the Schools, 16,
591–597.
Locke, E. A., Shaw, K. N., Saari, L. M., & Latham, G. P. (1981). Goal setting and task perfor-
mance: 1969–1980. Psychological Bulletin, 90, 125–152.
McCombs, B. L., & Marzano, R. J. (1990). Putting the self in self-regulated learning: The
self as agent in integrating will and skill. Educational Psychologist, 25, 51–69.
Mento, A. J., Steel, R. P., & Karren, R. J. (1987). A meta-analytic study of the effects of
goal-setting on task performance: 1966–1984. Organizational Behavior and Human
Decision Processes, 39, 52–83.
Schunk, D. H. (1990). Goal setting and self-efficacy during self-regulated learning. Edu-
cational Psychologist, 25, 71–86.
Tuckman, B. W. (1990). Group versus goal-setting effects on the self-regulated perfor-
mance of students differing in self-efficacy. Journal of Experimental Education, 58,
291–298.
Tuckman, B. W. (1991a). Motivating college students: A model based on empirical evi-
dence. Innovative Higher Education, 15, 167–176.
Tuckman, B. W. (1991b). The development and concurrent validity of the Procrastination
Scale. Educational and Psychological Measurement, 51, 473–480.
Tuckman, B. W., & Sexton, T. L. (1989). The effects of relative feedback and self-efficacy
in overcoming procrastination on an academic task. Paper presented at the meeting
of the American Psychological Association, New Orleans, LA.
Tuckman, B. W., & Sexton, T. L. (1990). The relation between self-beliefs and self-regu-
lated performance. Journal of Social Behavior and Personality, 5, 465–472.
Tuckman, B. W., & Sexton, T. L. (1991). The effect of teacher encouragement on stu-
dent self-efficacy and motivation for self-regulated performance. Journal of Social
Behavior and Personality, 6, 137–146.
Weiner, B. (1980). Human motivation. New York: Holt, Rinehart and Winston.
Reprinted from the Journal of Experimental Education, 60, (1992) 119–127, by permission.
468 ■ CHAPTER 17
■ An Example Evaluation
This section presents an analysis and evaluation of the preceding study in order
to illustrate the process. Readers should attempt the process themselves before
reading further, and then check their “answers” against the information in this
section. Each of the 25 questions in Figure 17.1 will be considered in turn.
1. (a) Does the research report articulate a problem statement? If so, (b)
what does it say? (c) Is it a clear statement? (d) Is it introduced prior to the
literature review? Yes, the article gives a clear problem statement in the first
paragraph.
The purpose of this study was to determine the effect of planning on stu-
dent motivation—in this case, the amount of effort put forth by college
students on a voluntary, course-related task. A second purpose was to
determine whether planning effects varied for students whose beliefs in
their own level of initial competence at the task varied from high to low.
number of classes from which to draw subjects. (Without this access, such a
study would be a relatively unworkable project.) The topic studied is also of
considerable current interest. The study is satisfactory on this question.
4. (a) Does the problem studied offer theoretical and practical value? (b)
Does the report establish these criteria? The study promises some theoretical
value, which the article attempts to establish in the fourth paragraph of the dis-
cussion section by relating planning to self-beliefs, self-system, and metacog-
nitive strategies. A better arrangement would have mentioned this theoretical
base in the introduction. The study has obvious practical value, demonstrating
that planning is a highly practical strategy, but no mention of this value appears
until the last paragraph of the discussion section. The study seems to offer sat-
isfactory value, but the report does not clearly establish its importance in the
introduction.
5. Does the literature review present a high-quality summary? Does it
achieve adequate (a) clarity, (b) flow, (c) relevance, (d) recency, (e) empiri-
cal focus, and (f) independence? Of the introduction’s seven paragraphs, the
middle five constitute the literature review. (Literature is also cited in the dis-
cussion section, but these references do not constitute part of the literature
review.) The first paragraph of the literature review (the second paragraph of
the introduction, the first being the problem statement) attempts to show that
the study focused on motivation, as reflected in self-regulated performance.
The next paragraph distinguishes between self-regulated learning and perfor-
mance, while the following one cites studies relating various external condi-
tions and one internal propensity to self-regulated performance. The next
paragraph introduces and describes planning, the focus of the study, but cites
no reference directly related to it. The final paragraph of the literature review
discusses the impact of goal setting on motivation to perform.
An evaluation detects a striking absence of cited research on planning, a
failure to establish a theoretical base for the study, and a preponderance of
citations of work done by the study’s author. While the literature review is
reasonably clear and flows well, it shows substantial weakness in its omissions.
It must be evaluated as a considerably less than satisfactory element.
6. Does the literature include technically accurate citations and references?
The literature review follows a technically accurate format, except perhaps for
citing and referencing a paper presented at a professional meeting by year as
opposed to year and month (Tuckman & Sexton, 1989, August) as required
by the American Psychological Association (2009) reference format. However,
the procedure used was appropriate at the time the article was published, under
guidelines provided in an earlier edition of the publication manual.
7. (a) Does the introduction offer hypotheses? If so, (b) what are they? Are
they (c) directional, (d) clear, (e) consistent with the problem, and (f) supported
470 ■ CHAPTER 17
Paraphrasing the first hypothesis, it expects that students who plan will
outperform those who do not plan. The second hypothesis asserts that stu-
dents low to average in self-competence will gain a greater benefit of planning
than will students high in self-competence. The introduction supports the first
hypothesis with statements that “research has shown a positive effect of goal
setting on performance,” and “planning . . . incorporates goal setting.” It sup-
ports the second hypothesis by noting “prior findings that external influences
minimally affect students who perceive themselves to be high in competence”;
planning is one such external influence.
The study rates as excellent on this question.
8. What actual variables does the study examine? Identify: (a) independent,
(b) moderator (if any), (c) dependent, and (d) control variables (only the most
important two or three). The independent variable is planning versus no plan-
ning, a manipulated variable with two discrete levels.
The moderator variable is self-competence, a measured variable initially
continuous that is subsequently divided into two levels: high and medium
plus low. Note, however, that the levels changed from the original formulation
(from three to two), and that the levels were not compared in the same analysis,
a requirement for testing a variable’s moderating effect.
The dependent variable is performance, a measured variable, but one that
is cast into discrete categories as the number of bonuses a student earns.
One major control variable is initial self-competence level. (In addition to
serving as a moderator variable, it served as a control variable to establish the
equivalence of the classes.) Other control variables included outcome impor-
tance, procrastination tendency, and mental ability. All are measured, continu-
ous variables.
9. (a) What intervening variable might the study be evaluating? (b) Was it
suggested in the research report? The use of a system or metacognitive strat-
egy, particularly one that might not otherwise have been self-initiated or self-
regulated, is a possible intervening variable. In other words, the planning form
may have caused students to use a strategy they may not have implemented on
their own. This possibility is suggested in the discussion section.
A N A LY Z I N G A N D C R I T I Q U I N G A R E S E A R C H S T U DY ■ 4 7 1
10. What operational definitions did the researcher develop for the variables
listed in answering Question 8?
11. (a) What type of operational definition was used for each variable? (b)
Was each definition sufficiently exclusive to the corresponding variable? Plan-
ning versus no planning was operationally defined as receiving and using a plan-
ning form versus not receiving the form. It was a manipulation-based variable.
Self-competence level was operationally defined as subjects’ judgments of their
own task performance capability times their self-ratings of confidence in those
judgments. It was a static variable. The dependent variable, performance, was
operationally defined dynamically as number of performance bonuses earned
on an item-writing task. Outcome performance and procrastination tendency
were static judgments without explicit operational definitions. Mental ability
was a dynamic measure of vocabulary skill.
In evaluating exclusivity, one might question whether not receiving a plan-
ning form equates to not planning, and whether vocabulary skill equates to
mental ability. However, the use of a manipulation-based independent variable
and dynamic dependent variable gives the study a strong operational base.
12. In controlling for extraneous effects, (a) how did the study prevent pos-
sible bias to certainty introduced by the subjects it employed, and (b) did these
precautions completely and adequately control for these effects? The study con-
trolled possible subject bias affecting certainty by establishing the equivalence
of the four intact classes on initial self-competence level, outcome importance,
procrastination tendency, and mental ability, all of which were presumably
related to performance on the task, the dependent variable. The best possible
design would have randomly assigned students to planning and no-planning
conditions, but apparently the researcher was not in a position to alter the com-
position of the classes, and both treatment levels could not possibly have been
carried out in the same class. Pretesting students on actual task performance
would have given stronger control than pretesting them on the other control
variables, but the researcher could not have done this, since the planning forms
were given out prior to the introduction of the task. A better procedure would
have started the task 1 week prior to the introduction of the forms in order to
obtain a pretest measure of task performance. Lacking a direct pretest measure
of the dependent variable, the author resorted to establishing initial equiva-
lence on possible performance-related measures, but this precaution does not
ensure equivalence between the groups. The control efforts gave better assur-
ance than doing nothing, however.
A second problem is introduced by the elimination of subjects, creating
possible mortality bias. Data analysis eliminated 18 students from the plan-
ning group, because they did not fill out and return the forms as instructed.
472 ■ CHAPTER 17
To assess possible mortality bias, the researcher determined their reasons for
failing to comply and compared them to the students who did comply on the
four control variables, finding no differences. (See the second paragraph of the
results section and the sixth paragraph of the discussion section.) However,
none of those 18 subjects earned double or triple performance bonuses, while
50 percent of those that complied with instructions did earn such bonuses. In
eliminating those 18 noncompliers, did the researcher introduce a major bias
into the results, because they were motivationally different from the remain-
ing 54 students who actually filled out and returned the forms? Does their
equivalence with the larger group on the four measures ensure that they were
motivationally the same? The study gives no way to answer these questions
with certainty.
A third possible certainty/participant bias is introduced by possible dif-
ferences in the academic capabilities and related grade expectations of the
participating students, particularly as related to the moderator variable, task
self-competence. The motivation for writing test items was to obtain grade
bonuses. Students who expected, based on their past academic performance,
to get high grades would be less motivated to write items than students who
expected lower grades. If self-competence for item writing equated to self-
competence for grades, then the difference in performance of self-competence
groups on the task would relate less to planning than to characteristic motiva-
tion. The author speaks to this issue in the second paragraph of the discussion
section, indicating recognition of a relationship between self-competence for
item writing and self-competence for grades. Because of this link, a critical
reader must seriously question the certainty of the moderating effect being
a function of item-writing self-competence, as opposed to a function of task
motivation.
Finally, an evaluation must consider the question of students’ actual item-
writing capability (in contrast to their self-judged capability), which the study
did not assess as a control variable. Without random assignment, the researcher
cannot assure equal distribution of this capacity across groups. (The author
raises this question himself in the next-to-last paragraph of the discussion
section.)
These criticisms reveal a weakness in the study in its control for threats
to certainty, because it failed to control adequately for participant or subject
bias.
13. In controlling for extraneous effects, (a) how did the study prevent possi-
ble bias to certainty introduced by the experiences it presented, and (b) did these
precautions completely and adequately control for those effects? The researcher
controlled for experience bias as a threat to certainty by including a control
group whose members did not receive the planning form in order to assess the
A N A LY Z I N G A N D C R I T I Q U I N G A R E S E A R C H S T U DY ■ 4 7 3
16. (a) Which variables did the study manipulate? (b) How successfully did
the researcher carry out the manipulation? The study manipulated its inde-
pendent variable, planning versus no planning, by giving a planning form to
one group but not the other. It confirmed that the planning group actually
used the form by collecting returned, completed forms at the conclusion of the
study period. The researcher apparently made a reasonable assumption that
filling out the planning form constituted planning. (This variable was further
operationalized by eliminating data for students who did not return filled-out
forms.) For this level of the independent variable, the manipulation must be
considered a success.
The control group members received no planning forms, but the study
made no attempt to determine whether and to what extent they engaged in
planning without the forms. It might have gained useful insight by surveying
students at the end of the study period to determine the degree to which they
planned their item-writing activity.
17. (a) What design did the study employ, and (b) how adequately did it
ensure certainty? Based on the author’s description, this study followed the
nonequivalent control group design, diagrammed as:
O1 X1 O3
——————
O2 X2 O4
However, since O1 and O2 were not pretest measures in the strict sense of
the term (that is, they measured, not the dependent variable, but other, presum-
ably performance-related variables), then the design can be considered to be an
intact group comparison:
X1 O1
————
X2 O2
If the reader as evaluator settles on the second design, then he or she attri-
butes low certainty to the study; if the reader accepts the first design as the
correct one, then it achieves adequate certainty. The decision hinges on the
question of threats to certainty imposed by subjects. Clearly, this question
points out a major weakness of the study.
The design ran tests three times: (1) for all students, (2) for students high
in self-competence, and (3) for students medium and low in self-competence.
Since no single analysis compared students at the high versus medium plus low
self-competence levels, the design cannot be considered a factorial one.
A N A LY Z I N G A N D C R I T I Q U I N G A R E S E A R C H S T U DY ■ 4 7 5
18. For each measurement procedure in the study, (a) what evidence of
validity does the research report provide, and (b) does this information indicate
adequate validity? The dynamic dependent variable, level of bonus earned, was
based on performance on the item-writing task. The measure, number of points
earned, translated directly into bonus level based on preset criteria. This mea-
sure must be considered a highly valid indicator of performance, since it varies
directly with behavior and requires relatively little judgment for its assessment.
(The only judgment concerns the acceptability of the written items, which, the
author explains, were “loosely screened for quality and returned for correc-
tions where necessary.”)
Self-competence and outcome importance, both static variables, were mea-
sured by answers to direct questions, ensuring validity as long as students give
reasonably frank and self-aware responses. Procrastination (another static vari-
able) was measured by a scale that the author confirms as valid by quoting
a reference. Mental ability (a dynamic variable) was measured by a vocabu-
lary test to which it has been “linked,” according to the author, citing another
reference.
The measures are judged to be valid ones, with possible questions about
aspects of mental ability other than just vocabulary.
19. For each measurement procedure (including observation) in the study,
(a) what evidence of reliability does the research report provide, and (b) does
this information indicate adequate reliability? The research report gives no
specific evidence of the reliability of any measure. However, the dependent
measure of bonus earned is based on an objective measure of points that stu-
dents earned by writing items, so its reliability is not at issue (except perhaps
if judgment influenced item screening). The report characterizes the procras-
tination measure as a reliable measure without giving any specific numbers.
That leaves the measures of self-competence and outcome importance, the first
apparently based on two items, the second on one item, without any mention
of reliability. The article should have provided some indication of reliability for
these two measures.
20. (a) Which statistics did the study employ, (b) were they the right choices
(or should it have used different ones or additional ones), and (c) were the
procedures and calculations correctly completed? Initial class equivalence was
established by four one-way ANOVAs across the four classes, one for each
control variable. The study tested the overall effect of planning versus no plan-
ning by means of a chi-square test, the same test it applied to gauge the effect of
planning versus no planning on subjects low and medium in self-competence.
The researcher completed no statistical comparison among students high in
self- competence, presumably because the same approximate number in plan-
ning and no planning conditions earned double or triple bonuses.
476 ■ CHAPTER 17
was significant at the .05 level. The difference for medium plus low self-com-
petence students was significant at the .01 level. Two facts suggest that the find-
ings were not only significant but important, as well: Almost twice as many
planning-form students earned big bonuses as those not given the form, and
among subjects medium and low in self-competence, more than twice as many
form users as nonusers earned big bonuses.
24. Did the discussion section of the research report draw conclusions, and
were they consistent with the study’s results? The article presents the research-
er’s conclusions in the first paragraph of the discussion section. This mate-
rial restates the findings more than articulating broad conclusions. The author
restricted his conclusions to the effect of the planning form rather than extend-
ing them to the effect of planning. These conclusions were indeed consistent
with the study’s results, but they might have drawn inferences somewhat
broader than the results.
25. (a) Did the discussion section offer reasonable interpretations of why
results did and did not match expectations, and (b) did it suggest reasonable
implications about what readers should do with the results? Much of the dis-
cussion was devoted to interpretations, with particular emphasis on meth-
odological issues. (This evaluation has already referred to most of the issues,
particularly in answering the questions on control.) All interpretations seemed
to offer quite reasonable suggestions.
Although identified as a “methodological issue” in the fifth paragraph of
the discussion, a substantial and important question arises about the specific
way in which the planning form affected performance. Since use of the form
apparently involved subjects in a number of processes, among which was goal
setting, the article leaves difficult if not impossible questions about the critical
activities that affected performance. Perhaps more interpretation could have
been provided on this point.
Only one clear implication was offered in the discussion section’s last para-
graph. More detailed treatment of that one reasonable implication would have
improved the article.
Despite the potential for improvement, the discussion section seemed ade-
quate for the study’s results.
Overall
Evaluation of the study has found the most important weaknesses in two areas:
the accuracy and completeness of the problem statement (based on the con-
fusion between planning and using the planning form), and methods of con-
trolling for participant or subject bias, which might have affected certainty.
Changes in statistical methods might have strengthened the study (although
478 ■ CHAPTER 17
many readers value simplicity). The reader must consider the serious possibil-
ity that the results may have been unduly affected by subject bias.
On the positive side, the study used a manipulation-based independent
variable and a dynamic dependent variable, both contributing to its validity.
It also included a moderator variable, offered hypotheses, and obtained strong
findings. Unfortunately, stronger steps might have more effectively controlled
for potential subject bias.
■ Summary
12. After determining what findings the study produced, an evaluating reader
compares them to the problem statement, checks for needed tables and
figures, and assesses their significance and importance.
13. Finally, the discussion section of the report must be evaluated for reason-
able conclusions, interpretations, and implications.
14. A study on planning provided a sample for evaluation. Consideration of
this study revealed principal weaknesses related to the accuracy and com-
pleteness of the problem statement and the effectiveness of controls for
potential subject bias affecting certainty.
■ Recommended References
Hittleman, D. R., & Simon, A. J. (1997). Interpreting educational research (2nd ed.).
Columbus, OH: Merrill.
Katzer, J., Cook, J. H., & Crouch, W. W. (1991). Evaluating information: A guide for
users of social science research (3rd ed.). New York, NY: McGraw-Hill.
PA R T
APPENDIXES
=
= APPENDIX A
Tables
22 17 68 65 84 68 95 23 92 35 87 02 22 57 51 61 09 43 95 06 58 24 82 03 47
19 36 37 59 46 13 79 93 37 55 39 77 32 77 09 85 52 05 30 62 47 83 51 62 74
16 77 33 02 77 09 6l 87 25 31 28 06 24 25 93 16 71 13 59 78 23 05 47 47 25
78 43 76 71 61 20 44 90 32 64 97 67 63 99 61 46 38 03 93 22 69 81 21 99 21
03 28 28 26 08 73 37 32 04 05 69 30 16 09 05 88 69 58 28 99 35 07 44 75 47
93 22 53 64 39 07 10 63 76 35 87 03 04 79 88 08 13 13 85 51 55 34 57 72 69
78 76 58 54 74 93 38 70 96 92 53 06 79 79 45 82 63 18 27 44 69 66 92 19 09
23 68 35 26 00 99 53 93 61 28 53 70 05 48 34 56 65 05 61 86 90 92 10 70 80
15 39 25 70 99 93 86 52 77 65 15 33 59 05 28 22 87 26 07 47 86 96 98 29 06
58 71 96 30 34 18 46 33 34 37 85 13 99 24 44 49 18 09 79 49 74 16 32 23 03
57 35 27 33 72 24 53 63 94 09 41 10 76 47 91 44 04 95 49 66 39 60 04 59 81
48 50 86 54 48 22 06 34 73 52 83 21 15 65 20 33 29 94 71 11 15 91 29 12 03
61 96 48 95 03 07 06 34 33 66 98 56 10 56 79 77 21 30 27 12 90 49 22 23 62
36 93 89 41 26 29 70 83 63 51 99 74 20 52 36 87 09 41 15 09 98 60 16 03 03
18 87 00 43 31 57 90 12 02 07 23 47 37 17 31 54 08 01 88 63 39 41 88 92 10
88 56 53 27 59 33 35 72 67 47 77 34 55 45 70 08 18 27 38 90 16 95 86 70 75
09 72 95 84 29 49 41 31 06 70 42 38 06 45 18 64 84 73 31 65 52 53 37 97 15
12 96 88 17 31 65 19 69 02 83 60 75 86 90 68 24 64 19 35 51 56 61 87 39 12
85 94 57 24 16 92 09 84 38 76 22 00 27 69 85 29 81 94 78 70 21 94 47 90 12
38 64 43 59 98 98 77 87 68 07 91 51 67 62 44 40 98 05 93 78 23 32 65 41 18
53 44 09 43 72 00 41 86 79 79 68 47 22 00 20 35 55 31 51 51 00 83 63 22 55
40 76 66 26 84 57 99 99 90 37 36 63 32 08 58 37 40 13 68 97 87 64 81 07 83
02 17 79 18 05 12 59 52 57 02 23 07 90 47 03 28 14 11 30 79 20 69 22 40 98
95 17 82 06 53 31 51 10 96 46 93 06 88 07 77 56 11 50 81 69 40 23 72 51 39
35 76 22 42 93 96 11 83 44 80 34 68 35 48 77 33 42 40 90 60 73 96 53 97 86
26 29 13 56 41 85 47 04 66 08 34 72 57 59 13 82 43 80 46 15 38 26 61 70 04
77 80 20 75 82 73 82 32 99 90 63 95 73 76 63 89 73 44 99 05 48 67 26 43 18
46 40 66 44 52 91 36 74 43 53 30 82 13 54 00 78 45 63 98 35 55 03 36 67 68
37 56 08 18 09 77 53 84 46 47 31 91 18 95 58 24 16 74 11 53 44 10 13 85 57
61 65 61 68 66 37 27 47 39 19 84 83 70 07 48 53 21 40 06 71 95 06 79 88 54
(continued)
■ 483
484 ■ APPENDIX A
TABLE I Continued
93 43 69 64 07 34 18 04 52 35 56 27 09 24 86 61 85 53 83 45 19 90 70 99 00
21 96 60 12 99 11 20 99 45 18 48 13 93 55 34 18 37 79 49 90 65 97 38 20 46
95 20 47 97 97 27 37 83 28 71 00 06 41 41 74 45 89 09 39 84 51 67 11 52 49
97 86 21 78 73 10 65 81 93 59 58 76 17 14 97 04 76 62 16 17 17 95 70 45 80
69 92 06 34 13 59 71 74 17 32 27 55 10 34 19 23 71 82 13 74 63 52 52 01 41
04 31 17 21 56 33 73 99 19 87 26 72 39 27 67 53 77 57 68 93 60 6l 97 33 6l
61 06 98 03 91 87 14 77 43 96 43 00 65 98 50 45 60 33 01 07 98 99 46 50 47
85 93 85 86 88 72 87 08 62 40 l6 06 10 89 20 23 21 34 74 97 76 38 03 29 63
21 74 32 47 45 73 96 07 94 52 09 65 90 77 47 25 76 16 19 33 53 05 70 53 30
15 69 53 83 80 79 96 23 53 10 65 39 07 16 29 45 33 02 43 70 03 87 40 41 45
02 89 08 04 49 20 21 14 68 86 87 63 93 95 17 11 29 01 95 80 35 14 97 35 33
87 18 15 89 79 85 43 01 72 73 08 61 74 51 69 89 74 39 82 15 94 51 33 41 67
98 83 71 94 22 59 97 50 99 52 08 52 85 08 40 87 80 61 65 31 91 51 80 33 44
10 08 58 21 66 72 68 49 29 31 89 85 84 46 06 59 73 19 85 23 65 09 29 75 63
47 90 56 10 08 88 02 84 27 83 42 29 72 23 19 66 56 45 65 79 20 71 53 20 25
22 85 61 68 90 49 64 93 85 44 16 40 12 89 88 50 14 49 81 06 01 82 77 45 12
67 80 43 79 33 12 83 11 41 16 25 58 19 68 70 77 02 54 00 53 53 43 37 15 26
27 62 50 96 72 79 44 61 40 15 14 53 40 65 39 27 31 58 50 28 11 39 03 34 25
33 78 80 87 15 38 30 06 38 31 14 47 47 07 26 54 96 87 53 32 40 36 40 96 76
13 13 92 66 99 47 24 49 57 74 32 25 43 62 17 10 97 11 69 84 99 63 22 32 98
10 27 53 96 23 71 50 54 36 23 54 31 04 82 98 04 14 12 15 09 26 78 25 47 47
28 41 50 61 88 64 85 27 20 18 83 36 36 05 56 39 71 65 09 62 94 76 62 11 89
34 21 42 57 02 59 19 18 97 48 80 30 03 30 98 05 24 67 70 07 84 97 50 87 46
61 81 77 23 23 82 82 11 54 08 53 28 70 58 96 44 07 39 55 43 42 34 43 39 28
61 15 18 13 54 16 86 20 26 88 90 74 80 55 09 14 53 90 51 17 52 01 63 01 59
91 76 21 64 64 44 91 13 32 97 75 31 62 66 54 84 80 32 75 77 56 08 25 70 29
00 97 79 08 06 37 30 28 59 85 53 56 68 53 40 01 74 39 59 73 30 19 99 85 48
36 46 18 34 94 75 20 80 27 77 78 91 69 16 00 08 43 18 73 68 67 69 61 34 25
88 98 99 60 50 65 95 79 42 94 93 62 40 89 96 43 56 47 71 66 46 76 29 67 02
04 37 59 87 21 05 02 03 24 17 47 97 81 56 51 92 34 86 01 82 55 51 33 12 91
63 62 06 34 41 94 21 78 55 09 72 76 45 16 94 29 95 81 83 83 79 88 01 97 30
78 47 23 53 90 34 41 92 45 71 09 23 70 70 07 12 38 92 79 43 14 85 11 47 23
87 68 62 15 43 53 14 36 59 25 54 47 33 70 15 59 24 48 40 35 50 03 42 99 36
47 60 92 10 77 88 59 53 11 52 66 25 69 07 04 48 68 64 71 06 61 65 70 22 12
56 88 87 59 41 65 28 04 67 53 95 79 88 37 31 50 41 06 94 76 81 83 17 16 33
02 57 45 86 67 73 43 07 34 48 44 26 87 93 29 77 09 61 67 84 06 69 44 77 75
31 54 14 13 17 48 62 11 90 60 68 12 93 64 28 46 24 79 16 76 14 60 25 51 01
28 50 16 43 36 28 97 85 58 99 67 22 52 76 23 24 70 36 54 54 59 28 61 71 96
63 29 62 66 50 02 63 45 52 38 67 63 47 54 75 83 24 78 43 20 92 63 13 47 48
45 65 58 26 51 76 96 59 38 72 86 57 45 71 46 44 67 76 14 55 44 88 01 62 12
39 65 36 63 70 77 45 85 50 51 74 13 39 35 22 30 53 36 02 95 49 34 88 73 61
73 71 98 16 04 29 18 94 51 23 76 51 94 84 86 79 93 96 38 63 08 58 25 58 94
72 20 56 20 11 72 65 71 08 86 79 57 95 13 91 97 48 72 66 48 09 71 17 24 89
75 17 26 99 76 89 37 20 70 01 77 31 61 95 46 26 97 05 73 51 53 33 18 72 87
37 48 60 82 29 81 30 15 39 14 48 38 75 93 29 06 87 37 78 48 45 56 00 84 47
APPENDIX A ■ 485
TABLE I Continued
68 08 02 80 72 83 71 46 30 49 89 17 95 88 29 02 39 56 03 46 97 74 06 56 17
14 23 98 61 67 70 52 85 01 50 01 84 02 78 43 10 62 98 19 41 18 83 99 47 99
49 08 96 21 44 25 27 99 41 28 07 41 08 34 66 19 42 74 39 91 41 96 53 78 72
78 37 06 08 43 63 61 62 42 29 39 68 95 10 96 09 24 23 00 62 56 12 80 73 16
37 21 34 17 68 68 96 83 23 56 32 84 60 15 31 44 73 67 34 77 91 15 79 74 58
14 29 09 34 04 87 83 07 55 07 76 58 30 83 64 87 29 25 58 84 86 50 60 00 25
58 43 28 06 36 49 52 83 51 14 47 56 91 29 34 05 87 31 06 95 12 45 57 09 09
10 43 67 29 70 80 62 80 03 42 10 80 21 38 84 90 56 35 03 09 43 12 74 49 14
44 38 88 39 54 86 97 37 44 22 00 95 01 31 76 17 16 29 56 63 38 78 94 49 81
90 69 59 19 51 85 39 52 85 13 07 28 37 07 61 11 16 36 27 03 78 86 72 04 95
41 47 10 25 62 97 05 31 03 61 20 26 36 31 62 68 69 86 95 44 84 95 48 46 45
91 94 14 63 19 75 89 11 47 11 31 56 34 19 09 79 57 92 36 59 14 93 87 81 40
80 06 54 18 66 09 18 94 06 19 98 40 07 17 81 22 45 44 84 11 24 62 20 42 31
67 72 77 63 48 84 08 31 55 58 24 33 45 77 58 80 45 67 93 82 75 70 16 08 24
59 40 24 13 27 79 26 88 86 30 01 31 60 10 39 53 58 47 70 93 85 81 56 39 38
05 90 35 89 95 01 61 16 96 94 50 78 13 69 36 37 68 53 37 31 71 26 35 03 71
44 43 80 69 98 46 68 05 14 82 90 78 50 05 62 77 79 13 57 44 59 60 10 39 66
61 81 31 96 82 00 57 25 60 59 46 72 60 18 77 55 66 12 62 11 08 99 55 64 57
42 88 07 10 05 24 98 65 63 21 47 21 61 88 32 27 80 30 21 60 10 92 35 36 12
77 94 30 05 39 28 10 99 00 27 12 73 73 99 12 49 99 57 94 82 96 88 57 17 91
78 83 19 76 16 94 11 68 84 26 23 54 20 86 85 23 86 66 99 07 36 37 34 92 09
87 76 59 61 81 43 63 64 61 61 65 76 36 95 90 18 48 27 45 68 27 23 65 30 72
91 43 05 96 47 55 78 99 95 24 37 55 85 78 78 01 48 41 19 10 35 19 54 07 73
84 97 77 72 73 09 62 06 65 72 87 12 49 03 60 41 15 20 76 27 50 47 02 29 16
87 41 60 76 83 44 88 96 07 80 83 05 83 38 96 73 70 66 81 90 30 56 10 48 59
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.30 2.25 2.20 2.16 2.13 2.08 2.03 1.97 1.93 1.88 1.84 1.80 1.76 1.74 1.71 1.68 1.67
7.68 5.49 4.60 4.11 3.79 3.56 3.39 3.26 3.14 3.06 2.98 2.93 2.83 2.74 2.63 2.55 2.47 2.38 2.33 2.25 2.21 2.16 2.12 2.10
28 4:20 3.34 2.95 2.71 2.56 2.44 2.36 2.29 2.24 2.19 2.15 2.12 2.06 2.02 1.96 1.91 1.87 1.81 1.78 1.75 1.72 1.69 1.67 1.65
7.64 5.45 4.57 4.07 3.76 3.53 3.36 3.23 3.11 3.03 2.95 2.90 2.80 2.71 2.60 2.52 2.44 2.35 2.30 2.22 2.18 2.13 2.09 2.06
29 4.18 3.33 2.93 2.70 2.54 2.43 2.35 2.28 2.22 2.18 2.14 2.10 2.05 2.00 1.94 1.90 1.85 1.80 1.77 1.73 1.71 1.68 1.65 1.64
7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.08 3.00 2.92 2.87 2.77 2.68 2.57 2.49 2.41 2.32 2.27 2.19 2.15 2.10 2.06 2.03
30 4.17 8.32 2.92 2.69 2.53 2.42 2.34 2.27 2.21 2.16 2.12 2.09 2.04 1.99 1.93 1.89 1.84 1.79 1.76 1.72 1.69 1.66 1.64 1.62
7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.06 2.98 2.90 2.84 2.74 2.66 2.55 2.47 3.38 2.29 2.24 2.16 2.13 2.07 2.03 2.01
32 4.15 3.30 2.90 2.67 2.51 2.40 2.32 2.25 2.19 2.14 2.10 2.07 2.02 1.97 1.91 1.86 1.82 1.76 1.74 1.69 1.67 1.64 1.61 1.59
7.50 5.34 4.46 3.97 3.66 3.42 3.25 3.12 3.01 2.94 2.86 2.80 2.70 2.62 2.51 2.42 2.34 2.25 2.20 2.12 2.08 2.02 1.98 1.96
34 4.13 3.28 2.88 2.65 2.49 2.38 2.30 2.23 2.17 2.12 2.08 2.05 2.00 1.95 1.89 1.84 1.80 1.74 1.71 1.67 1.64 1.61 1.59 1.57
7.44 5.39 4.42 3.93 3.61 3.38 3.21 3.08 2.97 2.89 2.82 2.76 2.66 2.58 2.47 2.38 2.30 2.21 2.15 2.08 2.04 1.98 1.94 1.91
36 4.11 3.26 2.86 2.63 2.48 2.36 2.28 2.21 2.15 2.10 2.06 2.08 1.98 1.93 1.87 1.82 1.78 1.72 1.69 1.65 1.62 1.59 1:56 1:55
7.39 5.25 4.38 3.89 3.58 3.35 3.18 3.04 2.94 2.86 2.78 2.72 2.62 2.54 2.43 2.35 2.26 2.17 2.12 2.04 2.00 1.94 1.90 1.87
38 4.10 8.25 2.85 2.62 2.46 2.35 2.26 2.19 2.14 2.09 2.05 2.02 1.96 1.92 1.85 1.80 1.76 1.71 1.67 1.63 1.60 1.57 1:54 1.53
7.35 5.21 4.34 3.86 3.54 3.32 3.15 3.02 2.91 2.82 2.75 2.69 2.59 2.51 2.40 2.32 2.22 2.14 2.08 2.00 1.97 1.90 1.86 1.84
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.07 2.04 2.00 1.95 1.90 1.84 1.79 1.74 1.69 1.66 1.61 1.59 1.55 1:53 1:51
7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.88 2.80 2.73 2.66 2.56 2.49 2.37 2.29 2.20 2.11 2.05 1.97 1.94 1.88 1.84 1.81
42 4.07 3.22 2.83 2.59 2.44 2.32 2.24 2.17 2.11 2.06 2.02 1.09 1.94 1.89 1.82 1.78 1.73 1.68 1.64 1.60 1.57 1.54 1.51 1.49
7.27 5.15 4.29 3.80 3.49 3.26 3.10 2.96 2.86 2.77 2.70 2.64 2.54 2.66 2.35 2.26 2.17 2.68 2.02 1.94 1.91 1.85 1.30 1.78
44 4.06 3.21 2.82 2.58 2.43 2.31 2.23 2.16 2.10 2.05 2.01 1.98 1.92 1.88 1.81 1.76 1.72 1.66 1.63 1.58 1.56 1.52 1.50 1.48
7.24 5.12 4.26 3.78 3.46 3.24 3.07 2.94 2.84 2.75 2.68 2.62 2.52 2.44 2.32 2.24 2.15 2.06 2.00 1.92 1.88 1.82 1.78 1.75
46 4.05 8.20 2.81 2.57 2.42 2.30 2.22 2.14 2.09 2.04 2.00 1.97 1.91 1.87 1.80 1.78 1.71 1.65 1.62 1.57 1.54 1.51 1.48 1.46
7.21 5.10 4.24 3.76 3.44 3.22 3.05 2.92 2.82 3.73 2.66 2.60 2.50 2.42 2.30 2.22 2.13 2.04 1.98 1.90 1.86 1.80 1.76 1.72
48 4.04 8.19 2.80 2.56 2.41 2.30 2.21 2.14 2.08 2.03 1.99 1.96 1.90 1.86 1.79 1.74 1.70 1.64 1.61 1.56 1.53 1.50 1.47 1.45
7.19 5.08 4.22 3.74 3.42 3.20 3.04 2.90 2.80 2.71 2.64 2.58 2.48 2.40 2.28 2.20 2.11 2.02 1.96 1.88 1.84 1.78 1.73 1.70
n1 degrees of freedom (for greater mean square)
no
1 2 3 4 5 6 7 8 9 10 11 12 14 16 20 24 30 40 50 75 100 200 500 ∞
50 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.07 2.02 1.98 1.95 1.90 1.85 1.78 1.74 1.69 1.68 1.60 1.55 1.52 1.48 1.46 1.44
7.17 5.06 4.20 3.72 3.41 3.18 3.02 2.88 2.78 2.70 2.62 2.56 2.46 2.39 2.26 2.18 2.10 2.00 1.94 1.86 1.82 1.76 1.71 1.68
55 4.02 3.17 2.79 2.54 2.38 2.27 2.18 2.11 2.05 2.00 1.97 1.93 1.88 1.83 1.76 1.72 1.67 1.61 1.58 1.52 1.50 1.46 1.43 1.41
7.12 5.01 4.16 3.68 3.37 3.15 2.98 2.85 2.75 2.66 2.59 2.53 2.53 2.35 2.23 2.15 2.06 1.96 1.90 1.82 1.78 1.71 1.66 1.64
60 4.00 8.15 2.76 2.52 2.37 2.25 2.17 2.10 2.04 1.99 1.95 1.92 1.86 1.81 1.75 1.70 1.65 1.59 1.56 1.50 1.48 1.44 1.41 1.39
7.68 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.56 2.50 2.40 2.22 2.20 2.12 2.03 1.93 1.87 1.79 1.74 1.68 1.63 1.60
65 3.99 3.14 2.75 2.51 2.36 2.24 2.15 2.08 2.02 1.98 1.94 1.90 1.85 1.80 1.73 1.68 1.63 1.57 1.54 1.49 1.46 1.42 1.39 1.37
7.04 4.95 4.10 3.62 3.31 3.09 2.93 2.79 2.70 2.61 2.54 2.47 2.47 2.30 2.18 2.09 2.00 1.90 1.84 1.76 1.71 1.64 1.69 1.56
70 3.93 3.13 2.74 2.50 2.35 2.23 2.14 2.07 2.01 1.97 1.93 1.89 1.84 1.79 1.72 1.67 1.62 1.66 1.53 1.47 1.45 1.40 1.37 1.35
7.01 4.92 4.08 3.60 3.29 3.07 2.91 2.77 2.67 2.59 2.51 2.45 2.35 2.28 2.15 2.07 1.98 1,88 1.82 1.74 1.69 1.62 1.56 1.53
80 3.96 3.11 2.72 2.48 2.33 2.33 2.12 2.05 1.99 1.65 1.91 1.88 1.82 1.77 1.70 1.65 1.60 1.54 1.51 1.45 1.42 1.38 1.35 1.32
6.96 4.88 4.04 3.56 3.25 3.04 2.87 2.74 2.64 2.55 2.48 2.41 2.32 2.24 2.11 2.03 1.94 1.84 1.78 1.70 1.65 1.57 1.52 1.49
100 3.94 3.09 2.70 2.46 2.30 2.19 2.10 2.08 1.97 1.92 1.88 1.85 1.79 1.75 1.68 1.63 1.57 1.51 1.48 1.42 1.39 1.34 1.30 1.28
6.90 4.82 3.98 3.51 3.20 2.99 2.82 2.69 2.59 2.51 2.43 3.36 2.26 2.19 2.06 1.98 1.89 1.79 1.73 1.64 1.59 1.51 1.46 1.43
125 3.92 3.07 2.68 2.44 2.29 2.17 2.08 2.01 1.95 1.90 1.86 1.83 1.77 1.72 1.65 1.60 1.55 1.49 1.45 1.39 1.36 1.31 1.27 1.25
6.84 4.78 3.94 3.47 3.17 2.95 2.79 2.65 2.56 2.47 2.40 2.33 2.23 2.15 2.03 1.94 1.85 1.75 1.68 1.59 1.54 1.46 1.40 1.37
150 3.91 3.06 2.67 2.43 2.27 2.16 2.07 2.00 1.94 1.89 1.85 1.82 1.76 1.71 1.64 1.59 1.54 1.47 1.44 1.37 1.34 1.29 125 1.22
6.81 4.75 3.91 3.44 3.14 3.92 2.76 2.62 2.53 2.44 2.37 2.30 2.20 2.12 2.00 1.91 1.83 1.72 1.66 1.56 1.51 1.43 1.37 1.33
200 8.89 8.04 2.65 2.41 2.26 2.14 2.05 1.98 1.92 1.87 1.83 1.80 1.74 1.69 1.62 1.57 1.52 1.45 1.42 1.35 1.32 1.26 1.22 1.19
6.76 4.71 3.88 3.41 3.11 2.90 2.73 2.60 2.50 2.41 2.34 2.28 2.17 2.69 1.97 1.88 1.79 1.69 1.62 1.53 1.48 1.39 1.33 1.28
400 3.86 3.02 2.62 2.39 2.23 2.12 2.03 1.96 1.90 1.85 1.81 1.78 1.72 1.67 1.60 1.54 1.49 1.42 1.38 1.32 1.28 1.22 1.16 1.13
6.70 4.66 3.83 3.36 3.06 2.85 2.69 2.55 2.46 2.37 2.29 2.23 2.12 2.64 1.92 1.84 1.74 1.64 1.57 1.47 1.42 1.32 1.24 1.19
1000 3.85 3.00 2.61 2.38 2.22 2.10 2.02 1.95 1.89 1.84 1.80 1.76 1.70 1.65 1.58 1.53 1.47 1.41 1.36 1.30 1.26 1.19 1.18 1.08
6.66 4.62 3.80 3.34 3.04 2.82 3.66 3.53 3.43 2.34 2.26 2.29 2.09 2.01 1.89 1.81 1.71 1.61 1.54 1.44 1.38 1.28 1.19 1.11
∞ 8.84 2.99 2.60 2.87 2.21 2.09 2.01 1.94 1.88 1.83 1.79 1.75 1.69 1.64 1.57 1.52 1.46 1.40 1.35 1.28 1.24 1.17 1.11 1.00
6.64 4.69 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32 2.24 2.18 2.07 1.99 1.87 1.79 1.69 1.52 1.52 1.41 1.36 1.25 1.15 1.00
n2
n1 9 10 11 12 13 14 15 16 17 18 19 20
12 0 0 0 1 1 1 1 1 2 2 2 2
3 2 3 3 4 4 5 6 6 6 7 7 8
4 4 5 6 7 8 9 10 11 11 12 13 13
5 7 8 9 11 12 13 14 15 17 18 19 20
6 10 11 13 14 16 17 19 21 22 24 25 27
7 12 14 16 18 20 22 24 26 28 30 32 34
8 15 17 19 22 24 26 29 31 34 36 38 41
9 17 20 23 26 28 31 34 37 39 42 45 48
10 20 23 26 29 33 36 39 42 45 48 52 65
11 23 26 30 33 37 40 44 47 51 55 58 62
12 26 29 33 37 41 45 49 53 57 61 65 69
13 28 33 37 41 45 50 54 59 63 67 72 76
14 31 36 40 45 50 55 59 64 67 74 78 83
15 34 39 44 49 54 59 64 70 75 80 85 90
16 37 42 47 53 59 64 70 75 81 86 92 98
17 39 45 51 57 63 67 75 81 87 93 99 105
18 42 48 55 61 67 74 80 86 93 99 106 112
19 45 52 58 65 72 78 85 92 99 106 113 119
20 48 55 62 69 76 83 90 98 105 112 119 127
Source: Adapted and abridged from Tables 1, 3, 5, and 7 of Auble (1953). For additional Mann-
Whitney U tables for values corresponding to other ns and other α (p) levels, see Siegel (1956).
APPENDIX A ■ 493
Group 1 2
N=
ΣX =
ΣX2 =
X=
r=
2. Calculation of t-value.
Steps
1. = -------------------------
2. = -------------------------
■ 495
496 ■ APPENDIX B
6. t = = ------------------- df = N1 + N2 – 2 = -------------
* If t-value in Step 6 exceeds the table value at a specific ρ level, then the null hypothesis (i.e., that
the means are equal) can be rejected at that ρ level
r=
*If r obtained in Step 12 exceeds the r given in Table III, Appendix A for df (Step 13) at a specific
ρ level, then the null hypothesis that the variables are unrelated may be rejected at that ρ level.
APPENDIX B ■ 497
ρ1 ρ2 ρi
n=
q1 ΣX =
(ΣX)2 =
ΣX
__
2
X=
SS* =
qj
( ΣX ) 2
*SS = ΣX 2 − (This and the above terms should be calculated for each cell.)
n
A1, 2, i = sum of means in columns 1, 2, i, respectively.
B1, i = sum of means in rows 1, i, respectively
G = sum of A’s = sum of B’s = __________
p = number of columns =_______________
q = number of rows = _________________
Steps
1. Add together all the SS _____________ = SSw
2. Add together all the ______________
3. pq/Step 2 = ______________ = ñ
4. G2/pq = ___________________
5. Square each Ai and add the squares together = _____________ = ΣA2
6. Step 5/q _______________
7. Square each Bj and add the squares together = _____________
8. Step 7/p_____________
__
9. Square every X and add the squares together = ____________
__
= ΣX 2
498 ■ APPENDIX B
MSB = = _______________
MSAB = = ______________
MSW = = ___________
FA =
FB = = ___________
FAB = = ___________
* If an obtained F value exceeds the value given in Table IV, Appendix A (for the appropriate df’s)
at a specific ρ level, then the null hypothesis that the variables are not related can be rejected at
that ρ level.
APPENDIX B ■ 499
If two or more scores are tied, assign each the same rank—that being the aver-
age of the ranks for the tied scores.
Σ R1 =_____________ Σ R2 = ______________
n1 = _______________________ n2 = _____________
U = n1n2 + – R1 = ________________
p = __________________†
U = n 1 n2 + – R2 = _______________
†
Rule: Use as U whichever of the two computed U values is smaller. Look up this value in the table
of critical values of U (Table V, Appendix A) to determine significance. If the smaller obtained
U value is smaller than the table value at a given p level, then the difference is significant at that
p level.
500 ■ APPENDIX B
rs = 1 –
1. Σd2 =_________
2. 6 × Step 1 = _____________
3. N = number of subjects or objects = ___________
4. N 3 – N =____________
5. Step 2 ÷ Step 4 = _____________
6. rs = 1 – Step 5 = ______________
7. p (from Table VI, Appendix A) = __________†
*This technique can be used for any number of subject or objects. For this illustration. N = 12.
†
If rs exceeds the table value at a given p level, then rs is significant at that p level.
APPENDIX B ■ 501
Steps
1. (A+B) (C+D) (A+C) (B+D) = _________
2. A × D= _________
3. B × C = _________
4. Step 2 – Step 3 = __________
5. Step 4 – =___________
2
6. (Step 5) =_____________
7. N × Step 6 = _________
8. Step 7 ÷ Step 1 = χ2 = ____________
df = (number of rows – 1) (number of columns – 1) = (2 – 1) (2 – 1) = 1
p (from Table VII, Appendix A) = ____________
*If the obtained χ2 value exceeds the value given in Table VII, Appendix A, at a given p level, then
the obtained χ2 value can be considered significant at that p level.
References
■ 503
504 ■ REFERENCES
Friedman, G. H., Lehrer, B. E., & Stevens, J. P. (1983). The effectiveness of self-directed
and lecture/discussion stress management approaches and the locus of control of
teachers. American Educational Research Journal, 20, 563–580.
Fuchs, D., Fuchs, L. S., Power, M. H., & Dailey, A. M. (1985). Bias in the assessment of
handicapped children. American Educational Research Journal, 22, 185–198.
Gagné, R. M., & Medsker, K. L. (1995). Conditions of learning (4th ed.). New York,
NY: Holt, Rinehart and Winston.
Ghatala, E. S., Levin, J. R., Pressley, M., & Lodico, M. G. (1985). Training cognitive
strategy-monitoring in children. American Educational Research Journal, 22,
199–215.
Glaser, B. (1978). Theoretical sensitivity: Advances in the methodology of grounded
theory. Mill Valley, CA: Sociology Press.
Glass, G. V. (1977). Integrating findings: The meta-analysis of research. Review of
Research in Education, 5, 351–379.
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Bev-
erly Hills, CA: Sage.
Goetzfried, L., & Hannafin, M. J. (1985). The effect of the locus on CAI control strate-
gies on the learning of mathematics rules. American Educational Research Journal,
22, 273–278.
Guba, E. G., & Lincoln, Y. S. (1981). Effective evaluation. San Francisco, CA: Jossey-Bass.
Harper, B. (2006). Epistemology, self-regulation and challenge. Academic Exchange
Quarterly, 10, 121–125.
Helm, C. M. (1989). Effect of computer-assisted telecommunications on school atten-
dance. Journal of Educational Research, 82, 362–365.
Henry, J. (1960). A cross-cultural outline of education. Current Anthropology, 1,
267–304.
Johnson, D. W., & Johnson, R. (1985). Classroom conflict: Controversy versus debate
in learning groups. American Educational Research Journal, 22, 237–256.
King, A. (1990). Enhancing peer interaction and learning in the classroom through
reciprocal questioning. American Educational Research Journal, 27, 664–687.
Klein, J. D., & Keller, J. M. (1990). Influence of student ability, locus of control, and
type of instructional control on performance and confidence. Journal of Educa-
tional Research, 83, 140–145.
Krendl, K. A., & Broihier, M. (1992). Student responses to computers: A longitudinal
study. Journal of Educational Computing Research, 8, 215–227.
Leonard, W. H., & Lowery, L. E (1984). The effects of question types in textual reading
upon retention of biology concepts. Journal of Research in Science Teaching, 21,
377–384.
Lepper, M. R., Keavney, M., & Drake, M. (1996). Intrinsic motivation and extrinsic
rewards: A commentary on Cameron and Pierce’s meta-analysis. Review of Edu-
cational Research, 66, 5–32.
McGarity, J. R., & Butts, D. P. (1984). The relationship among teacher classroom man-
agement behavior, student engagement, and student achievement of middle and
high school science students of varying aptitude. Journal of Research in Science
Teaching, 21, 55–61.
McKinney, C. W., et al. (1983). Some effects of teacher enthusiasm on student achieve-
ment in fourth-grade social studies. Journal of Educational Research, 76, 249–253.
506 ■ REFERENCES
Mahn, C. S., & Greenwood, G. E. (1990). Cognitive behavior modification: Use of self-
instruction strategies by first graders on academic tasks. Journal of Educational
Research, 83, 158–161.
Makuch, J. R., Robillard, P. D., & Yoder, E. R (1992). Effects of individual versus paired/
cooperative computer-assisted instruction on the effectiveness and efficiency of an
in-service training lesson. Journal of Educational Technology Systems, 20, 199–208.
Mark, J. H., & Anderson, B. D. (1985). Teacher survival rates in St. Louis, 1969–1982.
American Educational Research Journal, 22, 413–421.
Marsh, H. W., Parker, J., & Barnes, J. (1985). Multidimensional adolescent self-con-
cepts: Their relationship to age, sex, and academic measures. American Educational
Research Journal, 22, 422–444.
O’Connor, J. F. (1995). The differential effectiveness of coding, elaborating, and outlin-
ing for learning from text. Unpublished doctoral dissertation, Florida State Uni-
versity, Tallahassee.
Olds, E. G. (1938). Distributions of sums of square of rank differences for small num-
bers of individuals, Annals of Mathematical Statistics, 9, 133–148.
Olds, E. G. (1949). The 5% significance levels for sums of squares of rank differences
and correction. Annals of Mathematical Statistics, 20, 117–118.
Osgood, C. E., Suci, G. J., & Tannenbaum, P. H. (1957). The measurement of meaning.
Urbana: University of Illinois Press.
Patton, M. Q. (1990). Qualitative evaluation and research methods. Newbury Park,
CA: Sage.
Peterson, P. L., & Fennema, E. (1985). Effective teaching, student engagement in class-
room activities, and sex-related differences in learning mathematics. American
Educational Research Journal, 22, 309–335.
Pintrich, P. R., & De Groot, E. V. (1990). Motivational and self-regulated learning com-
ponents of classroom academic performance. Journal of Educational Psychology,
82, 33–40.
Prater, D., & Padia, W. (1983). Effects of modes of discourse on writing performance in
grades four and six. Research in the Teaching of English, 17, 127–134.
Ranzijn, F. J. A. (1991). The number of video examples and the dispersion of examples
as instructional design variables in teaching concepts. Journal of Experimental
Education, 59, 320–330.
Raphael, T. E., & Pearson, P. D. (1985). Increasing students’ awareness of sources of
information for answering questions. American Educational Research Journal, 22,
217–235.
Rasinski, T. V. (1990). Effects of repeated reading and listening-while-reading on read-
ing fluency. Journal of Educational Research, 83, 147–150.
Rayman, J. R., Bernard, C. B., Holland, J. L., & Barnett, D. C. (1983). The effects of
a career course on undecided college students. Journal of Vocational Behavior, 23,
346–355.
Reiser, R. A., Tessmer, M. A., & Phelps, P. C. (1984). Adult-child interaction in chil-
dren’s learning from “Sesame Street.” Educational Communications and Technol-
ogy, 32, 217–223.
Roberge, J. J., & Flexner, B. K. (1984). Cognitive style, operativity and reading achieve-
ment. American Educational Research Journal, 21, 227–236.
Roe, A. (1966). Psychology of occupations. New York, NY: Wiley.
REFERENCES ■ 507
■ 509
510 ■ INDEX
Friedman’s two-way analysis of variance, interaction, statistical, 31, 62, 73, 81, 130–
335 131, 136, 152–154, 161, 169, 173–175,
335–343, 351, 453, 476
gender, 67, 175, 295 interest areas, 17, 27, 42–43, 54–57
general hypothesis, 15–60, 85–92, interjudge agreement. See interrater
generality, 6, 8, 60, 130, 138, 140, 144, reliability
450, 452–453, 459, 473 internal validity, 5–8, 11, 19, 60, 112, 125–
graphs. See figures 130, 136–141, 161–164, 169, 171, 184,
195–196, 200, 230, 268, 375, 425, 450;
halo effect, 230 designs to control for, 158–175; factors
Hawthorne effect, 132, 137, 172–177, affecting (see certainty; history bias,
382; design to control for, 172–174 instrumentation bias; mortality bias;
history bias, 125–126, 137, 141, 151–153, selection bias; statistical regression
159, 161–163, 167–169, 195–198, 382 toward the mean; testing bias)
hypothesis, 3, 10, 15–16, 75, 77, 85–88, Internet, 55
100–102, 107, 114, 281, 302, 310, 322, interrater reliability, 230, 232, 425, 427
335–342, 366, 368, 370, 440, 448, 470, interval scale, 212, 222
476, 496, 498; alternatives, 88–90; intervening variable, 76–80, 209, 441, 444,
classroom, 100–101; directional, 447–448, 470
94, 101–102; evaluation, 366, 368; interview, 9–10, 16, 17, 18, 129, 150, 196,
examples, 116–117; formulation, 3, 10, 201, 211, 243–24, 245–247, 254–258,
15–16; general, 86–88; null (see null 260, 267, 272, 274, 277–285, 388, 390,
hypothesis); operational restatement, 393–410; child, 400–402; coding,
336; positive, 102–104; rationale, 277–285; construction of, 16, 254–255;
322; specific, 85–88; testing, 101–102, example, 393–410; question formats,
114–116 245–247, 255–258; response modes,
247
implementation check, 142–143. See also interviewers, 129, 201, 255–256, 276–282,
manipulation, success of 400, 427
independent variable: definition, 67–68; introduction, 316–317, 322–325, 427, 429,
in evaluation, 142–143; examples, 442, 469–470
68–71; identification, 15; in problem item analysis, 218–222, 225, 266
statement, 76–80; relationship to
dependent, 68–69; use in statistics, journals and books, 49–54, 58–59, 358–
260, 269, 297, 300, 308–311; writing 359, 425, 439
up, 329–331
index sources, 46–52, 58, 60 knowledge, 11–12, 33, 216–218
induction, 85–90 Kuder-Richardson (test) reliability, 208;
informed consent form, 12–13, 199 formula, 208
instructional materials, 24, 30–32,
328–329 learning: activity, 31; environment, 31;
instructional program, 4, 24, 29–35, materials, 32, 365
328–329, 373, 403 level, 62, 69–82, 90–92, 109, 113, 124,
instrumentation bias, 125, 129–130, 206, 134–136, 142, 152, 155–156, 170, 193,
230, 276, 282, 455; controlling for, 130 195, 211, 304–305, 310–352
intact group, 157, 165, 375, 474 Likert scale, 192, 222–226, 262, 264,
intact-group comparison, 176, 474 305, 403. See also rating scale; scaled
intelligence tests, 109. 209, 216–217 response mode
512 ■ INDEX
posttest only control group design, question format, 255–258; choice of, 255–
152–154, 164 256. See also direct-indirect questions,
predetermined questions, 247 fact-opinion questions; predetermined
prediction, 116–117, 324–325 questions; response-keyed questions;
predictive (test) validity, 208–209 specific-nonspecific questions
pre-experimental designs, 150–151; intact questionnaires, 243–282; administering,
group comparison, 157, 165, 375, 474; 243–247, 271–272; choice of, 243–247;
one-group pretest-posttest design, coding, 277–282; construction of, 254–
151, 159, 167, 169–170; one-shot case 266; examples, 249, 254, 261–263; pilot
study, 150–151, 159 testing, 265–266; scoring, 277–282.
pretest-posttest group design, 151–154, See also cover letter; question format;
163–170 response mode
privacy, 13–14
probability level, 271, 401 random assignment, 127, 133–135, 144,
problem, 3, 7–12, 15, 23–27, 29–38, 153–155, 164, 375, 451, 472
44–46, 85–92, 126–130, 195, 316–320, randomization. See random assignment
350, 374, 393, 395, 427–428, 439– random sampling, 267–268
446, 456, 468–470; characteristics, random selection. See random sampling
23–24; classroom research, 29–35; ranking response mode, 257–258
considerations in choosing, 36–38; rating scale, 111, 205, 212, 228–233, 277–
context, 46, 316–317; evaluation, 412– 278, 403. See also Likert scale; scaled
413; hypotheses, 85–87; identification, response mode
3, 15, 195; statement, 317–318 ratio scale, 212
procedures, 11–12, 15, 126–129, 140, 150, reactive effects, 130–131, 172, 174–177;
163–164, 205, 207, 267–282, 333–335, design to control for, 172, 174–177; of
387, 392, 400, 404 experimental arrangements, 131–132;
programmatic research, 36–37 of teacher expectancy, 176–177
proposal. See research proposal recommendations, 17, 321, 346, 359, 402
protocol analysis, 412–413 references, 41, 46, 49–51, 58, 60–61, 217–
psychometrics, 130 218, 322, 331, 336, 349, 446, 469, 475
publishing (an article), 358–359 regression analysis, 72, 128, 151, 153, 167,
190, 301, 305–308, 346, 451
qualitative research, 17, 387–413; reliability. See test reliability
analyzing data, 408–413; characterisics repeated measure, 170–172, 335
of, 387–388; conducting, 387–413; reports, 54, 59, 172, 315–359
data sources, 395–404; methodology, research: applied, 3–4; basic, 3;
393–395; problems, 390; themes, characteristics of, 10–11; definition
389 of, 3–4; ethics, 12–15; steps in, 15–17;
quasi-experimental designs, 158–172; survey, 9–10, 243–244
equivalent time-samples design, research proposal: introduction section,
160–163; nonequivalent control 316–326; method section, 326–336
group design, 163–166; patched-up research report: abstract, 350; discussion
design, 169–170; separate sample section, 340–349; of evaluation
pretest-posttest design, 167–169; study, 365–366, 375, 378, 381,
single subject design, 170–172; 384–385; figures (graphs), 353–358;
systematically assigned control group introduction section, 316–326; method
design, 166–167; time-series design, section, 326–336; references, 349;
159–160 results section, 336–359; tables, 351
514 ■ INDEX
■ 517