Research in Language Testing
Research in Language Testing
ANGUAGE
TESTING
RESEARCH IN
LANGUAGE
TESTING
Kyle Perkins
Southern Illinois University
Language Science
Language Teaching
lil
/ Language Learning
Copyright © 1980 by Newbury House Publishers, Inc. All rights reserved. No part of this book
may be reproduced or transmitted in any form or by any means, electronic or mechanical, including
photocopying, recording, or by any information storage and retrieval system, without permission in
writing from the Publisher.
Mary Anne
and
Jani
Contents
Preface ix
An Overview 1
Discussion Questions 54
Discussion Questions 72
Discussion Questions
182
Part VI Native versus nonnative performance:
What’s the difference?
Discussion Questions
215
Contents vii
References 253
Appendix 261
About the Authors 306
Index 311
1
-
Preface
This is a book reporting practical research into the nature of language proficiency.
It deals primarily with second or foreign language learners—some in classroom
settings, others in more natural surroundings. Secondarily it addresses the
question of the comparability of first and second language proficiency. Is learning
the mother tongue similar to the process of adding another language? The focus is
on the discourse processing skills exhibited by language users.
This book is a complementary sequel to Language in Education: Testing the
Tests (Oiler and Perkins, 1978). Whereas that volume addressed the broader issue
of language proficiency in relation to educational testing in general for both
monolingual English speakers and bilingual children, this volume concentrates
more specifically on the composition of language proficiency especially for second
language learners. Several chapters reach beyond this central question to assess,
for instance, the relatedness of first and second language learning, and a whole
section is devoted to aptitude, attitude, and behavioral variables believed to play a
role in facilitating or inhibiting language acquisition.
There is probably no other area of educational endeavor about which there
are so many conflicting opinions and where, at the same time, it is possible to do
appropriate empirical research to obtain answers by refining some hunches and
ruling out others. It seems to us that the time has come for the theorists and
researchers alike to turn attention to the explanation of how people produce and
comprehend meanings in the ordinary contexts of human experience. We find it
encouraging that one of the most persistent prods in this direction is coming from
the investigation of educational tests which have for so long been aimed at dividing
things up into long lists of supposedly unrelated components.
We cordially invite the reader to challenge, test, refine, reformulate, and
apply the findings reported here.
IX
Acknowledgments
Having grown out of a joint effort of the Center for English as a Second Language
and the Department of Linguistics at Southern Illinois University during academic
1976-1977, this volume is so much the result of multiple contributors that it
would be impossible to acknowledge all of those who deserve credit for the fact
that it has at last materialized as a physical entity. Some of the prime movers for its
completion are unfortunately not represented among the co-authors. Both
Richard Daeseh and Charles Parish (the co-directors of the Center for English as a
Second Language) provided much of the inspiration for the work as well as a good
deal of the hard labor. They, along with a team of other workers, helped in
preparing tests, scoring them, and coding the data.
James E. Redden, Professor of Linguistics, helped to press much of the work to
completion indirectly by organizing (along with Kyle Perkins) the First Interna¬
tional Conference on Frontiers in Language Proficiency and Dominance Testing
held at SIU in April 1977. Many of the contributions included here were either
presented or discussed at that conference, and several of them appeared in the
Proceedings edited by Dr. Redden. Student workers who helped in a variety of
capacities ranging from insight to keypunching included Damian Kodgis, Joseph
Repka, William Flick, Steve Spurling, Linda Thornberg, Sally Chai, and many
others. A special debt of gratitude is due Sheila Brutton and all the other members
of the CESL faculty who generously gave class time and a good deal of leisure time
to various phases of the project Sheila Brutton and Dick Daeseh graciously
helped with much of the taping necessary for some of the oral testing.
Administrative assistant to the Department of Linguistics, Lillian Higgerson, and
CESL’s right hand, Cathy Merriman especially, helped in too many ways to be
counted. We thank them particularly along with all the CESL staff.
Finally, the completion of the work would not have been possible except for a
CESL research and teaching grant to the first editor and co-author during
academic 1976-1977 while he was on leave without pay from the University of
New Mexico. The enthusiastic involvement of students, staff, and faculty was, to
us, an encouragement and an inspiration in days when research in higher
education often seems to require the lavish expenditure of federal tax money. We
are glad to say that in this case the costs were sustained by normal university
outlays and by pooling and coordinating existing resources of faculty, staff, and
student time.
XI
An Overview
1
2 RESEARCH IN LANGUAGE TESTING
The truly startling possibility that the previous research raises is that many of
the bits and pieces of skill and knowledge posited as separate components of lan¬
guage proficiency, intelligence, reading ability, and other aptitudes may be so
thoroughly rooted in a single intelligence base that they are indistinguishable from
it. Further, there may be little or no profit at all in trying to make many of the dis¬
tinctions common to tests and curricula in education. It is just possible that, for
instance, vocabulary and syntax (as components of language proficiency) are
essentially the same, psychologically and perhaps even neurologically.
Educational tests and curricula, of course, usually reflect a very different set
of assumptions. They treat the learning of arithmetic, literature, history, science,
etc., as distinct and in some cases even unrelated endeavors. They are apt to
assume (probably incorrectly) that language ability has little or nothing to do with
the acquisition of computational skills, the development of motoric abilities, and
so on. Moreover, even the language curricula intended for the teaching of native
language skills or foreign languages are commonly based on the assumption that
several distinct skills exist (e.g., listening, speaking, reading, and writing) and that
each of these can be divided up into multiple subcomponents (e.g., vocabulary,
syntax, phonology/graphology), and into receptive and productive repertoires,
and so forth. Sometimes it is even insisted that the separate skills and components
must be taught in separate classes, often by different instructors using unrelated
curricula. In many of our schools and colleges there are separate language courses
for conversation, pronunciation, reading, composition, and so on. In many
programs separate course sequences exist for oral skills as opposed to reading and
writing, or for productive skills in contrast to receptive skills, and so on. It now
seems likely that the assumptions on which such distinctions are based have been
misguided from the outset, founded as they were on untested analytical theories
that neglected fundamental properties of human intelligence and discourse
processing skills.
The present volume extends the research base offered in the earlier volume
primarily with reference to the nature of language proficiency in foreign and
second language learners. A second language population is defined as one that
learns the target language in a setting where that language is used for everyday
communication in the surrounding community. A foreign language population, on
the other hand, is one that studies, but usually fails to really learn the language, in a
classroom context exclusively. That is, in the case of foreign language instruction
the target language is not a commonly used medium of communication outside the
classroom. The prime question is: How many components and what sort of
components are present in the language proficiency of these groups? Subsidiary
questions relate to the efficiency of tests requiring listening, speaking, reading,
and writing performances; the validity and reliability of subjective judgments of all
the foregoing sorts of performances; the relative efficiency of various placement
procedures for foreign students; the effects of attitudes and beliefs on the ability of
subjects to fill in blanks in passages of prose dealing with controversial topics; and
the relative information yield of different scoring methods for essays and other
tasks. A brief section is devoted to a comparison of native and nonnative
Overview 3
performances on various tasks including taking dictation, judging the truth value
of propositions, and filling in blanks in prose (cloze procedure). Finally, a
substantial section is addressed to various issues of background, aptitude,
behavior, and attitude as correlates of language proficiency.
The ramifications of the studies included are too numerous to be explored
fully in this introduction. However, it is possible to highlight some of the surpris¬
ing results and perhaps to provide a sense of some remarkable trends toward
common conclusions. These trends are especially noteworthy because in most
cases the projects approached similar questions quite unintentionally from very
different vantage points and with completely independent empirical methods.
Nevertheless, in many cases similar results were obtained and similar if not
identical conclusions were supported. Some of the findings and conclusions are
regarded, therefore, as particularly secure.
For instance, evidence from many of the studies included sustains the
conclusion that a single unitary factor may underlie all (or nearly all) of the more
than sixty processing tasks investigated.
In Chap. 1, Oiler and Hinofotis, for example, found that a single global factor
of language proficiency accounted for no less than 65% of the total variance in sev¬
eral batteries of proficiency tests. Their study included measures aimed at listening
comprehension, grammatical structure, reading vocabulary, reading comprehen¬
sion, writing ability, oral fluency, accent, oral grammatical accuracy, oral vocabu¬
lary usage, conversational comprehension, as well as ability to take dictation and
fill in blanks in prose. In Chap. 2, Scholz, Hendricks, Spurling, Johnson, and
Vandenburg have extended the question to a wide variety of other discourse pro¬
cessing tasks. In all, twenty-two English language proficiency tests roughly divisi¬
ble into listening, speaking, reading, writing, and grammar tests were investigated.
In spite of the considerable unreliability in some of the tasks because of a variety of
uncontrollable factors, a single general factor accounted for over half the variance
in all the tests. Furthermore, at least one of the tests in eaeh of the five major cate¬
gories (listening, speaking, etc.) shared 67% or more of its variance with a single
global factor.
In Chap. 3, Flahive discusses an attempt to separate a reading ability factor
from the general factor posited by Oiler and Hinofotis and by Scholz et al. He, too,
however, found substantial support for a single general factor. The correlations be¬
tween various reading tests, and language proficiency measures, with Raven’s Pro¬
gressive Matrices (commonly believed to be a nonverbal measure of intelligence)
ranged from .61 to .84. The subject sample consisted of a group of relatively ad¬
vanced ESL students at Southern Illinois University (Center for English as a Sec¬
ond Language). Similarly, in Chap. 4, Hisama showed that a single factor could
account for no less than 82.4% of the total variance in four tests: syntax (English
structure), listening comprehension, reading comprehension, and a fill-in blank
procedure. Her sample consisted of 136 CESL students at SIU.
In Chap. 5, Benson and Hjelt cite evidence from several empirical studies
demonstrating the importance of listening comprehension to reading and other
skills. Bacheller (Chap. 6) argues that because the goal of the language learner is
4 RESEARCH IN LANGUAGE TESTING
the tested subjects, but in all cases, the correlations with the TOEFL total score
were equal to or greater than the correlation between the reading tests per se. The
correlation between the TOEFL and the McGraw-Hill was a whopping .91 (N =
40).
Doerr’s study (Chap. 13) constitutes an interesting departure from the
preceding studies in both method and theory. She wanted to know whether agree¬
ment or disagreement of test takers with the subject matter in prose tests would
affect their ability to fill in blanks in those texts. She selected (as suggested by Dr.
Charles Parish) three controversial topics—the desirability of the decriminaliza¬
tion of marijuana, the possibility of abolishing capital punishment, and the immo¬
rality of aborting unwanted children. She wondered if agreement with the opinions
expressed by the author of a passage would result in higher scores while disagree¬
ment with the views of the author might result in lower scores. Previous research
by Manis and Dawes (1961) had shown a significant effect of agreement/disagree¬
ment for native speakers. However, Doerr found no significant effect for nonnative
speakers and showed a .90 correlation between the scores on texts with which
subjects indicated agreement and the scores on texts with which they indicated
disagreement. Her results suggest that the validity of the cloze procedure is rela¬
tively unaffected by the possible controversiality of selected topics. At least this
appears to be so for nonnative speakers.
Part V includes four chapters examining approaches to the measurement of
writing ability. Kaczmarek (Chap. 14) offers some redeeming evidence in favor of
the use of essay tasks as measures of writing ability. Furthermore, her work
demonstrates that both subjective ratings and objective scores of essays are sub¬
stantially correlated with a wide variety of scores on other discourse processing
tasks. Mullen’s second contribution (Chap. 15), on subjective evaluations of
essays, demonstrates the essential unity of ratings of structure, organization, quan¬
tity, and vocabulary. Correlations among four scales aimed at the just mentioned
theoretical components of subjective judgments of compositions ranged from .67
to .84.
Flahiveand Snow (Chap. 16) and Evola, Manier, and Lentz (Chap. 17) show
that discrete point approaches to scoring of essays are, on the whole, considerably
less promising than holistic evaluations of communicative effect. Compare, for in¬
stance, the correlations found by Flahive and Snow (see their Table 16-6, column
4) between a complexity index and a holistic evaluation with the correlations
reported by Bacheller (Table 6-1). Although the data in the two studies are not
strictly comparable, the greater variability among subjects in Bacheller s data
(which includes the whole range of abilities represented at CESL) is compensated
for in part by the greater homogeneity of the test constructs posited by Flahive and
Snow (that is, all their measures were supposed to be measures of essay writing
ability or some component of that ability, while in Bacheller s study listening,
speaking, reading, writing, and grammatical knowledge were all included). The
comparison seems to suggest that holistic ratings are more informative than
discrete point scoring methods. This conclusion is further substantiated by the
6 RESEARCH IN LANGUAGE TESTING
findings of Evola, Mamer, and Lentz. Examine, for instance, their Table 17-3
where none of the discrete point measures used accounted for more than about
10% of the variance in the essay ratings and the essay scores, while the latter two
(see Kaczmarek, Chap. 14, Table 14-3) share no less than 59 and64% respective¬
ly, of their variances with the same common factor. Thus the holistic types of
scores would seem to be about six times as informative concerning global lan¬
guage proficiency as the discrete point measures are. Further evidence in support
of this same conclusion is offered incidentally by Johnson and Krug, Chap. 24.
They found that a discrete point redundancy index was a moderate to weak
predictor of scores on other measures of ESL proficiency. They found correla¬
tions ranging from .30 to .64 (see their Table 24-5).
Although the three chapters in Part VI all refer to substantially different
methods and distinct subject populations, all are concerned with the important
question of how similar or different first and second language acquisition are. Fish¬
man (Chap. 18) boldly asserts that “we all make the same mistakes”; at least this
appears to be true for dictation tasks attempted by native and nonnative students at
Southern Illinois University. Two convincing proofs that natives and nonnatives
share a number of aspects of discourse processing are offered by Fishman. She
shows that the rank ordering of segments in a text is substantially similar for the
two subject populations (rho = .68). Two Q-type factor analyses also support her
conclusions.
Carrell (Chap. 19) takes a basically different tack. Her research questions are
mainly concerned with the processing of two kinds of indirectly conveyed mean¬
ings. The first type is presuppositional meaning where a given statement or asser¬
tion requires that some other proposition be taken for granted in order for the im¬
mediate assertion to be understood. For instance, if someone says, “Close the
door,” uidess the door is open, the statement does not make sense. We must pre¬
suppose that the door is open. A second type of indirectly conveyed meaning is
what may be termed implicational. In this case, the suggested meaning is implied
or entailed by the asserted meaning. For instance, if someone says “George had to
leave town,” this implies that in fact he did leave. It seems, in a nontechnical sense,
that the difference between the two types of meanings is dependent on the implicit
time sequence of the events referred to or suggested in relation to the time of utter¬
ance.
In any case, Carrell hypothesized that both natives and nonnatives would
have more trouble in identifying false presuppositions (in relation to pictured
information) than false implications when the problem was to judge the truth or
falsehood of given statements. Half of the statements actually required false pre¬
suppositions and half of them entailed false implications. The results sustained
Carrell’s hypothesis and revealed a similar pattern for both natives and non¬
natives, indicating a possible underlying similarity of processing strategies. This
result accords well with Fishman’s findings.
In Chap. 20, Wilson takes yet another approach to a comparison of native and
nonnative performance on language tests. He asked whether it was possible or not
Overview 7
to deliberately create a cloze test that would be more difficult for ESL learners
who were native speakers of Vietnamese than for learners with other language
backgrounds. He employed 72 subjects from 12 different language backgrounds
including 9 native speakers of English. On the basis of a contrastive analysis of
English and Vietnamese structures, three types of cloze tests were constructed
which were expected to become progressively more difficult only for the Vietnam¬
ese subjects (37 of the 7 2 subjects tested). One cloze test was constructed by plac¬
ing blanks at points believed to be difficult for Vietnamese learning English.
Another was deliberately loaded with structures believed difficult for Vietnamese
on the basis of the contrastive analysis, but deletion points were selected randomly
on an every fifth word basis. A third text was doubly biased by salting it with struc¬
tures believed difficult for Vietnamese and by also selecting the most difficult
points for inserting blanks in the text Another regular cloze test over a passage
judged similar in difficulty level to the other three was used as a control test.
Contrary to the initial prediction, all the biased tests were more difficult for
all the subjects. Again, in this respect, natives and nonnatives performed simi¬
larly. Language background did not appear to be a significant factor in differentiat¬
ing the relative difficulties of the four tests. In sum, it is apparently possible to
make a test more or less difficult on the basis of a contrastive analysis, but it seems
to be difficult to do so in a way that will deliberately discriminate only against a
particular language group. Again, native and nonnative performances revealed
fundamental similarities.
Part VII contains four papers that address different aspects of the problem of
explaining the variability in language proficiency exhibited by foreign and second
language learners. Clarke (Chap. 21) presents data that may be surprising to some
of the advocates of aptitude testing as a basis for predicting success in foreign lan¬
guage learning. Her data, and the data she cites from the manual for the Modem
Language Aptitude Test, suggest that it is a weak predictor at best of foreign lan¬
guage attainment Clarke is mystified by the peculiar discrepancies in the MLAT
as a predictor of success in German as compared against success in Japanese. In
neither case are the obtained correlations consistently strong. For instance, for the
Japanese learners correlations range from .23 to .74, and for the German learners
from .07 to .48.
Although more research is clearly needed, it seems to us that the unreliability
of the MLAT as a predictor of attainment may be due to a variety of factors. One of
the important possibilities that we feel should be studied is that the MLAT may
measure primarily performance on the sort of discrete point tasks frequently char¬
acteristic of discrete point foreign language teaching—a variety of teaching that
fails so completely that even its best products ordinarily cannot engage in normal
conversation or letter writing or even reading a newspaper in the target language
with reasonable fluency and comprehension. Many students of foreign languages
would be helpless if asked to make simple introductions in the language or follow a
set of directions to find an address in a city where the target language is spoken.
Therefore, if the “aptitude” that the MLAT measures is the ability to deal
8 RESEARCH IN LANGUAGE TESTING
with the sorts of irrelevant skills that are characteristically attended to in many
ineffective foreign language programs, it should be no great surprise that its
correlations with such unreliable attainments should reveal substantial random¬
ness. Perhaps it will be possible to devise more pragmatically oriented aptitude
tests that will produce more reliable predictive relationships with actual language
processing tasks.
One of the most surprising results reported in any of the papers included in
this volume appears in Chap. 22 by Murakami. He was interested in a number of
behavioral, demographic, and attitudinal variables which he thought might be
causally related to the learning of English as a second language. His subject popu¬
lation consisted of 30 native speakers of Japanese studying in various academic
areas at Southern Illinois University. The best overall predictor of language profi¬
ciency for both cloze scores and dictation scores was the student’s status at time of
testing. The graduate students consistently did better than the undergraduates,
who in turn did better than those enrolled in classes in English as a second lan¬
guage.
The surprise came with respect to two of the behavioral variables. Oddly, the
number of close English-speaking friends subjects indicated having was more
strongly correlated with cloze scores than with dictation (.39 and .28, respec¬
tively), and the number of pages subjects indicated writing in the target language
each semester was more strongly correlated with the dictation score than with the
cloze score (.64 and .32, respectively). This is surprising because we should
expect English-speaking friends to make more of an impact on ability to perform a
listening task (namely, the dictation) rather than a reading and writing task
(namely, the cloze test), while the amount of writing in English each semester
might be expected to relate more strongly to scores on the reading and writing task
(cloze). In both cases, the findings were exactly reversed. We will leave it to the
reader to explore Murakami s proposed explanation, but perhaps we should note
that his results are radically opposed to the separation of skills in language teach¬
ing.
Chapter 23, jointly authored by the co-editors of this volume and Murakami,
investigates seven types of variables that might be expected to be moderate causal
factors in the attainment of second language proficiency. Contrary to many
popular predictions, the data do not fit the traditional assumptions concerning
integrataive and instrumental motives. These and other variables frequently
believed to contribute to success in learning a second language are no more
strongly related to measures of ESL proficiency than are the three extraneous
questions regarding agreement/disagreement with controversial statements about
marijuana, capital punishment, and abortion (included relative to the Doerr study.
Chap. 13 above). Several explanations are considered. It is not possible to rule out
the alternative that the so-called attitude questions, and in fact all the rest, might
be indirect though unintentional measures of language proficiency to start with.
This explanation would wipe out a great deal of the empirical support claimed by
attitude theorists, in spite of the fact that it leaves the theoretical arguments for
Overview 9
attitudes and motivations unscathed. In fact, we are inclined to believe that some
of those theoretical arguments are correct, but the empirical foundation is shaky;
see Oiler and Perkins (1978, Chap. 5).
The last chapter in the volume, Johnson and Krug (Chap. 24) considers the
possibility of using a redundancy index as a measure of the degree of integrative¬
ness of second language learners. It was reasoned that the tendency of second lan¬
guage learners to appropriately use functors such as the plural morpheme, the
copula, the possessive inflection, and the third person singular habitual present
marker might be taken as a measure of the learner’s desire to be like valued mem¬
bers of the target language community. While the redundancy index did in fact
prove to be a better predictor of attained language proficiency than more tradi¬
tional attitude questions, it can be argued that the redundancy index is a more or
less direct measure of language proficiency. Therefore, what has to be demon¬
strated is that it is also a measure of an “integrative” motive. The latter relation¬
ship cannot be clearly inferred from the Johnson and Krug study. While a modest
correlation of .32 (significant at p < .05) was found with one of the so-called
“integrative” reasons for learning English, the redundancy index was also corre¬
lated at .23 (p < .10) with an instrumental motive (see Table 24-1). However, it
failed to correlate significantly with any of the other integrative or instrumental
motives.
We conclude our introduction by noting that all the research recorded here
points unmistakably to the fact that language skills of all the traditionally posited
sorts are fundamentally related. This appears to be true for natives and nonnatives
alike. It seems that all human beings rather naturally attend to meaning in compre¬
hending and producing discourse, and they are either incapable of or at least not
good at attending to much of anything else.
.
Part I
13
14 I: FACTORS IN SECOND LANGUAGE SKILL
measure different aspects of that mental ability. Factor analysis is one of the sta¬
tistical procedures for studying the tendency of measures to produce meaningful
variances, that is, variances which are either unique to a particular test or com¬
mon to two or more tests. All factoring methods aim to simplify the data available
in a correlation matrix—the main question is how many factors and what sorts are
required to explain essentially all of the variance in a given matrix? By variance we
mean the algebraic quantity used in statistics to characterize the dispersion of
scores about a mean score for a certain population of subjects on a certain test or
battery of tests. By correlation we mean a similar quantity used to characterize the
degree of overlap in variance, or the tendency for scores on separate tests to covary
proportionately about their respective means.
The particular question investigated here is whether there is any unique
variance associated with certain language processing tasks. For instance, is there
any unique variance associated with tests that purport to measure vocabulary
knowledge, for instance, as opposed to tests that purport to measure, say, syn¬
tactic knowledge? Or is there any unique variance associated with, say, listening
comprehension as opposed to speaking ability, for example, as judged by tests with
those respective labels? In short, can language skill be partitioned into meaning¬
ful components which can be tested separately? Or, viewed the other way around,
does variance in the performance of different language tasks support the compo-
nential theory of language competence?
Two mutually exclusive hypotheses have been offered. First there is what we
will refer to as the divisible competence hypothesis: it has been argued by many lin¬
guists and pedagogues that language proficiency can be divided into separate com¬
ponents and separate skills or aspects of them. The components usually singled
out include phonology, syntax, and lexicon and the skills listening, speaking, read¬
ing, and writing. Some have argued further that it is necesary to distinguish
between receptive versus productive repertoires (that is, listening/speaking
versus reading/writing). It was even contended by Lado (1961) that the
grammatical components posited for one skill or modality may be different from
those functional in a different skill or modality. In a similar vein, Clark (1972)
spoke of separate “grammars” for speaking and listening.
A second major hypothesis is that language proficiency may be functionally
rather unitary. The components of language competence, whatever they may be,
may function more or less similarly in any language-based task. If this were the
case, high correlations would be expected between valid language tests of all sorts.
Seemingly contradictory results, such as the fact that listening comprehension
usually exceeds speaking proficiency in either first or second language speakers,
would have to be explained on some basis other than the postulation of separate
grammars or components of competence. For instance, one might appeal to the
load on attention and short-term memory that is exerted by different language¬
processing tasks. It may require more mental energy to speak than to listen, or to
write than to read, and so forth.
If the variance associated with language tests which are aimed at separate
components or skills were substantially overlappiing (that is, if the tests were
Oller/Hinofolis: Second language ability 15
Loadings on
Test g factor* h2
Listening Comprehension .87 .76
English Structure .82 .67
Vocabulary .67 .45
Reading Ability .73 .53
Writing Ability .78 .61
Cloze (any appropriate word scoring) .87 .76
Dictation .76 .58
Eigenvalue 4.36
♦Accounts for 100% of the total variance in the factor matrix (using
an iterative procedure with communality estimates in the diagonal less
than unity).
Test 1 2 3 4 5 6 7
1 Listening Comprehension .69 .56 .64 .68 .76 .69
2 English Structure .71 .64 .57 .65 .68 .63
3 Vocabulary .58 .55 .49 .60 .51 .47
4 Reading Ability .64 .60 .49 .58 .65 .53
5 Writing Ability .68 .64 .52 .57 .67 .52
6 Cloze .76 .71 .58 .64 .68 .75
7 Dictation .66 .62 .51 .55 .59 .66
Table 1-3 Residual Matrix with g Loadings Partialed Out (Mean of Absolute
Values = .026, Range = .08): Observed r minus Product of Loadings on g
Test 1 2 3 4 5 6 7
1 Listening Comprehension -.02 -.02 .00 .00 .00 .03
2 English Structure -.08 -.03 .01 -.03 .01
3 Vocabulary .00 .07 -.01 -.04
4 Reading Ability .01 .01 -.02
5 Writing Ability -.01 -.07
6 Cloze
-.08
7 Dictation
Oller/Hinofotis: Second language ability 17
correlations between the various test scores and the hypothetical variable which
may be taken as an empirical estimate of a unitary language proficiency factor. It is
in fact a linear combination of the original variables. The squared loadings indi¬
cate the proportion of variance overlap between the hypothetical factor defined by
the principal components analysis and any particular test variable. For instance,
the Listening Comprehension subtest of the TOEFL correlates at .87 with the
hypothetical g factor, thus accounting for .76 (for 76%) of the variance ing; or
alternatively, we may say that g accounts for 7 6% of the total variance in the
Listening Comprehension section of the TOEFL.
The next step is to determine how well the general factor or unitary compe¬
tence hypothesis accounts for the observed correlations between the various sub¬
tests used in the study. In other words, once the variance that can be attributed tog
is partialed out how much variance will remain? Will it be necessary to posit other
factors in addition tog, or will theg factor suffice to explain essentially all of the
nonerror variance?
Table 1-2 presents in the upper half correlations between test scores, and in
the lower half the predicted correlations based on the respective products of
loadings ong. Table 1-3 then presents the residuals—that is, what is left over after
the products of loadings ong are subtracted (that is, partialed out). For instance,
the product of loadings of Listening Comprehension and English Structure on the
g factor is .71, while the actual correlation between the Listening Comprehension
test and the English Structure test is .69. This leaves a residual of-.02. Proceed¬
ing in similar fashion for all variables, it soon becomes apparent from Table 1-3
that once the g factor is partialed out, practically no variance remains to be
explained.
Allowing for even a small percentage of error variance attributable to the
unreliability and less than perfect validity of each of the various measures,
essentially no variance is left once g is removed. This is noteworthy for several
reasons. In spite of the fact that there are two tasks that require listening—namely,
the Dictation and the Listening Comprehension subsection of the TOEFL—no
separate listening factor emerges. Similarly, in spite of the fact that there are
several tests that require reading comprehension, vocabulary, and structure, no
unique factors are needed to account for the variance in those tests, and neither do
they produce any unique variances that can be associated with anything different
from what is measured by the Cloze test or the Dictation.
A second set of data comes from foreign students at Southern Illinois
University. No productive oral task was included in the immediately foregoing
study or in the earlier work with the UCLA ESL Placement Exam (Oiler, 1976).
Hinofotis (1976), however, collected data from 106 subjects at SIU using the FSI
oral interview with its five subscales along with a cloze test, and the three subparts
of the Placement Examination used there by the Center for English as a Second
Language. (See also Hinofotis, Chap. 10, this volume.) Results parallel to the ones
given in Tables 1-1 through 1-3 above are presented in Tables 1-4 through 1-6 for
the latter group of subjects and for the respective set of tests.
18 I: FACTORS IN SECOND LANGUAGE SKILL
Loadings on
Tests g factor* h2
Cloze .81 .66
FSI Accent .72 .52
FSI Grammar .89 .79
FSI Vocabulary .87 .76
FSI Fluency .87 .76
FSI Comprehension .86 .74
CESL Listening Comprehension .78 .61
CESL Structure .69 .48
CESL Reading .76 .58
Eigenvalue 5.90
Test 1 2 3 4 5 6 7 8 9
1 Cloze .51 .62 .55 .58 .58 .74 .69 .80
2 FSI Accent .58 .67 .65 .66 .68 .48 .55 .48
3 FSI Grammar .72 .64 .87 .85 .82 .64 .59 .53
4 FSI Vocabulary .70 .63 .77 .85 .84 .60 .48 .55
5 FSI Fluency .70 .63 .77 .76 .83 .63 .48 .51
6 FSI Comprehension .70 .62 .77 .75 .75 .58 .49 .53
7 CESL Listening Comprehensi on .70 .56 .69 .68 .68 .67 .61 .74
8 CESL Structure .56 .50 .61 .60 .60 .59 .54 .63
9 CESL Reading .62 .54 .68 .66 .66 .65 .59 .52
Table 1-6 Residual Matrix with g Loadings Partialed Out (Mean of Absolute
Values = .091, Range = .17) : Observed r minus Product of Loadings on g
Test 1 2 3 4 5 6 7 8 9
9 CESL Reading
Oller/Hinofotis: Second language ability 19
♦Factors 1 and 2 account for 56 and 44%, respectively, of the total variance
in the factor matrix.
The first of two factors in the principal components analysis accounts for
87% of the total variance in the factor matrix and receives no loading less than
69% from any single test. The residuals in Table 1-6 are never as high as .20 and
are always small in proportion to the observed correlations and the respective
products of factor loadings.
The existence of a substantial general factor seems to be demonstrated,
though the possibility remains that there is some unique variance that is associ¬
ated with the FSI Oral Interview which is not also associated with the other tests
used. A two-factor explanation is supported by a varimax rotated orthogonal solu¬
tion derived from the principal components analysis. The orthogonal solution is
displayed in Table 1 -7. The heaviest loadings on Factor 1 in Table 1 -7 are from the
subscales of the FSI Oral Interview while the heaviest loadings on Factor 2 are
from the cloze and CESL placement subtests. An oblique two-factor solution (not
displayed), however, revealed a .71 correlation between two similarly differenti¬
ated factors. Hence the evidence for clearly distinct variance associated with a
speaking factor is not completely convincing, but neither can it be ruled out. By
comparing the eigenvalue associated with the two-factor solution in Table 1-7
with the eigenvalue associated with the one-factor solution in Table 1-4, it is
possible to form an impression of the advantage gained by the two-factor solution
over the one-factor—about 13% of the variance in the two-factor solution is not
accounted for by the g factor.
A third and final set of data comes from 51 of the above-mentioned subjects
who also took the TOEFL. The data from these subjects with the five TOEFL
subtests included are given in Tables 1-8 to 1-10. In this case, the g factor
accounts for only .65 of the total variance in the principal components matrix,
while two additional factors are required to account for the remaining .35. The
absolute mean of the residuals is .155 and has a range of .36, which is consider-
20 I: FACTORS IN SECOND LANGUAGE SKILL
Loadings on
Test g factor* h2
Cloze .80 .64
FSI Accent .29 .08
FSI Grammar .68 .46
FSI Vocabulary .66 .44
FSI Fluency .64 .41
FSI Comprehension .65 .42
CESL Listening Comprehension .76 .58
CESL Structure .45 .20
CESL Reading Comprehension .58 .34
TOEFL Listening Comprehension .67 .45
TOEFL English Structure .73 .53
TOEFL Vocabulary .57 .32
TOEFL Reading Ability .78 .61
TOEFL Writing Ability .68 .46
Eigenvalue 5.94
ably larger than for either of the two previous populations. However, there is con¬
siderably less variance in the latter population on all tests. This is because the pro¬
cedure for selecting the subjects to take the TOEFL eliminated roughly the
bottom half of the distribution—i.e., no subject who placed below the middle of
the distribution also took the TOEFL. Hence the correlations in Table 1-9 for the
51 subjects are depressed as compared with the correlations in Table 1-5 for the
full 106 subjects. For instance, whereas in Table 1-5 hardly any of the correla¬
tions are below .5, in Table 1-8 many are below .3.
Table 1-11 gives a varimax rotated solution for the 51 subjects over the 14
tests, indicating three orthogonal factors which may tentatively be labeled “read¬
ing/graphic” (Factor 1, with .39 of the variance), “oral interview” (Factor 2, with
.38 of the variance), and “listening” (Factor 3, with .23 of the variance). The total
eigenvalue for these three factors is 9.20 as compared with 5.94 in Table 1-8.
Hence the three-factor solution accounts for 35% more variance than the single¬
factor solution.
The results of this last analysis demonstrate the existence of a substantial g
factor but do not rule out the possibility of unique variances associated with
subtests aimed at separate skills (though unique variances associated with
separate components of skills are ruled out). Further research will be necessary to
determine whether Factor 1 in Table 1-7 and 2 in Table 1-11 indeed constitute a
“speaking” factor in the most general sense—i.e., whether such factors will have
Oller/Hinofotis: Second language ability 21
d- VO oo on O r- vo CN CN
VO o CN CN CN CN to d d to C" vo
1
po o r". CN d vo d- o to ON CO
o
CO CN CO to
o to
3
TD
O d- CN CO on O CN o vo VO to o ON
VO O CN CN CN CN d- CN d CN d- d CO
<D
>
00 d- o O _ ON CN ON d- VO CN r- o
O to o CO d CO CN VO to d to d to to
<D
Q.
CD/ CM to ,_ to d O d- vo ON 00 CN vo
O' to o CO CO d CO CN d d CO d-
E
o ON CO o CN vo r_ vo vo on
to CN CO to ON
CN to CN CO d CO d- CO
"3 1
<D
> ,_
00 ON ON CO o d- vo vo o CO vo to
to o CN CN CO CO CN CO CO
CD
Q
c/n
c r- CN to ON CO CN CN d d- to CO ON CN
o vo o CO CO d d- CO d to to d- to to
4-*
<D VO 00 ,_ ON to on on 00 d r- r-
CN to d* CN CO d- d CO to d-
O
U
-a to d- CO o oo CN On on CO VO o
CD CN to 00 d d CN CO d- d- CO to d-
<4-1
o
CD d* 00 to o CN CO O o 00 d 00 00
i_ CN d- 00 d d- to CO CO d* d- CO d^
to
CL.
Correlation Matrix (above Diagonal) and
co d VO d d d- CN o on vo O ON CO vo
CO to d d d- to CO CO d- to CO to d-
(N o ON 00 ON CN CO ON r- CO o
o CN CN CN CN CN
,_ CO d" CO CN _ vo VO d- 00 VO CN d-
CN to to to to VO CO d to to d* VO to
c
c _o
o c <S)
rd c
o
C <S)
'</)
<D
O C
CD c JZ
00
rd JZ <D CD CD
<D -C Q. D >-
<D
Q a. E *->
E CL o D
£ c
o E o 15
o o
<S) U o 00 to _d <
<
CD u c D 00
CD C 00 1— -C c 00
-O 1— rd
CD c
ZJ 00 ’c Si JZ c
JZ C CD d
■*-> o
13 CJ 6b d
Accent
CD 6 > LL -J -J -J ll LL Ll ll LL
bo N
U
to UJ LU LU LU LU
c VO to to to to ll)
to
Ll)
to
LU o o o o o
'~TD
rd
u Li- LL LL U- LL u u u H i- i- I- 1-
S
O <D *— CN CO d- to vo 00 ON o T— CN CO d*
T— ’- ’— ’— ’—
22 I: FACTORS IN SECOND LANGUAGE SKILL
O CO 00 co d- cn d* T— 00 in cn
CN T- r— CN CN o 1— o o CN o
T—
1 1 1 i i
CO cn CO C^ co in co cn o cn CN CO
cn ,— T- T- CN T— o r— »— o O CN
T_’
i | | 1 I i i
CD 00 m CO OA CO m cn o cn cn m
cn CN r— T- O T- T- o o T- i— o
• T-
1 1 1 1 1 1 1 l
ii
CD oo CO 00 r- VO CN r-
o o
o T—. CN o ; o CM O o
C
rd | | 1 1 1
O'
CN d- in CT\ T- d- cn O'* r~-
cn o o T- r— o o T- CN o o
in 1 | 1 1 | 1
T—;
II CO r- CN CO CN CN
CN CN CN CN CN r- '— o
<s 1 i 1 l | 1
3
rd m CO O'* in 00
00 <— o CN o
>
<D i | 1 i 1
+->
3 cn I--
r*-
o c- o r— T- r— O o
t/1
-O \ 1 1 1 1
d- CN CO CN m
CO CN cn cn cn cn
o
l
c
rd
<D cn CO co
in CN cn cn cn
2
3 m co vo
O CN CM cn
"O
JD
rd O in
cn m
rd
Q-
tn
bO
c
T3
rd
O
_J C
o
<st
bO
C
£ T3
rd
X O c
\— o c
rd <L>
JZ
<L»
.c
o <i>
TOEFL Reading Ability
Q.
rd 3
TOEFL Writing Ability
3 CL E
T3 O*3 E Q. o
TOEFL Vocabulary
*tn o E u
CD u o b©
O' >* bo <d u c
W) ’c
3
nj _c .E 3 c CJ
C 0) bo
E D
-O
T3 c
Q. cd
E rd
U E 0>
rd C£
o i— o o
■D < CD > u
CD cn cn LlJ LU
> o cn cn cn LU LU O O
u ll ll LL u u
JD
cd -D T— CN cn d* 00 0^ o
14
12
13
h- O
Oller/Hinofotis: Second language ability 23
Table 1-11 Varimax Rotated Solution (with Iterations) for a Cloze Test,
the Five Subscales of the FSI Oral Interview, the Three Subtests of the SIU
CESL Placement Examination, and the Five Subtests of the TOEFL (A/ = 51)
♦Factors 1,2, and 3 account for 39, 38, and 23%, respectively, of the total variance in
the matrix.
variances in common with other tests aimed at speaking ability (e.g., oral cloze,
reading aloud, sentence repetition) which are not also common to tasks relying on
other skills. Similarly, further research will be required to see if other tests that
require listening comprehension will load on a factor such as 3 in Table 1-11,
which is actually distinct from the possible speaking and graphic factors.
Considering the results of all three sets of data, the notion of separate
components of structure, vocabulary, and phonology finds no support There is
substantial evidence that the five subscales on the FSI Oral Interview, for
instance, are equivalent The choice between the unitary competence hypothesis
and the possibility of separate skills is less clear. There is some evidence to sug¬
gest that (excluding the oral interview data) if the data represent the whole range of
subject variability, the unitary competence hypothesis may be the best explana¬
tion, but if the variability is somewhat less, a moderate version of a separate skills
hypothesis would he preferred. Regarding the oral interview data, there seems to
be some unique variance associated either with a separate speaking factor or with a
consistency factor—the tendency of judges simply to rate subjects similarly on all
of the FSI scales. Certainly there is substantial evidence that a general factor
exists which accounts for .65 or more of the total variance in the several batteries
of tests investigated.
Note
1. A version of this paper was presented at the winter meeting of the Linguistic Society of America in
Philadelphia, Dec. 30, 1976.
Chapter
This paper reports a partial replication of Oiler and Hinofotis (Chap. 1). Two
hypotheses concerning the nature of language skill are evaluated. The
divisible competence hypothesis considers language proficiency a composite
of various skills and components. It is related to the discrete point approach
to teaching and testing. Although each of the components and/or skills is
expected to have a certain amount of overlapping variance with other
components and/or skills, it is assumed that tests of each separable skill or
component will have a certain amount of unique variance not associated
with tests designed for the other language components or skills. The second
hypothesis, which claims that language ability is essentially unitary, predicts
that once the common variance in a variety of language tasks has been
explained, there will be no leftover unique variances which can be at¬
tributed to separate skills or components. A factor analysis of the data from
182 subjects who participated in an experimental English language testing
project at the Center for English as a Second Language at Southern Illinois
University in the spring of 1977 tends to support the unitary competence
hypothesis.
Two mutually exclusive hypotheses have been offered to explain the nature
of language competence. The first has been called the divisible competence
hypothesis (Chap. 1). According to this hypothesis, language ability can be sep¬
arated into a number of relatively independent parts. Thus linguists traditionally
divide language ability into the areas of phonology, syntax, morphology, seman¬
tics, and so on. This schema is reflected in discrete point theories of language test
ing, where it is assumed that an efficient language test must be aimed at only one
24
Scholz et al.: Language ability—divisible or unitary? 25
skill (i.e., oral, reading, or writing) or only one modality of a skill (listening versus
speaking).
The appropriateness of the unitary competence hypothesis can be deter¬
mined by factoring a number of language tests to a principal components solution.
Then the existence of a general factor can be tested by using the products of the
first factor to predict the original correlation matrix. If the difference between the
factor loading products and correlations is relatively small, this would indicate that
one general factor accounts for the majority of the common variance among all the
language tests.
Method
Subjects. All 182 students enrolled at the Center for Enghsh as a Second Lan¬
guage (Southern Illinois University) during the second term of the spring semester
of 1977 were tested. The largest language backgrounds represented were Farsi,
Arabic, Spanish, and Japanese. Only subjects actually enrolled in CESL classes
were tested. Those who passed the normal placement tests at sufficiently high
levels to be exempted, of course, were not included in the testing.
Procedure. Twenty-two test scores were obtained. Since nearly all the testing
had to be done during class time and because the courses at CESL are limited to
about 15 students each (spaced over five levels ranging from beginning to
advanced), only 27 of the students completed all the tests. This was largely due to
absences over the several weeks of testing. The smallest number of subjects com¬
pleting any pair of tests, however, was 65 and the largest was 165.
The test data in all cases were keypunched, and two factoring procedures
were applied—a principal components analysis was followed by a varimax rota¬
tion. The main issue was to find the factor solution that best explained the maxi¬
mum amount of variance in the data. More specifically, the problem was to choose
between the multiple-factor solution (the varimax rotation) and the single-factor
solution (the first factor of the principal components analysis). Choosing the latter
solution would eliminate the divisible competence hypothesis, and choosing the
former would eliminate the unitary competence hypothesis.
In order to maximize the possibility of the multiple-factor result, multiple
tasks requiring listening, speaking, reading, writing, and grammatical decisions
were included among the tests. Here we concentrated primarily on the differentia¬
tion of skills rather than components of skills because of the previous results of
Oiler and Hinofotis which tended to show that the frequently posited components
of phonology, vocabulary, and syntax probably do not constitute distinct sources
of test variance. We thought perhaps it would be possible to sort out the skills of
listening, speaking, reading, and writing, or possibly some combination of subsets
of these skills such as listening/speaking versus reading/writing or perhaps listen¬
ing/reading versus speaking/writing.
26 I: FACTORS IN SECOND LANGUAGE SKILL
Tests.* The measures used will be discussed briefly here as they fall into the
categories of listening, speaking, reading, writing, and grammar tasks. Excluding
the scales from the Foreign Service Institute Oral Interview, two of the subtests
from Harris’s Comprehensive English Language Test, and the CESL Reading
Test, all the tests used are reproduced in the Appendix at the end of this volume.
(Some of them are also used in several of the subsequent chapters—especially
Chaps. 6, 7,12,14,17, 23, and 24.) All that is given here is a brief description of
each measure.
*5. Dictation. A standard dictation was administered and scored in the tra¬
ditional way (one point for each correct word). For a recent discussion of the tech¬
nique, see Oiler (1979, Chap. 10), also his references.
consisting of two raters each. In all, it was possible to interview only 70 of the total
182 subjects because of scheduling conflicts and attrition. The interviews were
conducted throughout the period of testing between February and April 1977.
*11. Repetition (Elicited Imitation). Each of three taped texts was first read
in its entirety and then repeated in briefer segments while students were asked to
repeat in pauses provided. Responses were taped at the CESL/SIU language lab¬
oratory and later were scored by exact and acceptable word methods (see Chap. 7).
*12. Oral Cloze Test (with Spoken Responses). Each of five passages was
first read without deletions, then read again with every seventh word deleted. Stu¬
dents were asked to provide the missing words at pause points. Responses were
taped and scored by exact and acceptable word methods.
*13. Reading Aloud. Students were asked to read three short paragraphs
aloud. Responses were taped and scored by the exact word method.
14. CESL Reading Test. This test is a modified version of the Science
Research Associates Reading for Understanding Placement Test (Thurstone,
Loadings on
Test g factor* h2
CELT Listening Comprehension .64 .41
Listening Cloze (Open-Ended) .65 .42
Listening Cloze (Multiple-Choice Format) .38 .14
Multiple-Choice Listening Comprehension .46 .21
Dictation .74 .55
Oral Interview—Accent .51 .26
Oral Interview—Grammar .83 .69
Oral Interview—Vocabulary .77 .59
Oral Interview—Fluency .75 .56
Oral Interview—Comprehension .84 .71
Repetition .71 .50
Oral Cloze (Spoken Responses) .75 .56
Reading Aloud .67 .45
CESL Reading Test .70 .49
Multiple-Choice Reading Match .74 .55
Standard Cloze .77 .59
Essay Ratings (by Teachers) .77 .59
Essay Score .74 .55
Multiple-Choice Writing .85 .72
Recall Rating .80 .64
CELT Structure .67 .45
Grammar (Parish Test) .82 .67
Test 1 2 3 4 5 6 7
3 Listening Cloze (Multiple Choice Format) .24 .24 .77 .24 .06 .39
1963). It is used as part of the CESL placement battery (also including tests 1 and
21 of the tests described in this section).
*16. Standard Cloze. This score was based on the sum of scores on eight
cloze tests in a traditional format. Six of the passages used here are discussed in
greater detail in Chap. 12. The remaining texts are given in the Appendix.
17. Essay Rating (by Teachers). The students wrote an essay about an
accident, who was involved, who was at fault, etc. It was scored on a six-point scale
discussed in Chap. 14.
*19. Multiple-Choice Writing Task. This task required three different per¬
formances on three multiple-choice tests over three passages each (nine in all).
Scholz et al.: Language ability—divisible or unitary? 29
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
.48 .41 .46 .43 .58 .36 .61 .37 .33 .46 .43 .52 .49 .53 .45
.46 .61 .68 .57 .58 .41 .32 .40 .52 .38 .37 .54 31 .24 .44
.36 .23 .23 .26 .41 .12 .30 .08 .14 .09 .28 .22 .15 .48 .26
.37 .25 .36 .34 .38 .27 .50 .20 .23 .25 .33 .26 .23 .47 .36
.39 .50 .50 .51 .61 .41 .39 .66 .73 .53 .66 .70 .51 .48 .69
.39 .54 .44 .42 .43 .34 .25 .17 .38 .30 .39 .38 .28 .32 .37
.89 .78 .87 .54 .50 .52 .79 .60 .51 .64 .47 .63 .64 .47 .57
.77 .84 .50 .47 .45 .46 .52 .39 .57 .50 .61 .61 .44 .52
.58 .82 .55 .50 .47 .37 .50 .43 .54 .52 .58 .52 .29 .49
.65 .63 .62 .58 .51 .40 .66 .50 .61 .62 .69 .61 .34 .59
.55 .53 .60 .68 .51 .37 .41 .39 .43 .50 .60 .52 .32 .56
.58 .56 .63 .53 .41, .45 .48 .49 .54 .62 .57 .49 .49 .59
.52 .50 .56 .48 .50 .56 .50 .58 .47 .38 .54 .62 .37 .57
.54 .53 .59 .50 .53 .47 .49 .51 .60 .41 .61 .63 .70 .62
.57 .56 .62 .53 .56 .50 .52 .73 .61 .55 .69 .62 .38 .64
.59 .58 .65 .55 .58 .52 .54 .57 .61 .61 .73 .71 .56 .74
.59 .58 .65 .55 .58 .52 .54 .57 .59 .65 .67 .69 .54 .60
.57 .56 .62 .53 .56 .50 .52 .55 .57 .57 .66 .54 .43 .55
.65 .64 .71 ..60 .64 .57 .60 .63 .66 .65 .63 .63 .59 .68
.62 .60 .67 .57 .60 .54 .56 .59 .62 .62 .59 .68 .59 .70
.52 .50 .56 .48 .50 .45 .47 .50 .52 .52 .50 .57 .54 .65
.63 .62 .69 .58 .62 .55 .57 .61 .63 .63 .61 .70 .66 .55
The first subtest required selecting an appropriate word, phrase, or clause to form
a continuation of the text The second set of passages required editing of errors,
and the third required correct ordering of words, phrases, or clauses.
*20. Rating of Recall Task. Three short paragraphs were each displayed for
one minute, using an overhead projector. Students were instructed to write down
what they had read, paying special attention to the meaning. Responses were
subjectively rated on a six-point scale (see Chap. 14 for amplification).
Table 2-3 Residual Matrix with g Loadings Partialed Out (Mean of Absolute
Values = .082, SD = .067): Observed r minus Product of Loadings on g
Test 1 2 3 4 5 6 7
©
o
00
1 CELT Listening Comprehension -.03 .01 .00 -.10
J
2 Listening Cloze (Open-Ended) -.11 -.19 .08 .09 .08
3 Listening Cloze (Multiple-Choice Format) .60 -.04 -.13 .07
4 Multiple-Choice Listening Comprehension -.05 -.25 .07
5 Dictation -.13 -.23
6 Oral Interview—Accent .04
7 Oral Interview—Grammar
8 Oral Interview—Vocabulary
9 Oral Interview—Fluency
10 Oral Interview—Comprehension
11 Repetition
12 Oral Cloze (Spoken Responses)
13 Reading Aloud
14 CESL Reading Test
15 Multiple-Choice Reading Match
16 Standard Cloze
17 Essay Ratings (by Teachers)
18 Essay Score
19 Multiple-Choice Writing
20 Recall Rating
21 CELT Structure
22 Grammar (Parish Test)
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
-.01 -.07 -.08 -.02 .10 -.07 .16 -.10 -.16 -.03 -.04 -.02 -.02 .10 -.07
-.04 .12 .16 .11 .09 -.03 -.14 -.08 .02 -.12 -.11 -.01 -.15 -.20 -.09
.07 -.06 .01 -.01 .12 -.13 .03 -.20 -.15 -.20 .00 -.10 -.15 .23 -.05
.02 -.10 -.03 .01 .03 -.04 .18 -.14 -.12 -.10 -.01 -.13 -.08 .16 -.02
-.18 -.06 -.12 -.02 .05 .09 -.13 .11 .16 -.04 .11 .07 -.08 -.02 .08
.00 .16 .01 .06 .05 .00 -.11 -.21 -.01 -.09 .01 -.05 -.03 -.02 -.05
.25 .16 .17 -.05 -.12 -.04 -.09 -.01 -.13 .00 -.14 -.08 -.02 -.09 -.11
.19 .19 -.05 -.11 -.07 -.08 -.05 -.20 -.02 -.07 -.04 -.01 -.08 -.11
.00 .02 -.06 -.03 -.16 -.06 -.15 -.04 -.04 -.06 -.08 -.21 -.13
.02 -.05 .03 -.19 .04 -.15 -.04 .00 -.02 -.06 -.22 -.10
.15 .03 -.13 -.12 -.16 -.12 -.03 .00 -.05 -.16 -.02
-.09 -.08 -.08 -.09 -.04 .06 -.07 -.11 -.01 -.03
.09 .00 .06 -.05 -.12 -.03 .08 -.08 .02
-.03 -.03 .06 -.11 .01 .07 .23 .05
.16 -.06 .00 .06 .03 -.12 .03
.02 .04 .07 .09 .04 .11
-.08 .02 .07 .02 -.03
.03 -.05 -.07 -.06
-.05 .02 -.02
.05 .04
.10
.36, and/r’s of.21 and.14, respectively), andtheFSI Oral Interview Accent scale
(with a correlation of .51, h2 = .26).
Following the method recommended by Nunnally (1967) and applied by
Oiler and Hinofotis (Chap. 1), Table 2-2 above the diagonal presents the pre¬
dicted correlations between the tests, based on the products of their g loadings,
while below the diagonal the actual correlations between subtests are given. The
Residual Coefficient Matrix in Table 2-3 presents the percent of each correlation
which remains after the products of the loadings on g are subtracted from the
original correlations. Thus the product of the loadings for the Essay Rating and the
CESL Reading Test is .54, while the actual correlation between these two tests is
.60. A positive residual of .06 remains after the variance which can be attributed to
g has been removed. This operation is performed for all the products of loadings on
g .
An examination of Table 2-3 reveals that relatively small residuals remain,
once theg factor has been extracted. The most notable exception is the residual for
the Listening Cloze (in multiple-choice format) and the Multiple-Choice Listen¬
ing Comprehension subtests. However, these high residuals may be due to unrelia-
32 I: FACTORS IN SECOND LANGUAGE SKILL
Table 2-4 Varimax Rotated Solution (without Iterations) for the Five
Subscales of the FSI Oral Interview, the Three Subtests of the CESL Place¬
ment Test, and the Eighteen Subtests of the CESL Testing Project (N = 65
to 162 subjects)*
*Only factor loadings above .32 (p <.05, with 65 df) are reported. The significant
loadings on all four factors account for 57% of the total variance in all the tests.
bility in these tests. They were administered near the end of the testing period
when student and staff fatigue was near its limits.
Allowing a certain degree of error variance due to unreliability in each test,
there is surprisingly little residual variance once the g factor has been accounted
for. Despite the fact that the tests can be grouped into the traditional areas of
listening, speaking, reading, and writing, separate factors for these skill areas do
not emerge. The results fail to reveal any unique variance which can be associated
with separate language skills or modalities. The large amount of overlapping vari¬
ance between the tests tends to support the unitary competence hypothesis.
Another way of factoring the data is to apply a varimax rotation to the princi¬
pal components solution. This technique sets up uncorrelated reference vectors
and differentiates the mathematical factors thus defined by maximizing the
loading of each contributing test on one and only one factor. Hence, if the divisible
competence hypothesis were correct, we might expect listening, speaking, read¬
ing, and writing tasks to load on different factors, or we might expect phonologi¬
cal, syntax, and vocabulary measures to load on separate factors. As shown in
Scholz et al.: Language ability—divisible or unitary? 33
Table 2-4, all the scales of the FSI type Oral Interview, with the exception of the
accent rating, load heavily on Factor 2. The three subparts of the CESL Placement
Test load mostly on Factor 4. However, the eighteen experimental tests are scat¬
tered over all four factors in no discernible pattern which could be associated with
the divisible competence hypothesis. For instance, Factor 1 receives significant
loadings from two listening tasks (Listening Cloze, Open-Ended, .42; Dictation
.84), four speaking tasks (Oral Interview Comprehension. 3 9, Repetition.38, Oral
Cloze .46, Reading Aloud .34), two reading tasks (Multiple-Choice Reading
Match .74, Standard Cloze .77), all four writing scores (Essay Ratings .50, Essay
Score .63, Multiple-Choice Writing .63, and Recall Rating .43), and finally from
Parish s Grammar Test (.60). The remaining three factors shown in Table 2-4 are
similarly complex. The significant loadings on the four factors taken altogether
account for 57% of all the variance in all the tests. This represents only a 6% gain
over the single-factor solution.
In conclusion, no clear pattern appears in which tests are grouped according
to the posited skills of listening, speaking, reading, and writing, or components of
phonology, lexicon, or grammar, the data seem to fit best with the unitary compe¬
tence hypothesis; and the divisible competence hypothesis is thus rejected.
Note
1. Editors’ note: The factor solutions presented in Tables 2-1 and 2-4 are based on pairwise deletion
of missing cases (Nie, Hull, Jenkins, Steinbrenner, and Bent, 197 5). They differ somewhat, therefore,
from those presented in Oiler 1979 (Appendix, Tables 1 and 2), which were based onlistwise deletion
of missing cases. However, the unitary factor solution is preferred in both cases. For further discussion,
see Oiler (1979), especially the section of the Appendix entitled The Carbondale Project
Chapter
Douglas E. Flahive
This chapter addresses topics that have been focal points of extensive educa¬
tional research since the beginning of this century: intelligence and language com¬
prehension. Despite this extensive research, little agreement exists among
researchers concerning even the basic definitions of the terms. Is intelligence a
static construct measured by performance on standardized IQ tests as Jensen
claims, or is it a dynamic construct as Piaget and his followers claim? And what
about language comprehension? What is it, and how is it measured? If one reads
and attempts to synthesize the numerous studies that have addressed the second
question, the frustrating conclusion is that there is no universally accepted defini¬
tion of language comprehension, nor is there a universally valid technique for
assessing language comprehension.
34
Flahive: The g factor 35
♦Editors’note: TOEFL scores are converted to a standardized scale with a mean of 500 and a standard
deviation of 100 based on scores of the original reference population tested in 1964 (ETS, 1973).
However, the average score for all candidates tested between October 1977 and June 1971 was 492
with a standard deviation of 80.
Flahive: The g factor 37
the results of the McGraw-Hill test, which ranks these students from the 9th to the
85th percentile when compared against a reference group of college freshmen and
sophomores. 1 n fact, the mean of the latter group of native speakers of E nglish falls
at the 22nd percentile of Raven s reference group. Performance on the Perkins-
Yorio test was only slightly lower than scores achieved by college freshmen who
were native speakers of English.
The correlations among the five measures are seen in Table 3-2. The fact that
all the measures are highly correlated is not surprising, since three of the tests are
purportedly testing the same skill, reading. Nor are the correlations between the
Perkins-Yorio test and the TOEFL and the cloze and TOEFL surprising. They are
all measures of language proficiency. What is surprising is the high correlation
between the Raven’s and the McGraw-Hill test. Over 70% of the variance in the
McGraw-Hill test scores is accounted for by scores on the nonverbal intelligence
test To contrast the power of the Raven’s as a predictor of scores on the three read¬
ing tests with that of the TOEFL as predictor of the reading scores, three separate
R2
Raven’s and TOEFL to predict McGraw-Hill .716
Raven’s alone to predict McGraw-Hill .707
regression analyses were run with the Raven s and the TOEFL as predictors in the
full models and the Raven’s alone in the restricted models. Results of these
analyses are given in Table 3-3. Almost as much variance in the McGraw-Hill test
was accounted for by the Raven’s alone as by the Raven’s and TOEFL combined.
However, in attempting to predict scores on the Perkins-Yorio and the cloze,
significantly less variance was accounted for by the Raven’s without the TOEFL.
From the description of the tests given above, the reasons for these results seem
rather clear. Neither the cloze nor the Perkins-Yorio requires the complex reason¬
ing involved in the McGraw-Hill multiple-choice reading test.
Conclusion
With the exception of the relatively high correlation between reading scores and
the nonverbal test of intelligence, the results of this study are, for the most part, not
38 I: FACTORS IN SECOND LANGUAGE SKILL
Note
1. The author of the Raven's claims that it is not a tme test of intelligence but rather a “test of
observation and clear thinking.” Then he goes on to say that it correlates a,t the .86 level with the
Terman-Merrill scale and has a g saturation of .82.
Appendix 3A
Perkins-Yorio Test
1. The athlete got plenty of rest after the race.
a. The athlete received a lot of money after the race.
b. Someone gave the athlete an important gift after the race.
c. The athlete enjoyed plenty of relaxation after the race.
d. The athlete became very famous after the race.
3. Tom was told to turn the lights out before closing the lab.
a. Somebody told Tom to turn the lights out before closing the lab.
b. Tom decided to turn the lights out before closing the lab.
c. Turning the lights out before closing the lab was Tom’s idea.
d. Tom had told somebody to turn the lights out before'closing the lab.
8. If they had had more time, Bob and David would have visited some of their
friends in Chicago.
a. Bob and David didn’t visit their friends.
b. Bob and David visited all of their friends.
c. Bob and David visited some of their friends.
d. Bob and .David don’t have any friends in Chicago.
10. Sam gave John the book his brother bought him several years ago. Who has
the book?
a. John. c. Sam’s brother.
b. Sam. d. John’s brother.
11. The new textbook has been written by one of the teachers at the Institute.
a. The Institute wrote the new textbook.
b. No teacher at the Institute has ever written a textbook.
c. There is a new textbook and a teacher at the Institute wrote it
d. Teachers at the Institute have never written textbooks.
12. That Carlos and Juan got 80 on the proficiency test without studying very
much surprised the rest of the class.
a. Nobody was surprised that Carlos and Juan got 80 on the proficiency
test because they had studied a lot.
b. Carlos and Juan were surprised that they got 80 on the proficiency test
because they hadn’t studied very hard.
c. Although Carlos and Juan studied very hard, they were surprised that
the rest of the class got 80 on the proficiency test
d. Everybody in the class was surprised that Carlos and Juan got 80 on the
proficiency test because they had not studied very hard.
13. Our visit to the country was very restful because we had been working verv
hard.
a. We were never able to relax in the country.
b. Our visit to the country was bad for our nerves.
c. We were very relaxed after our visit to the country because we didn’t
have to work there.
d. We were very tired in the country because we had to work there.
14. The house near the shore of the lake was destroyed by the storm last night
a. The storm destroyed the shore of the lake last night
b. The storm destroyed the house near the lake last night
c. The lake destroyed the house near the shore last night
d. There was a storm last night and it destroyed the shore of the lake.
15. The judge determined that the report contained an obvious untruth.
a. The judge showed that the material in the report was clearly correct.
b. The judge didn’t prove that the report was true.
c. The judge showed the error that the report contained.
d. Everybody knew that the report was full of lies except the judge.
16. “Had Tom known they were coming, would he have waited for them?” asked
Mary.
a. Tom knew they were coming.
b. Tom didn’t know they were coming.
c. Mary is asking if Tom knew they were coming.
d. Tom is asking if they were coming.
Perkins-Yorio test 41
17. Tom met Jane at his best friend’s brother s house. Where did Tom meet
Jane?
a. at Tom’s friend’s house.
b. at Jane’s house.
c. at the house of a brother of Tom’s friend.
d. at Tom’s brother's house.
18. The mailman carried the bag which contained the letters.
a. The mailman had a bag; there were no letters in it
b. The mailman carried the letters in a bag.
c. The mailman carried an empty bag.
d. The mailman carried the letters in his hand.
20. “Never will I repeat a thing like that,” said Alice to her parents.
a. Ahce is making a promise to her parents.
b. Alice is asking her parents a question.
c. Alice’s parents are asking her a question.
d. Alice’s parents are making her a promise.
21. I didn’t know that Mac hadn’t been killed after all.
a. Mac was killed but I didn’t know it
b. Mac wasn’t killed and I knew it
c. I knew that Mac was dead.
d. I didn’t know that Mac was alive.
27. The man who won the contest in 1971 married the girl who won the contest
in 1972.
a. The man won the contest in 1972 and then married the girl.
b. The girl won the contest in 1972 and then married the man.
c. The man married the girl and she won the contest in 1971.
d. The man won the contest in 1971; the girl won the contest in 1972, and
now they are married.
28. The joke wasn’t very funny; yet, the audience laughed unendingly.
a. There was no end to the audience’s laughter.
b. The audience didn’t laugh at the end of the joke, because it wasn’t very
funny.
c. The audience stopped laughing when they realized the joke wasn't very
funny.
d. The audience started laughing before the bad joke ended.
29. Paul asked his brother’s advice because he couldn’t decide which car to buy.
a. Paul had difficulty deciding which car to buy.
b. Paul knew which car to buy.
c. Paul asked his brother to buy the car.
d. Paul’s brother bought a car following his advice.
31. Tom hadn’t intended on staying up so late but his friends didn’t leave until
midnight.
a. Tom wanted to stay up until midnight
b. Tom wanted to go to bed early.
c. Tom didn’t want to go to bed early.
d. Tom’s friends left early so he went to bed.
Perkins-Yorio Test 43
36. Jim was said to have been seen leaving the scene of the crime accompanied
by a blonde woman.
a. Somebody said that Jim had seen the blonde woman leaving the scene of
the crime.
b. Jim and a blonde woman said that they had seen someone leaving the
scene of the crime.
c. A blonde woman said that she had seen Jim accompanied by someone
leaving the scene of the crime.
d. Somebody said that they had seen Jim and a blonde woman leaving the
scene of the crime.
37. Neither her daughter nor her son was at home when Mrs. Wilson returned.
Who was at home?
a. Mrs. Wilson’s son. c. No one.
b. Mrs. Wilson’s daughter. d. Both son and daughter.
39. The John Smith who introduced the speaker who received the award is not
the same John Smith who received the first award ever given two years ago.
a. John Smith received the first award two years ago and he introduced the
speaker.
b. John Smith introduced the speaker, the speaker received the first award
two years ago.
c. John Smith introduced the speaker and another John Smith received
the first award two years ago.
d. The speaker introduced John Smith and he had received the award two
years ago.
40. Although John hates vegetables, he ate the beans because Mary made them.
a. John likes beans.
b. John likes Mary.
c. John doesn’t hke Mary.
d. John likes vegetables, except for beans.
49. That important steps should be taken to solve the pollution problems that
affect our cities is clear to everyone.
a. Nobody thinks that the pollution problems that affect our cities are
really serious.
b. Everybody thinks it is clear that something ought to be done to stop
pollution in our cities.
c. The pollution problems that affect our cities are not very serious, and
everybody thinks that’s very clear.
d. Not everyone is sure that there is a pollution problem in the cities.
Appendix 3B
Cloze Test
The Capital Crisis
Economists have long agreed that the basis of the capitalistic system is, quite
simply, privately owned capital. Capital is simply money that comes from savings
and is used for investment Now, however, economics have begun to ! the
46 I: FACTORS IN SECOND LANGUAGE SKILL
causes and implications of a unique 2 For the first time in more 3 half
a century, the United States,_4 some other industrial economies, faces a
shortage_5_the capital that is needed to 6 new goods, profits, and jobs.
The shortage_1_capital poses a threat to 8 survival of the country’s
economic system-2_immediately, it raises doubts about the 10 ability to
recover from the latest recession_11 push unemployment down to about 5%
-L2 the work force and keep it 13 that level. Ultimately the formation of
-14-is the formation of jobs._L5_down of the pace at which
-16-nation saves its earnings and invests _JL__ resources results in reduced
economic activity_L§_fewer jobs. Yet the amount of 19 devoted to invest¬
ment in the United States is_20 that found in other major industrialized
-21-Furthermore, the need for additional goods 22 and technology has
never been greater. In the_23 few years, enormous amounts of money 24
be required to develop alternative sources 25 energy such as nuclear power,
improve pollution-26— ? erect more efficient factories, and 27 rebuilding
the decaying cities.
But—28 the funds be available? Although economists 29 been debat¬
ing the question for some , the issue has only recently attracted the 31
of the federal government. Some government 32 have begun to recognize that
the current-33-of capital could cause an economic 34 far more serious
than ever before_35_the history of the country.
Basically, —36— are two ways of approaching the 37 Some liberal
economists advocate a system-38_would encourage investors to invest their
money —39— high priority projects such 40_energy development These
economists say that with such_41_system available capital could he 42
more effectively than it is today. Conservative economists. 43 . maintain that
such a system is unworkable and_44_seriously harm the capitalistic system.
-45-solution to the problem is_46_lower the tax on profits earned from
—47-investments. In this way they hope to_48 investment in all areas of the
economy-49-this time it is impossible to_30_which of the two approaches
will work. The one fact that is certain is that some type of action is necessary soon if
an economic crisis is to be averted.
Chapter 4
An Analysis of
Various ESL Proficiency Tests
Kay K. Hisama
Two different methods of analysis were applied to the results of four ESL
proficiency tests. The profile method showed that the tests produced
somewhat different score patterns in relation to proficiency levels and native
language backgrounds of the subjects tested. On the other hand, a principal
components analysis revealed a single source of variance across all the tests
in spite of their different formats. It was concluded that score pattern
differences across the different tests may be partly attributable to minimal
sampling “biases” and unreliabilities in the tests. Nevertheless, all four tests
investigated reveal substantial reliability and validity as measures of a single
global proficiency factor.
47
48 I: FACTORS IN SECOND LANGUAGE SKILL
example, the results of several tests are likely to give a better prognosis of college
achievement than any single test The most obvious rationale for multiple
measurement is that any human ability is complex and thus requires a wide range
of item samples. It is believed, therefore, that a single set of test items or a single
test may fail to measure a very complex human ability such as language profi¬
ciency.
In the case of tests of English as a second language, however, there seems to
be a large amount of overlap in the information given by the subtests. Such overlap
is demonstrated by the rather high intercorrelations usually observed among them.
In the Seventh Mental Measurements Yearbook, for example, Chase (1972) noted
the substantial intercorrelations among the subtests of the TOEFL. Two possible
explanations were given: (1) there is an obvious overlap in format from test to test,
or (2) many common behaviors are required by the various subtests in the battery.
The same explanations can be offered for the overlap in variance on the parts of the
other two tests mentioned above, the CELT and the Michigan test.
These empirical facts as well as recent theoretical developments in language
and communication have generated much controversy among test specialists as to
how ESL proficiency should be measured. In response to the lead of Carroll
(1961), a substantial controversy has developed in terms of the contrast between
the discrete point and the integrative approaches to language testing. Certainly,
there is room for more empirical study. This investigation attempts to shed some
light on the multiple measurement of ESL proficiency based on empirical data ob¬
tained from the Center for English as a Second Language (CESL) at Southern
Illinois University, Carbondale.
Method
Subjects. A total of 136 nonnative students at CESL served as subjects. They
were tested during three separate six-week terms in October and November 197 5
and January 1976. Two criteria for selection were imposed: subjects had to be new
entering students at CESL, and they had to be recent arrivals to the United States
with no more than one month in residence at the time of testing. The test scores of
subjects who met these criteria were retained for analysis.
Language groups included: Arabic (N— 29), Farsi-Persian (N= 49), Spanish
(iV=27), African (iV=10), and Asian (iV=18). The latter group included speakers
of Japanese, Chinese, Vietnamese, and Thai. The remaining three subjects were
all from different language backgrounds.
Test Instruments. Four tests were administered upon entrance to CESL: the
first and second tests were the Structure and Listening subtests of the CELT; the
third was a remedial reading test entitled Reading for Understanding Placement
Test (RFUPT); and the fourth, designed by the author, was a modified cloze test
referred to as the New Cloze Test (NCT).
There were several contrasts between the tests. For one, the first three were
in a multiple-choice format. Examinees were required to choose one of four
Hisama: ESL proficiency tests 49
alternatives in response to each test question. However, the fourth test was an
open-ended fill-in-the-blank cloze procedure. A second contrast was that the Two
CELT tests were specially developed and validated with reference to nonnative
speakers of English while the RFUPT was constructed with reference to a
remedial reading program designed for students at intermediate elementary
school levels through college* The NCT, of course, was developed specifically for
testing the nonnative speakers at CESL.1 Third, the dominant register of language
used for the two CELT tests was conversational. These tests contained many
colloquial expressions which foreign students might not have encountered
previously. On the other hand, the register for the RFUPT was appropriate to
written materials likely to be used at school. The bulk of the questions in the
RFLPT required that the subjects have a good conceptual understanding of
various school subjects such as general science and language arts or social studies.
Fourth, the total number of items for each of the CELT tests and the RFUPT
differed considerably, ranging from a low of 50 for the Listening test to a high of
100 for the RFUPT. Accordingly, the timing of the tests varied. The Listening test
required all examinees to proceed at the same speed while the RFUPT and the
Structure test had fixed time limits. These differences in the manner of timing, in
the total number of test items, together with differences in difficulty level of each
test, tended to change the effects of guessing from test to test Finally, the NCT
substantially differed from the other three tests in that there was little or no chance
of subjects making correct random responses. For the multiple-choice tests the
odds against a correct guess are four to one, but on the open-ended cloze test the
odds against a correct guess are much stronger.
Method of Analysis. The scores of the four tests were analyzed by applying
two different techniques: a profile method and factor analysis. A profile is a
graphic summary of the results of multiple measurement The principal advan¬
tage of the profile is its simplicity—it presents test results in a nontechnical way.
The method is applicable, however, only to the extent that individual test scores
are reliable. Fortunately, all tests used in this study demonstrated substantial relia¬
bility and concurrent validity, as can easily be inferred from the correlations and
the factor analysis discussed below. (For additional statistical data, see Hisama,
1977a.)
Since the four tests contained different numbers of items, the raw scores are
not directly comparable. Therefore, they were transformed to standard scores by
the following formula:
T= 50-
where T = standard score; X = raw score; X = mean of the raw scores obtained by
the group; Sx = standard deviation of the raw scores.
•Editors’ note: This is the same test referred toby Scholz etaL in Chap. 2 as the CESL Reading Test It
is a modified version of the Science Research Associates, Reading Improvement Placement Test
50 I: FACTORS IN SECOND LANGUAGE SKILL
By transforming each score in this way, all tests were converted to a common
scale of measurement with a mean of 50 and a standard deviation of 10. By using
T-scores in the construction of profiles, direct comparisons may be made among
the various patterns of variability thus displayed. The T-scores on the four tests
used in this study are given in profile form, first by level of placement at CESL and
then by language group.
Subsequently, the interrelationships among the four tests are explored by
using a factoring method. Factor analysis is not a single method. It subsumes a
wide variety of procedures. The most important characteristic of factor analysis is
its data-reduction capability. Given an array of correlation coefficients (a correla¬
tion matrix) for a set of variables, factor analysis techniques enable us to see
whether some underlying pattern of relationships exists such that the data may be
“reduced” to a smaller set of factors (or components) that may be taken as source
variables accounting for the observed interrelations in the data. The method used
for this study was the principal component solution. Factors were extracted from
the correlation matrix with unities in the main diagonal.
Hisaraa: ESL proficiency tests 51
setting up of levels in the first place, along with the Listening and Structure por¬
tions of the CELT. Therefore, the fact that the NCT produces nearly as much
spread as the RFUPT is evidence of its concurrent validity.
Profile of ESL Test Scores by Native Language. Figure 4-2 shows the pro¬
files based on mean T-scores obtained on the four tests by five different language
groups. A host of factors may account for observed group differences: availability
of college education in their native countries, availability of scholarships or
private funds, geographical and sociocultural distance from English-speaking
societies and political entities, political and military relationships, or a combina¬
tion of these and other factors. Rather than speculate on these questions, let us
simply examine the pattern of variability as we did in Fig. 4-1. Here it is the NCT
that seems to afford the greatest discrimination across groups. The Structure test is
almost as good a discriminator, followed by the RFUPT and the Listening test,
both of which produce considerably less spread among groups.
Factor Analysis of the Four Tests. Means and standard deviations are given in
Table 4-1, followed by correlations in Table 4-2 and the principal component
factor solution in Table 4-3. The correlation coefficients reveal that the NCT
accounts for the greatest amount of the total variance in all the tests. This can be
seen from the fact that the correlation of the NCT with each other test is always the
highest correlation in the matrix (Table 4-2). Also, in the factor analysis, the NCT
produces the strongest loading ong, followed closely by the RFUPT. However, the
loadings of Listening and Structure ong are also quite strong. Indeed, theg factor
accounts for over 82% of the variance in all four tests. The remaining variance is
too small to be subjected to further analysis (the eigenvalue of the second compo-
Hisama: ESL proficiency tests 53
nent extracted is less than 1). The nearly Table 4-3 Factor Matrix
identical factor loadings of the NCT and Based on Principal Component
RFUPT are noteworthy inasmuch as the Analysis*
formats and reference populations for the
Test 9 h2
two tests were radically different. This
Structure .90 .81
seems to indicate that common mental
Listening .87 .76
processing strategies are required by all
RFUPT .92 .85
four of the ESL tests examined here in NCT .93 .86
spite of the fact that they appear to be quite
different at the surface. An important task, ♦Values for the second factor are not
reported because its eigenvalue is
then, will be to characterize these common
less than 1.
strategies as demonstrated by speakers of
English as a second language.
Score pattern differences among the different subgroups may be largely due
to unreliabilities or to sampling “biases” in the tests—the number of items, the
effect of random guessing, the register of test items, accent of the speakers on the
listening test, and so on. For this reason alone, multiple measures of ESL profi¬
ciency may be defended by arguing that their use will tend to cancel out or nullify
such biases. However, there is a large element of luck here in selecting the
measures to be used unless serious research is conducted to justify the choices and
to deliberately minimize the unreliability and biases.
Note
1. The NCT is being prepared for publication. Interested parties may write the author at Depart¬
ment of Special Education, Southern Illinois University, Carbondale, Ill. 62901, for more
information.
Part I Discussion Questions
3. If the two-factor solution for the various tests used by Oiler and Hinofotis in
their second study (see Table 1-7) were correct, how would we explain the
three-factor solution in their third study (see Table 1-11) or the four-factor
solution of Scholz et al. (Table 2-4)? In other words, if the separable vari¬
ances were reliable, why would they not tend to sort out similarly on differ¬
ent occasions if the same type of analysis is applied?
54
Discussion questions 55
Therefore, we should seek a solution that is consistent (i.e., one that contains
no self-contradictions), exhaustive (i.e., one that explains all the data that
need to be explained), and simple (i.e., one that is as uncomplicated as pos¬
sible).
8. When significant correlations are found between all the various language
tests and the so-called nonverbal measure of intelligence (see Table 3-2,
where the correlations in question range from .61 to .84), how is it that the
validity of the language tests is questioned rather than the validity of the so-
called nonverbal intelligence test? Fully 71 % of the variance in the McGraw-
Hill reading test is also present in the Raven’s Progressive Matrices. No less
than 37% of the variance in the so-called nonverbal IQ test is common to
every one of the language measures used (the TOEFL, Perkins-Yorio, and
the cloze). Further, this is true for a group of nonnative speakers of English.
How then is it possible to reason that the Raven’s test is pure while only the
McGraw-Hill reading test is confounded by an extraneous variable, namely,
nonverbal intelligence? Would it not be just as reasonable to assume that the
Raven’s is impure, i.e., actually a measure of language proficiency? Indeed,
is the case for the latter claim not stronger than that for the former? Notice,
for instance, that the correlations between all the language tests are as we
might expect; it is the relationship of all of them to the nonverbal IQ test
which demands explanation.
9. Note also that the correlation between the McGraw-Hill reading test and the
nonverbal IQ test is equal to the correlation between the Perkins-Yorio and
the TOEFL (Table 3-2). Further, these are the highest correlations in the
entire matrix (nine points higher than the next in line). The Perkins-Yorio
correlation with TOEFL would normally be interpreted to mean that they
are measuring the same thing (namely, ESL proficiency). Is it fair then to
take a completely different approach to the explanation of the correlation of
56 I: FACTORS IN SECOND LANGUAGE SKILL
10. Compare the results of Flahive in Chap. 3 with those of Stump (in Oiler and
Perkins, 1978). Stump’s results showed that a widely used, group-adminis¬
tered IQ test containing both verbal and nonverbal sections) generated little
(if any) reliable variance that was not also present in cloze and dictation
tasks. Stump used native speakers of English at the 4th and 7 th grades
(about 100 subjects at each level). How do these results square with
Flahive’s? Moreover, what does the comparison imply for the relation
between first and second language learning? (Stump used natives, Flahive
nonnatives). See also Part VI in this volume. The chapters there rely on task
performance of natives and nonnatives as viewed through types of errors,
patterns of difficulty of various structures, and the hke.
11. If the relationship among skills were perfectly similar across levels at CESL
(Southern Illinois University), and across language groups as displayed in
the profile analyses of Hisama, what would the displays look hke?* Are there
any reasons not to expect skills to be quite similar across language groups or
levels at CESL? What factors would be expected to contribute to differ¬
ences? What factors might contribute to the levehng of differences?
12. Compare the correlation matrix of Hisama in Table 4-2 with that of Flahive
in Table 3-2. If the labels were not given for column and row headings, how
would one know that different tests had been employed to derive the two
tables? In other words, is the pattern of test relationships sufficient to show
that a nonverbal IQ test was used by Flahive but not by Hisama? If the reader
does compute the principal components solution for the data in Table 3-2
(see Question 7), the factor solutions may be compared.
•Actually, there is more than one possibility, but a simple theoretical ideal would be for them to fall
into a pattern of perfectly straight parallel lines. However, owing to practical imperfections in tests and
test performances, we should expect only approximately straight and approximately parallel lines.
Curvilinear solutions are theoretically possible but do not seem likely on the basis of Hisama’s data.
Part II
Listening Competence:
A Prerequisite to Communication1
59
60 II: LISTENING TASKS
the very lesson format which they propose, seem to stress the need for immediate
oral production, beginning with tightly controlled mechanical drills:
The focus of the lesson is on grammar and structural pattern drills; we are not con¬
cerned here with the teaching of pronunciation, listening comprehension, reading or
writing. It may well be argued that it is impossible to learn only discrete items in a
language lesson and that a lesson of grammar also involves elements of pronunciation
and listening comprehension. This is undoubtedly so, and there is always incidental
learning taking place. The teacher, however, in making up his plans, must focus on
teaching one thing at a time so that he can concentrate on reading a predetermined
degree of student proficiency, know what to correct (and just as important, what not to
correct), and aid the students with prepared explanations (p. 23).
It [the cognitive-code version] should be a four-skills approach, but not in the manner of
audio-lingual habit theory. The four skills should be practiced simultaneously after the
presentation of explicit grammatical rules. Practice of all skills—but practice based
upon study and analysis—is a prime objective (p. 132).
the evidence from natural learning suggests that manifest speech is largely secondary.
That is, as long as the learner orients to speech, interprets it and learns the form or
arrangement that represents the meaning, he learns speech as fast as someone speaking
[i.e., someone practicing speech] (p. 339).
Benson/Hjelt: Listening competence 61
down the paper, and sit on the chair!” (Asher, 1969, p. 5). Students imitate the
physical execution of the commands as demonstrated by the instructor. Asher
found that students working under his method achieved a high degree of “listen¬
ing fluency” which transferred directly to speaking as well as reading and writing.
In one test, Asher demonstrated that when students were required to do both
listening and speaking, their comprehension of the target language was decreased
(Asher, 1969, p. 13).
Asher (1974) reports on a Spanish teaching experiment with twenty-seven
college students who had no prior knowledge of Spanish. This group was divided
in half, and each section met with an instructor for three hours one evening per
week for two semesters. The students followed Asher s method by simply sitting
around the instructor and responding physically to his commands. As each student
felt able, he or she would volunteer to respond physically (not verbally) without
the instructor s modeling of the required bodily movements. Students progressed
from one-word commands to commands like,“ When Henry runs to the blackboard
and draws a funny picture of Molly, Molly will throw her purse at Henry” (Asher,
1974, p. 27). After ten hours of instruction they were “invited but not pressured”
to change roles with the instructor and give commands for the others. Still later the
students produced skits and worked on role-playing situations. Reading and writ¬
ing were not formally dealt with. If the students requested it, the instructor might
write a new vocabulary item on the board at the end of the class, but this casual
procedure required only a few minutes of class time.
After 45 hours of instruction, of which 70 percent was spent on listening
comprehension, 20 percent on speaking, and 10 percent on reading and writing
tasks (with no homework assignments at all!), the experimental group was tested
against the three control groups. One group consisted of high school students who
had taken one year of Spanish, a second group consisted of college students just
finishing their first semester of Spanish, and a third group was made up of college
students who were just completing their second semester of Spanish. Measured
against the group of high school students with approximately 200 hours of class
work on a test of listening and reading comprehension, the experimental group
with only 45 hours of training had a mean score of 16.63, while the high school
group had a mean score of 14.63. On similar tests, the experimental group also
scored significantly higher on the listening comprehension task than either of the
college control groups (Asher, 1974, p. 28).
At the end of 90 hours of instruction using Asher’s method, the experimental
group took the Pimsleur Spanish Proficiency Tests, Form C. This test was
designed for students who had completed 150 hours of intensive audio-lingual
training. The experimental group performed beyond the 50th percentile in most
skills (Asher, 1974, p. 30).
In evaluating this study, Asher notes that perhaps his most important finding
was the extent to which listening comprehension transferred to other skills. On the
success of his method he writes:
Benson/Hjell: Listening competence 63
When language input is organized to synchronize with the student’s body movement,
the second language can be internalized in chunks rather than word by word. The
chunking phenomenon means more rapid assimilation of a cognitive map about the lin¬
guistic code of the target language (pp. 30-31).
The findings of both Asher and Postovsky challenge the first hypothesis men¬
tioned at the beginning of the chapter, which claims that speaking performance
must be emphasized in the initial phases of language instruction. Indeed
Postovsky’s work demonstrates that an immediate emphasis on speaking may even
hinder the learner’s capacity to process (decode) second language data. The
second hypothesis, which states that language learning is an integrative process
and that all language skills can therefore be introduced simultaneously with each
skill reinforcing the others, must also be questioned since both studies presented
skills sequentially, with listening skills preceding speaking skills.
Postovsky’s somewhat modest fndings show an improvement in all skills
when speech is delayed four weeks in what is otherwise a fairly traditional
behaviorist approach to language learning. The strengthening of listening skills
definitely seemed to benefit his students.
It is Asher’s findings, however, that provide the more dramatic evidence. By
allowing processing prior to speech, the subjects in Asher’s experimental group
were able to develop a listening competence more quickly than with traditional
classroom methods. In this approach, Asher delays speaking only ten instructional
hours. At this point students are encouraged to assume the teacher’s speaking role
for the language which has been conceptualized and is ready for production. It is
also important to note that the Pimsleur test, which assesses the full range of lan¬
guage skills, shows unusual competence in reading and writing skills even though
little direct instruction was given in these areas.
It would seem that both these studies of foreign or second language learning
support the third hypothesis—namely, that language learning is a highly organized
integrative process initially requiring contextual decoding of the meanings of new
utterances before meaningful and creative encoding can take place.
Support for this hypothesis can also be found in two additional studies using
subjects from radically different backgrounds. The first project, by Naiman
(1974), employed 112 first and second graders in a Canadian bilingual school. He
compared a comprehension task (translating from the target language to the native
language) with a production task, and elicited oral imitation in the target language.
In all five syntactic structures used in the experiment, performance on the
comprehension task exceeded performance on the imitation task (as cited by
Swain, Dumas, and Naiman, 1974), indicating that comprehension may well be
prerequisite to production skills.
The second study (Sticht, 1972), used 96 men in the U.S. Army. In Sticht’s
experiment, learning strategies necessary to reading ability were also shown to be
necessary to listening comprehension. He used a test consisting of three brief
prose passages at the 6.5, 7.5, and 14.5 grade levels as judged by the Flesch reada-
64 II: LISTENING TASKS
bility scale. The passages were presented alternately as listening and reading tests
to 40 men classified on the basis of an IQ test as having Low Mental Aptitude
(LMA) and to 56 men grouped as having Average Mental Aptitude (AMA) (Sticht,
1972, p. 287). The results for both groups showed a high interrelatedness between
listening scores and reading scores. The most striking contrast was at the 7.5 level
among the LMA subjects. At this level, mean reading scores were in the 52nd per¬
centile. Although the differences at the other levels were smaller, the mean
listening scores were, with only one exception, equal to or greater than the mean
reading scores (Sticht, 1972, p. 288). This supported Sticht’s hypothesis that
“developmentally, skill in learning by listening precedes and actually forms the
basis for the acquisition of skill in learning by reading” (p. 286).
On the basis of all the foregoing studies, we can assert with confidence that
the development of listening ability demands the integration of phonological,
grammatical, and lexical data into a relatively unitary competence or expectancy
grammar (Oiler, 1974). Listening skill therefore seems to be an essential prerequi¬
site to oral communication and appears to be tightly integrated with the other tra¬
ditionally recognized language skills, both receptive and productive.
A final source of evidence for the hypothesis which we are positing is the
evidence for the statistical correlation between listening comprehension tasks and
tests requiring speaking, reading, writing, and specific grammatical decisions. For
evidence of this sort, the reader is referred especially to the earlier chapters of this
volume.
To cite a few specific instances, Irvine, Atai, and Oiler (197 5) found that the
Listening Comprehension subtest of the TOEFL and scores on a cloze test were
correlated at .69; Listening Comprehension was correlated with dictation at .76
for a group of 159 Iranians. The correlation between the cloze and the dictation
scores was .75. These three correlations were higher than any other correlations
among the TOEFL subtests. A factor analysis over the same data reported by Oiler
and Hinofotis (Chap. 1) showed that all the reliable variance in all of the TOEFL
subtests plus the cloze and dictation was accounted for by a single factor (substan¬
tially common to all the tests). See Table 1-1.
Further evidence comes from Johansson (1973b), who also found a high
correlation between a multiple-choice listening comprehension task and dicta¬
tion. In a test administered to 26 students at Lund University in Sweden, Johans¬
son found a correlation of .83 between listening comprehension and dictation.
This correlation was higher than any of the other correlations between the subtests
(grammar, vocabulary, and dictation with noise, pp. 108-109).*
"■Editor’s note: also relevant are the results of Scholz et al. (Chap. 2), especially the tests numbered 1 to
5 (Table 2-1). Furthermore, a factor analysis of the 16 separate subscores on listening tasks from the
Carbondale project (Oiler, 1979, Appendix) revealed a single general component accounting for 40%
of the variance in all the tasks. Since the estimated reliabilities on the same tasks averaged about
J~A (or .63), it is concluded that the single-factor solution explains nearly all the reliable variance in
all the tasks.
Benson/Hjell: Listening competence 65
Two important findings seem to emerge from these studies. First, when
listening practice precedes work on oral skills, the development of an appropriate
expectancy grammar seems to be enhanced. Our personal preference is a cyclical
learning model in which comprehension skills precede production skills in small
learning cycles—as in the Asher approach, for instance. This is not to suggest,
however, that learning can take place only through acoustic channels. In fact, in
the studies of Postovsky and Asher, learning was apparently enhanced by physical
response—either bodily action or writing. Studies with deaf children also show
that hearing can be bypassed altogether if other sensory receptors are involved.
However, we suggest that deliberately bypassing listening, or failing to give ade¬
quate practice in listening is probably less efficient and is likely to be a frustrating
way to teach a language.
The second important finding is the reassertion of listening comprehension
as an integrative or global skill which, as its name implies, entails comprehension
or conceptualization and organization of new language data. Only when this “com¬
prehending' process is functional can the learner begin to manipulate his new lan¬
guage in meaningful and creative ways. The same process expands the expectancy
grammar in the new language and thus affects development in all areas of language
learning.
Our concern in this chapter has been with the early stages of language learn¬
ing. Most teachers would accept the proposition that listening comprehension is an
important aspect of language learning not only in the beginning but in all stages of
development Since it is an important skill, testing procedures may be weighted
toward aural proficiency. However, teaching listening comprehension, particu¬
larly to beginners, is a process that is not well conceived in most curricula. The
tendency is to rely on pattern drills, contrived dialogues, or grammatical presenta¬
tions in the textbook and to simply assume that comprehension will follow. Often it
doesn’t
There are two areas of failure in these early stages. First, when beginners are
asked to use words and structures too soon, they are forced to say what they do not
know how to process in the target language because the target language vocabu¬
lary and structures are incomplete in their developing grammatical systems.
Second, and perhaps this is the most critical factor, the context provided is often
insufficient, and the beginning student cannot possibly succeed in conceptualiz¬
ing the sense of utterances of the new language. In many of the approaches which
we have been arguing against, the conceptualization of the relation between utter¬
ance and extralinguistic context seems to be thought of as a kind of vague eventual
goal, perhaps the ultimate product of the learning cycle. In our view it becomes the
crucial prerequisite—the best foundation upon which the language learning
process can be established.
Note
1. Editor s note: this paper has appeared in a slightly different form in the Modern Language Journal
62, March 1978, pp. 85-90. It is reprinted here by permission.
Chapter
Frank Bacheller
The Goal of the Language Learner Is Communication. Anyone who has been
in a situation where he had to use a language other than his native language for
everyday purposes probably found that when speaking, emphasis was placed on
getting ideas across more than on grammatical details, and also when listening, the
focus was on the overall meaning rather than on trying to catch every single sound,
morpheme, or word. Whether listening, speaking, reading, or writing, the problem
is relating the surface forms of utterances to extralinguistic context in systematic
ways.
Richards (1971) points out that, when placed in a situation where he must
communicate, the learner controls the language to suit his intentions. For
example, he may simplify the syntax or find a circumlocution to express some¬
thing he hasn’t learned to say in a more idiomatic way.
How well a learner communicates depends on his ability to supply enough
information in the linguistic surface form to enable an audience to relate that form
66
Bacheller: Communicative effectiveness 67
Test r* N
listen, the second time to write, and the third time to make corrections (see the
complete directions in the Appendix). The second time, pauses were inserted
between phrases to allow time for writing. Marks of punctuation were given on the
second pass. For purposes of grading with the SCE, each segment between pauses
(marked by slashes in the Appendix) was considered to be an “item."
This resulted in 33 items. Each item was then graded on the SCE and total
scores for each student on each passage were correlated with scores obtained by
grading the same dictations in the conventional way (one point for each correct
word in the sequence dictated). Severity scores were also correlated with scores
obtained on other tasks included in the CESL testing project.
In addition to the above, the number of spelling errors made per item by each
student was correlated with other scores. It was decided not to penalize for spell¬
ing errors because (1) spelling errors don’t show up in oral production, (2) native
speakers often make spelling errors, and (3) other researchers (notably Johansson,
1973b, Whitaker, 1976, and Oiler, 1979) recommend not counting spelling
errors. Oiler (1979) cites empirical evidence from a study at UCLA showing that
spelling errors in dictations tend to be uncorrelated or even negatively correlated
with conventional scores on the same dictations as well as a variety of other lan
guage processing tasks. Examples of mere spelling errors are vaccene (vaccine),
developped (developed), and includ (include).
70 II: LISTENING TASKS
Table 6-3 Sample Errors as Classified on the SCE (Original Dictated Segments
in Italics)
Level 1
could be a contribution of great merit— could be a coatripution of married
spilled hot coffee at breakfast— spell hot coffee breakfast
for so many children to suffer— four seventy children to supper, for some children to suppert
tripped and fell down the stairs—trip and down stairs, to downstairs
We learn by what we reject—we learn by reject, we learn both we jack
as well as by what we accept—as well as my accept
Now there is no longer any need—now there is no longer
on the nonappreciation of literature—a. leadership, on the appreciate
AH in all, it was a bad day to get out of bed—all of the day a bad day is get up
it doesn't pay to get up-\t doesn’t paid get up, it doesn’t lake get up
Level 2
/ have always believed— I never always believed, I never believe youth
One day / woke up—Monday, I will come
Some days—Sunday, sun days
All in all, it was a bad day to get out of bed- all of them was bad days, all of us have one bad
day when we got up
as well as by what we accept—as well as what I decide to accept
spilled hot coffee at breakfast-spilled up hot coffee and breakfast, for breakfast we have a
cup of coffee
that every college curriculum—that every college
it doesn’t pay to get up—be doesn’t try to get up
the student would decide—the students decided
it used to strike— I used to strike
Level 3
tripped and fell down the stairs—tup and fell down stair
about four million children a year-about four million child a year
All in all, it was a bad day to get out of bed-all of_it was a bad day to get up
in 1963 a vaccine was developed-\n 1963 a vaccine is developed
the student would decide-the student will decide
One day / woke up—one day I wake up
Now there is no longer any need-tbere is know longer any need
and the sun was shining— and the sun shined
Level 4
It used to strike—it use to strike
I have always believed-i always believed
Birds were singing-beards were singing
that every college curriculum-that every coleege corecloum
analysis supported scale reliability and validity by showing that 23 of the 33 items
correlated with the total severity score (a kind of discrimination index) at .60 or
above, and of those 23 items, 11 correlated at .70 or above. The highest correla¬
tion achieved was .77, and the lowest was .31. The mean was .62.
All of the foregoing supports the notion that the more proficient a learner is,
the less severe his errors are and the higher his SCE scores. This corroborates
results obtained by Olssen (1973), who found that a more proficient group of
learners made fewer severe errors than did a less proficient group.
Since the degree of error severity is strongly correlated with proficiency,
SCE is shown to be a sensitive index of the overall communicative effectiveness of
second language learners. Further evidence of this sensitivity is found by looking
at how the scale actually judges learner errors and seeing if those judgments agree
with native speaker intuitions. Table 6-2 gives samples of actual errors as classi¬
fied on the scale.
Correlations between spelling errors, dictation, and other global proficiency
scores showed no consistent relationship. For the most part they were very low, or
even negative in some cases. The mean correlation was .08, and the range was from
-.02 to .26. These correlations indicate that spelling errors are unrelated to over¬
all ESL proficiency. Thus, as others have recommended, it seems best to avoid
counting spelling errors when scoring dictations in either the conventional way or
with the SCE.
In conclusion, this research suggests that the SCE is a reliable measure of
language proficiency; that severity or errors is inversely related to the degree of
proficiency or communicative effectiveness; and that the SCE is a valid index of
the latter. An important implication is that subjective scoring methods may be just
about as reliable as more quantitative objective methods. This may be true not
only in scoring dictations but possibly also for other pragmatic tasks such as
elicited imitation and composition. It seems that it would be a relatively straight¬
forward matter to generalize the SCE (with modifications) to many other
pragmatic tasks.
An implication for teaching is that languages might be better learned if
students were systematically and consistently pointed to the meaning of what is
being said. This would rule out, it seems, much discrete point teaching that empha¬
sizes surface details while casually forgetting about meaning.
Part II Discussion Questions
1. What sequence of skills is implied or possibly overtly planned into the mate¬
rials you use as a teacher or those you may have been exposed to as a student?
Is a particular skill or modality of processing afforded a privileged status?
2. Consider the relationship that Benson and Hjelt posit in Chap. 5 between
listening and reading. Would you expect good readers to be good listeners on
the whole and poor readers to be poor listeners? What about the reverse? In
other words, should good listeners tend to be good readers, and poor ones
poor readers? Why so? Does improvement of skill in reading necessarily
carry over into listening and vice versa? Reflect on Sticht’s findings. Better
yet, read the original study for yourself and reevaluate the conclusions of
Benson and Hjelt
3. In what ways might practicing speaking too early tend to reduce the effi¬
ciency of learning? Consider the learner's forming of an utterance as a step-
by-step process. What steps must be executed in speaking that are not neces¬
sary to auditory decoding? What steps (if any) are necessary to listening with
comprehension that may not be necessary to producing sensible utterances
orally?
72
Discussion questions 73
blackboard.” In what sense must the learner map the command onto the
stream of experience? Consider the delicate sequencing of motor activities
in relation to the syntactic ordering of elements within the utterance offered
by the instructor.
7. Do you agree with the stress Bacheller (Chap. 6) places on the global com¬
municative effect of utterances rather than the bits and pieces that go to
make up their surface form? Reflect on your own experience while watching
television or listening to the news on the radio or simply having a conversa¬
tion. Do you attend to the morphemes? How many times, for instance, did
the plural morpheme occur in Question 6? What does all this imply for
teaching procedures that deliberately exclude meaning while forcing the
learner to attend meticulously to surface form? Can you think of ways of get¬
ting students to attend to surface form without losing cognizance of overall
communicative intention and effects? Can you conceive of methods of
teaching languages that require the learner to manipulate surface forms from
the phoneme to full-fledged discourse without ever disregarding meaning?
(What we are suggesting here cannot be done, incidentally, with the
usual pattern drill methods, but it can be and is done with a variety of other
methods.)
A principal question dealt with extensively in the three chapters in this part
is whether or not speaking proficiency can be broken down into smaller
components or separable contributing characteristics. Can judges reliably
differentiate pronunciation, fluency, vocabulary,grammar, pleasantness,
nativeness, intelligibility, or comprehension as aspects of speaking skill.''
Can they reliably distinguish some subset of these possible dimensions of
speech? What sorts of oral testing procedures seem to provide the greatest
yield of individual differences (variance) among learners at various levels of
development? How reliable are the subjective evaluations of interviews,
speech samples, or other oral production performances? Are naive judges as
good as trained judges? Do they use different criteria? How good are judges
of speech samples at guessing the native language background of nonnative
speakers of English? Does knowledge of the language supposed to be the
source of the accent help the judge to guess the source language correctly?
These are just some of the important questions dealt with.
Chapter
Several oral testing techniques are investigated. A variant of the FSI oral
interview technique is compared against three pragmatic testing methods
that are somewhat simpler to apply. All the oral testing techniques studied
are correlated with the CESL Placement battery. Reliability indices for the
various subscales on the FSI Oral Interview are reported. Scores on the
various oral testing techniques (twenty-seven separate scores in all) are
factor analyzed to see if oral language proficiency can be broken down into
component parts. The results seem to support a single-factor solution. A
multiple-factor alternative is examined but discarded as probably unreli¬
able. In both factor solutions, the FSI scales of Accent, Grammar,
Vocabulary, Fluency, and Comprehension all load on a common factor,
suggesting that they are unitary. There is no evidence to support the claim
that separate aspects or oral proficiency can be clearly distinguished.
The need for developing efficient valid methods of testing oral proficiency
has long been recognized. The most obvious approach to oral testing, and the one
presumed to be most valid, is the oral interview. An interview provides a very
direct method of challenging someone to speak; and it offers a realistic situation in
which to assess overall oral mastery of a particular language. Alternate methods,
especially less direct approaches, have been criticized for not really putting genu¬
ine language ability to the test. However, this may be more the fault of the influ¬
ence of the typical discrete point orientation of language testers in recent years
than a real difficulty in devising feasible alternatives to oral interview.
The discrete point approach to language testing fails to provide either a con¬
textualized sociolinguistic setting or a sufficiently detailed sample of the language.
77
78 III: SPEAKING TASKS
As Spolsky et al. (1972) have indicated, discrete point tests may either be based on
a sample of some inventory of linguistic elements of a particular language, or be
derived from some notion of functional needs. However, if we look beyond the
level of words, the total number of elements in any language is not finite or well
defined; therefore, sampling techniques can scarcely be applied at all, and never
very systematically. Further, a selection based on some idea of functional neces¬
sity can never be quite certain that a specific element or set of elements is in fact
really necessary. The natural redundancy of any language allows one to use it with¬
out having fully mastered every structure of the language. In fact, the creative and
nonrepetitive nature of language systems in use makes it impossible for anyone to
“know” all the structures in the sense of separate items of knowledge. Finally,
neither of the commonly used methods of discrete point item selection seems
likely to provide an adequate assessment of a person s ability to function in a real¬
istic language setting.
To date, the most widely used technique for evaluating oral proficiency is the
Foreign Service Institute (FSI) Oral Interview (Spolsky et al., 1972, Wilds, 1975,
and Oiler, 1979). The FSI interview has high reliability, and because it takes place
in a more or less natural setting, it is believed to be a good determinant of a person’s
true language competence. Its major drawback is that it is time-consuming and
expensive to administer and score. Altogether, about a half hour is required on the
average for administering and scoring each interview (Wilds, 1975).
One alternative to oral interviewing would be to develop other pragmatic tests
of oral proficiency which (though they may be less direct than the oral interview)
are not tied to discrete point methods of test construction, scoring, and interpreta¬
tion. It has been suggested that reduced redundancy tests, such as cloze tests, may
provide valid replacements for the oral interview (Clark, 1975). Oiler and
Hinofotis (Chap. 1) found correlations ranging from .51 to .62 between a cloze test
and the various subscales of the FSI oral interview. This encourages us to believe
that some type of cloze test (written or oral) might be refined to provide some of the
information available through the more expensive and time-consuming oral inter¬
view techniques.
This chapter evaluates four possible procedures for testing oral proficiency.
Three pragmatic speaking tasks, repetition (elicited imitation), oral cloze, and
reading aloud, are used along with a traditional FSI-type oral interview. The inter¬
view technique used here, however, does not fully meet FSI requirements—in
particular, the teams of examiners did not include a trained linguist. Results show
that this fact apparently did not reduce reliability, but for this reason we refer to
the technique here as an FSI type of oral interview. The results of all four profi¬
ciency tests were intercorrelated and examined in relation to the factor results of
Scholz et al. (Chap. 2). The experimental tests were also correlated with the scores
of the Center for English as a Second Language (CESL) placement test consisting
of tests 1, 14, and 21 described in Chap. 2.
In addition to the factoring already reported in Chap. 2, further factoring was
done (both a principal component solution and a varimax rotation) to determine
Hendricks et al.: Oral proficiency testing 79
which of the specifically oral tasks and which scores based on those tasks
generated the most meaningful variance in relation to the other oral tasks.
Experiment
Subjects. Seventy of the 182 students enrolled at all levels of proficiency at the
Center for English as a Second Language at Southern Illinois University at
Carbondale served as subjects. Since all 182 CESL students were invited to
participate, the ones who actually elected to do so were probably not completely
representative of the whole group. In fact, they tended to be students who were
more confident about their skill in English. Nevertheless, their overall level of
performance was not particularly high on the FSI scales. The mean for Accent was
1.68 (out of a possible 4 points); for Grammar it was 12.69 (possible, 36); for
Vocabulary it was 8.98 (possible, 24); for Fluency, 5.58 (12); and for Comprehen¬
sion 10.38 (23). Thus on the scale in Appendix 7C this group would get a mean
proficiency rating of 1 + (based on a score of 39.31), and according to Appendix
7A, which gives rough verbal descriptions of the five proficiency levels, they
would be able to do little more than fulfill minimum travel needs.
Test Materials. The CESL Placement Test comprised of three parts
described in Chap. 2 was used for some of the correlations reported below.
Subtests were aimed at Listening Comprehension, Structure, and Reading. The
five proficiency levels of the FSI Oral Interview are defined roughly in Appendix
7 A (taken from the Manual for Peace Corps Language Testers prepared by Educa¬
tional Testing Service). The five subscales are similarly described in Appendix
7B. Finally, the recommended weighting and conversion tables for interpreting
scores on the five scales to assign a proficiency level rating are given as Appendix
7C.
In cases where the subject had very little or no command of the language, the
interview time was about 15 minutes or even less, but normally the interviews
lasted between 20 and 30 minutes. Each exchange was tape-recorded and
subjects were rated by two interviewers. In the weightings assigned to the various
subscales, Grammar received the heaviest weight followed by Vocabulary, Com¬
prehension, Fluency, and Accent, respectively, as shown in Appendix 7C.
Interviewers. Prospective FSI interviewers normally must participate in a
training program which for Peace Corps Language Testers requited tour to five
days (ETS, 1970). On the surface, the FSI Oral Interview appears to be a normal,
everyday conversation, but it is supposed to be a specialized procedure which
uses the relatively brief testing period to explore many aspects of the student’s lan¬
guage competence in order to place him in one of the categories (levels)
described” (ETS, 1970, p. 11). In order to reasonably duplicate the FSI require¬
ments, two teams of two interviewers each were made up from a pool of one
instructor and three graduate assistants at CESL. Before collection of the speech
data, all interviewers read the Manual for Peace Corps Language Testers (ETS,
1970) and listened to the fifteen training tapes of sample FSI Oral Interviews
80 III: SPEAKING TASKS
provided with the manual. A list of possible interview topics was given to each
team (see Appendix 7D).
Pragmatic Speaking Tests. All three of the pragmatic tests, repetition, oral
cloze, and reading aloud, contained texts intended to range in difficulty from easy
to difficult. All passages were approximately seventy words in length. Each easy
text was selected from a 4th to 5 th grade reader, the intermediate texts were taken
from a junior high school reader, and the difficult passages were excerpted from a
12 th grade reader. The texts of all three pragmatic tasks are given as entries 11,12,
and 13 in the Appendix at the end of this volume.
Procedure. Subjects were interviewed within a six-week period in the spring
of 197 7. Oral interviews were scheduled during the subjects’ free time. During the
experiment all interviews were tape-recorded. To estabhsh interrater reliability,
the first twelve subjects scored by each interviewing team were rated again by the
other team, producing twenty-four interviews rated by both teams.
The pragmatic speaking tests (repetition, oral cloze, and reading aloud) were
administered at the CESL/SIU Language laboratory during the same six-week
period as the oral interviews. (For detailed instructions provided to the subjects,
see the Appendix entries 11,12, and 13.) A second score was derived by counting
all the appropriate words that were included in subjects’ responses but not in the
original text; and a third score was simply the amount of time (in seconds) required
for each reading. For the pragmatic tasks each subject was individually tape-
recorded. Actual testing time was approximately 40 minutes, with an additional 15
minutes for seating and laboratory adjustments.
Scoring. The oral interviews were scored by the usual FSI procedures. Both
the repetition and oral cloze tests were scored by exact and acceptable word
methods. The reading aloud tests were scored in three ways: a first score was
arrived at by counting exact word renditions (one point for each word). Gross
mispronunciations were counted as errors in all cases. To obtain an overall score
for the reading aloud tasks, the number of correct words in each passage (exact
plus acceptable) was divided by the number of seconds required to complete each
passage. This was the score used by Scholz et al. in their factor analyses (see
Tables 2-1 and 2-4).
As Table 7-1 indicates, the correlation between the two interviewing teams for the
overall interview score was .90. The only low correlation between the ratings of
the teams was for the Accent Subscale (.43). However, the high intercorrelations
for the other scales show that the two interview teams agreed substantially in their
assessments of oral proficiency.
Table 7-2 indicates that the pragmatic task which correlates most highly with
the overall FSI score is Repetition (.70); Oral Cloze follows (.63), and then the
overall Reading Aloud score (.51). Each of these correlations is significant at the
Hendricks et al.: Oral proficiency testing 81
Table 7-1 Interrater Reliabilities for the Subscales of the FSI Oral Interview
{N = 24)
Group 2 1 2 3 4 5 6 7 8
Group 1
Measures
1 Oral Interview—Accent .43
2 Oral Interview—Grammar .83
3 Oral Interview—Vocabulary .85
4 Oral Interview—Fluency .80
5 Oral Interview—Comprehension .89
6 Oral Interview—Total Score .91
7 Converted Total Score .85
8 Overall Scores .90
.001 level. In terms of common variance, the Repetition task accounts for 49% of
the variance in the overall FSI rating; the Oral Cloze has a common variance of
40% with the overall FSI score, and the Reading Aloud 26%.
These facts can probably best be accounted for on the basis of the respective
reliabilities of the various measures. Repetition appears to be a promising measure
of oral proficiency. It correlates significantly (p < .001) with all the subscales of
the Oral Interview, and with the other pragmatic tasks. In addition, Repetition may
be useful as a diagnostic test. Subjects tended to make consistent grammatical
errors which could provide invaluable diagnostic data in defining the learner s
interlanguage system (cf. Selinker, 1972). Oral Cloze also looks promising.
However, we believe its utility could possibly be enhanced by altering the format
to require phrases expressing the next idea in the text rather than single words as in the
written cloze formats. Reading Aloud is the least effective measure, but this may
Table 7-2 Correlations between the Subscores on the Four Oral Proficiency Tests
[N> 62,<70)*
Measures 1 2 3 4 5 6 7 8 9
2 Oral Interview—Grammar 1.00 .89 .23 .53 .81 .44 .32 .06
be due to the scoring method used. We return to the latter problem below in the
discussion of the factor analyses in Tables 7-4 and 7-5. There is a hint in Table 7-2
that the Reading Aloud task measures Fluency (r=.91), Comprehension (r=.79),
and to a lesser extent Vocabulary (r=.44) more than whatever is measured by the
other scales. However, as we will see below, this interpretation does not fit the
factor results.
Table 7-3 shows the relationship between the oral proficiency tests and the
CESL Placement battery. The CESL battery consistently correlates best with the
Oral Interview Grammar and Vocabulary scores. They share a maximum common
variance of about 28%. On the whole, the Oral Interview is roughly equivalent to
Repetition and Oral Cloze as a predictor of the CESL Placement battery.
The factor analysis results reported in Table 7-4 are excerpted from Oiler
(1979, Appendix Table 5). Here it should be remembered that only the FSI scales
are really full-blown testing procedures. The other scores included are actually
subscores from the Repetition, Oral Cloze, and Reading Aloud tasks defined
above. Nevertheless, in spite of the reduced reliability that should be expected
when scores are based on such shortened subtests, the loadings on a common
principal factor (g) are nearly all significant at the .01 level. Only five of the sub¬
scores included fail to load significantly ong. At least one of the subscores in each
set accounts for .45 or more of the variance in the common factor.
The fact that the Acceptable word scores for Repetition tasks A, B, and C load
inconsistently (as well as lightly or insignificantly) on g is easily understood. If
subjects rendered the texts correctly by the exact word criterion (i.e., a verbatim
repetition) there was little likelihood of their getting any additional points for
Acceptable words not in the original text The same explanation applies even more
dramatically to the Reading Aloud Acceptable scores, which all tend to load nega¬
tively ong. This is due to the fact that the more the subject tends to render the texts
Hendricks et al.: Oral proficiency testing 83
Eigenvalue = 9.39
♦Loadings above .33 are significant atp <.01, those above .25 atp <.05;
in this case and in Table 7-5, the deletion of missing cases was listwise (Nie
et al., 1975), and therefore the subject population was the same for each
and every test score. __
exactly, the less opportunity they have for creative innovations that would gain
points under the Acceptable word scoring method. The Oral Cloze tasks, however,
were sufficiently challenging that Acceptable scores account for about as much of
the variance mg as do the Exact scores. As would be expected, the Reading Aloud
Time scores are negatively related to the global oral proficiency factor. The more
proficient subjects recjuired less time; hence the longer the time the lower the
proficiency.
Moreover, by studying the loadings on g for the Reading Aloud scores, it is
apparent that by including the Acceptable scores in the computation of the over¬
all Reading Aloud scores, we necessarily depressed the effectiveness of Reading
84 III: SPEAKING TASKS
Scores
Oral Interview-Accent .48 .23
Oral Interview—Grammar .89 .79
Oral Interview—Vocabulary .90 .81
Oral Interview—Fluency .83 .69
Oral Interview—Comprehension .87 .76
FSI Oral Proficiency Level .92 .85
Aloud as a measure of global proficiency. This fact probably explains (at least in
part) the results obtained in Tables 7-2 and 7-3 where the Reading Aloud overall
score (defined as the quantity, Exact plus Acceptable score, divided by Time)
performed less well than Repetition and Oral Cloze in relation to the FSI and
CESL subscores. All the loadings ong in Table 2-1 of Repetition, Oral Cloze, and
Reading Aloud could probably be improved by taking advantage of what can be
learned from Table 7-4 concerning optimum scoring methods.
In Table 7-5 a varimax rotated solution is presented over the same data.
While the single-factor solution accounts for only 35% of the total variance in all
the scores whereas the rotated solution (with seven significant factors) accounts
for an additional 21% (for a total of 56% of the variance), the rotated solution is
hardly parsimonious in interpretation.
All the FSI scores load on Factor 1 (Table 7-5). We might have expected the
Fluency scale to load on a factor common to the Time scores for the Reading Aloud
Proficiency—FSI oral interview 85
tasks. However, the Time scores all load by themselves on Factor 3. The Repeti¬
tion scores scatter their variances over five of the seven orthogonal (i.e., uncorre¬
lated) factors. In particular, the Exact scores tend to load on Factor 2 along with
the Acceptable scores for the Oral Cloze tasks, while the Acceptable scores for the
Repetition tasks fall out on Factor 6. These clusters of loadings are probably due
more to unreliabilities in the tasks (probably because of their brevity) than to
reliable differences in the nature of the processing skills exercised.
Clearly, there is no evidence that the several FSI scales are measuring differ¬
ent skills or components of oral proficiency. Further, in view of the fact that the
loadings on the varimax rotated factors are scarcely any higher in Table 7-5 than in
Table 2-1 (which offers a single-factor solution for an even wider range of tasks), it
is assumed that the FSI scales are unitary and that they measure the same basic
skill that underlies performance on the three oral pragmatic tasks (especially
Repetition scored by the Exact word method. Oral Cloze scored by the appropri¬
ate word method, and Reading Aloud scored simply in terms of the number of
seconds it takes subjects to complete the task).
Finally, in view of the estimated reliabilities of the various scores, it seems
reasonable to conclude that the unitary factor solution presented in Table 7-4
explains the bulk of the reliable variance in all the tasks considered and that the
additional variance accounted for in the multiple-factor solution shown in Table
7-5 is indeed unreliable variance. The latter claim will require further empirical
substantiation, but in the meantime it seems safe to say that Repetition, Oral
Cloze, and Reading Aloud all offer some promise as substitutes for oral interview
testing.
Appendix 7 A
The Five Levels of Overall Proficiency
of the FSI Oral Interview
Level 1
Able to satisfy routine travel needs and minimum courtesy requirements. The
student can answer questions on topics very familiar to him; within the scope of his
very limited language experience he can understand simple questions and
statements, allowing for slowed speech, repetition, or paraphrase. His speaking
vocabulary is inadequate to express anything but the most elementary needs;
errors in pronunciation and grammar are frequent, but he can be understood by a
native speaker used to dealing with foreigners attempting to speak his language.
Level 2
Able to satisfy routine social demands and limited work requirements. The
student can handle with confidence but not with facility most social situations
86 III: SPEAKING TASKS
Level 3
Able to speak the language with sufficient structural accuracy and vocabu¬
lary to participate effectively in most formal and informal conversations on
practical, social, and professional topics. The student can discuss particular
interests and special fields of competence with reasonable ease. His comprehen¬
sion is quite complete for a normal rate of speech. His vocabulary is broad enough
that he rarely has to grope for a word. His accent may be obviously foreign. His
control of grammar is good, and his errors never interfere with understanding and
rarely disturb the native speaker.
Level 4
Able to use the language fluently and accurately on all levels normally pertin¬
ent to professional needs. The student can understand and participate in any
conversation within the range of his experience with a high degree of fluency and
precision of vocabulary. He would rarely be taken for a native speaker, but he can
respond appropriately even in unfamiliar situations. His errors of pronunciation
and grammar are quite rare, and he can handle informal interpreting from and into
the language.
Level 5
Speaking proficiency equivalent to that of an educated native speaker. The
student has complete fluency in the language such that his speech on all levels is
fully accepted by educated native speakers in all its features, including breadth of
vocabulary and idiom, colloquialisms, and pertinent cultural references.
Appendix 7B
Proficiency Descriptions
Accent
1. Pronunciation frequently unintelligible.
2. Frequent gross errors and a very heavy accent make understanding difficult,
require frequent repetition.
Proficiency descriptions 87
Grammar
1. Grammar almost entirely inaccurate except in stock phrases.
2. Constant errors showing control of very few major patterns and frequently
preventing communication.
3. Frequent errors showing some major patterns uncontrolled and causing
occasional irritation and misunderstanding.
4. Occasional errors showing imperfect control of some patterns but no weak¬
ness that causes misunderstanding.
5. Few errors, with few patterns of failure.
6. No more than two errors during the interview.
Vocabulary
1. Vocabulary inadequate for even the simplest conversation.
2. Vocabulary limited to basic personal and survival areas (time, food, trans¬
portation, family, etc.).
3. Choice of words sometimes inaccurate; limitations of vocabulary prevent
discussion of some common professional and social topics.
4. Professional vocabulary adequate to discuss special interests; general
vocabulary permits discussion of any nontechnical subject with some cir¬
cumlocutions.
5. Professional vocabulary broad and precise; general vocabulary adequate to
cope with complex practical problems and varied social situations.
6. Vocabulary apparently as accurate and extensive as that of an educated
native speaker.
Fluency
1. Speech is so halting and fragmentary that conversation is virtually impos¬
sible.
2. Speech is very slow and uneven except for short or routine sentences.
3. Speech is frequently hesitant and jerky; sentences may be left incomplete.
4. Speech is occasionally hesitant, with some unevenness caused by rephrasing
and groping for words.
5. Speech is effortless and smooth, but perceptibly nonnative in speed and
evenness.
6. Speech on all professional and general topics as effortless and smooth as a
native speaker s.
88 III: SPEAKING TASKS
Comprehension
1. Understands too little for the simplest kind of conversation.
2. Understands only slow, simple speech on common social and touristic
topics; requires constant repetition and rephrasing.
3. Understands careful, simplified speech directed to him, but requires occa¬
sional repetition or rephrasing.
4. Understands quite well normal educated speech directed to him, but
requires occasional repetition or rephrasing.
5. Understands everything in normal educated conversation except for very
colloquial or low-frequency items, or exceptionally rapid or slurred speech.
6. Understands everything in very formal and colloquial speech to be expected
of an educated native speaker.
Appendix 7C
FSI Score Weighting Table
Proficiency description 1 2 3 4 5 6
Accent 0 1 2 2 3 4
Grammar 6 12 18 24 30 36
Vocabulary 4 8 12 16 20 24
Fluency 2 4 6 8 10 12
Comprehension 4 8 12 15 19 23
Total
Conversion Table
Past Tense
1. A frightening experience
2. Your most embarrassing moment
3. Your biggest surprise
4. What you did last weekend
5. An untrue story you told
6. Your most interesting trip
7. Your first long trip
8. An important event in your life
9. One time when you were misunderstood
10. Why you decided to come to this school or city
90 III: SPEAKING TASKS
Future Tense
Should or Imperative
1. How to be a good tourist
2. Travel tips
3. How to bake a cake, a pie (recipes and instructions)
4. Should married women work outside the home?
Conditional
1. If you had a million dollars
2. If you had three wishes
3. If you governed the world
4. If you had not come to this school
5. If you knew you had only two weeks to live
6. How would you teach English?
7. The changes you would make in this city
8. If you were the last person alive
Rater Reliability
and Oral Proficiency Evaluations
Karen A. Mullen
91
92 III: SPEAKING TASKS
auditory comprehension. Of these five, fluency has been said to be the easiest to
assess since it focuses only on pauses, backtracking, and fillers in the speech flow.
On the other hand, pronunciation has been said to be difficult to judge since cri¬
teria for judgment may vary from one person to another, some listeners may be
able to readily decode foreign accents; others may be judging comprehensibility
rather than phonemic or allophonic accuracy. Moreover, three of these compo¬
nents (grammar, vocabulary, and auditory comprehension) can presumably be
tested without asking a subject to speak. However, evidence of a formal knowl¬
edge of grammar and vocabulary does not guarantee that it will be applied in actual
speech production.
The question of how to test for speaking proficiency has usually been decided
in favor of the interview. It has certain advantages: it can be conducted rather
quickly, it most resembles real-life speaking situations, and it can be adjusted up
or down depending upon the speaker s demonstrated proficiency. It, however, has
been open to the criticism that the measures derived from such a test tend to have
rather low reliability. Some have suggested that a tolerable degree of reliability can
be achieved if behavioral statements are used as a standard for each scalar judg¬
ment, if judges are trained for their task, and if at least two judgments are pooled
for each interview.
The purpose of this chapter is to report a study designed to determine if
experienced ESL teachers, working in pairs, can reach the same judgments
regarding the oral proficiency of nonnative speakers of English, i.e., to determine
the degree of reliability of such judgments. In addition, the question of whether
different sets of judges will rate the same subjects differently is also posed. Finally,
the study was designed to determine the relative weight given to each component
category in predicting the overall proficiency score. Specifically, the hypotheses
were:
Method
To test hypothesis 1, a single-factor experimental design having repeated mea¬
sures was chosen. The F-statistic based upon the mean square* between judges
*Editors’ note: For the non-statistically-trained reader, this chapter is a bit more technical than most of
the others in this volume (also, see Chap. 15). Throughout this paper, the term mean square refers to a
quantity obtained by adding up squares of deviations from some average score (or quantity) and
dividing by the number of scores entering the computation. It is always an index of some sort of
variance.
Mullen: Reliability and proficiency evaluations 93
divided by the mean square of the judge-subject interaction was computed to test
the hypothesis of no significant differences between judges.1 Reliability coeffi¬
cients (unbiased) were calculated based upon the number of subjects in the same
sample, and the number of judges (2), the mean square between subjects, and the
mean square within subjects (Winer, 1971, p. 287). To test hypothesis 2, a two-
factor experimental design having repeated measures on one factor was chosen.
The F-statistic based upon the mean square between groups divided by the mean
square of subjects within groups was computed to test the hypothesis of no signifi¬
cant differences among groups.2 To test hypothesis 3, the regression coefficients
of a multiple regression equation were calculated and an F-test for a coefficient of
zero was performed. 5
Judges. Five judges participated in this study. They were randomly paired to
form six groups. All judges were graduate students in linguistics. They had
completed courses in phonetics, syntactic and phonological analysis, and TESL
methodology. They all had taught ESL for at least one year. All had been
instructed on how to use the rating form and guidelines, and all had participated in
such interviewing before. None of the judges had previously met the subjects they
interviewed.
Subjects. Ninety-eight subjects were referred to the University of Iowa
Department of Linguistics for a proficiency evaluation by either the foreign admis¬
sions officer, the foreign student advisers, or the student’s academic adviser. Most
of the subjects were new to the university and were referred because their TOEFL
scores were below 550 (an arbitrary cutoff point). Some were evaluated because
the foreign student advisers had noted a lack of facility in spoken English although
their TOEFL scores were not below 550. The purpose of the evaluation was to
determine whether additional work in English and a reduced academic program
might be recommended.
Procedure. Judges were required to rate speakers on five scales: listening
comprehension, pronunciation, fluency, control over English structure, and
overall speaking proficiency. These scales were labeled vertically on a rating fonn.
Beside each of the five scales there was a double horizontal line equally divided
into five contiguous compartments labeled from left to right: poor, fair, good,
above average, excellent. The judges were instructed to put an X in the box best
characterizing the speaker s proficiency with regard to each of the five scales or to
put an X on the line between boxes if the subject’s proficiency seemed to be
between the two labeled areas. A set of guidelines for deciding what level of profi¬
ciency to assign was explained to the judges beforehand and was available for
reference after each interview.
Each pair of judges was instructed to ask the subject a few questions in order
to get the interview started—what the subject’s name was, where he was from, how
long he had been in the United States, where he had studied, what his major field of
interest was, etc. These questions were intended to put the subject at ease as
quickly as possible. The interviewers were instructed to begin by speaking at a
94 III: SPEAKING TASKS
normal rate and to slow down later if it became apparent that the subject could not
follow. In lieu of grossly distorting their speech, the interviewers rephrased their
questions by simplifying the sentence structure and vocabulary.
If it became evident that the subject was able to do well in a rather informal
type of interview, the judges were prepared to shift the questioning to a level more
like that which subjects would encounter in an informal academic setting. They
would then ask such questions as: What interests you most about your field of
study? What are some important questions yet to be answered? What special sub¬
fields exist? What are the relationships between them? The intent was to simulate a
conversation, and the next question in a given sequence, of course, tended to
follow from what had preceded. One of the aims was to vary the questioning from
subject to subject so that later subjects would not know what questions to expect
Raters were instructed to contribute to the conversation occasionally but to
control their speaking in such a way that it could be used to assess the subject’s
listening comprehension. They were encouraged to give the subject every oppor¬
tunity to demonstrate his speaking proficiency. After each interview, lasting
approximately fifteen minutes, each judge evaluated the subject without consult¬
ing the other. Evaluations made by the individual judges were later converted to a
numerical value of l=poor, 2=between poor and fair, 3=fair, 4=between fair
and good, 5=good, 6=between good and above average, 7=above average,
8=between above average and excellent, and 9= excellent.
Source of
variation df Listening Pronunciation
SS MS F SS MS F
Source of
variation df Fluency Grammar
SS MS F SS MS F
Source of
variation df Overall
SS MS F
Between groups 5 46.64 9.33 2.95*
Between subjects 92 291.04 3.16
within groups
(pooled)
Between judges 6 7.68 1.3 4.15 f
within groups
(pooled)
Subjects x judges 92 28.32 .31
(pooled)
Table 8-3 shows the differences in judges’ mean ratings for each group. It is
apparent from the graphs that some lines rise more sharply than others and that the
distance between the lowest and the highest line is great enough to result in a
significant difference in groups at the .05 level. Group 2 stands out as the most
deviant with regard to differences between judges and with regard to mean ratings
when compared to other groups.
96 III: SPEAKING TASKS
4— 4— 4— 4— * 4— 4— 4—
r-» O 00 VO VO ON VO d" d- CN CO
LL. ON vo CN CO CN CO r-» VO vo CN *
<4- — O CN O ON* 00* pz vo* r—’ VO* CN* CO
o n3
03
rd CO CO ON 00 CO CN o d* CO o CO r— O 00 r^ vn VD
Cl_ O CO VO CO CO 00 in 00 CN o d" in 00 r— 00 CO ON CO VO
4-> VO 00 CO 00 in r— O 00 CN r- d^ CN ON ON d] d| O CN
C on i-l CO CN 1-1 d* CN* CO
CD
<D
4-
4-
4— H— 4— 4— 4— * * 4— 4—
Q 00 CN CO o CN VO 00 CO CO 00 vn o
t_ ON "d ON CO 00 CO '-"I o r- 00 •d ^—
rd LL
r-Z O CO ON rZ CO pZ ON
>* E
_o E
ftj
13 CO CO CN VO ,_ o o ON ,_ in ,_ d* in CO
d* d* o
O 3 CO CO CO O d* "d- ?— x— o d- 00 CN d- ON CN vo d* CO d-
d vo CN CN ON CO ON 00 CN vo O 00 CO CO CN O co
CN <— CN *- x— CN CN* CO
</n
Q.
3 * 4— 4— — 4— 4—
O in 00 ON »— CO in 00 vn CN P" r- CN
U. 00 VO in CN o d^ 00 O CN d^ CO
O > d- 00* ON CN in CN o’ CO rZ *
o CO
X c
03
C-O 3
T— CO ON CN vn o 00 o O VO 00 00 o in o CN o
4— lZ CO VO CO r— CO CO VO CO in in d- in vn CN ON d* in
d*
o d; vo ON CN CO CO o o o CO ON vo d^ CO d-
JZ d* X—^ CO ■d- CN* 1-1 «n i-l d- i-l CO *
o ,—
<d
LU
i-
vg 4— * 4— 4— * 4— * 4—
vn CO ON VO x— <n d- x— 00 CO d- CN
<D w t— 00 VO r- CO CN 1— X-
LL. 0°. o vo vn
C *4-> *
rd 4") d* CO* i-l in CN* CO CO* vo* CN*
O *o T- x—
''—^ c
<n 3
CD c p^ CO o CN in o 00 O in I"- ON ON VO o CO d- CN d-
1— o CO 4-) CO on CO CO vo 00 o vn VO CN x— CN o in ON vn
o i— in CO vo r— CN CO oo 00 CO 00 O CN ON o vo CO 00 CO
o Q_ CO CO CO ■d* CO* CN*
CO
<L>
CD
c
4— 4— 4_ 4_ 4— * 4_ *
rd
CD o 00 d- vo cn X- ON d- ON
E C 00 U- CO o 00 r-~ CN CN CN r— CN CN vo in
-(-Significant at/9 <.01.
c
o
<4—
.E
CD *c
vo vo* r—’ CN* r—“ X—’
T—
>n
X—
r-’ VO* VO* vn
4_ 03
<D 4-»
Q_
o
A.
(/>
o 00 00 in CN ,_ o vo ,_
o CN 00 o 00 o 00
4— Q_ _J co VO o CN o CO CN x— o vo d- o vo p^ CN CN d- r- d*
vo © ON r- CO 00 d*. co vn in in ON VO CO in vo
O DJO
vo CN* CO* »— *—* VO* 00* d* CO* d* CO*
CD C
O
c lx
rd
<D 4— d- T—
“O d* vo vo ON ON vo vo d- d* CO m
Cl CN CN
rd 1 T~~
CO
> 4—
</> t^>
4— o to
4->
to
4-4 *-> </» to tn </>
4—> to to
4-4 </> to
o t/N o </s o o tO o
+->
O to o O to V3
4-4
o l/N O
4-4
o IS\ o
4-4
03 03 03 a 03 03 03 03 03
t/N _CD 9— OX)
<13
OX)
03
00
03^ 03 03^ 03 03^ 03 03
c OX) OX) lo CSX)
CD O !E T3 IE ■O IE ■a !E IE *a lo E ■O !E
♦Significant atp <.05.
c/N rd o !o ■O
cd 03 3 D 3 3 3 3
to 3 3 3 3 3 3 3 3 3 3 3 to 3
4-4 </> c/3 to tn t/t to to </} i/>
CO O tr>
i— rd X X
cd c c C C C C X C C X C c X c c X
c CD 13 03 <13 03 <D 03 03 03 03
to to t^ 03 03 03 03
> o rd <13 13 03 03 03 03 03 03 03 03
</i
03 03
t/>
03 03
to
< Ll_ CO
>
£ £ OX) £ OX) £ £
03
OX) £
03
OX) £
03
OX) £
03
DO
4-4 4-* "O +-> 4-> ■O 4-* 4-> ■a 4-4 T3 4-4 ■a 4-4 TD
<u 03 3 03 03 3 03 03 3 03 03 03 03 03 03
c 3 3 3
CO CO CD CD CO CO CO CO QQ CO CO 00
CN o
00 t/N
O) CD Q.
W) 3
-O "O o
•—
CN CO d- in vo
rd 13
h-
Mullen: Reliability and proficiency evaluations 97
Table 8-4 shows the reliability coefficients (unbiased) for each pair of judges
for all scales of oral proficiency. The coefficients range from a low of .43 to a high
of .93. Table 8-4 also shows the relationship between the F-statistic and the relia¬
bility coefficients. Some eases show a significant difference in judges’ ratings and a
high reliability coefficient (for example, listening ratings in group 4). The reverse
is also apparent (for example, fluency ratings in group 3). In addition, some cases
show no significant difference in judges’ ratings and a high reliability coefficient
(for example, listening ratings in group 1) as well as the reverse (for example,
grammar ratings in group 2). Since the reliability coefficient is a measure of the
degree to which the average error measurement is zero and indicates how good an
average of the two judges’ ratings is as an estimate of the subjects’ true rating, in
cases where there is a significant difference, in judges’ ratings and a high reliabil¬
ity (above .70), one may conclude that an average of the two ratings is a better
estimate of the subject’s true rating than either by itself. In cases where there is a
high reliability and no significant difference in judges’ ratings, one may conclude
that either of the two ratings is as good an estimate of the subject’s true rating as an
average is. Where there is a low reliability and a significant difference in judges’
ratings, we may conclude that an average of the two judges’ ratings is not a good
estimate of the true rating. Likewise, where there is a low reliability and no signifi¬
cant difference in judges’ ratings, it is evident that the average error of measure-
1 15 6.06 1.75 6.06 1.98 5.46 1.55 6.13 1.35 5.73 1.79 6.20 1.47
2 17 7.47 1.46 7.76 1.09 6.41 1.37 7.11 1.26 6.11 1.49 7.41 1.12
3 10 6.10 2.02 6.70 .82 5.30 1.25 5.70 .82 5.60 1.17 5.70 1.41
4 17 5.52 1.69 6.52 2.00 5.41 1.37 5.47 1.50 5.64 1.65 6.00 1.83
5 25 6.04 1.45 6.60 1.75 5.68 1.21 5.72 1.02 5.72 1.51 6.08 1.77
6 14 6.28 1.58 7.00 1.56 5.50 1.22 5.85 1.09 6.07 1.32 6.21 1.42
ment is sufficient to reduce the extent to which an average of the two ratings is a
good estimate of the subject’s true rating. The overall scale appears to provide the
most uniform reliability coefficients across groups of subjects. Since a significant
difference in judges’ ratings is evident, it is best to use an average of the two judges’
ratings as the best estimate.
Mullen: Reliability and proficiency evaluations 99
F r2 F r2 F r2 F r2 F r2
1 15 .00 .93 4.83 .72 1.78 .75 10.42f .76 2.50 .88
2 17 1.74 .82 11.7 6 f .79 39.5 If .57 9.30f .57 19.36f .77
3 10 1.23 .43 2.25 .74 .05 .50 7.36* .74 7.36* .75
4 17 15.Ilf .82 .11 .92 2.85 .92 .03 .66 1.66 .93
5 25 6.24* .82 .03 .67 3.27 .88 .88 .86 2.24 .83
6 14 5.51* .77 2.52 .81 .32 .85 .10 .88 .13 .92
the equation significantly reduces the amount of unpredicted variance. Table 8-6
shows these results. The beta coefficients in the multiple regression equation
which are used to predict the overall rating from the four other scales indicate that
if the listening rating were increased by one unit and the other ratings remained
constant, the expected change in the overall proficiency rating would be .23. If the
grammar rating were increased by one (and the other ratings remained constant),
the change in the overall proficiency rating would be .29. Parallel conclusions may
Scale 2 3 4 5
Table 8-6 A Stepwise Regression Analysis with the Overall Proficiency Scale as the
Dependent Variable
♦Significant at p <.01.
rejected in each case. All are about equally important contributors to the
prediction of scores on the overall proficiency scale.
Conclusions
Two of the hypotheses outlined at the beginning of the chapter are rejected by the
results of the analysis of variance on performance ratings by pairs of judges in each
of the six groups. Ratings assigned by one judge are significantly different from
ratings assigned by the other in six out of thirty cells at the .01 level and eleven out
of thirty cells at the .05 level. However, the reliability coefficients for four out of
the six significant-difference cases at the .01 level are above .70, and the relia¬
bility coefficients for nine out of the eleven significant-difference cases at the .05
level are above .70. Therefore, an average of the two judges’ ratings will serve as a
good estimate of the true rating. This means that it is best if there is more than one
judge per subject
There is a significant difference at the .05 level in the ratings assigned by one
pair of judges to a group of speakers when compared to the ratings assigned by
another pair of judges to another group of speakers. This may be due to a
difference in judgments made by the raters or a difference in the groups them¬
selves. Since the design of the experiment did not control for homogeneity of
Mullen: Reliability and proficiency evaluations 101
1. In other words, if the variance across judges is small in relation to the variance across subjects (and
if the F-ratio is therefore high), it is assumed that the judges agree on where the differences between
subjects lie. This would be an indication of interrater reliability.
2. In other words, if the differences (variance) among subjects in the same group are about equal to or
greater than the differences (variance) between subjects in different groups, it will be assumed that the
ratings of different pairs of judges do not differ much across groups. If this is so, and if the groups are
really similar in ability to start with, this finding would also support the conclusion that the ratings ol
judges (in this case pairs of them) are reliable.
3 Here the question is how much each of the four separate scales of listening comprehension,
pronunciation, fluency, and structure contributes to the explanation of variance in overall speaking
proficiency.
Chapter 9
Donn R. Callaway
This study investigated the reactions of naive native raters and ESL teachers
to various samples of differently accented speech. Two panels of judges, 35
naive natives and 35 experienced teachers, evaluated 15 samples of
nonnative speech from three source language backgrounds (Arabic, Persian,
and Spanish). Evaluations for both groups were in terms of bipolar scales of
intelligibility, pleasantness, acceptability, nativeness, and overall profi¬
ciency of the speaker. In addition, the ability of judges to guess the speaker s
native language background was studied. Results indicated that both groups
can distinguish degrees of proficiency with substantial reliability, that they
cannot distinguish the posited components of speech, and that some judges
are fairly good at guessing language background of ESL learners based on
samples of their speech.
102
Callaway: ESL Oral proficiency 103
to link whatever language and speech features serve as salient cues in this judgmental
process with whatever kinds of evaluation or stereotypes are of interest to us in the
behavior of listeners.
He showed that social class and judgment ratings can be predicted from the
presence, absence, or strength of certain language features, among them length of
pauses and verb constructions (p. 477). However, since his study was limited to
the effect of black dialect on black and white elementary school teachers and did
not include other dialects and other evaluators, further research is needed.
As a result of the concern of sociolinguists that accented speech can cause
alienation and discrimination in educational and occupational opportunities
(Ortego, 1970, Rvan, 1973), most studies have dealt only with the language varie¬
ties of a few rather large minority groups: French Canadians (Lambert et al., 1960,
Anisfeld and Lambert, 1964, Webster and Kramer, 1968), Black Americans
(Harms, 1963, Shuy, Baratz, and Wolfram, 1969, Tucker and Lambert, 1969,
Williams, 1970, Williams, Whitehead, and Miller, 1971), Mexican Americans
(Ortego, 1970, Williams, Whitehead, and Miller, 1971, Ryan, 1973), and British
regionals (Strongman and Woosley, 1967, Giles, 1972). Richards (1971, p. 21)
points out that any “deviancy from grammatical or phonological norms of a speech
community elicits evaluational reactions that may classify a person unfavorably.
If this divergence from the standard language has an effect on the social relation¬
ships of these minorities, it could have an even more pronounced effect on the
degree of success attained by the second language learner, not to mention his like¬
lihood of being socially accepted.
In oral evaluation, a general assumption is that any native speaker can assess
the proficiency of a nonnative. In fact, a generally used measure, the American
Language Institute Oral Rating Form, describes a speaker of minimum profi¬
ciency as having “pronunciation... virtually unintelligible to ‘the man in the
street’ [my italics]” (1962). However, usually, it is the second language teacher,
rather than “the man in the street” who makes the evaluation. Is it safe to assume
that there is no significant difference between a trained rater and an untrained
one? Cartier (1968, p. 21) does not think it is. He says that judgments of profi¬
ciency “are made by the wrong people, they are made by sophisticated language
instructors who have become quite skilled at understanding heavily dialectal
English rather than the student’s eventual instructors, classmates, and job super¬
visors.” He seems to be implying that naive judges might be better. According to
Jakobovits (1970, p. 85), however, naive judges are apt to attribute too much
importance to “accent, pronunciation, and fluency” and too little to the weightier
matters of grammar and vocabulary.
A number of research studies have examined the effect of accent in
bidialectal and bilingual speech. These projects have dealt with the reliability of
judgments, the ability of judges to specify certain speech characteristics, degree of
accentedness, and overall proficiency.
104 III: SPEAKING TASKS
over three tasks: reading, story retelling, and narration. The judges showed sub¬
stantial reliability across language backgrounds and across tasks. However, they
were not very good at identifying the source language background of the speaker.
It would seem from Palmer s results that particular foreign accents, e.g., Spanish,
may not be as easily recognized as is popularly believed.
This study attempts to address a number of remaining questions. Here, as in
Palmer s study, the focus is on learners of ESL, but unlike Palmer s study, this one
compares interrater reliability among naive raters with interrater reliability among
experienced ESL teachers. Further, evaluations by both groups are validated
against the independent placement of the speakers in one of five proficiency
levels,via a separate testing procedure. Whereas some of the cited studies were
concerned with the effect of a speaker’s accent on the listener s judgment (with¬
out knowing or caring what characteristics of speech contributed to that evalua¬
tion), this experiment will attempt to separate accentedness into distinguishable
dimensions.
In particular, the following questions are addressed:
Method
Speech Samples. During the fall semester of 1975, the experimenter asked
instructors from each of the five proficiency levels at the Center for English as a
Second Language (CESL), Southern Illinois University, to recommend students
of average speaking ability from Arabic-, Persian-, and Spanish-speaking language
backgrounds who would be willing to participate in a short recording session. Two
native speakers who were graduate students in the Department of Linguistics,
Southern Illinois University, were also taped. Originally 25 ESL students, along
with the two American students, were recorded in a laboratory setting reading one
of twelve 100-word passages in English. (The paragraphs are given in Appendix
9A.) Each speaker was allowed to practice the passage twice before he was
recorded. Eleven tape samples were eliminated because of technical recording
problems or to avoid including too many samples from the same source language
background. Three nonnatives from each of the five levels of CESL (lo ESL
106 III: SPEAKING TASKS
students) were finally selected along with one of the American students to form
exactly 16 criterion samples of speech.*
Questionnaire: Scales of Accentedness and Overall Proficiency. The scales
were constructed in a semantic differential form similar to the scales used by
Lambert et al, Galvan et al.. Palmer, and others. The first four scales consisted of
four pairs of bipolar adjectival descriptors (“not very intelligible” to “intelligible,”
“unpleasant” to “pleasant,” “unacceptable” to “acceptable,” and “nonnative" to
“native”). In addition, there was an overall proficiency scale (OPS). Each of these
scales was in a six-point Likert-type format. Also, a multiple-choice question
about the language background of the speaker was included (see Appendix 9B).
Raters. The tape and questionnaire were administered to 70 raters (Rs). Half
of them were enrolled in undergraduate English composition courses and had
neither linguistic nor teaching experience (the naive group), while the other half
were instructors or teaching assistants in ESL (the experienced group). Before
taking the test, each R completed a form on biographical data. The naive Rs were
tested during their regular class meetings, while the teachers were tested either
individually or in small groups.
Rating Procedure. Each sample was rated by the naive native Rs and the
experienced ESL teachers on each of the six scales. The order of the six scales was
the same for all samples of speech. The order of the 16 speakers, however, was
randomized for the first tape and was given in reverse order on a second tape.
Detailed instructions for the use of the protocols were presented orally with an
example using a male Spanish speaker. Form A was used with 15 of the naive Rs
and 20 of the ESL teachers. In sum, each of the Rs heard the same directions,
either orally or taped, and the same example, and they listened to one of the two
tapes of the 16 speakers. They evaluated each speaker on each of the six scales.
Results
Rater Agreement. In order to see whether or not the raters agreed in their evalua¬
tions of accentedness (see Question 1 above), the Rs were treated as variables in a
Q-type factor analysis. Normally, of course, the variables input to a factor analysis
are test scores, scales, or other measures. In this case, the raters were treated as
variables and the various accent scales were treated as subfiles (each containing
16 cases, i.e., the 16 speech samples). Owing to computer space limitations, only
60 Rs could be included. Therefore, five were excluded from the naive group and
five from the group of experienced ESL teachers on a random basis. In the first
factor analysis, the first four scales were treated together without distinction. We
will return shortly to the justification for this. (See the discussion of Question 4,
below.) In the principal components analysis. Factor 1 accounted for 48% of the
variance among the individual Rs. All of the Rs loaded positively on this factor and
•Editors’ note: These speech samples were collected prior to the major testing project discussed in
other chapters (especially 2, 6, 7, 14, 17, 18, 23, and 24).
Callaway: ESL Oral proficiency 107
6* .862 36 .548
7 .500 37 .811
8 .466 38 .691
9 .840 39 .505
10 .849 40 .810
11 .805 41 .778
12 .718 42 .754
13 .729 43 .833
14 .664 44 .666
15 .767 45 .748
16 .580 46 .700
17 .698 47 .664
18 .424 48 .693
19 .721 49 .778
20 .589 50 .476
21 .539 51 .596
22 .791 52 .809
23 .373 53 .739
24 .492 54 .791
25 .694 55 .762
26 .777 56 .804
27 .525 57 .703
28 .444 58 .812
29 .655 59 .741
30 .719 60 .714
31 .687 61 .785
32 .703 62 .681
33 .406 63 .837
34 .547 64 .734
35 .616 65* .714
♦The first and the last five raters were eliminated so as not to exceed
the space limitation of the SPSS factor program (PA1, Nie, Hull,
Jenkins, Steinbrenner, and Bent, 1975, pp. 479-480).
above .36 (Table 9-1). Fifty-six Rs showed a correlation of greater than .50 with
this factor, while 12 of these 56 loaded at .80 or higher. The overall mean loading
was .69. The mean loading for the naive Rs was .645, and for the 30 ESL teachers
was .735. On a similar principal components analysis for the OPS, the first factor
accounted for 56% of the total variance. The average loading for the naive Rs was
.676; for the ESL teachers, it was .816 (Table 9-2). From these analyses, it can be
concluded that there is very substantial agreement among the Rs, regardless of
whether they are naive or experienced.
108 III: SPEAKING TASKS
6* .952 36 .620
7 .673 37 .731
8 .516 38 .729
9 .797 39 .834
10 .895 40 .863
11 .727 41 .715
12 .848 42 .717
13 .778 43 .914
14 .758 44 .777
15 .830 45 .827
16 .846 46 .804
17 .660 47 .783
18 .361 48 .824
19 .698 49 .850
20 .598 50 .881
21 .642 51 .633
22 .915 52 .873
23 .394 53 .723
24 .437 54 .842
25 .917 55 .792
26 .658 56 .808
27 .615 57 .796
28 .357 58 .841
29 .784 59 .762
30 .570 60 .786
31 .766 61 .801
32 .729 62 .810
33 .326 63 .760
34 .613 64 .828
35 .654 65* .708
Scales 1 2 3 4 5
Scales 1 2 3 4 5
Table 9-5 The Percentages of Correctly speakers, then the Persians, and then
Identified Language Backgrounds the Arabs. For the naive Rs, the order
was Spanish, Arabs, and Persians
Source By By
language experienced na'fve
(Table 9-5).
of speaker raters raters Of these three languages, only
Arabic 66.3 31.4 Spanish was studied by more than one
Spanish 84.6 32.6 R (6 naive Rs and 20 experienced
Persian 68.6 26.9 Rs). For the naive Rs it is interesting
Combined 73.0 30.0
to note that those who had not had
Spanish identified the speakers with
16% greater accuracy than the naive judges who had studied Spanish (Table 9-6).
In looking at the ratings of the experienced Rs, we discover that with or without
studying Spanish, Rs could identify the speakers equally well. It seems, therefore,
that the ability to correctly identify the language background of a speaker may be
due more to mere contact with speakers of the language in question than to formal
study of the language supposed to be the source of the accent.
By By
experienced naive
raters raters
Those who had studied Spanish 85 20
Those who had not studied Spanish 84 36
Discussion
With little or no research basis, Cartier speculated that teachers are not the most
reliable judges of oral proficiency. With equally little empirical study, Jakobovits
suggested that naive natives are also not the best judges. Obviously, both Cartier
and Jakobovits cannot be right, but neither recommended appropriate research to
test their claims. With unabashed certainty, language testers have categorized oral
performance into the separate components of accent, grammar, vocabulary,
fluency, and comprehension (Valette, 1967, 1977, Harris, 1969, Clark, 1972,
Heaton, 1975, Davies, 1977). These categories have become the sanctioned
criteria for the evaluation of oral proficiency by teachers. For the most part, their
empirical necessity has gone unquestioned.
The data from this experiment dispute both the speculation of Cartier and the
opposite claim of Jakobovits; for if both claims were true, we would expect to find
little reliability in the evaluations of either of the two groups of raters studied here.
However, the results indicate that both groups can distinguish degrees of profi¬
ciency with substantial reliability, although the teachers are somewhat more
reliable than the naive judges (but contrary to what some might have hoped, the
latter fact does not support Jakobovits’ claim; it refutes his claim because of the
Callaway: ESL Oral proficiency 111
demonstrated unity of the various scales of accentedness and the overall profi¬
ciency rating). Apparently, all the raters tended to make holistic unidimensional
evaluations, rather than multidimensional judgments, i.e., separate evaluations of
the presumed components. The unity of the scales suggests that dividing oral per¬
formance into components is superfluous at best, and artifactual at worst Accord¬
ing to the available empirical evidence, a listener apparently does not and perhaps
cannot componentialize the characteristics of speech. Rather it would appear that
overall comprehensibility is what motivates the evaluation.
The inability of a naive judge to identify the source language of accented
speech substantiates Palmer’s findings (1973). The data, also, show that ESL
teachers (the experienced group) are quite successful (7 3%) in identifying source
language backgrounds. Further, the data indicate, contrary to Palmer (197 3), that
accents are substantially distinctive.
More research is required before it will be possible to relate degrees of
accentedness to points on a well-defined, although subjective, continuum. In
addition, experiments should be conducted to see if the ability to identify a
speaker s first language affects the reliability and validity of a given judge’s evalua¬
tions. Since studies have shown that speech characteristics may affect personality
assessments, the converse relationship between personality assessments and
speech characteristics may also affect the reliability and validity of oral profi¬
ciency evaluations and should also be investigated.
A final area of research would be the reliability of evaluations of accentedness
by nonnatives the speech of whom also reveals varying degrees of accentedness.
This study used only native speakers of English as raters: what if we wanted to
generalize to nonnatives from different language backgrounds? Would Arabs, for
example, tend to rate Arabic speakers as reliably as native speakers of Spanish or
Persian? Would they be more lenient? Therefore, own-accentedness and native
language are factors that might well be studied in relation to the reliability of
nonnatives as evaluators of nonnative speech.
Note
Appendix 9 A
[The texts used in this study were selected from several sources. Each passage was
rewritten to conform to a 100-word hmit on length. Texts 1, 2, 3, and 4 were each
read by two different speakers. The remaining texts were read by only one speaker
each.]
1. Ben got off the bus and then the bus drove away. He forgot about the
tickets because it was raining. The road was wet and there was a very big hole in his
shoe. Then a second bus stopped and he got on. This time there was a seat. He paid
a dime for his ticket and then shut his eyes. When he opened them again, the bus
was past the theater. He rang the bell and the bus stopped suddenly. It was still
raining as he walked back to the theater and went in through the door. He saw
many photographs of the actors just before he saw the stage. (Adapted from Baird,
Broughton, Cartwright, and Roberts, 1972, p. 44.)
2. I hope to learn several foreign languages but English is the one I want to
study first To begin with, I hope to get a good position with one of the big compa¬
nies in the capital and it will be an advantage for me to have an understanding of
English. If my work should ever require my traveling outside of the country, it
would be helpful if I knew English. It is used in carrying on business in almost
every part of the world. My brothers and sisters, already skillful in English, are
eager to practice it with me; so I will have many opportunities when I am ready to
speak English. (Adapted from Van Syoc and Van Syoc, 1971, p. 89.)
3. On Saturday mornings the big public library opens at half past nine. A lot
of the people go into the library on Saturday because this is the time when they go
shopping, they take their books into the library, and go home with new ones. Susan
and Mary, the two girl librarians, were standing behind the desk. They took the
books from the people who came in and gave them their tickets. It was a warm
Saturday, and a lot of people were in the streets and in the stores, and many were
coming into the library too. (Adapted from Baird, Broughton, Cartwright, and
Roberts, 1972, p. 47.)
4. As was expected, the favorites had gotten well out in front with the
remaining horses grouped together some way behind. On a dangerous bend, three
of the horses leading the group fell, throwing the riders into great confusion. As the
race progressed, the track became full of horses without riders. Toward the end,
there were only three horses left. College Joy and Sweet Seventeen were still lead¬
ing the race with an unknown horse far behind. The crowd was very disappointed
when on the last jump in the race, the riders of both favorites failed to keep in the
saddle. The crowd cheered and applauded as the unknown horse crossed the
finishing line. (Adapted from Alexander, 1974, p. 60.)
5. Moving the pilot aside, the man took his seat and listened carefully to the
urgent instructions that were being sent by radio from the airport below. The plane
was now dangerously close to the ground, but it soon began to climb. The man had
to circle the airport several times in order to become familiar with the controls.
Callaway: ESL Oral proficiency 113
The terrible moment came when he had to land the plane. Following the
instructions, the man guided the plane toward the airfield. It shook violently as it
touched the ground and then moved rapidly across the field, but after a long run it
stopped safely. (Adapted from Alexander, 1974, p. 61.)
6. The following Sunday we stayed at home, even though it was a fine day.
About noon a large and very expensive car stopped outside our house. We were
astonished when we saw several people preparing to have a picnic in our small
garden. Father got very angry and went out to ask them what they thought they
were doing. You can imagine his surprise when he recognized the man who had
taken our address the week before. Both men burst out laughing and Father wel¬
comed the strangers into the house. In time, we became friends, but we had
learned a lesson we have never forgotten. (Adapted from Alexander, 1974, p. 63.)
7. It was a very dark and stormy night. Two men were walking slowly down
the road. Snow was covering the ground and a cold wind was blowing. They
noticed a light behind some trees and soon arrived at a house. A poor old man
immediately invited them into a clean room. He seemed a strange fellow, but he
spoke kindly and offered them milk and fresh fruit The men remained there until
morning. Then the man led them to the nearest town, but he would not accept any
money for his help. (Adapted from Alexander, 1974, p. 18.)
8. Science has told us so much about the moon that it is fairly easy to
imagine what it would be like to go there. It is certainly not a friendly place. As
there is no air or water, there can be no life of any kind. Also for mile after mile,
there are only flat plains of dust with mountains around them. Above, the sun and
stars shine in a black sky. The moon is very silent. But beyond the horizon, our
earth is shining more brightly than the stars. It looks like an immense ball, colored
blue, green, and brown. (Adapted from Alexander, 1974, p. 35.)
9. The store was empty and very peaceful. We sat down in the main hall and
listened to the rain beating against the windows. Suddenly there was a loud noise at
the door. Then a large party of boys were led in by a teacher. The poor man was try¬
ing to keep them cpiiet, but they were not paying any attention to him. The boys ran
here and there. The teacher explained that the boys were rather excited. But the
noise proved too much for us; so we decided to leave. After all, the boys had more
right to be in the store than we did. (Adapted from Alexander, 1974, p. 54.)
10. Driving along a highway one dark night, Tom suddenly had a flat tire.
Even worse, he discovered that he did not have a spare tire in the back of his car.
Tom waved to passing cars and trucks, but not one of them stopped. At last, he
waved to a car like his own. To his surprise, the car actually stopped and a well-
dressed woman got out The woman offered him her spare tire, but Tom had never
changed a tire in his life. So she set to work at once and changed the tire in a few
minutes while Tom looked on. (Adapted from Alexander, 1974, p. 27.)
11. Dan found the school work easy. He read widely both at school and in
the branch library. After the third year of high school, he left to take a job with a
glass firm. Art work had always been a major interest, and he did so well with the
114 III: SPEAKING TASKS
firm that he was promised rapid advancement But then the depression came, the
business failed, and Dan was without a job. At first, he went out looking for a job
and continued his art work at home, but when all his efforts brought no results, he
stopped looking for work and even lost interest in art (Adapted from Whyte, 1955,
pp. 8-9.)
12. Tony came into the club to talk the situation over with John. He was try¬
ing to get transportation, he said, but even if he could arrange it in the next few
minutes it was so late that the boys would miss a large part of the evening. If any¬
one wanted his money back or a ticket for the next football game, he could have it.
John explained the situation to the boys and then said that he thought it would be
better if we went another time. Tony agreed. He said that John could collect the
tickets later. (Adapted from Whyte, 1955, p. 182.)
Appendix 9B
Questionnaire
Name_ Sex F M
State or country-- ESL Experience yrs. mos.
Native language__
Other languages___
Age___
In this experiment, you will rate how well some nonnative speakers read a short prose
passage. In addition, you are asked to identify their native language.
EXAMPLE
NVI (Not very intelligible) 1 2 3 4 5 6 VI (Very intelligible)
UNP (Not pleasant) 1 2 3 4 5 6 P (Pleasant)
UNA (Not acceptable) 1 2 3 4 5 6 A (Acceptable)
•NN (Nonnative) 1 2 3 4 5 6 N (Native)
OPS 1 2 3 4 5 6
Language background Ar. Sp. Pr. Am. X
(Ar.—Arabic, Sp.—Spanish, Pr.--Persian, Am.--American, X- Unknown)
Callaway: ESL Oral proficiency 115
Name
1. NVI 1 2 3 4 5 6 VI
UNP 1 2 3 4 5 6 P
UNA 1 2 3 4 5 6 A
NN 1 2 3 4 5 6 N
OPS 1 2 3 4 5 6
LgB Ar. Sp. Pr. Am. X
16. NVI 1 2 3 4 5 6 VI
UNP 1 2 3 4 5 6 P
UNA 1 2 3 4 5 6 A
NN 1 2 3 4 5 6 N
OPS 1 2 3 4 5 6
LgB Ar. Sp. Pr. Am. X
Part III Discussion Questions
2. Why might direct tests be considered to have superior validity (even before
any experimental or empirical studies are done)? Are the criteria associated
with judging the directness of a measure more important than its reliability?
How do those criteria relate to other considerations that enter into the empir¬
ically demonstrable validity of a proposed test?
4. What kinds of motivations can you posit for the inclusion of topics to elicit
certain tenses and structures in instructions to teams of oral interviewers?
See Appendix 7D.
5. Consider the factor analysis shown in Table 7-5. Try to find a simple expla¬
nation for each of the loadings recorded. Remember that the factors
displayed are mathematically uncorrelated (the technical temi is orthog¬
onal).
116
Discussion questions 117
8. On the basis of the correlations displayed in Table 8-4, compute the average
variance overlap among scales. This can be done by computing the squares
of the correlations displayed in the table (the squares can easily be displayed
in the same table in the lower half of the matrix, below the diagonal). Divide
the sum of the squares by the number of entires. The result is the average
variance overlap across all five scales. (There is a simpler way to do the com¬
putation. Can you see how? Consider the method used in Question 9.)
9. Now, to estimate how much of the reliable variance on the average is com¬
mon variance, average the reliabilities displayed in Table 8-4 for each of the
scales separately. Then average the reliabilities across all the scales. Com¬
pare the square of the average reliability against the square of the average
common variance. If they are approximately equal, it can be concluded that
nearly all the reliable variance is common variance. In other words, to the
extent that any single scale is reliably measuring anything, the other scales
are measuring the same thing. If you subtract the square of the average relia¬
bility from the square of the average variance overlap, the result will be a
good estimate of the average variance in all the scales that is unique to each
one.
11. Are there some questions that cannot be answered except on the basis of
untested opinion? If so, what is the difference between empirically vulner-
able and nonexperimental issues?
12. Can you think of any explanation for the fact that judges who lack knowledge
of the source language underlying an accent should be better at identifying
the source language for the accent than judges who have studied the source
language in question (see Table 9-6, in particular the figures given for naive
raters)?
13. Compare the correlations reported by Callaway in Table 9-3 with those
reported in Table 8-5 by Mullen. What differences or similarities do you see?
Consider the nature of the scalar judgments required of raters as well as the
pattern of relationships between scales.
Part IV
Can a cloze test be substituted for more complicated ESL testing proce¬
dures without significant loss of information? Which scoring method for the
cloze test will yield the most information? A cloze test in open-ended form
was administered to over 100 incoming foreign students at the Center for
English as a Second Language (CESL) at Southern Illinois University
during the summer of 1976. The Test of English as a Foreign Language
(TOEFL) and the placement examination used at CESL were the written
criterion measures against which the cloze test was evaluated. It was scored
twice: first, responses corresponding exactly to the deleted words were
counted as correct (cloze-exact), and second, responses that were grammati¬
cal and contextually appropriate were counted as correct (cloze-accept¬
able). The data indicate that cloze testing may indeed be a viable alternative
procedure for placement and proficiency testing. It appears, however, that
the two cloze scoring methods are not equally reliable. Apparently the
cloze-acceptable method yields a more accurate assessment of the student’s
ESL ability than the cloze-exact method, particularly at more advanced
levels. This study supports previous research in language testing which
suggests that the cloze procedure is a useful evaluative tool for ESL
specialists.
121
122 IV: READING TASKS
problem is more complex. The examination procedures that are used to determine
overall ESL proficiency levels are, for the most part, quite involved. The tests are
most often written and are usually composed of a number of subtests, each one
supposedly assessing a different facet of language ability. In general, a composite
score on such an examination is taken as an indicator of the examinee’s overall
ESL proficiency.
The various tests now employed seem to provide a reasonably accurate
indication of English language ability. However, because of the length of the tests
and problems of scheduling, it is often necessary to set aside two or three days at
the beginning of each term of instruction for the purpose of screening new stu¬
dents. If testing time could be reduced and if comparable results could be
obtained, the considerable time and effort saved could be put to better use in the
classroom.
A testing method originally used as a readability index has recently been
looked at seriously as a possible measure of second language proficiency. The
“cloze” procedure, as the method was called by its originator Wilson Taylor,
involves deleting every nth word from a prose passage and asking the person tested
to supply the missing words in the blanks. The procedure is justified on the
assumption that a person who is either a native speaker of the language or a reason¬
ably proficient nonnative speaker should be able to anticipate what words belong
in the blanks given the contextual clues of the passage. A cloze test is easy to
administer and can be scored quickly. Studies to date indicate high correlations
between cloze test results and total scores on established ESL proficiency mea¬
sures. The work of Oiler at UCLA and the University of New Mexico, Stubbs and
Tucker in Beirut, Krzyzanowski with secondary students in Poland, Hisama at
Southern Illinois University, and others has shown positive correlations in the
70s and 80’s between cloze test scores and total or composite scores on a variety
of ESL examinations. On the basis of such findings, the possibility of streamlining
both placement procedures at language centers and ESL proficiency evaluation in
general is suggested. The present study was conducted to confirm and extend
earlier results which seem to indicate that cloze testing can be used in lieu of more
complicated testing procedures for measuring overall ESL proficiency.
The two research questions addressed in this paper are: (1) Can a cloze test
be substituted for more complicated ESL testing procedures with substantial
savings of time and effort and without significant loss of information? and
(2) Which cloze test scoring method provides the most information about ESL
ability?
Method
Subjects. Foreign students studying ESL at the Center for English as a Second
Language (henceforth CESL) at Southern Illinois University comprised the popu¬
lation sampled for this study. The subjects included 107 incoming foreign stu-
Hinofotis: Cloze as an alternative method 123
dents from a variety of native language backgrounds and with varying degrees of
competence in English. The subjects were the students who arrived at CESL for
two consecutive six-week terms during the summer of 1976.
Testing. Two ESL proficiency tests, the Test of English as a Foreign Lan¬
guage (TOEFL, Educational Testing Service) and the CESL Placement battery,
were the criterion measures against which the cloze test was evaluated. Since both
of these test batteries are described in earlier chapters (the TOEFL in Chap. 1 and
the CESL Placement in Chap. 2), their context is not described further here.
The CESL Placement examination is given to all incoming foreign students.
On the basis of the CESL scores, it is determined which of the students will be
given the TOEFL. Because of the greater difficulty of the TOEFL, only those stu¬
dents scoring within the intennediate and advanced ranges on the CESL
Placement are given the TOEFL.
The cloze test constructed for this study was selected on the basis of the
results of pretesting with both native speakers of English and students at each of
the six proficiency levels at CESL. (See Chap. 4 for a description of the profi¬
ciency levels.) The passage chosen was 427 words in length. Every seventh word
was deleted up to a total of fifty blanks. The passage was adapted from an inter¬
mediate ESL text. It was about different forms of transportation for long-distance
traveling, a familiar topic to foreign students studying in the United States. As is
customary, a few sentences were left intact at the beginning and at the end of the
passage to provide context The length of blanks was kept uniform throughout
Students were allowed thirty minutes to complete the test which was scored
twice: first, responses corresponding exactly to the deleted words were counted as
correct (cloze-exact), and second, responses that were grammatical and contextu¬
ally appropriate were counted as correct (cloze-acceptable). Previous research
with cloze testing does not indicate definitely which scoring procedure is prefer¬
able. Using the exact-word scoring method eliminates the element of subjectivity
and should therefore yield a more reliable score. In addition, tests can be scored
more quickly using the exact-word method, and this is an important consideration
when large numbers of students are being tested. However, whether there is sub¬
stantial loss of information by substituting the exact-word method for the
acceptable-word method is not clear. For this reason, the question concerning
scoring procedure was raised.
Reliability
Possible coefficient
Measure score Mean SD N KR 20
Cloze-exact 50 11.9 2.08 107 .61
Cloze-acceptable 50 15.3 7.30 107 .85
Total CESL Placement 300^3=100 50.8 16.23 107
CESL Listening Comprehension 100 50.4 18.50 107 .70-87
CESL Structure 100 50.4 20.80 107 .79—.92
CESL Reading 100 51.3 16.01 107
Total TOEFL ca. 700 422.1 56.06 52 .965
TOEFL Listening Comprehension 70 43.7 6.88 52 .899
TOEFL Structure 70 44.8 6.94 52 .864
TOEFL Vocabulary 70 39.8 7.27 52 .892
TOEFL Reading 70 42.9 7.52 52 .841
TOEFL Writing 70 39.8 6.92 52 .855
Table 10-2 Correlations for Cloze Tests with Total and Subtests
of the Criterion Measures
Cloze-exact Cloze-acceptable N
Total CESL Placement .80 .84 107
CESL Listening Comprehension .71 .73 107
CESL Structure .63 .69 107
CESL Reading .80 .80 107
Total TOEFL .71 .79 52
TOEFL Listening Comprehension .47 .51 52
TOEFL Structure .51 .58 52
TOEFL Vocabulary .59 .62 52
TOEFL Reading .68 .77 52
TOEFL Writing .55 .64 52
an intermediate level. Thirty-one of the 107 subjects tested actually were placed
in classes at that level. The reliability for the CELT subtests on CESL Placement
is quite good. The range of the reliability coefficient on the Listening Comprehen¬
sion testis .70 to .87, and the range for the Structure testis .79 to .92. No reliability
statistics are available for the Reading for Understanding (CESL Reading) test, but
based on the observed reliability coefficients for the other CESL subtests, a con¬
servative estimate of the reliability of the CESL Total score would be about .80.
The mean for the total TOEFL was 422.1, which falls about one standard
deviation below the average for the reference population which took that test in
1964. It should be pointed out that the statistics for the TOEFL scores were
computed for only 52 subjects (since only those who placed at level three or above
were given the TOEFL), while the computations for the other tests included all
107 subjects. The reliability for the total TOEFL is .965.
To check the validity of the cloze scores against the established criterion
measures, simple correlations were computed for all the scores on the TOEFL and
Hinofotis: Cloze as an alternative method 125
CESL Placement with the cloze-exact and cloze-acceptable scores. Table 10-2
provides these figures.
The cloze scores and total scores on the TOEFL and CESL Placement tests
were strongly correlated regardless of the method used to score the cloze test
Cloze-exact correlated with the total TOEFL at .71 and with the total score on the
CESL Placement battery at .80. Cloze-acceptable correlated with the same two
measures at .79 and .84, respectively. The correlations were all significant beyond
the .005 probability level, and indicate substantial variance overlap among the
tests. Indeed, it seems safe to conclude that probably all the reliable variance in
the cloze scores is present in the total scores on both the proficiency batteries.
However, the converse is not true. Some of the reliable variance in the TOEFL is
not generated by the cloze scores. The best explanation for this is probably the
difficulty of the cloze text, which would tend to depress its overall variance and
hence its reliable variance. This is a problem that can be corrected by lengthening
the cloze test and/or making it easier. Therefore, it is concluded that cloze testing
may indeed be a viable alternative procedure for ESL placement and proficiency
testing.
Previous research with cloze testing has shown that cloze tests tend to corre¬
late best with those subtests of ESL proficiency measures that are highly
integrative in nature—subtests aimed at reading comprehension, listening skill,
and dictation. It is suggested that this trend may indicate that cloze testing is an
effective procedure for measuring overall ESL proficiency rather than some more
narrowly defined facet of language ability. This claim is supported repeatedly in
this volume (especially in Chaps. 1, 2, 4, 7,13, 20, and also see Oiler and Perkins,
1978). In the present study, there were moderate to high correlations between
cloze test scores and scores on all the subtests. The highest correlations were with
the subtests of the CESL Placement battery. The correlation of cloze-exact with
the CESL reading test was as high as with the total CESL Placement score. Cloze-
exact correlated with both the total CESL Placement and its Reading subtest at
.80. These correlations are both at about the maximum level that could be
expected in view of the estimated reliabilities of the tests. Taken together with the
results of Hisama (Chap. 4; see especially her factor results). These findings
indicate that the CESL Reading subtest may he a satisfactory placement test by
itself. Of course, if sufficient reliability is obtained, the same could be said for
nearly any one of the subtests studied. The advantage of the cloze procedure is the
relative ease of test construction.
It is interesting that both the cloze-exact scores and the cloze-acceptable
scores correlated more highly with the reading subtest of the TOEFL and CESL
Placement examination than with any of the other subtests. This is not surprising
in light of earlier work with native speakers where cloze tests have consistently
been found to be highly reliable and valid measures of reading ability. Indeed, the
high correlations between the cloze test scores and reading subtests of the TOEFL
and CESL Placement examination could be a function of method. That is, cloze
tests may correlate highly with reading tests because basically the same method of
126 IV: READING TASKS
measurement is used. However, this explanation would not fit recent findings
showing that cloze scores are also excellent predictors of variance in nonverbal IQ
tests (see Chap. 3 and Stump, 1978). Nor does it fit the fact that cloze scores are
known to be very good predictors of achievement scores on a wide range of other
tests (Stump, 1978, and Streiff, 1978). In terms of validity, the cloze task and the
reading subtests all involve reading performance. Nevertheless, taking a reading
comprehension test is an integrative task, and therefore, the conclusion of previ¬
ous research that cloze tests tend to correlate highly with tests that are integrative
in nature is also supported.
Which scoring method for the cloze test yields the most accurate informa¬
tion about the student’s ESL ability? As one would expect, the mean was higher on
the cloze tests for the acceptable-word scoring method than for the exact-word
method. The means were 15.3 (cloze-acceptable) and 11.9 (cloze-exact) out of a
possible 50. The ranges were 0 to 45 for the acceptable-word scoring method and
0 to 32 for the exact-word scoring method. With the acceptable-word method, the
average percent of correct responses was 31% and with the exact-word method
24%. While the scores obtained by the acceptable-word method were higher, the
performance pattern on the test was virtually the same. That is, if the subjects were
rank ordered according to both sets of scores, there would be few differences in
rankings. Furthermore, the cloze-exact responses and the cloze-acceptable
responses correlated at .97 (p < .001), indicating a nearly total overlap in variance
(94%). These data suggest that there is little appreciable difference in informa¬
tion provided by the different scoring methods.
However, a rather different conclusion is suggested by the standard devia¬
tions and the reliability coefficients. The standard deviations for the two scoring
methods differed by several points: 2.08 for exact-word and 7.3 for acceptable-
word. If we square these values in order to compare them, the acceptable-word
scoring method generates about 13 times as much raw variance across learners as
the exact-word method (53/4=13). This is not to say that it contains 13 times as
much information, however, as other factors enter in.
When subjects are homogeneous in ability for the skill being tested, a small
amount of variance is desirable and would indicate precision in the testing instru¬
ment. However, in a situation where students with widely varying language ability
are tested, the amount of variance (the square root of which is the standard devia¬
tion) must be viewed differently. A low standard deviation in this case might
indicate that the testing instrument does not discriminate levels because it is either
so difficult that the beginning, intermediate, and some of the advanced students all
perform poorly, or so easy that the majority get almost everything correct When a
wide range of performance is expected, the standard deviation should be generally
higher than with a homogeneous group of subjects.
In the present study, the subjects represented the full range of ESL profi¬
ciency tested at CESL, from those considered beginners to those considered pro¬
ficient enough to enter the university. Apparently the exact-word scoring method
Hinofotis: Cloze as an alternative method 127
does not discriminate among levels to the extent the acceptable-word method
does.
The reliability coefficients obtained for the two scoring procedures suggest
that the acceptable-word scoring method yields more reliable scores. The coeffi¬
cients are .61 for the exact-word scoring method and .85 for the acceptable-word
scoring method. The differences in the coefficients are largely a function of the
amount of variability in the test scores (and in the way the reliabilities are
estimated). The difference in the standard deviations given above illustrates the
considerable gain in variance with the acceptable-word scoring method. By the
KR 20 estimate, the estimated difference in reliability between the two scoring
methods is about 35% (.852- .612 = .35). However, the difference in actuality is
probably not so great, since the correlation between the two scoring methods
(another way of estimating reliability) indicates that only 6% of the variance in the
cloze-acceptable and cloze-exact scores is not common variance. Further, the
observed correlations in Table 10-2 also suggest that the KR 20 estimate of
reliability for cloze-exact is a bit too conservative.
Nevertheless, since the exact-word scoring method does not allow for alterna¬
tive answers, the observed variance is somewhat suppressed, and the full range of
ability levels of the students may therefore not surface in the test results as well as
with the acceptable-word method. It would appear, then, on the basis of the
standard deviations and reliability coefficients, that the acceptable-word scoring
method provides more accurate information about ESL proficiency levels.
In an attempt to help clarify the issue, the correlations obtained for the two
scoring methods with the total CESL Placement and the total TOEFL were tested
to see if they were significantly different The correlations for cloze-exact and
cloze-acceptable with CESL Placement were .80 and .84, respectively. These
correlations did not prove to be significantly different (by a t-test). This indicates
that the scores obtained by both scoring methods provide essentially the same
information about student performance on the CESL Placement examination.
The correlations obtained between cloze-exact and cloze-acceptable with
the total TOEFL were .71 and .79, respectively. There correlations were signifi¬
cantly different (p < .05). Apparently with regard to performance on the TOEFL,
more information is provided by the acceptable-word scoring method. One
possible explanation for this may be that a high-level test such as the TOEFL
makes a somewhat finer discrimination at the upper end of the proficiency
continuum than CESL placement; so with the cloze procedure the scoring method
that allows for finer discrimination among the more advanced students will
provide more information about the language ability of those students.
In conclusion, the high positive correlations between cloze test scores and
scores on established criterion measures suggest that the concurrent validity of the
cloze test used in this study would warrant cautious application of cloze proce¬
dure for placement purposes. Results indicate that the difficulty level of the test
should be somewhat lower for CESL students at SIU than the text selected for this
128 IV: READING TASKS
study. The data show that the acceptable-word method is probably the more
reliable of the two scoring methods studied. The exact-word method is the pre¬
ferred grading procedure in practical terms, but the information that it yields about
the student’s ESL ability does not seem to reflect the student’s language compe¬
tence as well as the information yielded by the acceptable-word method.
Note
1. This is a revised and slightly expanded version of a paper presented at the annual meeting of the
Linguistic Society of America in Philadelphia, Dec. 30, 1976. The study was part of the author’s
doctoral dissertation at Southern Illinois University in Carbondale. The author is indebted to Richard
Daesch and the staff at the Center for English as a Second Language, Southern Illinois University, for
permission to conduct the research there. Special thanks is expressed to Dr. Dorothy Higginbotham
and Dr. Paula Woehlke for helpful comments and suggestions at every stage in the study.
Chapter
With the majority of cloze testing research to date, two different scoring
methods have been employed—the exact-word method and the acceptable-word
method. The exact-word method involves counting as correct only the words
which were actually deleted. The acceptable-word method allows any response
that is grammatically and contextually appropriate. Using the acceptable-word
scoring method can be time-consuming, however, and an element of subjectivity is
introduced. Sometimes a word may make sense in the sentence in which it appears
but not in the passage as a whole. While scores in general with the acceptable-word
method tend to be higher, the performance pattern on the test is usually the same.
129
130 IV: READING TASKS
Method
Subjects. Sixty-six incoming foreign students for the Oct 4, 1976, term at CESL
took the cloze test as a part of their placement examination. Language back¬
grounds of subjects included Arabic, Japanese, Farsi, French, Spanish, Chinese,
and Vietnamese.
Testing. The same 427-word passage used by Hinofotis (Chap. 10) was con¬
verted into a 50-item MC test Each MC item was based largely on responses
observed in the previous administration of the same text in open-ended form to
107 incoming foreign students during the summer of 1976, at the Center for
English as a Second Language, Southern Illinois University (Chap. 10). The most
frequent incorrect responses obtained from the open-ended version of the test
were used as distractors for the MC tests. High-frequency responses which were
grammatical in terms of short-range phrase structure but which were not
contextually acceptable in relation to the full text were not used.
Two test forms were then constructed as follows: on form A, the first 25 items
were MC and the second 25 were open-ended. On form B, the order was reversed.
Tests were distributed so that every other student took fonn A while the rest were
taking form B. This procedure had three desirable effects: first, it counterbal¬
anced any order effect of MC versus open-ended format; second, it systematically
randomized the selection of students who took form A or form B; and third, it
tended to reduce the possibility of test compromise.
Because the open-ended and MC items were part of the same passage, we
made the directions for the two parts of the passage as similar as possible. For the
open-ended items the student was asked to follow the standard fill-in-the-blank
procedure. For the MC items, the four alternatives were listed in the right-hand
Hinofotis/Snow: Alternative cloze testing procedure 131
margin, and the student was instructed to write the best choice in the blank.
Subjects were allowed 30 minutes to complete either form A or B.
The CESL Placement battery (described in Chap. 2) was the criterion against
which the cloze tests were evaluated. Statistics were computed over the entire
group of 66 subjects and separately for the 33 subjects taking form A and the 33
subjects taking form B.
Form B: Open-
ended-multiple
Form A: Multiple choice-open-ended choice All subjects
Mean SD Mean SD Mean SD
ences are not. in fact, statistically significant at the .05 probability level. This
suggests that the open-ended and MC tests are providing similar information.
In the study by Hinofotis (Chap. 10), the correlations between the cloze
scores and the CESL Placement Total were substantially higher (.80 exact, .84
acceptable) than those shown for cloze-exact and cloze-acceptable with the CESL
Placement Total in Table 11-2(.71 and.74). However, the corresponding correla¬
tions obtained for the subjects taking form A of the present test were .83 and .80,
as shown in Table 11-3. The corresponding correlations for the subjects taking
form B were considerably lower, as shown in Table 11-4 (.61 and .69). However,
the correlations between MC form A and MC form B with the CESL Placement
Total were nearly identical (.63 and .65, respectively).
The item facility indices (not tabulated here) revealed that in general the MC
items were easier than the open-ended items, as should be expected. Few of the
Hinofotis/Snow: Alternative cloze testing procedure 133
items on either the open-ended or MC tests, however, approached the 85% facility
level, but a number of them came close to the 15% facility level, indicating they
were quite difficult for this sample of students. This is consistent with the previ¬
ously reported findings of Hinofotis (Chap. 10). The discrimination was better
with the open-ended tasks than with the MC items. This may be due to the fact that
in taking an open-ended cloze test the student has to generate his own alternatives
much as he would in normal communication (allowing, as Hisama noted in Chap.
4, less chance for correct guessing).
In conclusion, the MC cloze technique seems to have good potential as an
ESL proficiency measure. A caution to keep in mind is that constructing an MC
cloze test is a considerably more complicated procedure than constructing an
open-ended cloze test An MC version requires pretesting with an open-ended
task (or some other system) to obtain distractors. It is then necessary to pretest the
MC version to check item facility and discrimination. Items which do not dis¬
criminate well should be eliminated or modified. However, once a reliable MC
format is obtained, the time involved in administering and scoring the test should
be less. An MC cloze test can be easily hand checked or computer scored. The
results of this study, though not conclusive, suggest that MC cloze tests have some
promise. To compensate for the noise factor created by the greater likelihood of
correct guesses on MC cloze tests than on open-ended versions, test length will
probably have to be increased in order to achieve the desired reliability levels,
and/or item facility and item discrimination requirements will have to be more
rigorously monitored during test development phases.2
Notes
1 This is a revised version of a paper presented at the Midwestern Regional NAFSA Meeting in
Chicago in November 1976. The authors wish to express their thanks to William Flick for reading and
commenting on an earlier draft of the paper.
2. Further research with a 50-item MC cloze test is presently underway at UCLA and Southern
Illinois University with native and nonnative speakers of English. The depressed variability in the
sample used in this study prevents us from concluding that an MC cloze test cannot be substituted fora
more complicated testing procedure. Indeed, in spite of the generally low overall proficiency in our
sample of subjects, there is substantial reason for optimism concerning MC cloze testing.
Chapter
Naomi Doerr
In order to test the hypothesis that the foreign language learner will tend to
block out or alter material with which he disagrees, passages which repre¬
sented positive and negative sides of three controversial issues were
converted into cloze tests and were administered to the same 182 subjects
referred to in Chap. 2 as part of the CESL testing project About 100
subjects also participated in an oral interview where they were asked to indi¬
cate their views (pro, con, or neutral) on each of the three issues. Subjects
indicating neutrality were not included in the study. These self-expressed
attitudes were used to determine composite agree and disagree scores for
the 88 nonneutral subjects who completed cloze tests over pro and con texts
on each of the three topics. The agree score for each subject was the sum of
scores on texts with which the subject had indirectly indicated agreement,
and the disagree score was a similar sum over the remaining texts. The
composite agree and disagree scores as well as compositepro and con scores
(i.e., the sum of scores on pro and the sum of scores on con texts, respec¬
tively) were contrasted by two-tailed Mests and were also tested for correla¬
tion. Correlations were .91 for agree and disagree scores and .90 for pro and
con. The contrasts were significant in both cases (p < .03), but the agree
scores, contrary to prediction, were lower rather than higher; and although
no contrast was predicted on pro and con scores, the pro texts proved to be
easier. From these data it was deduced that if attitudes are a contributing
factor in the foreign language learner s comprehension of a message with
which he disagrees, their effects are probably not significant and almost
certainly not very strong.
134
Doerr: Effects of agreement/disagreement 135
Method
Subjects. Initially all 182 subjects enrolled at CESL during the first tenn in the
spring of 1977 were tested (see Chap. 2 for more descriptive data). However, a
much smaller number completed the interviewing and all six of the tests described
below. In all, 88 subjects (mostly drawn from the higher end of the proficiency
range) completed the interview and the tests, and also indicated nonneutral
feelings on the three controversial topics selected for study.
Testing. For each of three controversial topics (the legalization of marijuana,
the retention of capital punishment, and the morality of abortion), two 100-word
passages, a pro passage (supporting the positive side of the issue) and a con passage
(supporting the negative side), were selected for use in the experiment. An attempt
was made to structure pro and con passages on the same topic to be of equivalent
difficulty. To ensure maximum spread of subjects, each pair of texts became pro¬
gressively more difficult the marijuana texts were easy; the capital punishment
136 IV: READING TASKS
texts somewhat more difficult; and the abortion texts more difficult still (see
Appendix 12A for complete version of the texts).
The first sentence of each passage was left intact; every fifth word was then
deleted until a total of twelve blanks were inserted. Following the twelfth blank,
the remainder of the passage remained unmutilated. The position of each deletion
was signified by a numbered blank of standard length. The subjects were
instructed to fill in the missing words. Cloze scores were determined by counting
the number of words restored exactly as they had appeared in the original text plus
the number of other acceptable words inserted, e.g., synonyms.
Approximately one week before the written testing was done, oral interviews
assessing attitudes toward marijuana, capital punishment, and abortion were
completed. Each subject indicated how much he agreed or disagreed with the
following statements (see Appendix 12B for a complete description of the Oral
Interview Attitude Survey where the following questions appear as items 7 a, 7b,
and 7 c):
2-tailed
Score Mean SD f value probability
agree 7.6818 6.634
-2.18 0.032
disagree 8.3750 7.300
Note
1. This control could be effective only to the extent that the various pairs of texts (pro and con) were
actually of equivalent difficulty. In fact, they turned out not to be of quite equivalent difficulty. This is a
confounding factor further complicated by the fact that the composite agree and disagree scores could
be constituted by any combination of pro and con scores over the pairs of texts.
138 IV: READING TASKS
Appendix 12A
Name_Instructor_
Course Level_Section
Form A
DIRECTIONS: You will read six short paragraphs. Some of the words have
been left out See if you can fill them in. Read all of each para¬
graph before you try to fill in the missing words.
You should fill the blank with the word tree. Other words that
also fit are fence, wall, door, street, branch, and so on. It is all
right to guess if you are not sure. Try to use only one word for
each blank.
[Pro Marijuana]
Mr. Sam Brown says “millions of people use and enjoy marijuana. Marijuana helps
people forget (1)_problems of everyday life. (2)_in a
modern society (3)__sometimes very difficult People (4) __
often unhappy with life. (5)_when people smoke marijuana (6)_
feel very calm and (7) __They think that life (8)_pleasant
Marijuana is being (9)_by many groups. Some (10)_these
groups say that (11)_is not dangerous to (12)__. Marijuana
is becoming more and more common. People of all ages, professions, and nation¬
alities use marijuana to make their lives a little more enjoyable.”
[Con Marijuana]
Dr. Jane Wilson says “marijuana is a very dangerous drug. Scientists have studied
marijuana (13)_many years. Their studies (14)_us that
marijuana has (15)_effects on people. For (16)_, mari¬
juana causes changes in (17)_emotions. It can make (18)_
person feel afraid, or (19)_, or very sad. Marijuana (20)_
causes people to forget (21)_Because of this, smokers (22)_
marijuana often have problems (23)_their jobs or at (24)_
Doerr: Effects of agreement/disagreement 139
Finally, marijuana makes people feel very tired. They have no energy to do even
the most important jobs.
According to lawyer Tom Jones, “the death penalty is not justice; it is more an act
of hate. Punishment cannot be considered (37)_real goal of criminal
(38)_. Yet simple punishment is (39)_of the most often
(40)_arguments for the death (41)_. A responsible society
wants (42). ., not revenge. It does (43). . need to get revenge
(44)_ _itself, nor should it (45) . to do so. Punishment (46)
_ punishment’s own sake is (47) .justice. The death penalty
(48) not only unnecessary and useless, but it is also crude and brutal.
Capital punishment has no place in a civilized society that does not require the
taking of a human life for its safety and welfare."
[Pro Abortion]
Dr. George Smith of the Malaga Medical School says that “abortion is one of the
most beneficial developments of modem medicine. A simple and safe (49)
_, abortion is recognized all (50)_the world for its (51)
on the lives of (52)_of individuals, as well (53)
its sociological implications. Abortion (54) freed women from
unwanted (55)_. People who cannot (56) support a child,
or (57)_are not psychologically prepared (58) _become
parents can turn (59) abortion as a practical (60)-real¬
istic solution to their dilemma. At the sociological level, abortion is useful as a
population control device."
[Con Abortion]
Dr. Harry White asks and answers the question: “What is an abortion? It is the kill¬
ing of a distinct, irreplaceable, unique human being. At best, it is (61)-
140 IV: READING TASKS
Appendix 12B
Oral Interview Attitude Survey
Demographic information:
1. Name_
Last (family name), First (given name)
2. Native language_
(the language you spoke most in your home country)
3. Country of origin_
(where you grew up)
7. Most people would disagree strongly with the statement that “Being
unfriendly is a desirable human quality.” But they would agree strongly
Doerr: Effects of agreement/disagreement 141
with the statement “Being friendly is a desirable trait” How do you feel
about the following statements? Indicate how much you agree or disagree
by saying (1) strongly agree, (2) agree mostly, (3) don’t care one way or the
other, (4) disagree mostly, or (5) disagree strongly. (Interviewer should
prompt to try to get the subject’s actual opinion—but the interviewer
should not try to influence the subject’s opinion at all!)
Behavior Patterns:
8. Who do you live with? People who speak your native language or English-
speaking people? In other words, what language do you speak most of the
time when you are at your dormitory or apartment?
(Interviewer should place the subject on the following scale.)
9. About how many pages of text (typewritten pages as a basis for estimating
length—double-spaced) do you write each semester?
142
Perkins/Pharis: TOEFL scores 143
Method
Subjects. We tested students at levels 5 and 6 in three different batches (Group 1
contained 23 subjects, 2 contained 47 and 3, 40) at the Center for English as a
Second Language (CESL) at Southern Illinois University—Carbondale in the fall
of 1976 and spring of 1977. AtCESL, the TOEFL is given to all students at level 3
on up (see Chap. 4 for a description of the levels and how they are set up). On the
basis of TOEFL, Michigan, and CESL Placement scores, students are eventually
recommended for full or part-time university study at the graduate or undergradu¬
ate level. Groups 1, 2, and 3 each took different standardized reading tests, as is
indicated below.
Testing. The procedures for gathering the data were as follows. We adminis¬
tered each standardized reading test to the foreign students enrolled in levels 5
and 6 at CESL approximately one week before the TOEFL was administered in
each of the three cases.
In the first phase of our data gathering (with Group 1, N = 23), we used the
Nelson-Denny Reading Test, Form D (ND). This test was designed for use in
grades 9 through 16, and it contains subtests aimed at Vocabulary, Comprehen¬
sion, and Reading Rate. In our studies, we did not use the Reading Rate data. There
are 100 Vocabulary items and 36 Comprehension items. The Comprehension
score is given double weight, thus giving 172 total points. The nonnal working
time is 30 minutes. The ND was administered to Group 1 in mid-December, 1976.
In the second phase of our data gathering (with Group 2 , IV = 47), we used the
Iowa Silent Reading Tests (ISRT). We chose level 2 of the ISRT, which is
intended for grades 9 through 14, with norms differentiated according to post-
high-school plans, and we administered only two of the subtests—Vocabulary and
Comprehension. The Vocabulary section consists of 50 items which are said to
survey the depth, breadth, and precision of the student’s general reading vocabu¬
lary. There are four multiple-choice answers to each question, from which the
student selects the nearest synonym for a given word.
The Comprehension section consists of 50 items which are supposed to
measure the student’s ability to comprehend literal detail, to reason in reading,
and to evaluate what he has read. Part A includes 38 questions based on six short
passages. In this section, the reader is allowed to look back at the passages while
answering the questions. Part B consists of 12 passages which are supposed to test
the student’s short-term recall. The student is required to answer one question
about each passage without looking back. The normal working time is 54 minutes.
144 IV: READING TASKS
The ISRT was administered to Group 2 from February to May 1977. Summary test
statistics for the ISRT and TOEFL are reported below in Table 13-2.
Our colleague, Douglas Flahive (1977), collected the data for the third phase
of our research (with Group 3). He used the McGraw-Hill Basic Skills System
Reading Test This battery includes a Reading Comprehension section (MHRC)
with 10 short passages followed by a total of 30 questions testing for main ideas
and supporting details. The MHRC allows 40 minutes of working time. He corre¬
lated the MHRC with the TOEFL Reading subscore. In the two earlier phases, the
TOEFL total score was used in addition to the Reading subscore.
What we were interested in throughout was the power of the TOEFL scores as
predictors of whatever is measured in standardized reading tests intended for
native speakers of English. In the single predictor case, such as the three cases we
are interested in here, the regression analysis is a simple correlation problem.
Hence, for all three groups of subjects, this is the statistical method we applied.
Summary test statistics for the ND and TOEFL (Group 1) are given in Table 13-1.
The correlation of .49 between the TOEFL and the ND is significant at p < .05.
Oddly the TOEFL Reading subscore correlates with the ND only at. 15 (p > .05).
Thus the overlap in variance between the ND and TOEFL Total score is only about
24% (somewhat less than we might have expected). Perhaps the explanation has to
do with the difficulty of the ND for the nonnative subjects, which tends to depress
the variance in their ND scores. They ranged from the 1st to the 27th percentile
when compared against the norms for English-speaking 12th graders on the ND.
Summary data for the second phase (Group 2) are given in Table 13-2. Again
the correlation between the TOEFL Total score and the reading measure (ISRT in
this case) is positive and significant (p < .01). The variance overlap is slightly less
(21%). As before, however, the total possible variance is restricted by the low
scores obtained by nonnatives. They ranged from the 1st to the 16th percentile
when compared against a population of 12th grade English-speaking natives
bound for four-year colleges. Again the correlation between subscores aimed
specifically at reading comprehension of the two test batteries is somewhat lower
than for the total scores (.23,p > .05). We will return below to the question of why
this anomalous result repeats itself.
Data from the third phase (Group 3) are reported in Table 13-3. Here,
Flahive (our collaborator; see his write-up, 1977), used only the subscore of the
TOEFL aimed at Reading, and the MHRC score. The correlation was surprisingly
high, .91 revealing a variance overlap of 83%. His students seem to have been
more advanced, ranking in the 10th percentile for English-speaking freshmen at
four-year colleges. Correspondingly, his students also achieved substantially
higher scores on the TOEFL than those in Groups 1 and 2.
As expected, the three studies taken together seem to suggest a moderate to
strong relationship between TOEFL scores and scores on standardized reading
Perkins/Pharis: TOEFL scores 145
Mean SD Range
TOEFL 426 40.60 344 - 520
ND 26 (172) 10.85 9 - 48
KR 20 Reliability (for the Standardized Reading test) .87
Pearson product moment correlation
ND and TOEFL Total .49*
ND and TOEFL Reading .15
Percentile rank with norm reference group
Grade 12: 1st to 27th percentile
*p < .05.
Mean SD Range
TOEFL 461 43.72 398 - 550
IS RT 35 (100) 10.48 9 - 65
KR 20 Reliability (for the Standardized Reading test) .82
Pearson product moment correlation
IS RT and TOEFL .46*
ISRT Reading Comprehension and TOEFL Reading .23
Percentile rank with norm reference group
Grade 1 2 bound for 4-year colleges: 1 st to 16th
percentile
*p < .01.
Mean SD Range
TOEFL 489 43.70 366 - 578
M-H 12.5 (40) 4.70 4- 24
KR 20 Reliability (for the Standardized Reading test) .87
Spearman rank order correlation
M-H with TOEFL Reading .91*
Percentile rank with norm reference group
Mean score: 10th percentile four-year
college freshman
*p <.01.
146 IV: READING TASKS
tests intended for native speaking populations. We can conceive of two plausible
explanations for the fact that in two cases the subscores aimed at reading compre¬
hension from the TOEFL and the corresponding standardized test failed to corre¬
late significantly. The first explanation has to do with the fact that the shorter test
provides less possibility of variance in scores, hence the lower correlation for the
subtests than for the total scores. The second possibility is that the standardized
reading tests are actually measuring what is termed “power” for natives and
“speed” for nonnatives. However, positing separate constructs does not fit well
with the data from the third phase (Table 13-3), where it is clear that the standard¬
ized reading measure and the TOEFL Reading subtest are measuring substantially
the same thing (nor would this second interpretation fit with many of the earlier
findings in this volume).
Finally, we pose two questions: What can be said about grade level equiva¬
lences for nonnative college populations in comparison with their native English-
speaking competitors? and, How is it that so many foreign students are able to
succeed with such minimal reading ability in English? By all accounts it seems that
our advanced ESL students at CESL/SIU (and probably at centers like ours all
over the United States) are well below average college freshmen in reading ability.
The standardized tests used here could certainly be applied to get an idea of just
how far below our students are (this in spite of the fact that the tests are too difficult
for many of our students). How then do foreign students ever make it through
courses of study that most assuredly require a great deal of reading with compre¬
hension in English? We can offer two tentative reasons. First, we know that univer¬
sity students will be taking courses that require reading; therefore, practice may
compensate for the lack of requisite skills at the beginning of university studies.
Second, we believe that although the ESL student’s surface English machinery
(grammar, in the more traditional sense of the term) may not be as well developed
as that of a native speaker, the deep cognitive machinery is probably as well
developed as that of English-speaking competitors. This deep conceptual ability
may help to compensate for the lack of surface skill in English.
Note
1. A version of this paper was presented at the NAFSA meeting in New Orleans, 1977, and also
appeared in B. J. Robinett (ed.) 1976-77 Papers in ESL: Selected Conference Papers of the Associa¬
tion of Teachers of English as a Second Language. Washington, D.C.: NAFSA, 13-15.
Part IV Discussion Questions
1. Consider examination procedures used in any program with which you may
be familiar. Is multiple measurement with several subtests depended on?
How much time in person-hours is spent in preparing, administering, and
scoring the tests? Discuss the possibility of using a simpler testing procedure.
How could you determine whether there was any significant loss of informa¬
tion if a simpler procedure were proposed?
3. On the basis of what sort of theoretical reasoning are multiple tests proposed
as a basis for E SL placement and proficiency testing? What kinds of assump¬
tions are made concerning the nature of language proficiency? Compare the
CESL Placement battery (see the description of subparts in Chap. 2) with
the TOEFL (see Chap. 1 for a list of the subparts). Better yet, take two such
tests (or more) and compare them point by point to see how much agreement
exists between the experts who make up such tests.
5. Reflect over your own personal experience with statements that you find
particularly agreeable or disagreeable. Is it sometimes hard to understand a
view that you disagree with? Easier to misunderstand a view that is radically
opposed to your own? Discuss these reflections in relation to the findings of
Doerr. What would all this imply for the selection of ESL or FL teaching
materials? Testing?
147
148 IV: READING TASKS
7. Discuss the problem of determining the difficulty levels of texts. Can you
conceive of a way to counterbalance Doerr's design more effectively to
eliminate any possibility of contamination due to differences in difficulty
levels across pro and con texts?
10. What is the usual effect of lengthening a test (say, doubling the number of
items) on its total variance? Its reliability? If the total variance in scores on
the TOEFL is best explained by a single factor (i.e., global language profi¬
ciency, as suggested in Chap. 1), then what would be the effect on test relia¬
bility if the length of the test were cut to one-fifth of the total length of the
test? This is essentially what is done if the Reading subtest is used by itself
instead of the total score in at least one of the correlations discussed in
Tables 13-1,13-2, and 13-3. What would be the predictable effect on corre¬
lations with standardized reading tests?
Part V
Celeste M. Kaczmarek
There has been much debate over the utility of essays as tests of language
proficiency and even as tests of writing skill per se. this chapter reports a
study of two writing tasks and two scoring methods for each. The results
show that subjective methods of evaluating essays work about as well as
objective scoring techniques and are strongly correlated with other
measures of ESL proficiency which have independent claims to validity.
Furthermore, it is demonstrated that teacher judgments of a subjective sort
have substantial reliability and are strongly correlated with similar judg¬
ments by independent raters and with objective scores computed over the
same essays. It is argued, therefore, that essays may reasonably be used for a
wide variety of assessment purposes.
In the past many critics have opposed the use of composition as a measure of
language proficiency. They have reasoned that (1) students are apt to perform
differently on different occasions and when writing on different topics; (2) the
scoring of essays is highly subjective; and (3) students can easily avoid problems
and mask their weaknesses (see Harris, 1969, p. 69). Furthermore, scoring essays
has been considered to he too time-consuming in large-scale testing situations. It
has been argued that other methods might work just as well and be much more
economical to use.
On the other side, teachers have long felt that essays are reasonable sorts of
tasks to require of students who will have to do a great deal of writing in order to
complete just about any educational program. Some of them have argued that the
scores assigned by instnictors on essays written by their own students should be
quite reliable and indeed valid. They argued that there would be considerable
151
152 V: WRITING TASKS
agreement among raters about how well a given essay expresses the author s
intended meaning.
To assess the reliability and validity of two sorts of essay writing tasks, the
following study was designed. Both subjective rating techniques and objective
scoring methods are examined.
Method
Subjects. One hundred and thirty-seven of the subjects who participated in the
Center for English as a Second Language testing project completed all the tests in
this study. They ranged in ability from beginners to fairly advanced students. (See
Chap. 2 for more details.)
Tests. Two essay tasks were used. The first was a rewriting or recall task.
Subjects viewed a paragraph on a screen at the front of the room via an overhead
projector. The text was displayed for exactly one minute. Then the projector was
turned off and the students were instructed to write down what they had read using
the original words of the text as much as possible, but being sure to convey the
intended meaning. Five minutes were allowed for the rewriting. This procedure
was repeated three times—once for each of three different texts. The texts were
selected from reading materials that were believed to range from easy to difficult
for the test population. (See entry 20 in the Appendix at the end of this volume for
texts used in the Recall tasks.)
The second type of writing task was a more traditional essay. Subjects were
allowed twenty minutes to write about an accident according to the following
instructions: “Imagine you are a witness to an auto accident Tell who was
involved; tell who was at fault Describe what happened, where you were, when it
happened, how long before the police arrived, whether or not there were injuries,
and so on.”
Three objective tests aimed at writing skill were also included. (These are
given in their entirety as entry 19 in the Appendix at the end of this volume.) They
consisted of multiple-choice items of the following types:
A farmer’s daughter had been out to milk the cows and was returning home,
carrying her pail of milk on her head. As she walked along she
(1)_(A)
(A) started thinking:
(B) had to
(C) prepared
(D) began to be
“The milk in this pail will provide me with cream, . . .”
Kaczmarek: Scoring and rating essay tasks 153
or the environment
Scoring Procedure. The essay recall task was scored in two ways. First, the
instructors who taught the writing classes and who were most familiar with the
subject’s writing skills were asked to rate each protocol on the following six-point
scale:
Scale 1
Second, five raters were instructed in a method of scoring each of the three
paragraphs objectively. All sequences that were unintelligible or that contained
information not actually stated or clearly implied in the original text were disre¬
garded. Then a count was made of the error-free words in the correct meaningful
sequences that remained. Thus the total possible score was roughly equal to the
number of words in the original text.
The essay task was scored in three ways. First, each instructor assigned a
point score to each protocol according to the six-point scale defined above.
Second, a separate team of five raters were instructed to score the essay objec¬
tively. They were instructed to read the entire essay trying to understand the
intended meaning, then to rewrite it so that it expressed the apparently intended
meaning clearly and correctly; a score was then computed by counting the error-
free words in the subject’s original rendition minus the number of errors in the
same rendition. Errors included: (1) words that subjects wrote which were not
necessary, and which were therefore deleted by the rater, (2) words that the
subject did not use but which were necessary to convey the apparently intended
meaning, and which had to be added by the rater, and (3) in the latter instance non-
idiomatic sequences which had to be rewritten by the rater. A count of whichever
sequence contained the greater number of words (that is, the rater s sequence or
the subject’s original wording) determined the number of errors. Phonetically
correct misspellings were not counted as errors (see Chap. 6). An example illus¬
trating the scoring method follows:
other
The last day I had a colision with another car. The other person involved
at at a
was a young boy. The boy was in fault. I was stoped in an traffic light, when
hit
the other car -hite4 my own car in the rear part, it happened the last week.
The police arived inmidiately, because there was a high-way patrol stopped
on Fortunately, not any
-in the right side of the street Unfortunately, did not there were
Kaczmarek: Scoring and rating essay tasks 155
SO
injuries, both cars had full Insurance, *m4 the father of the boy gave me
another was being fixed at the service station.
other, while mine is fixing +n ears.
(76 -21)/ 76 = .72
where 76 is the number of error-free words and 21 is the number of
errors.
The third scoring method for the essays involved a subjective rating of
content and organization employing a somewhat different scale than the one used
for the teacher s ratings. The following six-point scale was used by the same team
of five raters who did the objective scoring:
Scale 2
1. Incomprehensible or no attempt
2. Most points are hard to understand and/ or there are serious contradictions or
inconsistencies.
3. The meaning is sometimes hard to follow and often awkwardly expressed.
There also may be occasional inconsistencies.
4. Organization and transitions are occasionally difficult to follow, but there are
no glaring inconsistencies or incomprehensible sequences.
5. Meaning is easy to follow throughout and there are no obvious weak transi¬
tions or awkward sequences.
6. Well-written native-like composition with practically no errors.
The question was whether a subjective rating would yield as much meaning¬
ful variance as the objective scoring and whether either method would produce
new information not available in the other.
vo t— O', m oo 00 CN OVOr- lo vo CO 00 CN
co co co CO (N CN p P lO fO Tp lo CO COCO p vo
VO O P ’— CN CN vo CO 00 r- CO CO 00 O vo CO
P P P P P P p p p P m in co in cn lo
03 T— p^ vo CO 00 vo o o cn in P r- OO
vo p p VO p CN vo vo vo P VO vo P P CO
vo 00 o
CN r"^ CO
,_ p r—
CN CN CN CO
vo vo
CO CN
t— r— Tp
CN CO P
VO r-
CO CO
00 CN
P CO vo
vo r-
CO CO p
r_ vo
p
o VO
p CO
VO CN O
co p p
VO
P
o 0*3 03 o CO vo CN 00 (N 00 P^ vo
p CO p vo p p vo p p co p in
p" CN o
p CO p
CN r_
03
vo p CO
p"
vo
o vo
vo p
CN
CO
p
vo
p 03 CO vo 03 VO p 03 vo P
vo CO p vo CO CO vo vo vo P
vo 00 vo vo CO o 00 03 03
p p p CO p p p p p
Correlations of Essay Tasks and Multiple-Choice Cloze Tasks*
CO CO 03
VO p p
CN ,_
p
vo p p
vo
vo 00
vo p vo
vo p vo
,_ CN p^
vo vo p
p^
vo
00 o p vo p^ 03
VO vo VO vo p CO
CO ,_
VO
p p vo
00 03
CO p ll
in co in CO
in >n p cu
JZ
£
00 r- (N
p" in in
"O
00 CN o
vo
"O
LU
+
cn
Total^Select+Edit+Order
CN i/i
_C0 LLl
W> n3 rd
c O O
CO CO
rd c c
rd o o
& n < co u
Passage A
< 03 U
Passage B
Passage C
on </>
OJ <
C c <U <D <D J) iu n
CO u T3
<D < CO <u o o on cm on
<d rd rd
M » M
rd
<D 0) C
on on ■ —
<D <D -C 4->
I O _Drd
Vj
03 </3 </3 </3 i/> rd rd
</> </>
Table 14-1
on on
rd rd rd rd rd rd _=3 rd rd rd
(/> 1/3
</> </3 h-
</3 1/3
(/3 </3 <u oj rd
« U Q. Q. Q_
I ^ rd rd rd rd n3 > > O
— 0) ■a
i <U
_L CL CL Q- J_ 0- Q_ Q_ LU LU cn ■- co p in
16
17
18
19
o' CXI LU
*
Kaczmarek: Scoring and rating essay tasks 157
Tests Mean SD
Recall—Teacher’s Ratings
Passage A 2.00 1.41
Passage B 1.83 0.99
Passage C 1.75 1.02
Recall—Trained Raters
Passage A 13.32 13.58
Passage B 7.88 10.04
Passage C 10.45 11.91
Essay—Teacher’s Ratings
Evaluation on Scale 1 2.67 1.17
Essay—Trained Raters
Evaluation on Scale 2 2.57 1.13
Objective Score of Essay 45.50 29.67
Multiple-Choice Cloze
Select
Passage A 2.81 1.41
Passage B 2.80 1.48
Passage C 2.45 1.75
Edit
Passage A 1.48 1.28
Passage B 1.75 1.29
Passage C 1.31 1.21
Order
Passage A 12.26 4.75
Passage B 7.54 4.58
Passage C 7.51 6.18
Loadings on
Tests g Factorf h2
Recall—Teacher’s Ratings
Passage A .78 .61
Passage B .69 .48
Passage C .78 .61
Recall—Trained Raters
Passage A .74 .55
Passage B .66 .44
Passage C .62 .38
Essay—Teacher’s Ratings
Evaluation on Scale 1 .80 .64
Essay—Trained Raters
Evaluation on Scale 2 .77 .59
Objective Score of Essay .80 .64
Multiple-Choice Tasks
Select
Passage A .63 .40
Passage B .74 .55
Passage C .71 .50
Edit
Passage A .67 .45
Passage B .62 .38
Passage C .45 .20
Order
Passage A .75 .56
Passage B .69 .48
Passage C .62 .38
Eigenvalue 8.84
*The data in this table and Table 14-4 are also discussed
by Oiler (1979, Appendix).
fAccounts for 49% of the variance in the total factor
matrix.
Table 14-4 Varimax Rotated Solution for the Essay Tasks and
Multiple-Choice Tasks {N - 137)
Recall—Teacher’s Ratings
Passage A .57 .65 .05 .65
Passage B .74 .40 -.01 .71
Passage C .81 .30 .21 .79
Recall—Trained Raters
Passage A .40 .69 .10 .65
Passage B .64 .25 .24 .53
Passage C .71 .09 .30 .59
Essay—Teacher’s Ratings
Evaluation on Scale 1 .45 .58 .33 .68
Essay—Trained Raters
Evaluation on Scale 2 .34 .70 .29 .68
Objective Score of Essay .24 .77 .25 .71
Multiple-Choice Tasks
Select
Passage A .49 .41 .15 .43
Passage B .18 .70 .37
Passage C .20 .43 .65 .64
Edit
Passage A .45 .25 .52 .53
Passage B .49 .12 .53 .30
Passage C .11 .08 .71 .51
Order
Passage A .18 .68 .40 .65
Passage B .28 .40 .55 .54
Passage C .06 .46 .59 .56
Eigenvalue 10.81
♦Factor 1, Factor 2, and Factor 3 account for 19.5% of the total variance, 24.4
and 1 6%, respectively.
robust measure of global language proficiency. Furtber, Stump (1978) shows this
factor is apparently indistinguishable from factors of intelligence and school
achievement in native speakers.
Chapter
Karen A. Mullen
160
Mullen: Evaluating writing proficiency 161
In the field of second language testing, one of the skills commonly tested is
that of writing ability, either objectively in the context of multiple-choice
questions or productively in the framework of a writing task. Objective writing
tests, most notably of the type included in the TOEFL (see Chap. 1 for descrip¬
tion), have been constructed to meet the requirements of reliability and validity.
Yet one of the major criticisms is that they do not allow one to see how a second
language learner organizes his thoughts on paper or applies his known vocabulary,
nor do they indicate how well a learner uses in extended, unified prose the formal
grammatical ndes he has been taught chapter by chapter in his language texts. On
the other hand, productive writing tests have been criticized for their failure to
produce reliable measurements of writing skill. This unreliability has been
attributed to two sources: the topics on which the learner writes and the judges
who evaluate the result These criticisms have been leveled against such tests
given to native speakers of English. However, for nonnative speakers, the case may
be different, particularly if the criteria to be measured are those of sentence
structure, vocabulary usage, fluency of writing, and coherence of ideas. If the
purpose of having a learner of English write is to test his ability to put sentences
together while appropriately selecting from his store of vocabulary and to apply
within a set period of time the grammatical rules he knows, then the question of
whether the judges of such writing can assess this and produce parallel assess¬
ments is an important one.
The purpose of this chapter is to report the results of a study designed to
determine if experienced ESL teachers, working in pairs, can come to a mutual
agreement concerning the writing proficiency of nonnative speakers of English
and to determine the reliability of such judgments. In addition, the question of
whether different sets of judges rate differently is posed. Finally, the role each
scale plays in the evaluation of overall writing proficiency is examined. Specific¬
ally, the hypotheses are as follows:
Method
To test hypothesis 1, a single-factor experimental design having repeated
measures was chosen.* The F-statistic based upon the mean sum of squares
between judges divided by the mean sum of squares of the residual variance was
computed to test the hypothesis of no significant difference between judges.
Unbiased reliability coefficients were calculated based upon the number of
subjects in the sample, the number of judges within the group, the mean square
between subjects, and the mean square within subjects (Winer, 1971, p. 287). To
test hypothesis 2, a two-factor experimental design having repeated measures on
one factor was chosen. The F-statistic based upon the mean square between
groups divided by the mean square of subjects within groups was computed to test
the hypothesis of no significant difference among groups. To test hypothesis 3, the
F-statistic based upon the increment in the sum of squares due to the addition of a
scale variable divided by the residual variance was calculated from a stepwise
regression analysis.
Procedure. The judges were required to rate the subjects on five scales of
writing proficiency: Control over English Structure, Organization of Material,
Appropriateness of Vocabulary, Quantity of Writing, and Overall Writing Profi¬
ciency. The scales were labeled vertically on a rating form. Each scale was pre¬
sented in the form of a double horizontal line equally divided into five contiguous
compartments labeled from left to right: poor, fair, good, above average, and
excellent (see Appendix 15A). The judges were instructed to put an X in the box
best characterizing the learner’s proficiency with regard to each of the five scales
or to put an X on the line between boxes if the evidence warranted. A set of
guidelines for deciding what level of proficiency to assign was explained to the
judges before they read the composition, and it was at hand for reference during
the evaluation (see Appendix 15B). For analysis, judgments were later converted
to a numerical value of 1 = poor, 2 = between poor and fair, 3 = fair, 4 = between
fair and good, 5 = good, 6 = between good and above average, 7 = above
average, 8 = between above average and excellent, and 9 = excellent.
The subjects were given a composition booklet with instructions inside
directing them to choose a topic from one of four choices, to plan their ideas for ten
or fifteen minutes, to develop their ideas using details and examples, and to
consider that their writing would be evaluated for grammar, vocabulary, paragraph
organization, logical development, and quantity of writing. Subjects were allowed
an hour for the task.
Judges. Five judges participated in this study. They were randomly paired to
form eight groups. All judges were graduate students in linguistics. They had com¬
pleted courses in phonetics, syntactic and phonological analysis, and TESL
methodology. They had all taught ESL for at least a year. All had been instructed
on how to use the rating form and the guidelines, and they had participated in such
composition evaluation before. None of the judges had had a previous acquaint¬
ance with the subjects whose compositions they read.
Subjects. The 117 subjects in this study had been referred to the University of
Iowa Uinguistics Department for a proficiency evaluation by either the foreign
admissions officer, the foreign student adviser, or the student’s academic adviser.
Mullen: Evaluating writing proficiency 163
Most of the subjects were new to the university. Most had been referred because
their TOEFL scores were below 550. Some appeared for an evaluation because in
the course of their few days on campus, the foreign student advisers had noted a
lack of facility in English, although the TOEFL scores were not below 550. The
purpose of the evaluation was to determine whether additional instruction in
English and a reduced academic program might be recommended for the student.
Source of df Overall
variance SS MS F
4— * * H— 4_ 4— * *
O lO CN O "d- to T—* CN to O O O 00 to o
Li. CO CO to T— os CD 00 CO d[ o O 00 CO
o’ CO* CO CD CN* CN ■d- CN* CO T—* CO* CN* d*
r_ T to
rd CN
<d
> r- o d CD O CO CD 'd- CO to r- os as OS CN o O r- CD i— o o d
o CN CO 00 CO O CD p- 00 CD C4 r- o o o P- to to to CD CN to to as
Analyses of Variance on Performance Scores (One for Each of Eight Pairs of Subjects Tested by
CO o T— to r— o CO os CD to O CN CN »— 'd as CD o o CN CD o o d
CO CO T” CD CD ■d- CN CN* d* CN* CN*
4— 4— 4— 4— * 4— * *
p* CD o p- to T- d- CD o CD \— T— to as P- o
>* d CN d p* CD CN d^ CN o ■d CO O p- d o
cd u. O T-^ r-1 CD* 00* to* CO r-* d* co r—*
T— r"
3
-O
rd T— r— T—
o o to CN CD CN d to CO 00 o CO r— to CO CN o as CD 00 o o
o CD os d T— CD r— CD o 00 o r- 00 d- d- r-> to CD co CD d CO to to
> CO I— t— CN t— p- OS CD 00 to d^ 00 CO o d o '— d CD oo to d d
r-»* T“ 00 to r-* d- CN* CN* co
4— * * * 4— * -4— -4— 4— 4— * *
11 ON p- 00 CD O r— O o r- T- as d o
CD 00 CD T— p- <— o as. to o to *— co O o
+-> d* CO OS* CN* CD CN* ^-* to’ CN* o* 00* CN* to as’
T— CN CN
c
cd ,_ ,_ ,_
3 OS CO d OS 00 as r- os CN to to o CD CD CN o CO as o o to
os o CN d CD CD to CD r^» CM CN o CO CO P" to 00 d d to o o to
O'
CO CD *— 00 o p* »— OS CM CN 00 CO CD CD d o O 00 00 o to
to 00* T CN r-** Os to d- CO T— to* CN* CN* to
, |
-4— H— 44- H— 4— * *
d 00 00 O CO GO as to 00 CD as P' to to
U_ • o d r— o d^ to to o o 00 00 o CD
Different Pairs of Judges) on Five Scales of Writing Proficiency
o
*+-» p-’ d CN* CN* 00* to* to CO d* CO CN* to
rd CN T
N
’c r- r— 00
rd o CO 00 to CD o CD o OS r- O r- CO to CN o CD
00 CO o d CD CN r- CO o d- o CN CN CN to CN to CD p- to T—
CO d CD OS CD T— CD 00 O os CD as
co.
r— CO co o 00 CN T— o d o P-
O CD CO r-* to’ p** CN* r—* T—* T-! *
d- d- o* to* d* d*
4— 4— 4— 4—
O i— CD to to CN 00 as CD CN d 00
fSignificant atp <.01.
00 to CD o
<D u_ d CD d d CN CO as o to CN »— o as CO CO CN
3 CO CO CN* oo* CD* r-— T CN* d* d* r-1 CN*
+-> r—
CD r“
3
£3 OS o 00 T— to OS CD CO CD CO to CN CD OS as T— o to as r— 00 r_ o P-
co CO T- CD 00 d CN r- CO CO OS CN CO o o CD to o CN d CD r— o
:£ CO d O CO 00 O CN r^* as CD CO o as CO to o CD 00 o p- CO CM as
p*** T CN* CD* CO CN* CN* T— 00* CO CN*
r~
4— CM ,_ CN ,_ ,_ 00 _ _
00 as as o o OS as OS as
"O CN CN T T— r— r—
r— T T“ T— r_
us t/i ir> us us US us
•*-> +-> +-* 4-»
L) us CD US o at CD </) CD US o (A CD us us
<D CD
cd <D <D CD CD CD CD <D CD CD CD
4— 00 00 WD 0D 00 00 00 CuO
o D -O -O -D -O
T3 ■a
Significant atp <.05.
Cl
3
O r“ CN CO d- to CD P- 00
O
Mullen: Evaluating writing proficiency 165
reported for each scale. The F-statistic for a difference between groups is not
significant at the .05 level for four of the scales—Structure, Organization,
Vocabulary, and Overall Proficiency. It is not significant at the .01 level for any of
the five scales. The F-statistic for a difference between judges within at least one
pair is significant at the .01 level for every scale except Structure.
In order to ascertain which pairs of judges might be responsible for producing
significant differences, we must consult Table 15-2, showing the results of a
single-factor analysis of variance having repeated measures for each of the five
scales and for each of the eight groups of subjects. The F-statistic for a difference
between judges’ Structure scores is not significant at either the .05 or the .01 level.
It is significant at the .05 level for a difference between judges’ Organization
scores for four pairs of judges (groups 2, 4, 7, and 8), between judges Quantity
scores for four pairs (groups 1, 2, 3, and 8), between V ocabulary scores for two
pairs (groups 2 and 3), and for Overall scores for two pairs (groups 1 and 2). For six
out of the eight groups, there is a significant difference in ratings on at least one
scale. At the .01 level, there is a significant difference between judges Organi¬
zation scores for group 2, between Vocabulary scores for groups 2 and 3, and for
Overall scores for group 2. The F-statistic shows no significant diffeience between
judges’ Quantity scores at this level. Additionally, there is no significant differ¬
ence between judges across rating scales for six out of the eight pairs. Pair 3 shows
a significant difference on one scale (Vocabulary). Pair 2 is the most deviant,
showing a significant difference at the .01 level for three out of the five scales. This
pair performed similarly on an identically designed experiment involving scales of
speaking proficiency on a randomly selected group of subjects from the same pool.
The reliability coefficient is a measure of the degree to which an average of
the rater’s scores on a given scale is a good estimate of the subjects’ true scores and
as such indicates the percentage of the obtained variance in the distribution of
scores which may be regarded as variance not attributable to errors of measurement.
Squaring the reliability coefficient will indicate the accuracy ol the prediction
when the score assigned by one judge in a pair is used to predict the score given by
the other judge. Table 15-3 shows the unbiased reliability coefficients for each
pair of judges for all scales of writing proficiency. The coefficients range from a low
of-.34 to a high of .99. Table 15-3 also shows the relationship between the F-
statistic for a difference between judges and the reliability coefficients. Some
cases show a significant difference and a high reliability (Vocabulary scores in
group 3, for example), indicating that the judges are not interpreting the scale in
the same way. The scores they assigned are not equivalent but are closely parallel.
There are also cases in which there is no significant difference between judges’
scores but a rather low reliability (Structure scores for group 5, for example). This
indicates that the judges, though showing no overall difference in assigning scores,
do not interpret the scale in any way consistent with one another across all subjects
in the sample. Where there is no significant difference in judges’ scores and a high
reliability (group 6 for all scales), we may infer that both judges are interpreting the
scales similarly and reach agreement in their evaluation of individual subjects. In
166 V: WRITING TASKS
Scale 1 2 3 4 5
1 Structure .79 .67 .84 .89
2 Organization .82 .82 .90
3 Quantity .72 .83
4 Vocabulary .92
5 Overall Proficiency
cases where there is a significant difference between judges and a low reliability
(as for pair 2 on all scales), one may conclude that the judges disagree and that the
error of measurement on the scales for these raters is sufficient to reduce the
extent to which an average of the two scores could be used as an estimate of the
subject’s true score. Reliability coefficients for four of the eight pairs of judges are
within the acceptable limits of .70 or above on all scales. A fifth pair is within
acceptable limits on four out of five scales. Three sets of judges did not produce
acceptable reliability coefficients on the five scales. The quantity scale appears to
provide the most uniform reliability coefficients across all groups of subjects and
raters.
As shown in Table 15-4, the correlation between scale ratings is high.
However, a stepwise multiple regression analysis reveals that as the variables are
added one by one to the equation for predicting the Overall Proficiency rating,
each addition to the equation significantly reduces the amount of unpredicted
variance. The variable accounting for the most variance is added first and subse¬
quent variables are then added one by one depending on which accounts for the
most remaining variance. Table 15-5 shows these results. The beta coefficients in
the equation indicate that if the Vocabulary rating were increased by one unit and
the other ratings remained constant, the expected change in the overall
Mullen: Evaluating writing proficiency 167
Table 15-5 A Stepwise Regression Analysis with the Overall Proficiency Scale
as the Dependent Variable
proficiency rating would be .36. If the Quantity rating were increased by one (and
the other ratings remained constant), the change in the overall rating would be .23.
Parallel conclusions inay be reached for the influence of an increase in the Struc¬
ture or Organization rating on the Overall score. A unitary increase in the Vocabu¬
lary rating would produce the most change (by about a third) and such an increase
in the Organization rating would produce the least (by about a fifth). Unitary
increases in the Quantity or Structure scores would change the Overall rating by
about a fourth. If all four ratings were increased by one unit, the change in the
Overall Proficiency rating would be very slightly more than one unit as well. This
indicates that although the scales are unequally weighted, they function together
in such a way that they produce a unitary change in the Overall score. In addition,
the R-square coefficient indicates that the four variables account for .943 of the
variance in the Overall scores and that the relationship between the four scales and
the Overall scale is very nearly linear.
Conclusions. It is apparent from the results of this study that some pairs of
judges achieve fairly high reliability and show no significant difference in scoring
on all scales. It is also apparent that some pairs of judges are highly reliable in their
rating, although one may be calibrating judgments consistently higher than the
other. It is also clear that some pairs of judges cannot produce reliable judgments.
168 V: WRITING TASKS
Moreover, each of the scales plays a role in the determination of the Overall score.
In relation to the hypotheses stated at the beginning, we may conclude that
Note
1. This is an expanded version of a paper presented at the AILA/TESOL Convention in Miami, Fla,
on Apr. 27, 1977. Another version of this paper appeared in H. Douglas Brown, Carlos A. Y orio, and
Ruth H. Crymes (eds.) On TESOL 'll Teaching and Learning English as a Second Language: Trends
in Research and Practice. Washington, D.C.: TESOL, 309-320. It is reprinted here by permission.
Appendix 15A
Composition Evaluation
Name_ Date_
Evaluator_
Above
Poor Fair Good Average Excellent
Compositional Organization
Quantity of Writing
Appropriateness of Vocabulary
Appendix 15B
Guidelines for Evaluation of Compositions
Compositional Organization
Excellent: Well-developed introduction which engages concern of the reader.
Use of internal divisions and transitions. Substantial paragraphs to
develop ideas. Conclusion suggests larger significance of central
idea.
Very good: Obvious inclusion of an introduction, though not smoothly devel¬
oped. Division of central idea into smaller parts, though paragraphs
are lean on detail. Conclusion restates the central idea.
Good: Intent to develop central idea is evidenced, but only a few points are
mentioned. The introduction or conclusion is very simply stated or
may be missing. Occasional wandering from topic.
Fain Limited organization. Thoughts are written down as they come to
mind. No introduction or conclusion.
Poor: No organization. No focus. No development. No major consideration
of topic.
Quantity of Writing
Excellent: Writing is an easy task. Quantity seems to be no problem.
Very good: Reasonable quantity for the hour. Writing flows without much hesi¬
tation.
Good: Enough writing to develop the topic somewhat Evidence of having
stopped writing at times.
Fain Much time spent struggling with the task of putting down thoughts on
paper.
Poor: Very little writing during the hour-long assignment.
Appropriateness of Vocabulary
Excellent: Precise and accurate word choice. Obvious knowledge of idioms.
Aware of word connotations. No translation from native language
apparent May have attempted a metaphoric use of words.
170 V: WRITING TASKS
Very good: Occasional misuse of idioms, but little difficulty in choosing appro¬
priate fonns of words. Uses synonyms to avoid repetition. Some
vocabulary problems may be due to translations.
Good: Use of the most frequently occurring words in English. Does not
use synonyms to avoid repetition. Some inappropriate word choices.
Uses circumlocutions or rephrasing when the right word is not avail¬
able.
Fain Depends upon a very small vocabulary to convey thoughts. Repeti¬
tion of words is frequent. Appears to be translating. Great difficulty
in choosing appropriate word forms.
Poor: Vocabulary is extremely limited.
Chapter
Until recently the ESL skill areas of reading and writing have suffered from
neglect by researchers and teachers alike. The historical reasons behind this past
neglect have been discussed elsewhere by others and will not be reviewed here
(Saville-Troike, 1973, Wilson, 1973). Now that the need has been recognized,
researchers are faced with the task of formulating and testing hypotheses whose
acceptance or rejection they hope will lead to an increased understanding of
second language (L2) reading and writing processes and ultimately to better
methods of teaching.
Conducting empirical research in these two areas is not without problems.
Despite the wide-ranging theorizing and research into first language reading and
writing processes, little is agreed upon as certain. There is as yet no widely
accepted model of the reading process. Empirical studies of writing present
171
172 V: WRITING TASKS
perhaps even more uncertainty, since writing is both a skill and an art. However,
because of the vast amount of research done in these areas with native speakers,
the L2 researcher would probably be missing a good bet not to formulate
hypotheses based on the successful studies of LI researchers. Because of the
relatively crude state of the L2 research art, the hypotheses formulated will be
general in nature.
Method
Objective Measures. The study reported below concerning the use of objective
measures of syntactic complexity in the evaluation of ESL compositions is based,
to a large extent, on the work of Hunt (1965). Hunt, in looking for an objective
measure of syntactic maturity, formulated the I -unit. The T-unit, as defined by
Hunt, is a “minimal terminable unit. . . minimal as to length, and each would be
grammatically capable of being terminated with a capital letter [at one end] and a
period [at the other]” (1965, p. 21). The unit not only preserves subordination but
also all of the coordination between words and phrases and subordinate clauses. In
Hunt’s study it was demonstrated that as students become older and their writing
becomes more complex, the mean length of their T-units as seen in their
compositions also increases.
In addition to length of T-unit, Hunt developed a subordination ratio. The
subordination ratio is determined by simply adding the total number of clauses per
T-unit and dividing by the number of T-units. The subordination ratio, like the
number of T-units, increases as the students increase in age and syntactic
maturity. Table 16-1 displays the results obtained by Hunt for the three grade
levels which his study examined.
While we recognize the usefulness of Hunt’s measures, we also see some limi¬
tations on applying them to the writing of L2 learners. Hunt’s measures do not take
errors into account. Nor do the measures take into account morphological and
transformational complexity. As a possible supplement to the T-unit for purposes
of analyzing the development of the writing of L2 learners, we developed two addi¬
tional measures, both based on the T-unit. The first of these is an Errors-per-T-
unit measure. The measure is computed by simply adding the number of errors
found in each T-unit. The second measure is one of morphological and transforma¬
tional complexity loosely based on the work of Endicott (1973). In our adaptation
Grade level 4 8 12
T-unit length 8.6 11.5 14.4
Clause/T-unit ratio 1.30 1.42 1.68
Flahive/Snow: Measures of syntactic complexity 173
T-unit length =7
If, however, a T-unit contains embedding and one or several complex, derived
morphological forms, the unit is scored in the following manner
111 2
John carelessly hit the red ball which his father bought him over the
1 11111 111 1111
2
neighbor s fence.
1 1
Complexity = 7 + 15 = 22
---= 1.47
T-unit length — 15
The means for each of the measures across the six levels are found in Table 16-2.
The general developmental trend observed by Hunt for native speakers is also
seen in this study in the writing of nonnative speakers. This trend was observed
174 V: WRITING TASKS
Table 16-2 Results Obtained from ESL Compositions over Six Levels of
Proficiency (Total of 300 Compositions, 50 at Each Level)
Proficiency level 1 2 3 4 5 6
over the two measures developed by Hunt as well as with the Complexity Index
developed by us. The only exception to the progression Hunt observed is seen in
the clause/T-unit ratio at levels 5 and 6. This is not surprising, since the writers at
both these levels are all fairly advanced in their composition skills. Perhaps they
are making more precise choices in lexical items and thereby are reducing the
length of clauses through greater clause density. Another possibility is that some of
the level 5 students were actually more proficient writers than level 6 students.
Using the four measures described above, the authors attempted to determine
(1) how accurately objective measures alone could discriminate among the
different levels of ESL placement and (2) what the relative power of each of the
objective measures is. To answer these questions, discriminant analysis using both
direct and stepwise procedures was employed. Since there are only four measures
and six groups, it was necessary to collapse the six groups to three: levels 1 and 2
became Group 1; 3 and 4 became Group 2; 5 and 6 became Group 3.
Only two discriminant functions were found—length of T-unit and clause/T-
unit ratio. Table 16-3 contains the four variables together with the four Wilks’
lambda and univariate F-ratios with 2 and 297 degrees of freedom. Related eigen¬
values and canonical correlations for the two discriminate functions are found in
Table 16-4. Together the two measures accounted for 56% of the total variance
across groups. The errors-per-T-unit and the Index of Complexity (Endicott
index) turned out to be totally lacking in discriminatory power (see Wilks’ lambdas
in Table 16-3). The summary table (Table 16-5) reveals that 64% of the grouped
cases were correctly classified.
While the results of this preliminary study are somewhat encouraging, it is
also clear that more precise measures are needed to discriminate better among the
Flahive/Snow: Measures of syntactic complexity 175
1 100 83 12 5
2 100 27 46 27
3 100 6 31 63
intermediate- level students. One possibility is to adjust the weightings of the Com¬
plexity Index to reflect more accurately the relative complexity of the various
structures. However, given the current state of linguistic theory, many questions
concerning what is or is not a complex structure have not been resolved. Another
possibility is to develop a measure which would more accurately assess the task
demands of the writing process, a process which involves the logical chaining of
one sentence to another. This measure of cohesion could possibly serve as a useful
complement to the Length-of-T-unit and the clause/T-unit ratio. Nonetheless it
seems safe to conclude that we have found that the sentences of ESL students grow
in complexity in ways similar to the sentences of native speakers. Further, objec¬
tive measures employed reasonably discriminate among the various ability levels.
A final question was whether there would be significant correlations between
objective measures of complexity and holistic evaluations of compositions. For
this purpose, each composition was evaluated by experienced ESL teachers on a 1
to 5 scale: 5 represented outstanding, 4 above average, 3 average, 2 below average,
and 1 inferior. The scale was relative to each level. For example, a 3 for Course 3
indicated that the composition was “average"’ for students writing at that level. To
ensure reliability, both of the authors reevaluated the compositions. The interrater
reliability exceeded .90. Pearson product moment correlations were computed
between the holistic evaluations and the objective measures. Results are
presented in Table 16-6. For the lower three levels, the highest correlations were
obtained between the clause/T-unit ratio and the holistic evaluation. Progressing
up the levels, the correlations between length of T-unit and holistic evaluation
continued to increase. At the most advanced level, 50% of the variance in the
holistic evaluations could be accounted for on the basis of length of T-unit alone.
176 V: WRITING TASKS
*p <.01.
While the authors concede that there is far more to writing than length-of-T-
unit or clause/T-unit ratios, this study has demonstrated that these measures are
relatively useful in determining levels of overall ESL proficiency and in predict¬
ing the overall effectiveness of writing ability.
Chapter 1
177
178 Y: WRITING TASKS
1 2 3 4 5
Arabic 2 15 4 11 8 40
Farsi 2 13 23 6 10 54
Total 4 28 27 17 18 94
(Lee and Canter, 1971, and Warden, 1976). Assuming that these hierarchies also
exist for second language learners, it was hypothesized that correct usage within
each of these three categories would increase from elementary to advanced levels,
and that there would be no significant difference in usage by speakers from two
diverse language groups (Arabic and Farsi). We used three different scoring
methods and correlated each with two measures of global proficiency, subjective
essay ratings and objective essay scores, to discover to what extent our discrete-
point scoring methods were valid as indicators of language proficiency.
Subjects. Our sample was taken from a population of 182 foreign students
studying English at CESL. We selected the two largest homogeneous groups
according to sex and language. The subjects used for analysis were 94 males: 54
native speakers of Arabic and 40 native speakers of Farsi. They were distributed
across the levels at CESL as shown in Table 17-1.
Elicitation Instrument. The data were based on essays written as part of the
spring term testing project (see Chap. 14). The students were to write about an
imaginary accident to which they were witnesses and were asked to report who was
injured, when the police arrived, and other relevant facts about the incident. This
task was selected because it drew on experiences outside the language classroom.
Although none of the instructors who administered the test gave any clues as to
paragraph construction or the use of introductions and conclusions, all but five
teachers helped their students with the vocabulary in the directions.
Scoring. The hierarchies of difficulty used for the analyses of conjunction and
pronoun usages were adapted1 from the scales reported by Lee and Canter (1971)
for language-delayed children acquiring English as their native language. The
hierarchies used in this study are as follows in ascending order of hypothesized
difficulty:
Conjunctions: (1) and; (2) hut; (3) because; (4) so, so that, if; (5) or, except,
only; (6) where, when, while, why, how, whether (or not), for, till, until, since,
before, after, unless, as, as + adjective + as, as if, like, that, than, therefore, how¬
ever, whenever.
Pronouns: (1) I, me, my, mine, you (subject and object),your, yours (no need
for referent); (2) he, him, his (adjective and nominal), she, her (object and
Evola/Mamer/Lentz: Scoring for cohesive devices 179
possessive), hers; (3) we, us, our, ours-, (4) they, them, their, theirs; (5) all reflexive
pronouns.
% Variance
Category df F P explained
Conjunctions
Position in hierarchy 5,504 122.78 .001 .48
Level at CESL 4,504 11.12 .001 .04
Native language 1,504 2.95 .082 .00
Pronouns
Position in hierarchy 4,420 92.81 .001 .38
Level at CESL 4,420 10.68 .001 .04
Native language 1,420 .03 .999 .00
Articles
Position in hierarchy 2,102 38.93 .001 .34
Level at CESL 4,102 3.23 .015 .03
Native Language 1,102 18.38 .001 .05
Position in hierarchy
by language 2,102 9.99 .001 .04
Essay rating by teachers .37* .01 .04 .36* -.02 .05 •18f .10 .26* .18*^
Objective essay score .26f .01 .04 .37* .06 .13 .26* .19| .28* .24*
*p < .001. fp < .01. \/p <.05. § Full credit. y Half credit.
only 14%. For articles, several of the scoring methods are significant, but none of
the correlations exceed .28 (p < .01).
Our findings indicate that skills in the usage of cohesive devices are indeed
minimal indicators of overall language proficiency. A student s ability to use
conjunctions, pronouns, and articles correctly cannot be expected to reflect his
communicative ability, although it must contribute to finer aspects of that skill.
Surprisingly, our second hypothesis was weakened: Level at CESL accounts for
far less variance than does an item’s hierarchical position within any grammatical
category. In other words, a student’s competence in using these three structures
has little or no bearing on his level at CESL. Furthermore, native language seems
to play no role at all in ability to use cohesive devices except possibly articles.
When global proficiency measures are used as validating criteria, the scoring
system which totals correct usages seems to be a better indicator of language
ability than systems which take into account the number of words produced, errors
made, or obligatory contexts. Discrete-point analyses seem to reveal only narrow
descriptions of potential communicative capacity and as a result do not appear to
be comprehensive indices of language proficiency.
Note
1. Pronouns used as referents to persons were the only pronouns scored; those referring to objects or
situations were omitted. First person plural and third person plural pronouns were tabulated
separately. Sentence points and weighted scores (see Lee and Canter, 1971) were not used.
Part V Discussion Questions
3. What hypothetical factor do you believe can best explain the loadings ong in
Table 14-3? Also consider the loadings on g in Table 2-2. Compare the
single-factor solution of Table 14-3 with the three-factor solution of Table
14-4. Try to find a consistent (non-self-contradictory explanation) for the
factors in Table 14-4.
5. In Chap. 15, what does a significant difference between judges indicate con¬
cerning interrater reliability? Note that contrasts sometimes co-occur with
high reliabilities (see Table 15-3).
6. Why do some pairs of judges achieve so much more reliability than others?
What factors might enter in? How many of the reliability coefficients dis¬
played in Table 15-3 do you consider to be acceptable? How many, for
instance, are above .80? Below .80?
182
V: Discussion questions 183
9. Is it possible to write long awkward sentences? What effect would the tend¬
ency to do so have on T-unit scores (see Chap. 16)? What effect would such a
tendency have on essay ratings? Or, consider sentences that are very short,
pithy, and clear.
10. Are the measures used in Chap. 16 sensitive to errors or organizational prob¬
lems involving constraints that go beyond the level of a single T-unit or
clause? What about holistic ratings? Are they sensitive to such constraints?
14. W hat factors are overlooked in discrete-point scoring methods but included
(at least implicitly) in the more holistic scoring techniques?
'
>
Part VI
Do native speakers of English tend to make errors that are different in type
from those made by nonnatives who are learning English as a second
language? Are the structures that are difficult for one group also difficult for
the other? What is the strength of the similarity (if any)? Do children learn¬
ing English as their first language and adults learning it as their second
language both find direct assertions easier to process than indirectly
conveyed meanings (e.g., presuppositions or implications)? Can it be
demonstrated that learners of ESL from a particular language background
find specific structures in English more difficult than others because of
interference from their native language? In other words, will a group of ESL
learners from a particular native language background find certain struc¬
tures in English to be significantly more difficult in relation to other English
structures than those same structures might be for native speakers of
English or for learners of ESL from other nonnative backgrounds? These
questions and others are dealt with in Chaps. 18 to 20.
' '•
Chapter
Michelle Fishman
Second language learner errors have often been attributed to the develop¬
mental cognitive strategies that the learner uses while learning a second
language. This study attempts to find out whether, as is usually supposed, the
kinds of errors made by native speakers of English are actually very differ¬
ent from the kinds of errors that second language learners make. The data
were gathered from a dictation task administered to native and nonnative
speakers. The same passage of prose was recorded twice. The natives
listened to a tape with white background noise while the nonnatives heard a
tape without noise. The dictation consisted of nine segments with pauses to
allow both groups to attempt to write verbatim what they had heard. We
hoped to learn whether processing difficulties are distributed similarly over
segments of text for native and nonnatives, and whether the two groups
would tend to make the same types of errors and in roughly the same propor¬
tions. The results reveal substantial similarities in dictation processing by
natives and nonnatives. Although natives made fewer errors in spite of the
noise factor, the native and nonnative speakers tended to agree on what they
found difficult and what they found easy. The data show that in a Q-type
factor analysis all the natives and the more proficient nonnatives loaded on
the same factor rank ordering the difficulty of segments, and a second factor
analysis of the same type showed that a single component accounted for .86
of the total variance between subjects in both groups. The foregoing would
suggest that when pushed to the limits of their ability, both native and non¬
native speakers seem to make the same kinds of errors.
187
188 VI: NATIVE/NONNATIVE PERFORMANCE
more frequently, but that is not the issue. The issue is whether or not natives and
nonnatives tend to make the same types of errors, and in roughly the same propor¬
tions. Put differently, do nonnatives use different processing strategies and
therefore make different kinds of errors, or do they use similar processing strate¬
gies but simply use them less efficiently? And, does the same hierarchy of
difficulty for segments of discourse hold for both natives and nonnatives?
Related to these basic questions is the whole issue of contrastive analysis and
the often postulated influences of the learner’s native language on target language
processing. Naive contrastive theories predicted that types of errors would
necessarily be different for natives and nonnatives (also see Chap. 20). Depend¬
ing on the nonnative’s first language, learning strategies would differ. However,
Richards (1971) showed that errors made by second language learners are not very
different from errors children make while learning their native language. On the
basis of such reasoning, learner outputs (especially errors) have come to be
regarded as evidence of the strategies the learner uses to test hypotheses about the
structure of the language—i.e., the route he follows when developing proficiency
in the target language.
It is widely accepted that both first and second language learners must
abstract similar rules in order to achieve mature or native-like competence. To test
the degree of similarity in native and nonnative processing strategies, a dictation
was used as an elicitation device. In many previous studies tasks have been used in
which oral or written responses were elicited from the learner (Selinker, 1972,
Dulay and Burt, 1974, Schumann and Stenson, 1974, Johansson, 1975, and
Taylor, 1975). All these studies have assumed that it is possible to infer some¬
thing of the nature of the underlying mental processes from learner performance.
In order to challenge the efficiency of speech perception of the native
speakers used in this study (see below) and thus make the dictation task of
somewhat more equivalent difficulty for natives and nonnatives, white noise was
imposed on the signal presented to the natives. A similar technique had been used
by Gradman and Spolsky (1975) to measure second language proficiency. Long
before that, the technique was used with native speakers by Miller, Heise, and
Lichten (1951) in a study of voice communication systems.
Method
Subjects. Thirty-two native speakers of English attending Southern Illinois
University and thirty-two nonnatives served as subjects. The former group
consisted of randomly selected freshmen, sophomores, juniors, and seniors. Major
fields of study were Business Administration (5 students), Teacher Education (4),
Marketing (3), Health Education (3), Early Childhood Education (3), Sociology
(3), Social Welfare (2), and one each in Biology, Physical Therapy, Public Rela¬
tions, Clothing and Textiles, Political Science, Law, and Physiology. Two were
undecided. The foreign students were among those enrolled in Southern Illinois
University’s Center for English as a Second Language (CESL) during the fall
Fishman: Comparative study in taking dictation 189
semester of 1976. The thirty-two normative subjects were chosen from level 5 at
CESL. They included native speakers of Farsi, Arabic, Spanish, Japanese,
Turkish, and French. According to CESL’s placement procedure (see Chap. 4)
this group was supposed to be relatively homogeneous in ESL ability.
Testing. The same test was tape-recorded twice. The following passage of
prose from Mark Twain's Huckleberry Finn was used:
(1) The river life was very leisurely. / (2) The meals were mostly fish caught while travel¬
ing. / (3) The boat he possessed had a raised section for living quarters. / (4) In normal
weather he could lie around / (5) and fish or sleep without ever getting wet / (6) About
the only work involved was tying up the boat at night / (7) and fixing meals, neither of
which required much work. / (8) Overall, it is easy to see why / (9) this life can be called
a relaxing one (Hardison, 1966, pp. 111-113).
The main difference from ordinary speech was that pauses were inserted at phrase
and clause boundaries at the same places in both readings. Pauses (indicated by
slashes) were sufficiently long for the hearer to have ample time to write each
dictated segment verbatim. (The numbers given in parentheses were not dictated
and are used only for the sake of convenient reference to the segments later in this
chapter.)
The dictation contained 76 running words of text broken into 9 segments as
shown above. Two types of errors were computed. First, scores were tabulated for
each segment by dividing the number of correct words written by each subject on
each segment by the actual number of words in that segment The level of segment
difficulty could then be calculated from these scores.
Ten categories of errors were differentiated. A percentage score was
calculated for each subject on each category by dividing the number of errors on
that category by the total number of errors. Thus the native and nonnative
responses could be compared for relative frequency of errors on the various seg¬
ments and for the relative frequency of occurrence of the various types of errors.
The categories and examples of errors of each type are given in Table 18-1.
Spelling errors were not included as part of the scoring process, except those
which seemed to indicate difficulties in perception of distinct sounds as in“rever”
for “river” or those which affected the lexical identity of the word as in “whether”
for “weather” or“possede” for “possessed.” Therefore, spellings like “travelling,
tieing, and leasurly” were not counted as incorrect (See Oiler, 1979, and Chap. 6.)
Overall, in spite of the noise factor, native speakers made approximately 2.5 times
fewer errors than the nonnatives. However, both the natives and nonnatives
tended to rank the segments similarly. That is, what the natives found difficult the
nonnatives also found difficult and what the natives found easy, the nonnatives
also found easy. There appeared to be a similar hierarchy of difficulty in segments
for the two groups.This can be seen in Table 18-2 by comparing the rank order of
segments by the natives with the rank order by nonnatives.
190 VI: NATIVE/NONNATIVE PERFORMANCE
1. Morphological Changes
(1) The river life is (was) very leisurely. (1) The river life is (was) very leasurly.
(4) In normal weather he could lay (lie) (4) In normal weather he could lay (lie)
around . . . around ...
(9) This life could (can) be called a (9) This life may (can) be called a
relaxing one. relaxing one.
6. Substitution Attributable to
Phonological Similarities
(between [b] and [v]) (between [b] and [v])
(6) ... was time devoted at night (was (6) ... was time enough to vote at
tying up the boat at night). night (was tying up the boat at
night).
♦Parenthesized numbers refer to the specific segments at issue. Errors shown in italics have
the correct wording given in the immediately following parentheses.
Fishman: Comparative study in taking dictation 191
8. Inflectional Deletions
(7) ... and fixing meals, neither which (7) ... and fixing meal (meals) ...
require (neither of which required)
much work.
... and fixing meals, this require . . . and fixing meals, neither which
(neither of which required) much require (neither of which required)
work. much work.
^Parenthesized numbers refer to the specific segments at issue. Errors shown in italics have
the correct wording given in the immediately following parentheses.
Natives Nonnatives
the first principal component, which accounted for 40% of the total variance. All
the natives loaded on that factor at levels above .50. Their mean loading was .78.
Eleven of the nonnatives also loaded on that factor at levels above .50. The mean
loading for nonnatives on the first factor, however, was .19. The second principal
component or factor accounted for .24 of the total variance and received loadings
above .50 by the remaining twenty nonnatives. The mean loading of nonnatives in
the second factor was .60 and for the natives was .12. The Spearman rank order
correlation for the ranking of segments by natives and nonnatives was .60.
(Pearson, r = .58, p < .001). From all these data it is possible to conclude that
what is difficult for natives tends to be difficult for nonnatives.
A similar Q-type factor analysis was also applied to the native and nonnative
ranking of error categories, i.e., to the ranking of categories by proportion of errors
Fishman: Comparative study in taking dictation 193
Natives Nonnatives
in each one. The ranks, percentages, means, and standard deviations for natives
and nonnatives are given in Table 18-4. A single principal component accounted
for .866 of the total available variance, as shown in Table 18-5. The mean loading
for the natives was .84 and for the nonnatives was .90. Only two native speakers
loaded on that factor below .91 and only five nonnatives loaded on the same factor
below .90. No other interpretable factors emerged. The Spearman correlation
between ranks (as displayed in Table 18-4) was .68 (Pearson, r = .71). Clearly,
natives and nonnatives perform very similarly in terms of overall proportions of
errors of the ten types studied.
194 VI: NATIVE/NONNATIVE PERFORMANCE
As a further check and to investigate specific error types, ten t-tests were
computed contrasting natives and nonnatives in terms of the proportion of errors
of each type made by each group. Only two of the contrasts were significant Non¬
natives made proportionately more errors of type 4, distortion or deletion of
weakly stressed syllables (t = 4.27, df = 62, p < .001). Also nonnatives made
proportionately more errors of type 8, inflectional deletions (t — 3.24, df = 62,
p < .002). We might have predicted that the natives would make fewer errors of
each type, in comparison with the nonnatives. From all the foregoing we can say
that when pushed toward the limits of their ability, natives and nonnatives seem to
make the same kinds of errors in taking dictation and in roughly the same
proportions.
Chapter 19
Processing of Indirectly Conveyed Meaning:
Assertion versus Presupposition in
First and Second Acquisition1
Patricia L. Carrell
not only directly conveys its basic assertion, but also indirectly conveys a
presupposition2 and an implication5:
There are several linguistic properties that play a role in the interpretation of
indirectly conveyed meaning. One is lexical class; another is grammatical struc-
195
196 VI: NATIVE/NONNATIVE PERFORMANCE
ture. The example above illustrates the role played by lexical class membership (in
that case, a verb, manage, of the class of implicative verbs) in the interpretation of
indirectly conveyed meaning (in that case, presupposition and implication).
A classic case of lexical presupposition is illustrated by the sentence Max has
stopped beating his wife. In addition to the directly conveyed assertion, this state¬
ment indirectly conveys the presupposition that Max did at one time beat his wife.
The presupposition in this example is lexical because it depends on the lexical
item stop; something cannot be stopped (or not stopped) unless it has been
happening.
Illustrative of the role played by grammatical structure in indirectly
conveyed meanings are the following cleft sentences:
Sentence (4) conveys a different meaning from sentence (5), owing primarily to
differences in the assertions and presuppositions of these sentences (Akmajian,
1969, Muraki, 197Q):
Sentence (4) directly conveys the assertion (6b) and also indirectly conveys the
presupposition (6a); sentence (5) directly conveys the assertion (7b) and also
indirectly conveys the presupposition (7 a). The type of presupposition illustrated
here, known as grammatical presupposition, does not appear to be attributable to
the presence of any particular lexical item (note that the sentences employ exactly
the same lexical items). Rather the presupposition is due to the particular
grammatical structures, namely, the cleft constructions. The related pseudo-cleft
sentences exhibit the same properties (Muraki, 1970):
Sentence (8) directly conveys the assertion (6b) and also indirectly conveys the
presupposition (6a); sentence (9) directly conveys the assertion (7b) and also
indirectly conveys the presupposition (7 a).
Although presupposition has been traditionally dealt with by logicians as a
strictly logical relation between statements, several philosophers and linguists
have recently come to regard it as a pragmatic notion relative to the belief
structures of the speaker and the hearer (Sellars, 1954, Hutchinson, 1971,
Carrell: Assertion versus presupposition 197
Hutchinson continues:
Under this analysis presupposition involves a belief on the part of the speaker that the
addressee is aware of some fact which the speaker believes to be true, and assertion
involves a belief on the part of the speaker that the addressee is ignorant of some fact
which the speaker believes to be true. In no case are the facts themselves relevant, that
is, it doesn’t matter for considerations of communication whether A and B are in
actuality true. What is relevant is whether x believes them to be true or not and whether
y believes them or not, for it is in these terms that we may characterize the appropriate¬
ness or inappropriateness of speech acts. The speaker must believe both A and B if he is
to make a legitimate assertion, and the addressee must have the beliefs the speaker
believes he has if the assertion is to be appropriate to this addressee (1971, p. 136).
Obviously, then, the difference in the appropriate use of sentences like (4) and (5)
or (8) and (9) depends on what the speaker believes and what he believes his
listener believes.
Hutchinson also discusses what may happen if the belief inferences fail to
hold. A speaker might, for example, employ a construction involving a presupposi¬
tion the speaker believes to be false. In this case, the speaker may be said to be
intentionally misleading the listener. Since a speaker may also employ a false
assertion in order to intentionally mislead the listener, the question arises as to
whether a listener is more likely to be deceived by a false presupposition than by a
false assertion.
The cleft and pseudo-cleft sentence constructions described above are
particularly well suited for empirically testing that question. First, we may note the
neatly “reversative” relationship which holds between the noun phrases of the
presupposition and assertion of each pair of cleft sentences. In a cleft sentence
like (4) It is a bird that is eating the worm, the presupposition is about something
eating the worm, the assertion is that the bird is the thing doing it. In the related
cleft sentence (5) It is a worm that the bird is eating, the presupposition is about
the bird eating something, the assertion is that the worm is the thing it is eating.
That is, the asserted and presupposed noun phrases are reversed in the two related
types of cleft sentences. In the first type of cleft sentence, (4), we may say that the
agent noun phrase is asserted, the object noun phrase is presupposed. In the
second type of cleft sentence, (5), we may say that the object noun phrase is
asserted, the agent noun phrase is presupposed. The same relationship holds for
the two types of pseudo-cleft sentences, like (8) and (9). (See Chart 19-1.)
198 VI: NATIVE/NONNATIVE PERFORMANCE
Cleft Pseudo-cleft
Sentence type Type 1 Type II Type 1 Type II
Assertion (4) (5) (8) (9)
bird worm bird worm
agent object agent object
1st NP 1st NP 2nd NP 2nd NP
Second, we may note the order relationship which holds between the noun
phrases of cleft and pseudo-cleft sentences making identical assertions and pre¬
suppositions. Sentences (4) and (8) are alike in their assertions and presupposi¬
tions; sentences (5) and (9) are alike in their assertions and presuppositions. In the
cleft sentences, the first noun phrase is asserted, the second is presupposed. In the
pseudo-cleft sentences, the second noun phrase is asserted, the first is presup¬
posed. (See Chart 19-1.)
This study makes the assumption that the theoretical linguistic distinction
between assertion and presupposition described above is psychologically “real”
and is therefore empirically measurable. In particular, by comparing subjects’
responses to these two types of cleft sentences, as well as their pseudo-cleft
counterparts, it should be possible to measure empirically the differences in the
effects of assertion and presupposition. Previous work by Hornby (1974) has
shown this to be a valid assumption for adult native speakers of English. Further,
this study assumes that if the distinction between assertion and presupposition is
present in the competence/performance of adult native speakers of English, it
should be detectable in preadult stages of children acquiring English as their first
language and in the acquisition process of nonnative adults acquiring English as a
second language. This study is an attempt to extend Hornby’s findings to children
acquiring English as their first language and to adults acquiring English as a
second language, and to compare the results of the two in terms of the relationship
between first and second language acquisition.
Method
weeks. The LI subjects were 20 four- and five-year-old children attending nursery
school. Their ages ranged from 4.3 to 5.4 (M = 4.45). Half of these subjects were
female, half male.
Procedure. The subjects were presented with a series of prerecorded cleft
and pseudo-cleft sentences (see Appendix 19A). Each sentence was followed
ahnost immediately (one second delay) by the presentation of a slide picture (one
second duration). The duration of the slide presentation was arrived at through a
pilot experiment to be long enough to allow the subjects to form an impression of
the picture but also short enough to keep them from making out all the details of
the picture. Each picture involved only a simple three-element event: an agent, an
object, and a simple relationship between the agent and object; for example, a girl
riding a bicycle (see Appendix 19B). The task was for the subject to decide
whether the sentence did or did not correctly describe the picture. If the sentence
correctly described the picture, they were to respond “true,” but if they noted a
discrepancy between the picture and the sentence, they were to respond "false.
Their answers were later transcribed onto computer-scored answer sheets. Prior
to actual testing, each group of subjects was given four example test items to
ensure comprehension of the task.
The only differences in the procedures for the two groups were that the L2
adults, who were literate, were tested in groups of 10 to 15 and gave their
responses in writing. In other words, they simply had to circle “true’ or "lalse” on
a prepared dittoed answer sheet. The LI children, who were not yet literate, had to
give their responses verbally—responding simply “yes” rather than "true,” or
“no” rather than “false.” The necessity of verbal rather than written responses
dictated that the children be tested individually.
Half of the test items involved misrepresentation in the picture of the asserted
information (the asserted noun phrase) in the related sentence; the other half of
the test items involved misrepresentation in the picture of the presupposed
information (the presupposed noun phrase) in the related sentence. In no case was
more than one of the two noun phrases misrepresented, and in no case was the
action or verb relationship between the two noun phrases misrepresented. Addi¬
tional test items in which the sentence correctly represented the picture were
introduced as control items to break up any set toward negative responses that
might develop, but these were not scored.
In addition to systematically varying the test items so that half the misrepre¬
sentations involved the asserted noun phrase and the other half involved the pie
supposed noun phrase, it was very important to control and systematically vary two
other factors in order to rule out other possible alternative explanations for the
expected results. In order to rule out the possibility that any detected differences
might be due to a difference between first noun phrase and second noun phrase
(rather than due to assertion-presupposition differences), both cleft and pseudo¬
cleft sentences were included. If only cleft sentences had been included, for
example, one might be able to explain any significant differences between asser¬
tion and presupposition as due to differences between the first noun phrase and
200 VI: NATIVE/NONNATIVE PERFORMANCE
24 Items
the second noun phrase. (See Chart 19-1.) Therefore, half the test items involved
misrepresentations in cleft sentences, the other half in pseudo-cleft. In order to
rule out the possibility that any detected differences might be due to a difference
between agent and object (rather than to assertion-presupposition differences),
both types of cleft and both types of pseudo-cleft sentences were included. (See
Chart 19-1.) Half the test items involved misrepresentations in agent noun
phrases, the other half in object noun phrases. Thus the 28-item test was construc¬
ted as follows: 4 control items involving no misrepresentation between sentence
and picture, not scored; 24 scored items involving misrepresentation between
sentence and picture. (See Fig. 19-1.)
Null Hypothesis. There is no significant difference in effect between asser¬
tion items and presupposition items. In mathematical terms, we would state this
null hypothesis as H : pa — p. , where pu is the mean of the population on the
assertion items and p is the mean of the population on the presupposition items.
Research Hypothesis (preferred alternative hypothesis). There is a signifi¬
cant difference in effects between assertion items and presupposition items; in
fact, subjects score better on assertion items (making significantly fewer numbers
of errors) than on presupposition items (making significantly greater numbers of
eiTors). In mathematical terms, we would state this alternative hypothesis as
A: p > p .
ra
Results
The principal measure was the number of times each subject correctly reported
that the sentence was not a correct representation of the picture. The important
comparison was whether the correct responses occurred significantly more
frequently when the misrepresentation involved the asserted noun phrase than
when it involved the presupposed noun phrase. Said another way, did the subjects
make significantly greater numbers of errors with the presupposition items than
Carrell: Assertion versus presupposition 201
with the assertion items? The results are presented in Table 19*1.
It can readily be seen that both groups of subjects performed better with the
assertion noun phrases than with the presupposition noun phrases. Out of a maxi¬
mum possible score of 12, there is over a full point difference (actually 1.12)
between the mean performance on assertion items (10.04) and the mean perform¬
ance on presupposition items (8.92) for the L2 subjects. For the LI subjects, there
is almost a full point difference (actually .85) between the two mean scores. When
the presupposed noun phrase was misrepresented, the subjects tended to make
more errors, i.e., failed more often to notice the discrepancy, than when the
asserted noun phrase was misrepresented. For the presupposed noun phrases, the
L2 subjects overlooked the misrepresentation an average of 3.08 times out of 12,
but when the asserted noun phrases were misrepresented, the average number of
errors was only 1.96. The LI subjects overlooked the misrepresentation an
average of 2.95 times out of 12 for the presupposed noun phrases, but the average
number of errors for the asserted noun phrases was lower at 2.10.
We may also note that for the L2 subjects the range of correct responses was
higher for the assertion items (8 to 12) than it was for the presupposition items (5 to
12). For the LI subjects the range of correct responses was the same for both
assertion and presupposition items (6 to 12). There was also relatively greater vari¬
ability among the responses to the presupposition items than there was to the
assertion items; S2 — 2.857 to S2 — 1.567 for L2 subjects, S“ — 4.471 to
S2 = 3.568 for the LI subjects. In other words, there was relatively greater
homogeneity in the responses to the assertion items than to the presupposition
items.
The relatively high mean scores (10.04, 8.92 and 9.90, 9.05) indicate that
the test was not too difficult for either group of subjects and that they were
generally able to carry out the assigned task;4 they were able to detect both the
directly conveyed assertion and the indirectly conveyed presupposition. It should
also be clear that these mean scores near 9.0 and above are extremely unlikely to
have occurred by chance; that is, it is extremely improbable that these scores
could have resulted from chance guessing on the part of the subjects. The proba¬
bility of a mean score of 9 or above on a 12-item true-false test is only .07. Clearly,
then, the subjects were not merely guessing but were attending as well as they
could to the task at hand. They performed well at detecting both the directly
202 VI: NATIVE/NONNATIVE PERFORMANCE
Discussion
These results demonstrate a number of things about the competence/performance
with assertions and presuppositions of both adults acquiring English as a second
Carrell: Assertion versus presupposition 203
language and young children acquiring English as their first language. First, they
show that both groups of subjects are able to detect directly conveyed assertions
and indirectly conveyed presuppositions. Both groups performed better than
chance in recognizing either aspect of the meaning of the sentence/picture they
were presented with.
Second, the results show that there is a certain degree of correlation between
the two variables. That is, the processing of asserted information and the process¬
ing of presupposed information are not totally unrelated. Similar mental strate¬
gies appear to be involved. What isn’t clear from these results is the magnitude of
the correlation. The results from the L2 subjects appear to indicate only minimal
overlap—only about 17% of the variance in one variable is accounted for by the
variance in the other variable. The results from the LI subjects appear to suggest
greater overlap—as much as 45% of the variance overlapping. One possible expla¬
nation might be that the correlation coefficient for the LI subjects is artificially
high because of the relatively small sample size. However, this area of correlation
and overlap between the two variables requires further investigation.
Third, these results show that, although the two variables are related, they are
significantly different in the level of competence/performance. Both groups of
subjects attended better to assertions than to presuppositions; they are more likely
to be deceived by misrepresented presuppositions than by misrepresented asser¬
tions. The fact that the subjects in all these studies failed more often to notice the
discrepancy between the presupposed proposition and the picture suggests that
they tended to take for granted that this part of the picture was correct and focused
their attention on those parts of the picture relevant to determining the correct¬
ness of the asserted proposition.
These findings might best be accounted for in terms of Hutchinson’s (1971)
notion of pragmatic presupposition and his so-called “inference schema”
described earlier. Sentences like clefts and pseudo-clefts, with overt grammati¬
cally signaled distinctions between asserted and presupposed information, are
generally used in contexts where the speaker assumes that the listener already
knows the presupposed information. It follows then that the presupposed part of
the sentence is not usually providing the listener with any new information. In
judging whether or not such statements were true, the listener would be con¬
cerned primarily with the new information that the speaker is asserting. Under
time pressures such as those in the present study, the subject would presumably
first try to confirm this aspect of the meaning by rapid visual analysis of the
assertion-related portion of the picture. Based on that confirmation, and failing
additional times to further check the presupposition-related portion of the
picture, he would conclude that the statement was a correct description of the
picture.
Hutchinson describes this phenomenon as one of presupposition “swallow¬
ing” (1971, p. 137). If the hearer has no prior beliefs about the presupposition, as
would be the case in the present study, and does not have time or the inclination to
check the presupposition, as would also be the case in this study, a hearer has two
courses of action open to him:
204 VI: NATIVE/NONNATIVE PERFORMANCE
(i) he can express surprise over the new (putative) fact brought to his attention, or
(ii) he can “swallow" the presupposition and come to believe it on the basis of his
respect for the “expertise" of the speaker (Hutchinson, 1971, p. 137).
In some instances, Hutchinson maintains, the latter course of action is the more
natural of the two. Most listeners, most of the time, do not go around expressing
surprise at the presuppositional beliefs attributed to them by speakers; rather they
tend to go along with the speaker. A convincing example of this is cited by
Hutchinson: If someone said to a listener, “The present shaman of the Chippewa is
a friend of mine,’’ the listener would most likely conclude that there exists such a
person rather than to question or express surprise over the existence of such a
person. Hutchinson says:
It is through this propensity to adopt the beliefs of others when we do not hold counter¬
beliefs that one can inform (or misinform) through presuppositions. Presuppositional
lying can be extremely effective (1971, p. 138).
Notes
1. The author gratefully acknowledges the assistance provided by the following graduate students
who worked on various segments of the project which yielded these studies: Joan Jamieson, Maureen
Garry, Jonas Nartney, and Pamela Benson. Thanks are also due to the Center for English as a Second
Language (CESL) at SIU-C, which kindly granted the author access to the L2 subjects who partici¬
pated in the study, to the Child Study Cooperative Nursery School at SIU-C, which kindly granted the
author access to the LI subjects who participated in the study, and to SIU-C’s Office of Research
Development and Administration, which provided financial assistance for this project This paper
appeared in Language Learning under the title “Empirical Investigations of Indirectly Conveyed
Meaning: Assertion versus Presupposition in First and Second Language Acquisition” (1977, 27, 353-
369). It is reprinted here, with minor editorial changes, by permission.
2. Strawson (1952) defines presupposition as the relation between two statements, A and B (read“A
presupposes B ’) when the truth of B is a necessary condition for the truth or falsity of A.
Carrell: Assertion versus presupposition 205
3. Classical logical implication is a relation between two statements, A and B (read “A implies B”),
when the truth of B follows from the truth of A and the falsity of A follows from the falsity of B. I am
following Austin (1962) in using the term "imply” in its ordinary weaker sense: “A implies B” means
only that asserting A commits the speaker to B; asserting ~ B need not commit the speaker to - A.
4. As an indication of the overall reliability of the instrument, the internal consistency reliability
coefficient of the 24-item test was computed to be .55. If the test were lengthened to a total of 100
items, the estimated internal consistency reliability coefficient would be .84.
5. A one-tailed test of significance is appropriate because the research hypothesis claims that asser¬
tion items will be less difficult than presupposition items.
Appendix 19A
Test Items by Groups
The following is the 28-item test used in the reported studies. Items are listed in
the same random order in which they were administered. If the item involved a
misrepresentation between the sentence and the picture, that misrepresented
noun phrase is underlined in the sentence and the object actually pictured is indi¬
cated on the right. Each item is coded as to the type of item it is:
EXAMPLE:
TEST ITEMS:
Cleft-As-Obj 20. It is a gun that the lady is holding. bow and arrow
Appendix 19B
Sample Pictures
The following are eight sample pictures used in the reported studies.* Pictures are
numbered to correspond to test-item numbers.
♦The pictures are taken from Peabody Picture Vocabulary Test: Series of Plates by Lloyd M. Dunn,
Ph.D., published by American Guidance Service, Inc. They are reproduced here by permission.
Chapter
Craig B. Wilson
The view of grammar as grammatical structure opens the way to a comparison of the
grammatical structure of the foreign language with that of the native language to
discover the problems of the student in learning the foreign language. The results of
such a comparison tell us what we should test and what we should not test... (1957, p.
208
Wilson: ESL cloze tests 209
The experiment reported here was intended to measure the degree to which cloze
tests that are deliberately biased on the basis of contrastive analysis would actually
be harder for Vietnamese than for speakers of other languages. It tests the
approach to ESL for Vietnamese described in a guide to teachers of Vietnamese
refugees published by the Center for Applied Linguistics:
The teacher of Vietnamese students can tell in advance which lessons will be difficult
for his students by comparing the structure taught in a lesson with the parallel structure
in Vietnamese... (National Indochinese Clearinghouse, 1975, p. 7).
Method
Subjects. Subjects came from seven ESL programs with Vietnamese students.2
Students were judged to be at the intermediate stage of ESL learning or higher.
The 72 subjects from 12 language backgrounds were grouped according to their
native languages into three main categories: Vietnamese (37), other nonnatives
(26), and native speakers of English(9). Forthe sake of judging degrees ofinterfer-
ence, the group designated “other nonnatives” was divided further into Southeast
Asians, Hmong (5); Laotian (4); and Cambodian (1); Asians, Chinese (5), and
Japanese (1); and Indo-Europeans, Spanish (6); Farsi, Hindi, Armenian, and
Russian (1 each). In a few cases, respondents indicated two languages as their first.
The language used by the minority in the country in question was chosen.3
Tests. The cloze tests were from four Readers Digest articles which were
reduced to about 200 words each (see Appendix 20A). A contrastive analysis of
Vietnamese and English grammatical structures served as the basis for biasing
three of the passages. The author of that contrastive analysis shares Lado’s
approach to learning difficulties:
The fundamental principle guiding the writing of this contrastive grammatical analysis
of English and Vietnamese is the conviction held by many linguists and foreign lan¬
guage teaching specialists that one of the major problems in learning a foreign language
is the interference caused by the structural difference between the native language of
the learner and the foreign language to be learned (Nguyen Dang Liem, 1967, p. xii).
Four cloze tests were constructed: (1) Every fifth word was deleted from the
unmodified passage which served as the control test; (2) a selected deletion test was
constructed over the second passage by choosing to place cloze deletions in struc¬
tures predicted by the contrastive analysis to be difficult for Vietnamese; (3) a
third salted test was constructed by loading the text with structures predicted to be
hard for Vietnamese by contrastive analysis and simply deleting every fifth word
on a random basis; (4) the remaining passage was used to construct a double-
biased test by salting and by carefully selecting difficult structures for deletion
points.
The Flesch readability formula showed the modified texts (Appendix 20A) to
be of nearly identical difficulty, and of similar interest levels. However, the Fry
(1968) readability index ranked the control test and the selected deletion test as
210 VI: NATIVE/NONNATIVE PERFORMANCE
appropriate for readers between grades six and seven. The salted test was rated at
seventh grade reading level and the double-biased test at between grades seven
and eight. A similar ranking using the SMOG index (McLaughlin, 1969) rated the
first three tests at eighth grade level and the fourth at ninth grade level.
Table 20-1 Means and Standard Deviations on the Four Cloze Tests for Each
Language Group
Selected
Language group Control deletion Salted Double-biased
Mean SD Mean SD Mean SD Mean SD
Vietnamese 15.89 7.85 11.27 6.61 9.76 7.11 7.41 5.14
Other SE Asian languages 12.20 6.51 7.40 5.30 6.00 2.62 5.80 2.82
Chinese and Japanese 23.33 5.92 15.50 7.82 14.83 7.22 11.33 5.28
Indo-European 20.70 7.04 17.30 7.21 14.80 6.34 11.40 6.80
Native English 26.89 5.37 23.44 4.93 24.00 5.77 21.00 7.57
Other
Vietnamese, nonnatives,
Test mean mean df F*
Selected deletion 11.99 12.06 1,60 .006
Salted 10.43 10.45 1,60 .001
Double-biased 7.91 8.51 1,60 .611
The mean product moment correlation (not tabled) between scores on the
biased tests and scores on the control test was .88 for the Vietnamese and .85 for
the other nonnatives taken together. The correlations between item ranks on the
several tests also led us to reject the hypothesis that the Vietnamese performance
is in any w av due specifically to their language background. The average Spearman
rank-order correlation between the item ranking by Vietnamese and the ranking
by the several other language groups taken together (including the native English
speakers) was .70 on the selected deletion test, .78 on the salted test, and .67 on
the double-biased test The corresponding Pearson product moment correlations
of the average scores on test items over all four tests was .85 for the Vietnamese
with the other Southeast Asians; .83 for the Vietnamese with the Chinese and
Japanese Asians; .81 for the Vietnamese with the Indo-Europeans; and .64 for the
Vietnamese with the native speakers of English.
In conclusion, there is no evidence of a difference in performance due to
native language interference for the Vietnamese group. These findings, though
they do not constitute a final answer to the contrastive analysis issue, are nonethe¬
less convincing enough to suggest that native language interference may be a less
important factor than it has often been claimed to be. The author believes,
however, that interference models built on foundations other than structuralist
models also need to be tested. Ross (1976), for example, suggests that contrastive
analysis should be directed to the use of structures rather than to their surface
forms. The question is whether experimentally testable hypotheses can be
formulated.
Notes
Appendix 20A
Cloze Tests
NAME AGE
Arthur Mitchell was one of the first black ballet dancers to be in a major
American ballet company. Now, after doing well as a dancer, he is teaching ballet
to black boys and girls.
When he was a _ b°y_, Arthur never dreamed of_being_a
famous dancer. Rut one_day a teacher noticed him_dancing
at a party__encouraged him to study ballet_He tried
out and Sa'ne(^ admission into a high scbo°* for the arts. When
he graduated, some people gave him_ enough money to
study ballet all the time.
Al the_arts school, Arthur had. to _ work very hard. “I’m
fighter. I was detennined to succeed," he says of
those days. He began to perform in the school’s troupe
and stayed with it for fourteen years. He became one
of its best dancers.
The murder _ of Martin Luther King in _ 1968 changed
Mitchell’s life. Whh the death of a_ man loved by so many
_people_ , Arthur asked himself: What can _ I do in his
memory He decided to teach ballet _ to black children, who
usually little chance to _r study that art. Mitchell chose
to open a school in Harlem, m _ New York City. He
NAME AGE
Pele was _boy, he, too,. played soccer with his friends. At
first, _ the ball was only made . of rags, and the boys
played in the street. But by the time he turned . fifteen, Pele was
able to play for money. As he matured, he became the. best
player in Brazil. He helped_ his country win three world
Pele’s playing might be useful in another way. Once, while the Biafran war
was raging, he went to that area to play. On the two days he was there, the fighting
actually stopped so that people could go to see him on the field.
NAME AGE
Jean Eymere is a champion skier who lost his sight to disease when he was
thirty-three. But he has not let blindness defeat him. Instead, he has learned to use
his knowledge to help other sightless people to live fuller lives.
Jean was a skiing teacher until 1969. That was die_year
when disease blinded him_ . Though encouragement was offered
_by_ friends, Jean himself felt that his life had become
empty_ . However, he was finally persuaded by another skier to
try_ his sport again. When Jean attempted to ski, he
found he could hardly even stand_ up. Yet, there was
the realization that he could *earn_ to ski again, despite
being_blind.
It was this_experience that gave Eymere the Idea_for his life’s
work. Skiing_ _was dangerous for the_sighted Jean dared to believe
that blind people could learn. Although_ some did not agree
with_ him, Jean began encouraging die_ blind to ski. "I
want to make other blind _people laugh and feel the
wind on their faces, as_!_have,” he would say.
214 VI: NATIVE/NONNATIVE PERFORMANCE
NAME AGE
Loretta Lynn is the most popular woman country music singer in the United
States. Housewives especially like her. Her songs are full of the wisdom she herself
gained as a wife and mother.
Loretta was_horn_the second of eight children to a_Poor_
family living in Kentucky. Whenever he could , her father mined coal or
worked on federal projects for those without j0bs. He
-- helped by a wise wife. But they remained poor because
saving money was _ hard in those days. Then, at only thirteen,
Loretta chose to marry , though few W0llld marry at that age.
__from Washington State, her _ husband decided to move
his family back there. _ This meant that _young_
Loretta had to_carry_the load of raising a growing_family far away
from her parents.
_There_was always a lot of work for the new _wife to do. She
found that singing _would make her happy. Ji_ was this
__that showed her husband that Loretta. _ much talent had
1. Fishman boldly asserts in her title that “we all make the same mistakes.” Dis¬
cuss the evidence that she presents along with that presented in Chaps. 19 and
20 by Carrell and Wilson. To what extent does her observation seem to be true?
How much of the total variability across test items in Fishman’s study, Carrell’s,
and Wilson’s appears to be common to native and nonnative speakers?
2. Discuss the specific errors committed by natives and nonnatives which are
used to illustrate the various error categories employed by Fishman (see
Table 18-1). Consider the errors that seem to be identical for natives and non¬
natives. Also consider any that seem to be different What strategies seem to
be common to both groups, and what strategies seem to be unique to one or the
other?
5. The reliability of Carrell’s 24-item test was .55 according to her footnote 4.
How does this reliability compare with some of those observed in relation to
other testing procedures discussed in earlier chapters? What possible expla¬
nations might be offered to explain the contrast? Can Carrell’s test be classed
as integrative or discrete-point? Is it a pragmatic procedure? Bear in mind the
fact that her test does require the mapping of utterances onto contexts.
215
216 VI: NATIVE/NONNATIVE PERFORMANCE
7. Note that Wilson remains (see his concluding remarks) unconvinced of the
failure of contrastive analysis to provide a suitable basis for explaining the
performance of the Vietnamese learners on the ESL tests. He seems to believe
that a different sort of contrastive analysis might get better results. Discuss
some of the options for using a functional contrastive analysis rather than a
structural one. (See Ross, 1976.)
Part VII
How strongly are measures of aptitude, attitude, and other variables related
to second language proficiency.'' Is the Modem Language Aptitude Test, for
instance, a good predictor of success in university level foreign language
study? Does motivation really seem to influence the degree of success
attained in learning ESL? Are self-reported differences in amounts of
practice in using the target language significantly related to its acquisition?
Are variables that should be expected to correlate with ESL proficiency
more strongly related to it than variables which have no theoretical relation
to ESL learning? Can a measure of redundancy utilization (e.g., the
tendency to use correctly morphemes which are largely redundant) be
shown to be a good index of the degree of integrativeness of ESL learners?
These questions and related issues are dealt with in Part VII.
'
Chapter
Sadako 0. Clarke
Aptitude tests have been used for a variety of purposes by educators, e.g., for
grouping within classes and determining whether a student has the potential for
future language study (a use by the Foreign Language Institute). This study was
designed to determine the strength of the Modem Language Aptitude Test as a
predictor of foreign language achievement scores. More specifically we wanted to
know whether scores on the MLAT are as highly correlated with language achieve¬
ment for Japanese as for Indo-European languages. German was selected as a
representative of the latter language family. We were also concerned to discover
what effect course status (whether elective or required) had on aptitude-
achievement correlations and whether these correlations weie highei for some
language skills than for others. Finally, we attempted to discover whether previous
language had any effect on aptitude scores and what effect this training had on
achievement.
219
220 VII: MEASURING FACTORS
Method
Materials. Two major types of instruments were used: the Modem Language
Aptitude Test (MLAT) and various achievement tests given by the language
instructors. A questionnaire was also used to elicit information about behavior and
experience outside the classroom.
MLAT. The short form of the MLAT developed by Carroll and Sapon (1959)
was used. It consists of Parts III, IV, and V from the longer version. In Part III,
Spelling Clues, students select the correct meaning of disguised spellings of
English words. In Part IV, Words in Sentences, students respond to various aspects
of English Grammar but without having to use specific grammatical terminology.
This part purports to measure sensitivity to grammatical structure. In Part V,
Paired Associates, students memorize pairs of words. This supposedly measures
their ability to learn rapidly by rote. This short form of the MLAT requires approxi¬
mately 30 minutes of testing time and was given at the beginning of the spring
semester.
Table 21-1 Mean Scores and Standard Deviations for the Short Form of the
MLAT (Total and Parts) and for Fall and Spring Achievement Scores (Total
and Grammar Subscore)
German German
Japanese German Elective Required
Variables (possible score) Total N = 22 Total N = 69 N = 25 N = 44
Mean SD Mean SD Mean SD Mean SD
MLAT Part III (50) 23.32 9.70 19.94 8.12 24.72 7.48 17.23 7.22
Part IV (45) 21.23 6.74 25.33 7.48 27.00 7.12 24.39 7.59
Part V (24) 17.45 5.30 18.86 4.52 19.84 4.20 18.30 4.64
MLAT Total (119) 62.00 18.19 64.13 15.23 71.56 13.62 59.91 14.60
Achievement, fall 1976, % 90.75 5.44 86.09 7.95 87.27 8.89 85.42 7.38
Spring 1977, % 85.53 8.07 84.61 10.09 87.12 9.49 83.19 10.24
Grammar subscore, spring
1977,% 85.11 8.00 81.61 13.95 84.81 13.03 79.80 14.27
Table 21-2 Correlations between the MLAT Scores and Fall and Spring
Achievement Scores in Japanese and German
*p < .05.
222 VII: MEASURING FACTORS
69 Spring Semester
Achievement scores .29 64.1 15.2
(German only)
Criterion
samples
H 24 Course grade .47 61.0 14.2
K 22 Course grade .29 61.7 13.0
K 21 Course grade .36 70.8 12.6
J 37 Final exam grade .40 76.6 9.8
J 38 Course grade .30 76.6 9.7
Table 21-4 Correlations between the MLAT Scores and Fall and Spring
Achievement Scores for German Students Broken Down into Elective and
Required Groups
Elective N = 25 Required N = 44
Fall Spring Fall Spring
1976 1977 Grammar 1976 1977 Grammar
K)
*
LO
Part IV .35* .30 .29 .55* .32*
Part V .04 -.07 -.29 .11 .09 .14
Total .24 .15 .01 .49* .29* .31*
*p < .05.
also required to take German. Table 21*1 shows that the German students who
took the courses as electives obtained the highest mean score (71.56) on the
MLAT. German students who took the courses as requirements on the other hand
obtained a mean of 59.91, and the Japanese group obtained a mean of 62.00. How¬
ever, the German elective group did not get the highest score for academic
achievement (87.27); the Japanese group did (90.7 5). However, as shown in Table
21-4, there were no significant aptitude-achievement correlations for the elective
German students except between Part IV of the MLAT score and the fall Achieve¬
ment score (r = .35). All the other correlations in Table 21-4 for the elective
German group were insignificant. This confirms Cooper s finding (1964) that only
grammar tests seem to predict first semester grades of students in German. Our
study also supports Carroll’s (1958) observation that English grammar tests fre¬
quently have been highly predictive of success in foreign language classes (though
not necessarily of success in actual foreign language acquisition; see Krashen,
1977). . , ,
Table 21-5 shows the correlation between previous language training and
MLAT scores* Carroll and Sapon (1959) note that very little direct evidence has
been obtained concerning the relation between previous language study and the
MLAT scores. But here students in the German group who had studied Latin in the
past showed a high correlation with Part 111 (r = .89, p < .001, N — 9) and the
total score of the MLAT (r = .73, p < .01, N= 9). Could prior study of Latin have
an effect on linguistic aptitude as measured by the MLAT? Does the MLAT
measure what Latin courses teach? There certainly appears to be a substantial
amount of shared variance between the MLAT total and especially the I ait III
subscores and previous study of Latin. Table 21-5 also shows that Latin is the only
language for which prior study is correlated to aptitude. Could this be because the
methods of instruction used in Latin (e.g., teaching a good deal about English word
roots, etc.) contrast markedly with those used in modern languages?
♦In spite of the fact that many of the correlations in this table appear to he quite strong (both positive
and negative ones), the numbers of subjects are so small in most cases that the correlations (even the
seemingly large ones) rarely achieve significance.
224 VII: MEASURING FACTORS
German group
MLAT MLAT MLAT
Language N III IV V Total
German 69 .15 -.02 .0015 .07
French 7 .48 .23 .08 .34
Spanish 22 -.10 .04 -.29 -.12
Latin 9 .89* .06 .46 .74*
Japanese group
MLAT MLAT MLAT
Language N III IV V Total
J apanese 22 .17 -.004 .08 .12
Spanish 9 .26 -.17 -.03 -.26
Chinese 3 .79 .13 .28 .39
French 4 -.77 -.34 .32 -.57
*P <.01.
Amount of
time studying Japanese .12 .47* .53* .62* .63*
Parents’ use of
Japanese .21 .40* .32 .41 .71*
Amount of
time studying Japanese .64* .70* .67* .75*
Parents’ use of
.53 .47 .49 .33
Japanese
.62* .57* .60* .79*
Time abroad
*p < .05.
It should be noted that Japanese achievement may have been affected by the
smaller number of students in Japanese classes and by the fact that they were
encouraged to work with native speaking Japanese students on a tutorial basis.
This sort of experience was not available for the German students.
In summary, it would appear that achievement in foreign language study is
significantly (though weakly) related to scores on the MLAT. Contrary to the
expectation of Carroll and Sapon (1959), the MLAT predicted achievement in
Japanese (not an Indo-European language) better than achievement in German (an
Indo-European language with Roman script). Further, for the Japanese subjects
time spent in studying the language, parents’ or spouses’ use of the language, and
time in a country where the language was spoken all had significant impacts on
achievement The possibility that the MLAT may measure rather well whatever is
taught in Latin courses is fairly strongly suggested.
Further study is needed on all the foregoing issues, but it seems clear that the
ML AT by itself leaves a rather large margin of error as a predictor of foreign
language achievement Between 45 and 90% of the variance in foreign language
achievement in this study for both the German and the Japanese students was not
predicted by the MLAT. Furthermore, and much more importantly, the results
obtained here are if anything more encouraging than is typical of previous research
with the MLAT (Carroll and Sapon, Manual 1959; also see Carroll, 1967).
Indeed, the predictions for the Japanese group were considerably better here than
those reported for the German reference populations in the MLAT test manual.
Chapter
Mitsuhisa Murakami
Method
Materials. In order to find out whether or not various predictor variables were
significantly correlated with proficiency scores in English, two types of tests were
constructed—a modified dictation procedure and a cloze test, accompanied by a
227
228 VII: MEASURING FACTORS
self-report questionnaire. Ten items were made for the dictation by taping various
portions of radio broadcasts. Parts of the news, dramas, narratives, reports, and
interviews were used. Test items were arranged so that subjects, first, listened to
an entire recorded selection and then wrote a selected sentence from it The
selected sentences were copied on the tape and thus were repeated several times,
always in exactly the same way that they were originally said. There were no
unnatural pauses between words or phrases in the criterion sentences. The text of
each item is given in Appendix 22A.
For the cloze test, two passages were taken from college freshman textbooks,
one from Pickett and Laster, Writing and Reading in Technical English, p. 218,
and the other from Anne Free, Social Usage, p. 117. The usual cloze test construc¬
tion procedure was followed with every seventh word suppressed and the first two
sentences in each passage left intact The total number of blanks was twenty-five.
These tests are reproduced below as Appendix 22B.
On the questionnaire, the subjects were asked to supply demographic
information, self-ratings of their speaking and listening skills as well as informa¬
tion about their social life in the United States.
*Not significant (p > .05). All nonasterisked correlations are significant at p <.05 or better.
friends the ESL student had, the better was his reading skill as measured by the
Cloze test. The reason for this is unclear. Perhaps, even more curious are the
statistics in row (v). The number of pages the subjects wrote in English was more
strongly related to their score on the Dictation performance than on the Cloze
task: that is, the number of pages written in English is more closely related to
scores on the listening comprehension task than to scores on the reading
comprehension task. There is no obvious explanation, but the following specula¬
tion is offered for consideration.
Since writing requires intense mental and linguistic work for nonnative
speakers, it may orient ESL students to be attentive to the less salient function
words of speech in addition to the more obviously meaningful content words. Thus
writing might exercise their ability to catch the function words of speech which
otherwise flow too fast and are almost inaudible. In fact, the average subject had
the greatest difficulty with the function words on the Dictation test In the Cloze
test, in contrast, function words are shown as boldly as content words, which
naturally requires less analysis-by-synthesis guesswork on the part of the
nonnative speakers. Hence, perception of grammatical points learned through
writing might prove to be very helpful in clarifying the fuzzy segments of actual
speech.
Table 22-2 row (vi) shows a negative correlation between the integrative
motive and the subjects’ test performance, which is partly in agreement with the
finding of Spolsky (1969) that integrativeness is not a significant lactor for
Japanese learning ESL. The integrative motive, which has been summarized from
the reasons the subjects gave for their stay in the United States, consisted of five
statements as follows:
(1) “I like the United States as a nation.”; (2) “I like Americans.”; (3) "I want to marry
an American”; (4) “I want to make as many English speaking friends as possible.';
(5) “I want to see real American life and people.”
Subjects rated their agreement with each statement on a five-point scale ranging
from “indifferent” to “very much so.” The results show that, the more subjects
230 VII: MEASURING FACTORS
indicated that they were integratively motivated, the less proficient they were
likely to be.
As is seen in Table 22-2 (vii) and (viii), the correlation between the self-rating
of speaking skill and test performance, especially on dictation, was fairly high,
though not high enough to be taken as an adequate substitute for the proficiency
tests used in this study. The rating of ability to speak English was somewhat more
highly correlated with the ESL test scores than the rating of listening skill.
In conclusion, the fact that Japanese students show rather remarkable
progress in their aural comprehension skills and less marked improvement in their
reading skill according to their length of stay in the United States presumably
reflects the bias of English education in Japan. Their reading skill may have been
developed nearer to its maximum while there was greater room for improvement in
listening comprehension.
Notes
1. I am grateful for the comments and suggestions of John Oiler on an earlier draft of this paper. 1
would also like to thank Takeshi Ohara for help with the statistics and computer programming for the
present study. Any errors, of course, are my own.
2. Forthe 17 cases where the TOEFL test was taken between November 1974 and November 1976,
the Dictation, the Cloze, and the Total scores of the present test given during October and November
1976 correlated with the total TOEFL scores at .59, .68, and .76, respectively.
Appendix 22A
Part I. Dictation.
(2) The Democratic Vice Presidential nominee says the Peace Corps repre¬
sents a classic example of the dividends that flow from idealism and that its
spiritual commitment may be more important than what its projects actu¬
ally accomplish. At its peak in the mid-sixties, the Corps had more than
15,000 volunteers in 48 countries; now it has less than 7,000 in 58 coun¬
tries. —News and Commentary
(3) Everyone wants more out of life. Did you ever meet someone totally satis¬
fied? The more we have, the more we want, it seems. But, then, there are
those who resign themselves to mediocrity. They know they will go no
further and settle down to what Henry David Thoreau calls “lives-of-quiet-
desperation.” -Narrative
Murakami: Progress in ESL 231
(4) Consider primitive man who cowered in a cave, who literally trembled at
every falling leaf, who saw himself surrounded on every side by malevolent
spirits. Even the gods to whom he prayed were capricious and fickle. That
was the primitive, ignorant, undeveloped man. But why do we have similar
problems, we sophisticated, modern, civilized beings? --Narrative
(5) “Fred?"
“Yeah, honey, it's me."
“Well, what's the matter?”
“Uh, you better forget about going out to dinner tonight. I just got fired.
“What?”
“A nice anniversary present for us, uh?" --Drama
(8) “Well, lately we’ve been hearing a lot about hyperactive children, Dennis
These kids are restless, irritable, excitable and impulsive and naturally
this behavior pattern causes all kinds of problems both at home and at
school."' -Science Report
(9) “You mentioned they were looking at the brain of the ant. Now let me see.
That’s about the size of a speck. What can they possibly learn from that?”
“.. . His experiments have shown that the individual ants with bigget
brains are able to perform better in intelligence tests.
“Hum, that’s fascinating! ...” -Science Report
232 VII: MEASURING FACTORS
(10) “Speaking of the sun,... all over the country millions of Americans are
stretched out on beaches, back yards and swimming pools trying to get a
good, healthy sun tan. Only, as most of us ought to know by now, the sun tan
is not too healthy.”
“.. . Every year we pass along the same warnings, and I have a feeling that
people aren't really listening.” —Science Report
Appendix 22B
Part II. Cloze Test
1. The letter seeking adjustments is in some ways the most difficult of all
letters to write. Frequently the writer is angry or annoyed or extremely dissat¬
isfied and his first impulse is to express his feelings in a harsh, angry, sarcastic
letter. But the purpose of the adjustment letter is to bring about positive
action that satisfies the complaint A rude letter that antagonizes the
reader is not likely to result in such a positive action. Thus, above all
in writing an adjustment letter, be calm, courteous, and businesslike.
Assume that the reader is fair and reasonable. Include only factual
information, not opinions; and keep the focus on the real issue, not on
personalities.
Generally, the adjustment letter includes these three points:
(1) identification of the transaction, (2) statement of the problem, and
(3) desired action.
What factors are important to learning English as a second language for adult
foreign students in the United States? Seven types of self-reported data were
investigated: (1) descriptive/demographic variables such as length of study
of EFL before coming to the United States; (2) expressed attitudes toward
instruction such as the extent to which learners enjoy or feel they benefit
from ESU classes; (3) reported behavior in using and studying the target
language including whether the learner thinks or dreams in English and the
amount of time spent studying English each day; (4) attitudes toward
Americans including the extent to which they are viewed as truthful, kind,
friendly, rich, powerful, etc.; (5) attitudes toward self including the extent to
which the learner sees himself as reserved, talkative, calm, carefree, etc.;
(6) reasons for studying English including integrative motives such as getting
to understand Americans and instrumental motives such as getting a good
job; and (7) opinions on controversial topics such as the legalization of
marijuana or the abolition of capital punishment. Subjects were tested on a
battery of language proficiency tests including a conventional dictation and
a modified cloze test focused on grammatical functors, rnodals, conjunc¬
tions, prepositions, and the like. Moderate to low correlations (never above
.46) were observed between the dependent variables (scores on the
dictation and the grammar test) and the predictor variables (the seven types
of self-reported variables) for the 45 to 77 subjects who took the tests and
completed the questionnaires. None of the attitude variables accounted for
more than .16 of the variance in either of the language proficiency measures
used. Further, the variables of type (7), which were originally considered
extraneous to the study, accounted for as much variance as any of the non-
extraneous variables. Three alternative explanations are considered the
alternative that the attitude questionnaire is a kind of unintentional language
and intelligence test cannot be nded out
233
234 VII: MEASURING FACTORS
In the last couple of decades, the hypothesis that certain attitudes and
motivations are apt to lead to higher levels of attainment in second language learn¬
ing than others has grown in popularity. It has come to be widely accepted that a
so-called integrative orientation toward the target language culture—that is, a
desire to become like valued members of that community—is apt to be a more
effective basis for second language learning than an instrumental orientation—the
mere desire to acquire some material or utilitarian advantage (Gardner and
Lambert, 1972). Recently, however, another plausible alternative has been pro¬
posed to explain the possible superiority of the so-called integrative motive.
Savignon (1972) has produced evidence suggesting that an integrative orientation
may be the result rather than the cause of a superior performance in language
learning.
More recently still, a third possibility has been proposed: the popular
measures of attitudes and motivations may themselves be surreptitious measures
of language proficiency (Oiler and Perkins, 1978, Chap. 5). If the last alternative
were correct, it would not invalidate the theories, but it would invalidate tests of
them based on the popular questionnaire formats. Further, if the latter alternative
were correct, it should be possible to show that the correlation between attained
target language proficiency and affective variables such as integrative or instru¬
mental reasons for learning ESL should be no stronger than the correlations
between attained language proficiency and attitudes toward extraneous variables
such as the legalization of marijuana, the abolition of capital punishment, and
whether or not abortion is an immoral act
A disturbing fact about much of the previous research is that the posited rela¬
tionships between motives and learning are sometimes sustained and sometimes
not, yet the popularity of the theories seems to increase in spite of the evidence.
For instance, it is generally assumed that an integrative motive is superior to an
instrumental motive. Gardner and Lambert (1959) found evidence that this is so.
However, Anisfeld and Lambert (1961) found no contrast between the two types
of motivation and neither did Lambert, Gardner, Barik, and Tunstall (1962).
Lukmani (1972), in fact, found the opposite—that the instrumentally motivated
learners tended to outperform integratively motivated learners. It would seem that
the real strength of the theories resides in their intuitive appeal rather than in the
available empirical data. Perhaps as Oiler (1977) has argued, the final arbiter of
questions concerning attitudes and affective variables in general will have to be
subjective judgment rather than empirical tests employing psychometric mea¬
sures of affect
In the meantime, certain empirical questions remain to be answered:
(1) What is the relation between language proficiency and a wide range of self-
reported variables? (2) Can language proficiency be predicted more accurately on
the basis of variables that are expected to be causally related to its attainment than
on the basis of extraneous variables? (3) Can the possibility that affective
measures are surreptitious measures of language proficiency be ruled out? To
obtain answers to the foregoing questions, the following study was designed.
Oiler/Perkins/Murakami: Learner variables 235
Method
Subjects. In all, 182 foreign students at the Center for English as a Second Lan¬
guage (Southern Illinois University, Carbondale, Ill.) were tested as part of the
spring testing project in 1977. Owing to absenteeism and the voluntary nature of
participation in the attitude part of the study, between 45 and 101 students com¬
pleted relevant portions of the questionnaires, the oral interview, and the language
tests. There was some selectivity favoring the better students because the weaker
ones tended to complete fewer language tests and fewer attitude questions, but all
levels of CESL were represented. Practically all the subjects were males between
the ages of 19 and 30 and the largest language backgrounds represented were
Arabic, Persian, and Spanish.
Dependent Measures. The Dictation score used in this study was the
composite (sum) of the three dictations discussed in greater detail by Bacheller
(Chap. 6, this volume); also see entry 5 in the Appendix to this volume. The
Grammar test was prepared for CESL by its Academic Director, Dr. Charles
Parish. It is given in its entirety as entry 22 in the Appendix. Other scores could
have been used, but the Dictation and Grammar tests were among the best
predictors of global proficiency as defined in the factor analytic studies of Scholz
et al. (Chap. 2, this volume), and they were among the tests completed by the
largest numbers of subjects, thus maximizing the number of valid eases for the
correlations and regression analyses reported below.
(1) Descriptive/demographic:
i. time in the U.S.;
ii. length of study of EFL before coming to the U.S.;
iii. language spoken by EFL teacher back home;
iv. highest educational level attained by either parent;
v. father s occupation;
vi. whether the subject had visited the U.S. or Britain before coming to
the U.S. to study;
vii. for how long;
viii. and for what purpose.
236 VII: MEASURING FACTORS
Data Analysis. The question was whether or not the predictor variables
would account for a significant (and/or substantial) portion of the variance in
either of the dependent measures of language proficiency. Hence a multiple re¬
gression technique was used. Each of the seven clusters of predictor variables was
regressed onto the Dictation and Grammar scores separately. Since there was no a
priori basis for positing a particular hierarchy among the predictor variables in any
set, each set was dealt with in a stepwise fashion, selecting the best predictor from
a given set, and then the next best (having partialed out the portion of variance
already accounted for in the dependent variable by the first selected predictor),
and so forth until all variables in a given set had been exhausted. Since this
procedure increases the possibility of chance relations in rough proportion to the
number of variables entered into any given regression equation, we imposed three
stringent statistical constraints; First, to be considered a significant predictor of
the criterion (dependent variable), the predictor had to be significantly correlated
with the criterion at p < .05; second, the predictor had to enter the regression
equation at a significant F-ratio (p < .05); and third, the predictor had to
significantly increase the amount of variance accounted for in the dependent
variable (again at p < .05). The first constraint was suggested by Gardner (per¬
sonal communication) and the second and third are taken from Kei linger and
Pedhazur (1973).
Results
We will report the results of the regression analyses in the order of the question
types as presented under Predictor Variables above. Only one of the descriptive/
demographic variables proved to he a significant predictor of the Dictation scores.
It was time spent in the United States. It entered the regression equation at an F-
ratio of 12.42 with 1,47 degrees of freedom. The raw correlation between this
variable and the Dictation score was .46, which proved to be the strongest for any
of the 108 correlations computed between the independent and dependent
variables. None of the predictor variables of the descriptive/demographic type,
however, was significantly correlated with the Grammar score. It is interesting to
238 VII: MEASURING FACTORS
note that although the amount of time spent in the United States before the testing
seemed to improve scores on the Dictation, it had no effect on scores on the
Grammar test. Also, it may be worth noting that even though the amount of time
spent in the United States predicted more of the variance in the Dictation than any
other predictor accounted for in either of the dependent variables, the amount of
explained variance is not great (.462 = 21%).
Among the variables classed under Attitudes toward Instruction, only one
accounted for a significant amount of variance in the two dependent variables,
namely, the extent to which subjects considered the language tests to be difficult
The latter variable entered the regression equation as a predictor of the Dictation
score at an F-ratio of 7.51 with 1,43 degrees of freedom and correlated at—.38
with that score. It entered the regression equation with the Grammar score as
dependent variable at an F-ratio of 12.34 with 1,64 degrees of freedom and corre¬
lated at -.40 with the latter variable. From these facts we may conclude that
subjects who thought the tests were difficult tended to do more poorly than those
who thought that the tests were easy.
Of the variables classed under reported use of the target language, only the
tendency to think or dream in English correlated significantly with either of the
dependent variables. In fact, this variable correlated only with the Dictation score
and inversely at that. It entered the regression equation at an F-ratio of 4.92 with
1,49 degrees of freedom and correlated at .30 with the criterion. This correlation
is problematic, however, because it should be negative if we assume that a greater
tendency to think or dream in English is indicative of greater skill in the language.
In fact, the variable was coded such that a high score indicates a lesser tendency to
think or dream in English. We will return to this problem below.
What about attitudes toward Americans? None of the predictor variables was
signficantly correlated with the dependent variables. Although the predictor
“truthful” barely failed to achieve a significant positive relation with the dictation
and the variable “technologically retarded” barely failed to achieve a significant
negative relation with the Grammar score, contrary to expectations, no clear
interpretable pattern emerged. Attitudes toward self were also disappointing in
this regard. None of the predictor variables was significantly correlated with either
of the dependent criteria.
Among the nine reasons for studying English, one correlated with the Dicta¬
tion score, and a different one with the Grammar score. The desire to understand
the American people and their way of life was inversely related to the Dictation
score (F = 6.71, df= 1,45). The correlation was .36 but should have been
negative to support the usual interpretation of the integrative motive because the
scale was scored high for disagreement and low for agreement. The variable
intended to measure the subject’s desire to stay in the United States was positively
correlated with the Grammar score at .24 (F = 4.06, df = 1,64).
The last three variables included in the regression analyses were extraneous
ones—presumably unrelated to language proficiency. However, contrary to
popular theories, the extraneous variables proved to be as strongly related to the
Oller/Perkins/Murakami: Learner variables 239
Discussion
It is immediately apparent that the theoretical views concerning integrative and
instrumental motives for learning second languages do not serve very well as
explanations of the data reviewed above. In no case does an attitudinal or affective
variable account for more than .16 of the variance in a predictor variable, and
surprisingly, one of the extraneous variables accounts for as much variance as any
other single affective variable or set of variables. An exceedingly difficult finding
for the popular theories to explain is the fact that the degree of integrativeness of
subjects is inconsistently related to scores on the language proficiency tests. In
one case it appears to be negatively related—e.g., the fact that the desire to behave
as Americans do is negatively correlated with the Dictation score.
What possible explanation can be offered that is not inconsistent with some of
the data? To suggest that this group of subjects becomes less integrative as they
become more proficient in the language fails as an explanation of the data. For one
thing, it fails to explain the positive relation between desire to stay in the United
States and the Grammar score.
Of the three hypotheses considered at the outset, only one is consistent with
the data. Unfortunately, it is the least attractive of the three hypotheses and it
encourages far-reaching skepticism for the future of attitude measurement It is
the view that the nonrandom variance in attitude measures may have little to do
with actual attitudes and motivations. It may be largely the result of extraneous
nonrandom sources such as response set, the approval motive, and self-flattery
(see Oiler and Perkins, 1978, Chap. 5). Certainly if a person does not understand
the question, he cannot give the desired response (or the one that he perceives to
be the desired response) or the self-flattering response (the one that will make him
look good in his own eyes and the eyes of others); neither can he give a consistent
response (one that merely keeps him from contradicting himself). Hence the
ability to give consistent, socially appropriate, and self-flattering responses hinges
on the ability to both understand the language of the questions and to infer the
socially appropriate, flattering, and consistent response. Thus modest correla¬
tions with language proficiency measures would be predicted independent of the
content of the questions. This interpretation is the only one that can explain the
fact that the extraneous questions correlated as strongly with the language profi¬
ciency criteria as did any of the other predictor variables.
240 VII: MEASURING FACTORS
It would seem that attitude theorists will have to find better measures, or a
different basis for testing their theories. Substantial evidence to date suggests that
the extant questionnaire techniques regardless of the language of presentation
(whether in the native or target language), regardless of the format (oral or written),
may be inadequate to test the theories they are constructed to evaluate.
Note
1. This paper was originally presented under the title Four Clusters of Learner Variables in Relation
to Attained Proficiency in ESL at the TESOL Convention in Miami, Fla., Apr. 29,1977. We are grate¬
ful for the comments of Paul Holtzman and others who reacted to the paper at the meeting. The present
version is much revised and more complete than that report, however.
Chapter
The affective domain has turned out to be a Pandora’s box for second
language acquisition researchers. The more research that is done, the more
complex the relation between affect and second language acquisition seems to
become. Study follows study, but there is little agreement on how basic constructs
are to be conceptualized or measured. Perhaps a large part of the problem is that
the social sciences may not be amenable to the same methodological procedures as
the hard sciences (Cicourel, 1964). In any case, thoroughly adequate instruments
have yet to be designed and a highly elegant model relating attitudes and
motivations of second language learners to attained language proficiency has yet to
be developed.
The central research issue discussed here is the notion of integrative and
instrumental motivations. The first of these has been variously defined in terms of
the social distance of the second language learner from the target culture, the
learner’s desire to become a member of the target culture, and the ego-
241
242 VII: MEASURING FACTORS
Method
There were also one or two speakers each from a variety of other languages, includ¬
ing Greek, Turkish, and Vietnamese.
Materials. A motivational and attitudinal measure constructed by Gardner
and Lambert (1972, p. 148) is given as Appendix 24A. It contains reasons
(according to Gardner and Lambert) often given by students in the Louisiana and
Maine communities for studying French. The measure actually used in this study,
however, is given in the Appendix to this volume as the last nine questions of entry
23. The questions used have been slightly modified to fit more closely the social
situation of the ESL learners studied. For instance, ‘French people and
“French-speaking" were changed to “American people" and “English-speaking”;
“to finish high school" was changed to “to fulfill my educational goals"; and “it
will allow me to meet and converse with more and varied people" was changed to a
maximally integrative type of reason, “it will enable me to marry an American.
Question 9 concerning desire to stay in the United States was added as a further
indicator of integrative or instrumental motivation (Schumann, 1975b, p. 74).
This modified Gardner and Lambert measure was translated into the native
language of the subjects—Spanish, Persian, Arabic, and Japanese. This was done
in an attempt to minimize variance attributable directly to ESL proficiency (Oiler
and Perkins, 1978, Chap. 5). A questionnaire in English was also provided for the
few subjects who spoke languages other than Spanish, Persian, Arabic, and
Japanese.
An oral interview, roughly in the FSI format (see Chap. 7) was used to obtain a
measure of redundancy in the subject’s interlanguage according to the procedure
described below.
Procedure. The modified Gardner and Lambert questionnaire (Appendix
entry 23, last nine questions) was handed out to all students enrolled at CESL
during April 1977 (182 in all). The questionnaire in the five languages was
attached to the back of another attitude questionnaire which was part of a separate
study (see Chap. 23). The students were asked to take the questionnaire home, fill
it out, and return it. Seventy-two students returned the questionnaire. It was
scored according to the scale which appears at the bottom of Table 24-1; a value ot
1 was assigned for “strongly agree" and 7 for “strongly disagree."
A redundancy measure was also obtained for each subject by scoring the
taped interviews. Five obligatory functors whose semantic function in English is
largely redundant were counted:
1. the plural morpheme, as in cats, shoes, or boxes (/s/, /z/, or /0z/),
2. the copula, as in He is sick; _ t
3. the copula in progressive constructions, such as He is going (is + *ing),
4. possessive inflection, as in John s bike; and
5. the third person singular habitual present, as in She sends him money every
month (Brown, 1973, pp. 253-254).
Each interview was played for approximately five minutes. Each time the subject
produced one of the functors, a variable a was incremented by one point. Each
244 VII: MEASURING FACTORS
Redundancy
Reasons for studying English Mean SD index
*1. It will help me to understand the American people and
their way of life. 3.2 1.4 .32-f
2. 1 think it will someday be useful in getting a job. 3.0 1.5 .11
3. It will enable me to gain good friends more easily
among English-speaking people. 2.8 1.1 .10
4. One needs a good knowledge of at least one foreign
language to merit social recognition. 3.6 1.9 -.21
5. It should enable me to begin to think and behave
as Americans do. 5.1 1.5 .11
6. 1 feel that no one is really educated unless he is
fluent in the English language. 5.8 1.4 -.14
7. It will enable me to marry an American. 5.6 1.5 .00
8. 1 need English in order to fulfill my educational goals. 2.4 1.4 .23
9. If you could stay in the United States for as long as
you wanted after you finish your studies, how long
would you stay?
A. Leave as soon as my studies were finished.
B.
C.
Stay 3 months.
Stay 6 months. 1 1 „ 1.7 .15
I
D. Stay 1 year. j
E. Stay 5 years.
F. Stay permanently (immigrate). )
Scale 12 3 4 5 6 7
strongly agree disagree strongly
agree disagree
time a functor was required but was not produced, a variablep was incremented by
one point Then RI was computed as follows:
a + p
Thus the redundancy score for each subject is equal to the number of correct
usages of the above-listed functors divided by the number of obligatory occasions.
Table 24-3 Rotated Varimax Factor Solution over Integrative and Instrumental
Reasons for Studying English and Desire to Stay in the United States
♦Factor loadings of .50 or less are not listed in the table but are entered into the computation
• of percentages of explained variance and the eigenvalue.
observed (p > .05). Moreover, on five of the integrative correlations and four of the
instrumental correlations, the tendency is opposite to the one expected.
The redundancy index (Table 24-5) is significantly correlated with all the
language tests. In addition, several of the questions from the larger questionnaire
discussed by Oiler, Perkins, and Murakami (Chap. 23) correlated significantly
with the redundancy index. As we might have expected, the harder the student
thought the language tests were, the lower was the score on the redundancy index
(r = .38). The larger the percent of leisure time the students said they spent speak¬
ing or listening to English, the higher was the score on the redundancy index
Johnson/Krug: Integrative and instrumental motivations 247
A. 10%
B. 20%
C. 50%
D. 70%
E. 100%
(On the next two items, how would you rate Americans?)
Scientifically -.35 24
54. Scientifically
advanced retarded
12 3 4 5 6 7 8
(r = .34). The more technically and scientifically advanced the students felt
Americans to be, the higher the score on the redundancy index (r = -.46, and
—.35, respectively).
None of the correlations between the integrative and instrumental measures from
the modified Gardner and Lambert questionnaire with the several language tests
achieved significance (p > .05). The redundancy index, however, showed signifi
cant correlations in all instances with the same tests.
248 VII: MEASURING FACTORS
Several alternative conclusions are possible from our study: (1) Attitude and
motivation measures of the type developed by Gardner and Lambert may not be
sensitive enough indicators of learners’ feelings toward the target culture to be
good predictors of attained language proficiency. At least for this type of subject
population, this conclusion cannot be excluded by the evidence we have
presented above. (2) The redundancy index appears to be significantly correlated
with attained language proficiency, but it cannot be linked with factors related to
the affective domain on the basis of present results. Further study is needed.
(3) The relationship between affect and cognition may be more complex than such
constructs as integrativeness and instrumentality would indicate; i.e., the con¬
structs themselves may need revision. A possible conceptual revision that we
believe deserves consideration is that there may exist an “intrinsic" kind of
motivation. If this is so, an activity such as second language learning might be
undertaken for its own sake, independent of such “extrinsic" motivations as inte¬
grativeness or instrumentality (Conway, 1975, p. 259). (4) The methods of
empirical science are insufficient to capture the interrelationships that are
displayed by everyday situations. Whereas science assumes discreteness, perma¬
nence, and stability as features of a real world, the reality of every day life in which
language learning takes place may not be so constituted (Mehan and Wood, 197 5,
p. 64). Maureen Concannon-O’Brien (1977) addresses this possibility when she
observes that “psychologists may have to accept what is obvious to the teacher—
that the structures of no two personalities are the same, and that each individual
has a different set of habits, drives, needs, and impulses” (p. 196).
In the final analysis, whatever conclusions our results suggest, we still believe
that an awareness of learners’ feelings is an important factor in effective language
teaching, regardless of whether or not these feelings can be conceptualized and
measured.
Johnson/Krug: Integrative and instrumental motivations 249
Appendix 24A
Gardner and Lambert Motivational Measure
(1972, p. 148)
a It will help me to understand better the French people and their way of life.
b. It will enable me to gain good friends more easily among French-speaking
people.
c. It should enable me to begin to think and behave as the French do.
d. It will allow me to meet and converse with more and varied people.
The rating scale below each question had the following form:
1. Clarke’s study (Chap. 21) seems to indicate that language aptitude is only
one predictor of achievement in language classes. Compile a list of other
factors that you think might be significant predictors of achievement in lan¬
guage classes. How could the factors in your list be measured?
2. Clarke showed (Table 21-1) that the German elective group (N = 25) had a
mean of 71.56 on the MLAT total while the German required group (N = 44)
had a mean of 59.91. There is a significant difference between the means:
t = 3.27 (df = 67), p < .001. Does this finding lend support for the motiva¬
tion hypothesis: students will learn more if they want to take the language
class? What other selective factors might enter in?
5. Clarke suggested that the MLAT by itself would be a rather weak predictor
of foreign language achievement. What do you consider to be some valid,
reliable predictors of attained language proficiency to be used in conjunc¬
tion with the MLAT?
6. Clarke indicated that the MLAT may measure rather well whatever is taught
in Latin courses.What does this imply for the validity of the MLAT? How
would one go about validating a test of this type?
7. According to Clarke’s Table 21-2 for the Japanese students, the total MLAT
score is a better predictor of Grammar scores (r = .74) than Part IV of the
MLAT, which purports to measure sensitivity to grammar (r = .50). For the
spring 1977 Japanese, Spelling Clues, Part III of the MLAT was the best
predictor of Achievement scores (r = .58), yet for the fall 1976 Japanese
Part V, Paired Associates was the best predictor (r = .56). For the German
groups, Part IV was the best predictor of Achievement and Grammar (.48,
.34, and .33). What conclusions can you draw about the construct validity of
the MLAT subtests? (cf. Chaps. 1 and 13.)
250
Discussion questions 251
9. Why would academic status at the time of testing correlate very highly with
test performance for Murakami’s subjects?
11. Murakami found that length of stay in the United States correlated signifi¬
cantly with listening comprehension. Can we say that length of stay can be
equated with a practice effect?
12. Murakami does not specifically discuss the reliability and/or validity of his
elicitation instruments. Can they be inferred from any of the data he does
present? How? (See other chapters in this volume which use similar tech¬
niques.)
13. Oiler, Perkins, and Murakami (Chap. 23) suggested that "the popular mea¬
sures of attitudes and motivations may themselves be surreptitious measures
of language proficiency. Discuss the feasibility of developing an elicitation
instrument for attitudes and motivations that is not unintentionally a test of
language proficiency and/or intelligence.
14. What statistical procedures could one use to extract the extraneous non¬
random variance in self-reported attitude and motivation data? Consider the
quantity of the reliable variance which may be attributable to such factors.
15. Given the wide range of discussion and conflicting data on instrumental and
integrative orientation, one can conclude that the role of affective variables
and aptitude factors in language acquisition has not been settled. Consider
some research questions which Chap. 23 suggests for affective and aptitude
factors in language proficiency. What measures need to be developed or vali¬
dated? What kind of research designs need to be utilized to carry out this
research?
16. Oiler, Perkins, and Murakami stated that the ability to give consistent,
socially acceptable, and self-flattering responses hinges on the ability to
understand the language of the questions and to infer the appropriate
response. Can one conclude that these abilities are acquired? If so, are
these abilities examples of rule-governed behavior? Can they logically be
distinguished from language ability?
17. Johnson and Krug (Chap. 24) concluded that attitude and motivation
measures of the type developed by Gardner and Lambert may not be sensi¬
tive enough to be good predictors of attained language proficiency. What
252 VII: MEASURING FACTORS
measures would be sensitive enough? How could one test the sensitivity of
proposed instruments?
18. Attitude and motivation studies have implicitly treated the integrative and
instrumental orientation constructs as static entities. Do you think these
constructs might be variable? Is that why they are so elusive and hard to mea¬
sure? Along this line, discuss the claim of Cicourel (1964) that this field may
not lend itself to measurement.
19. Johnson and Krug found that none of the correlations between the integrative
and instrumental measures and the language tests achieved significance. On
the other hand, correlations between integrative and instrumental motives
did tend to be significant What plausible explanation(s) can be offered for
the raw correlation of .43 between reasons 2 and 3 (Tables 24-1 and 24-2)?
What about the correlation of .43 between reasons 5 and 6, another inte¬
grative and instrumental pair. Shouldn’t one expect “integrative” and
“instrumental” motivational factors rather than several factors which are
both “integrative” and “instrumental”?
References
Aborn. M., H. Rubenstein, and T. D. Sterling. 1959. Sources of contextual constraints upon words in
sentences. Journal of Experimental Psychology 57. 171-180.
Adorno, T. W.. Else Frenkel-Brunswick, D. J. Levinson, and R. N. Sanford. 1950. The Authoritarian
Personality. New York: Harper.
Aiken, Lewis, jr. 1971. Psychological and Educational Testing. Boston: Allyn & Bacon.
Akmajian. A. 1969. On deriving cleft sentences from pseudocleft sentences. Unpublished manuscript,
MIT.
Alexander, L. G. 1974. A First Book in Comprehension, Precis, and Composition. Rowley, Mass.:
Longman-Newbury House.
Allen, J. P. B., and A. Davies (eds.). 1977. Testing and Experimental Methods. London: Oxford Uni¬
versity Press.
American Language Institute. 1962. Oral Rating Form. Washington. D.C.: Georgetown University
Press.
Anisfeld, M„ and W. E. Lambert 1961. Social and psychological variables on learning Hebrew.
Journal of Abnormal and Social Psychology 63. 524-529. Also in Gardner and Lambert (1972),
217-227.
_. 1964. Evaluational reactions of bilingual and monolingual children to spoken languages.
Journal of Abnormal and Social Psychology 69. 89-97.
Asher, J. J. 1969. The total physical response approach to second language learning. Modem Language
Journal 53. 3-17.
Asher, J. J., J. A. Kusudo, and R. De La Torre. 1974. Learning a second language through commands:
the second field test Modem Language Journal 58. 24-32.
Austin, J. L. 1962. How to Do Things with Words. New York: Oxford University Press.
Baird, A., G. Broughton, D. Cartwright, and G. Roberts. 1972. Success with English: A First Reader.
Baltimore, Md.: Penguin Books.
Bezanson, K. A., and N. Hawkes. 1976. Bilingual reading skills of primary school children in Ghana.
Working Papers on Bilingualism 11. 44-73.
Bondaruk, J., J. Child, and E. Tetrault 1975. Contextual testing. In Jones and Spolsky (1975), 89-
104.
Brennan, E. M., E. B. Ryan, and W. E. Dawson. 1975. Scaling of apparent accentedness by magnitude
estimation and sensory modality matching. Journal of Psycholinguistic Research 4. 27-36.
Brodkey, D., and H. Shore. 1976. Student personality and success in an English language program.
244.
Brown, Roger. 1973. A First Language. Cambridge, Mass.: Harvard University Press.
Burt, M. K. 1975. Error analysis in the adult EFL classroom. TESOL Quarterly 9. 1, 53-56.
Carroll, J. B. 1958. A factor analysis of two foreign language test aptitude batteries. Journal of General
32°. . . .
_. 1967. Foreign language proficiency levels attained by language majors near graduation trom
college. Foreign Language Annals 1. 131-151.
253
254 REFERENCES
Carroll, J. B. 1972. Defining language comprehension: some speculations. Paper presented at the
research workshop on language comprehension and the acquisition of knowledge. Durham, N.C.,
Mar. 31-Apr. 3. Also in R. 0. Freedle and J. B. Carroll (eds.), Language Comprehension and the
Acquisition of Knowledge. New York: Wiley, 1-29.
Carroll, J. B., and S. Sapon. 1959. Modern Language Aptitude Test Manual New York: The Psycho¬
logical Corporation.
Cartier, F. A. 1968. Criterion-referenced testing of language skills. TESOL Quarterly 2. Reprinted in
Palmer and Spolsky (1975), 19-24.
Chafe, Wallace. 1972. Discourse structure and human knowledge. In Roy 0. Freedle and John B.
Carroll (eds.), Language Comprehension and the Acquisition of Knowledge. New York: Winston,
41-69.
Chase, C. 1.1972. Test of English as a foreign language. A review in 0. K. Buros (ed.). Seventh Mental
Measurements Yearbook. Highland Park, N.J.: The Gryphon Press.
Chastain, Kenneth. 1975. Affective and ability factors in second language acquisition. Language
Learning 25. 153-161.
Christie, R., and F. Geis. 1970. Studies in Machiavellianism. New York: Academic Press.
Cicourel, Aaron. 1964. Method and Measurement in Sociology. New York: Free Press.
Clark, J. L. D. 1972. Foreign Language Testing: Theory and Practice. Philadelphia. Pennsylvania:
Center for Curriculum Development.
-. 1975. Theoretical and technical considerations in oral proficiency testing. In Jones and
Spolsky (1975), 10-24.
Concannon-O’Brien, M. 1977. Motivation: an historical perspective. In M. K. Burt. H. Dulay, and M.
Finocchiaro (eds.), Viewpoints on English as a Second Language. New York: Regents, 185-197.
Conway, Patrick. 1975. Volitional competence and the process curriculum of the ANISA model. In
P. Conway (ed.), Development of Volitional Competence. New York: MSS Information Corpora¬
tion.
Cooper, Carl J. 1964. Some relationships between paired-associates learning and foreign-language
aptitude. Journal of Educational Psychology 55. 132-138.
Coulthard, Malcolm. 1975. Discourse analysis in English: a short review of the literature. Language
Teaching aiul Linguistics Abstracts 8. 73-89.
Cranney, A. Garr. 1973. The construction of two types of cloze reading tests for college students.
Journal of Reading Behavior 5. 60-64.
Crowne, D. P., and D. Marlow. 1964. The Approval Motive. New York: Wiley.
Darnell, D. 1970. The development of an English language proficiency test of foreign students using
the clozentropy procedure. Speech Monographs 37. 36-46.
Davies, A. 1977. The construction of language tests. In J.P.B. Allen and A. Davies (eds.), Testing and
Experimental Methods. London: Oxford University Press.
Davis, F. B. 1944. Fundamental factors of comprehension in reading. Psychometrika 9. 185-197.
Donaldson, Weber, D., Jr. 1971. Code-cognition approaches to language learning. In Robert C. Lugton
(ed.), Towards a Cognitive Approach to Second Language Acquisition. Philadelphia: Center for
Curriculum Development
Dulay, H. C., and M. K. Burt. 1974. Error and strategies in child second language acquisition. TESOL
Quarterly 8. 129-136.
Endicott A. L. 1973. A proposed scale of syntactic density. Research in the Teaching of English 7. 5-
12.
Enkvist, N. 1973. Should we count errors or measure success? In Jan Svartvik (ed.), Errata. Papers in
Error Analysis. Lund, Sweden: CWK Gleerup.
Ervin-Tripp, Susan M. 1970. Structure and process in language acquisition. Georgetown University
Monograph. 21st Annual Round Table No. 23.
Flahive, Douglas. 1977. The reading proficiency of ESL/EFL learners. Paper presented at the Spring
1977 KICL meeting, Eastern Kentucky University, Richmond, Ky.
Flesch, Rudolf. 1948. A new readability yardstick. Journal of Applied Psychology 32. 221-233.
Frederiksen, C. H. 1975a. Effects of context induced processing operations in semantic information
acquired from discourse. Cognitive Psychology 7. 139-166.
REFERENCES 255
Frederiksen, C.H. 1975b. Representing logical and semantic structure of knowledge acquired from
discourse. Cognitive Psychology 7. 431-458.
Fry, Edward. 1968. A readability formula that saves time. Journal of Reading 11.513-516, 575-578.
Galvan, J.. J. A. Pierce, and G. N. Underwood. 1975. The relevance of selected educational variables
of teachers to their attitudes toward Mexican-American English. Paper presented at the 1975
annual meeting of the Linguistics Association of the Southwest in San Antonio, Tex.
Gardner, R. C. 1975. Social factors in second language acquisition and bilinguality. Paper presented at
the invitation of the Canada Council’s Consultative Committee on the Individual, Language,
and Society at a conference in Kingston, Ontario, November, December 1975.
Gardner, R. C., and W. E. Lambert 1959. Motivational variables in second language acquisition.
Canadian Journal of Psychology 13. 24-44.
-. 1972. Attitudes and Motivation in Second Language Learning. Rowley, Mass.: Newbury
House.
Gardner, R. C., P. C. Smythe, R. Clement and L. Gliksman. 1976. Second language learning: a social
psychological perspective. Canadian Modem Language Review 32. 198-213.
Genesee, F. 1976. The role of intelligence in second language learning. Language Learning 26. 267-
280.
Giles, H. 1970. Evaluative reactions to accents. Educational Review 22. 211-227.
-. 1972. The effect of stimulus mildness-broadness in the evaluation of accent Language and
Speech 15. 262-266.
Giles, H., and P. F. Rowesland. 1975. Speech Style and Social Evaluation. London: Academic Press.
Gorosch, M. 1973. Assessment intervariability in testing oral performance of adult students. In Jan
Svartvik (ed.). Errata: Papers in Error Analysis. Lund, Sweden: CWK Gleerup.
Gradman, H. L., and B. Spolsky. 1975. Reduced redundancy testing: a progress report. In Jones and
Spolskv (1975), 59-70.
Guiora. A. Z„ M. Paluszny, B. Beit-Hallahmi, J. C. Catford, R. E. Cooley, and C. Y. Dull. 1975. Lan¬
guage and person studies in language behavior. Language Learning 25. 43-62.
Gunnarsson, Bjami. A look at the content similarities between intelligence, achievement, personality,
and language tests. In Oiler and Perkins (1978), 18-35.
Haggard, L. A., and R. S. Isaacs. 1966. Micro-momentary expressions as indicators of ego mechanics
in psycho-therapy. In L. A. Gottschalk and A. H. Auerback(eds.), Methods of Research in Psycho¬
therapy. New York: Appleton-Century-Crofts, 154-165.
Hardison, 0. B., Jr. 1966. Practical Rhetoric. Chapter by M. Twain, The Mississippi Riven A Symbol
of Freedom. New York: Appleton-Century-Crofts, 111-113.
Harms, L. S. 1963. Status cues in speech: extra-race and extra-region identification. Lingua 12. 300-
306.
Harris, D. P. 1969. Testing English as a Second Language. New York: McGraw-Hill.
Harris^ D. P., and L. A. Palmer. 1970. CELT Technical Manual: Preliminary' Edition. New York:
McGraw-Hill.
Heaton, J. B. 1975. Writing English Language Tests. London: Longman.
Hinofotis, F. B. 1976. An investigation of the concurrent validity of cloze testing as a measure of over¬
all proficiency in English as a second language. Unpublished doctoral dissertation. Carbondale,
Ill.: Southern Illinois University.
Hisama, K. 1976. Design and empirical validation of the cloze procedure for measuring language pro¬
ficiency of non-native speakers. Unpublished doctoral dissertation. Carbondale, Ill.: Southern
Illinois University.
__ 1977 a. A new direction in measuring proficiency in English as a second language. Paper pre¬
sented at the annual meeting of the American Educational Research Association, New York.
_ 1977b. Predictive validity of short-form placement tests under two scoring systems. Paper
presented at the National Council on Measurement in Education, New York.
Hornby, P. A. 1974. Surface structure and presupposition. Journal of Verbal Learning and Verbal
Hutchinson, L. G. 1971. Presupposition and belief-inferences, in Papers from the Seventh Regional
Meeting, Chicago Linguistic Society. Chicago: Chicago Linguistic Society, 134-141.
Indochinese Refugee Education Guides, No. 11,1975. Arlington, Va: National Indochinese Clearing¬
house, Center for Applied Linguistics.
Iowa Silent Reading Tests. Level 2, Form E. 1973. New York: Harcourt Brace Jovanovich.
Irvine, Patricia, Parvin Atai, and John W. Oiler, Jr. 1974. Cloze, dictation, and theTestofEnglishasa
Foreign Language. Language Learning 24. 245-252.
Jakobovits, L. A. 197 0. Foreign Language Learning: A Psycholinguistic Analysis of the Issues. Rowley,
Mass.: Newbury House.
Jensen, Arthur R. 1972. The nature of intelligence. In G. H. Bracht, Kenneth D. Hopkins, and Julian
C. Stanley (eds.), Perspectives in Educational and Psychological Measurement. Englewood Cliffs,
N.J.: Prentice-Hall, 191-213. Reprinted from Harvard Educational Review (1969), 39. 5-28.
Johansson, S. 1973a. An evaluation of the noise test IRAL 11. 107-133.
-. 1973 b. Partial dictation as a test of foreign language proficiency. Swedish-English Contrastive
Studies, Report No. 3, Department of English, Lund University, Sweden.
-. 1975. Papers in contrastive linguistics and language testing. Lund Studies in English. Lund
University, Sweden.
Johnson, D. 1977. The TOEFL and domestic students: conclusively inappropriate. TESOL Quarterly
11. 79-86.
Jones, E. E., and F. Kohler. 1958. The effects of plausibility on the learning of controversial state¬
ments. Journal of Abnormal Social Psychology 58. 315-320.
Jones, R. L., and B. Spolsky (eds.). 1975. Testing Language Proficiency. Arlington, Va.: Center for
Applied Linguistics.
Jonz, J. 1976. Improving on the basic egg: the M-C cloze. Language Learning 26. 255-265.
Karttunen, L. 1971. Implicative verbs. Language 47. 340-358.
-. 1974. Presupposition and linguistic context Theoretical Linguistics 1. 181-194.
-. 1975. On pragmatic and semantic aspects of meaning. Texas Linguistic Forum 1.
Kerlinger, F. N. 1973, Foundations of Behavioral Research, 2d ed. New York: Holt Rinehart and
Winston.
Kerlinger, F. N., and E. J. Pedhazur. 1973. Multiple Regression in Behavorial Research. New York:
Holt Rinehart and Winston.
Krashen, Stephen. 1977. Language testing: current research. Paper presented at the ACTFL meeting
in San Francisco, November.
Krzyzanowski, H. 1976. Cloze tests as indicators of general language proficiency. Studia Anglica
Posnaniensia 7. 29-43.
Labov, W. 1966. The Social Stratification of English in New York City. Washington, D.C.: Center for
Applied Linguistics.
Lado, R. 1957. Linguistics across Cultures; Applied Linguistics for Language Teachers. Ann Arbor.
University of Michigan Press.
-. 1961. Language Testing: The Construction and Use of Foreign Language Tests. New York:
McGraw-Hill.
Lambert, W. E., R. C. Hodgson, R. C. Gardner, and S. Fillenbaum. 1960. Evaluational reactions to
spoken languages. Journal of Abnormal and Social Psychology 60. 44-51.
Lambert, W. E., R. C. Gardner, H. Barik, and K. Tunstall. 1962. Attitudinal and cognitive aspects of
intensive study of a second language. Journal of Abnormal and Social Psychology 66. 358-368.
Reprinted in Gardner and Lambert (1972), 288-245.
Lee, Laura L., and Susan M. Canter. 1971. Developmental sentence scoring: a clinical procedure for
estimating syntactic development in children’s spontaneous speech. Journal of Speech and Hear¬
ing Disorders 36. 315-340.
Lett, John. 1977. Assessing attitudinal outcomes. In June K. Phillips (ed.), The Language Connec¬
tion: From the Classroom to the World. ACTFL Foreign Language Education Series 9.
Levenston, E. A. 1975. Aspects of testing the oral proficiency of adult immigrants to Canada In
Palmer and Spolsky (1975), 67-74.
REFERENCES 257
Liebert, R. M., and M. D. Spiegler. 1974. Personality: Strategies for the Study of Man. Rev. ed. Home-
wood, Ill.: Dorsey Press.
Lorge, Irving, Robert L. Thorndike, and Elizabeth Hagen. 1964. The Lorge-Thorndike Intelligence
Tests. Boston: Houghton Mifflin.
Lukmani, Yasmeen. 1972. Motivation to learn and language proficiency. Language Learning 22. 261-
274.
Manis, Melvin, and Robyn M. Dawes. 1961. Cloze score as a function of attitude. Psychological Reports
9. 79-84.
McLaughlin, G. H. 1969. SMOG grading: a new readability formula. Journal of Reading 12.639-645.
Mehan, H., and H. Wood. 1975. The Reality of Ethnomethodology. New York: Wiley-Interscience.
Mehrens, William A., and Irvin J. Lehmann. 1969. Standardized Tests in Education. New \ ork: Holt,
Rinehart and Winston.
Miller, G. A. 1956. The perception of speech. In M. Halle (ed.). For Roman Jakobson. The Hague:
Mouton.
Miller, G. A., G. A. Heise, and W. Lichten. 1951. The intelligibility of speech as a function of the con¬
text of the test materials. Journal of Experimental Psychology 41. 81-97.
Miller, G. A., and P. N. Johnson-Laird. 1976. Language and Perception. Cambridge, Mass.: Belknap
Press of Harvard University Press.
Muraki, M. 1970. Presupposition and pseudoclefting. In Papers from the Sixth Regional Meeting,
Chicago Linguistic Society. Chicago: Chicago Linguistic Society, 390-399.
Naiman, N. 1974. The use of elicited imitation in second language acquisition research. Working
Papers on Bilingualism 2. 1-37.
Nelson, M. J., and E. C. Denny. 1973. The Nelson-Denny Reading Test. Boston: Houghton Mifflin.
Nguyen Dang Liem. 1967. A Contrastive Analysis of English and Vietnamese. Canberra: Austrahan
National University.
Nie, N. H., C. H. Hull, J. G. Jenkins, K. Steinbrenner, and D. H. Bent 1975. SPSS: Statistical Package
f°r the Social Sciences. New York: McGraw-Hill.
Nunnally, Jum C. 1964. Educational Measurement and Evaluation. New York: McGraw-Hill.
-. 1967. Psychometric Theory. New York: McGraw-Hill.
- 1975. Introduction to Statistics for Psychology and Education. New York: McGraw-Hill.
O’Donnell, Roy C., W. J. Griffin, and R. C. Norris. 1967. Syntax of kindergarten and elementary school
children: a transformational analysis. NCTE Research Report No. 8. Champaign-Urbana, Ill.:
National Council of Teachers of English.
Oiler, John W„ Jr. 1972. Contrastive analysis, difficulty, and predictability. Foreign Language Annals
6. 95-106.
-. 1973a. Cloze tests of second language proficiency and what they measure. Language Learning
23.105-118.
-. 1973b. Discrete-point tests versus tests of integrative skills. In Oiler and Richards (1973),
184-199.
-. 1974. Expectancy for successive elements: key ingredient to language use. Foreign Language
Annals 7. 443-452.
_. 1976. Interlanguage and fossilization. Paper presented at Modern Language Association Con¬
vention, New York. (See also Rule fossilization: a tentative model, with N. Vigil. Language Learn¬
ing 26. 281-295.)
_. 1977. Affective variables in second language acquisition: how important are they? Paper pre¬
sented at the NAFSA meeting in New Orleans, May 27, 1977. In Betty Wallace Robinett (ed.),
1976-77 Papers in ESL: Selected Conference Papers of ATESL. Washington, D.C.: NAFSA.
_. 1979. Language Tests at School: A Pragmatic Approach. London: Longman.
Oiler, John W„ Jr., and J. C. Richards (eds.). 1973. Focus on the Learner: Pragmatic Perspectives for
the Language Teacher. Rowley, Mass.: Newbury House.
Oiler, John W„ Jr., and Virginia Streiff. 1975. Dictation: a test of grammar based expectancies. In
Jones and Spolsky (1975), 71-88. Also in English Language Teaching 30. 1975. 25-36.
258 REFERENCES
Oiler, John W., Jr., and F. B. Hinofotis. 1976. Two mutually exclusive hypotheses about second lan¬
guage ability: factor analytic studies of a variety of language tests. Unpublished paper delivered at
the Winter meeting of the Linguistic Society of America. Dec. 30, 1976. Also in this volume as
Chap. 1.
Oiler, John W., Jr., Alan J. Hudson, and Phyllis Fei Liu. 1977. Attitudes and attained proficiency in
ESL: a sociolinguistics study of native speakers of Chinese in the United States. Language Learn¬
ing 27. 1-27.
Oiler, John W., Jr., Lori Baca, and Fred Vigil. 1977. Attitudes and attained proficiency in ESL: asocio-
linguistic study of Mexican-Americans in the Southwest TESOL Quarterly 11. 173-183.
Oiler, John W., Jr., and K. Perkins. 1978. Language in Education: Testing the Tests. Rowley, Mass.:
Newbury House.
Olssen, M. 1973. The effect of different types of errors in the communication situation. In Svartvik
(1973).
Ortego, P. D. 1970. Some cultural implications of a Mexican-American border dialect of American
English. Studies in Linguistics 21. 77-84.
Palmer, L. A. 1973. A preliminary report on a study of the linguistic correlates of raters’ subjective
judgments of non-native speech. In Shuy and Fasold (1973), 41-59.
Palmer, L. A., and B. Spolsky (eds.). 1975. Papers on Language Testing 1967-1974. Washington,
D.C.: Teachers of English to Speakers of Other Languages.
Paulston, Christina Bratt, and Mary Newton Bruder. 1975. From Substitution to Substance: A Hand¬
book of Structural Pattern Drills. Rowley, Mass.: Newbury House.
Perkins, K. 1976. Hierarchies of syntactic complexity of adult ESL Learners. In Robert St Clair and
Beverly Hartford (eds.), LEKTOS: Interdisciplinary Working Papers in Language Sciences.
Louisville: University of Louisville.
Perren, G. E. 1968. Testing spoken language: some unsolved problems. In A. Davies (ed.). Language
Testing Symposium A Psycholinguistic Approach. London: Oxford University Press. 107-116.
Postovsky, Valerian A. 1974. Effects of delay in oral practice at the beginning of second language
learning. Modern Language Journal 58. 229-239.
-• 1976. Individual differences in acquisition of receptive and productive language skills. Paper
presented at the Kentucky Foreign Language Conference, Lexington, Ky.
Praninskas, J. 1959. Rapid Review of English Grammar. Englewood Cliffs, N.J.: Prentice-HalL
Raven, J. C. 1960. Guide to the Standard Progressive Matrices. London: H.K. Lewis and Co. Ltd.
Raygor, Alton. 1970. McGraw-Hill Basic Skills System Test Manual Del Monte Research Park,
Monterey: McGraw-Hill.
Richards, Jack C. 1970. A non-contrastive approach to error analysis. English Language Teaching'll.
115-135. Also in Oiler and Richards (1973), 96-113.
-• 1971. Error analysis and second language strategies. Language Sciences 17. 12-22. Also in
Oiler and Richards (1973), 114-135.
-• 1974. Error Analysis: Perspectives on Second Language Acquisition. London: Longman.
Roscoe, John T. 1975. Fundamental Research Statistics for the Behavioral Sciences. 2d ed. New York:
Holt, Rinehart and Winston.
Ross, Janet 1976. The habit of perception in foreign language learning: insights into error from con¬
trastive analysis. TESOL Quarterly 10. 169-175.
Ryan, E. B. 1973. Subjective reactions toward accented speech. In Shuy and Fasold (1973), 60-73.
Sattler, J. M. 1974. Assessment of Children s Intelligence. Philadelphia: W.B. Saunders.
Savignon, Sandra. 1972. Communicative Competence: An Experiment in Foreign Language Teaching.
Montreal: Marcel Didier.
Saville-Troike, M. 1973. Reading and the audio-lingual method. TESOL Quarterly 7. 395-405.
Schumann, John. 1974. The implications of interlanguage, pidginization, creolization for the study of
adult second language learning. TESOL Quarterly 8, 2. 145-152.
-• 1975a. Second language acquisition: the pidginization hypothesis. Unpublished doctoral dis¬
sertation. Harvard University.
REFERENCES 259
-. 1975b. Affective factors and the problem of age in second language acquisition. Language
Learning 25. 209-235.
-. 1976. Social distance as a factor in second language acquisition. Language Learning 26.135-
143.
Schumann, John, and Nancy Stenson. 1974. New Frontiers in Second Language Learning. Rowley,
Mass.: Newbury House.
Selinker, L. 1972. Interlanguage. International Review of Applied Linguistics 10, 3. 209-231. Also in
Schumann and Stenson (1974), 114-136.
Sellars. W. 1954. Presupposing. The Philosophical Review 63. 197-215.
Shuy. R. W„ J. C. Baratz, and W. A. Wolfram. 1969. Sociolinguistic Factors in Speech Identification.
Research Project No. MH 1504801. Arlington. Va.: Center for Applied Linguistics.
Shuy, R. W., andR. W. Fasold(eds.). 1973. Language Attitudes: Current Trends and Prospects. Wash¬
ington, D.C.: Georgetown University Press.
Spolsky, Bernard. 1969. Attitudinal aspects of second language learning. Language Learning 19.
271-283.
Spolsky, Bernard, Penny Murphy, Wayne Holm, and Allen Ferrel. 1975. Three functional tests of oral
proficiency. TESOL Quarterly 1972, 6. 221-235. In Palmer and Spolsky (1975), 76-90.
State University of Iowa. 1964. The Iowa Tests of Basic Skills. Boston: Houghton Mifflin.
StendahL Christina. 1972. The relative proficiency in their native language and in English as shown
by Swedish students of English at university level Projektet sprakfardighet: engelska(SPRENG),
Rapport 6. Engelska institutionen, Goteborgs Universitet
Stevenson, D. 1974. Construct validity and the test of English as a foreign language. Unpublished
doctoral dissertation. Albuquerque, N.M.: University of New Mexico.
Sticht, Thomas G. 1972. Learning by listening. In R. Freedle and J. B. Carroll (eds.), Language Com¬
prehension and the Acquisition of Knowledge. Washington, D.C.: V.H. Winston.
Strawson, P. F. 1952. Introduction to Logical Theory. New York: Wiley.
Streift Virginia. 1978. Relationships among oral and written cloze scores and achievement test scores
in a bilingual setting. In Oiler and Perkins (1978), 65-100.
Strongman, K... and J. Woosley. 1967. Stereotyped reactions to regional accents. British Journal of
Social and Clinical Psychology 6. 164-167.
Stubbs, J. B„ and G. R. Tucker. 1974. The cloze test as a measure of English proficiency. Modern
Language Journal 58. 239-241.
Stump, Thomas A. 1978. Cloze and dictation as predictors of intelligence and achievement scores. In
Oiler and Perkins (1978), 36-63.
Svartvik, J. (ed.). 1973. Errata: Papers in Error Analysis. Lund, Sweden: CWK Gleerup.
Swain, M„ G. Dumas, and N. Naiman. 1974. Alternatives to spontaneous speech: elicited imitation and
translation as indicators of second language competence. Working Papers in Bilingualism 3. 68-
79.
Taylor, B. 1975. The use of overgeneralization and transfer learning strategies by elementary and
intermediate students in ESL. Language Learning 25. 73-108.
Taylor, W. L. 1953. Cloze procedure: a new tool for measuring readability. Journalism Quarterly 30.
415-433.
_. 1954. Application of cloze and entropy measures to the study of contextual constraint in
samples of continuous prose. Unpublished doctoral dissertation. Champaign-Urbana, 111.. Uni¬
versity of Illinois.
_. 1956. Recent developments in the use of the cloze procedure. Journalism Quarterly 33. 42-
48.
Thurstone, T. G. 1959. General Reading for Understanding: Teacher's Handbook. North Carolina:
Science Research Associates.
-. 1963. Reading for Understanding Placement Test. Chicago: Science Research Associates.
260 REFERENCES
Truus, S. 1972. Sentence construction in English and Swedish in the writings of Swedish students of
English at university level: a pilot study. Projektet sprakfardighet: engelska (SPRENG) Rapport
7, Engelska institutionen, Goteborgs Universitet
Tucker, G. R., and W. E. Lambert. 1969. White and negro listeners’ reactions to various American-
English dialects. Social Forces 47. 463-468.
Tucker, G. R., E. Hamayan, and F. H. Genesee. 1976. Affective, cognitive, and social factors in second
language acquisition. Canadian Modern Language review 32. 214-226.
Upshur, John A. 1975. Objective evaluation of oral proficiency in the ESOL classroom. TESOL
Quarterly 5. 47-60. In Palmer and Spolsky (1975), 53-65.
Valette, R. M. 1967. Modern Language Testing: A Handbook. New York: Harcourt, Brace and World
(2d ed„ 1977).
Van Syoc, W. B., and F. S. Van Syoc. 1971. Let’s Learn English: Advanced Course, Book 5. New York:
American Book Company.
Vineyard, Edwin, and Robert B. Bailey. 1966. Interrelationships of reading ability, listening skill,
intelhgence, and scholastic achievement. Journal of Developmental Reading 3. 174-178.
Warden, David A. 1976. The influence of context on children’s use of identifying expressions and
references. British Journal of Psychology 67. 101-112.
Webster, W. G., and E. Kramer. 1968. Attitudes and evaluational reactions to accented English speech
Journal of Social Psychology IS. 231-240.
Whitaker, S. F. 1976. What is the status of dictation? Audio-Visual Journal 14. 87-93.
Whyte, W. F. 1955. Street Comer Society: The Social Structure of an Italian Slum. Chicago: Univer¬
sity of Chicago Press.
Wijnstra, Johan M., and Nico van Wageningen. The cloze procedure as a measure of first and second
language proficiency. Unpublished manuscript
Wilds, Claudia P. 1975. The oral interview test. In Jones and Spolsky (1975), 29-44.
Wilhams, F. 1970. Psychological correlates of speech characteristics: on sounding’‘disadvantaged."
Journal of Speech and Hearing Research 13. 472-488.
Wilhams, F., J. L. Whitehead, and L. M. Miller. 1971. Ethnic stereotyping and judgments of children’s
speech. Speech Monographs 38. 166-170.
Wilson, L. I. 1973. Reading in the ESOL classroom; a technique for teaching syntactic meaning.
TESOL Quarterly 7. 259-267.
Winer, B. J. 1971. Statistical Principles in Experimental Design, 2d ed. New' York: McGraw-Hill.
Appendix
♦The numbering system used here corresponds to the numbering used in the main body of the text
referring to these tests. It does not refer to the order of testing, and there are some numbers missing
because they refer to published tests. See the brief descriptions of tests used by Scholz et aL, Chap. 2.
**A11 the directions were presented both in written form and on tape (in English). The written
form and the answer sheet, consisting of numbered blanks, are not given here. Instructors were advised
to make certain students understood thoroughly before the actual testing began. On the whole, the task
proved to be quite difficult for all but the most advanced students.
261
262 APPENDIX
Passage A
(1) At the airport they saw a helicopter. (2) Neither Tom nor David nor Annabel
had ever seen one close up before. (3) “It looks funny without wings,” Annabel
said. (4) “How long does it take you to fly downtown from here?” (5) Tom asked
the pilot. (6) “Six minutes,” he answered. (7) “Wow! By car it takes almost an
hour,” Tom said. (8) “And it can take even longer if there’s lots of traffic,” (9) his
father added, (10) “I’d like to go for a ride more than anything,” David said. (11)
“I’m sorry,” said the pilot, (12) “I can t take you downtown now, son. (13) We
have a full load, (14) and everybody is in a hurry." (15) Looking at his watch, he
added, “We have to take off now. So long.”
Passage B
(16) The United States can try to find more oil. (17) Big deposits of oil have
recently been discovered (18) in Alaska, in Mexico, and under the waters of the
North Sea. (19) We already get a lot of oil from underneath the water along the Gulf
Coast and the Pacific Coast of the United States. (20) Scientists believe there may
be large oil deposits off the Atlantic Coast as well.
(21) There is another source of oil, (22) but it will be hard to get (23) Hundreds of
millions of barrels of oil are mixed with sandy rock called shale. (24) There is
enough shale oil to keep us going for many years. (25) But at present, it is very
costly to remove, (26) and mining it may leave the countryside ugly.
(27) The United States has always been rich in oil. (28) We still produce a great
deal of it (29) But we use & great deal of oil, (30) so someday we will run out of it
Passage C
(31) Anybody who has traveled much on any continent (32) knows that there are
many land surfaces (33) that are too flat to be called mountains, (34) and too rough
to be called plains. (35) We can all agree to call such land surfaces hills, (36) with
the understanding that in some regions high hills are difficult to distinguish from
low mountains, (37) and low hills are difficult to distinguish from plains. (38) Hill
country includes regions with an average altitude change of more than 500 to
1,000 feet, (39) and with many more sloping surfaces than flat ones. (40) By this
definition, (41) most of the Illinois mountains are not mountains at all; (42) rather
they are hills. (43) So are many of the so-called mountains of New England. (44)
They may be called mountains by New Englanders, (45) but to a geographer they
are low hills.
Name Instructor _
DIRECTIONS: You will hear three short paragraphs. Each one will be read
once; then it will be read a second time with some of the
words left out Try the example to see if you understand.
EXAMPLE: 1. A. library 2. A. friend 3. A. jerk
B. bookstore B. janitor B. fellow
C. drugstore C. manager C. story
D. bar D. owner D. clerk
The correct choices are B for 1, C for 2, and D for 3. They
work at the bookstore. Sarah is the manager, and Harry is a
clerk.
Passage A
(1) Two weeks before school started, Henry Higgins was in the kitchen one
evening feeding Fala (2) while Mr. Higgins washed the dinner dishes and Mrs.
264 APPENDIX
Higgins dried them. (3) Henry took some hamburger and half a can of Alpo Dog
Food out of the refrigerator. (4) Henry broke up the hamburger and put it on Fa la's
dish. (5) “Why don't you chew it?” he asked, (6) when F^ala began to gulp it down in
large chunks. (7) Henry spooned the last of the can of Alpo into the plastic dish. (8)
Fala sniffed at the food. (9) Then he wagged his tail and looked hopefully at Henry,
(10) who knew this meant that Fala would eat the dog food (11) only when he was
sure (12) he was not going to get any more hamburger.
Passage A
Passage B
(13) You know how fast a jet goes. (14) It is hard for the pilot to get out of it (15)
When it is goingso fast (16) it is hard to get away from it too. (17) We had to work
out a way of escape. (18) A pilot must not be caught in a jet when there is trouble.
(19) We made a seat that would shoot down out of the jet (20) It was automatic.
(21) A pilot could just press a button! (22) Out he would go, seat and all. (23) We
3. Listening cloze 265
tried out the automatic seat with dummies. It seemed to work. (24) Now I was to be
the first live man to try it
Passage B
(13) A. f°ggy (19) A. jet
B. super B. pilot
C. fast C. button
D. far D. seat
(14) A. might (20) A. went
B. should B. was
C. is C. has
D. could D. did
(15) A. like (21) A. pilot
B. actually B. driver
C. also C. rider
D. so D. passenger
(16) A. get (22) A. have
B. walk B. g°
C. drive C. be
D. fly D. land
(17) A. it (23) A. wild animals
B. we B. pictures
C. you C. dummies
D. he D. guinea pigs
(18) A. surely (24) A. honest
B. too B. ordinary
C. not C. great
D. try to D. live
Passage C
(25) All Aztec boys had to attend school when they reached fifteen years of age.
(26) They were trained chiefly in making war. (27) That was because Aztecs
valued war. (28) They believed wars made them strong and great, (29) so of course
they taught their children the same values. (30) The Aztec child learned that it was
important to be a brave soldier and (31) that it was an honor to die in battle. (32)
That pleased the gods, who rewarded the soldier with a long and happy life after
death. (33) The Aztecs were also trading people. (34) At their markets they
exchanged food for clay pots, tools, and other things they needed. (35) The Aztecs
traded with other Indians also. (36) Aztec merchants visited villages hundreds of
miles away.
266 APPENDIX
Passage C
(25) A. had (31) A. for
B. did B. like
C. must C. and
D. ought D. to
(26) A. and (32) A. angered
B. to B. worshipped
C. in C. destroyed
D. of D. pleased
(27) A. so (33) A. Amazons
B. because B. Amish
C. however C. Aztecs
D. like D. Asians
(28) A. wanted (34) A. because
B. tried B. since
C. needed C. with
D. believed D. and
(29) A. his (35) A. traded
B. your B. sold
C. their C. bought
D. its D. cashed
(30) A. funny (36) A. villages
B. brave B. miles
C. happy C. others
D. real D. them
*These directions were presented both in writing and on tape. The text, questions, and choices
were all taped for each passage. However, the students’ answer sheets only contained printed versions
of the alternatives to each question.
4. Listening comprehension 267
Passage A*
In 1911 people had something new to talk about A New York newspaper said
it would give 50,000 dollars to the first man to fly across the country. The flight
had to be made within 60 days, however. There weren’t many planes or fliers in
those days. The first American flight had been made only eight years before. Few
people had ever seen a plane flying. The New York newspaper s story made most
people laugh. No one thought it was possible to fly a plane clear across the country.
Besides, who would be foolish enough to try?
The first man to start was Robert Fowler. He was only able to travel a couple
of hundred miles a day. Everywhere people stopped what they were doing to watch
him. Most of them had never seen a plane before. After 49 days and many troubles
he finally made it from San Francisco to New York. He didn’t collect the 50,000
dollars though because he had waited too long to start his trip and the 60-day time
limit had elapsed.
* Adapted from Science Research Associates Reading Lab Orange 5, Secondary Edition.
268 APPENDIX
Passage B*
Little things mean a lot. They may make a lot of money for the man who
invents them. They may save other people a lot of time. Every day you use many of
these little things called “gadgets.” One of these little things made Hyman Lipman
a fortune. He was the man who thought of putting an eraser on a pencil. Ever use
one? He took out a patent on this idea and sold it for $ 10,000. That may not sound
like much today, but it was ten years’ wages in 1908. Another useful little gadget is
the paper clip. Hortense Jones was responsible for its patent It netted him
$25,000. Another patent was for screw top jars. It seems simple now, but it was a
real contribution to the average home in the early 19th century. Bottle tops too
were invented by a resourceful man. His name was Pointer. Have you ever thought
of inventing something and taking out a patent on it? Who knows, you might get
rich. While you’re thinking you can scratch off barbed wire invented by Joseph
Glidden in 1827, and zippers made by a man named Judson in 1830. But it is a safe
bet that many gadgets are yet to be invented.
A. 1920.
B. 1801.
C. 1800.
D. 1755.
(17) The passage invites the listener to consider
A. becoming a railroad worker.
B. joining the Air Force.
C. flying crop dusters.
D. trying to invent something.
(18) The incentive is that you might
A. become famous.
B. get an education.
C. get rich.
D. become a businessman.
(19) The following gadgets are already patented according to the article
A. pens, glasses, and envelopes.
B. paper clips, zippers, and barbed wire.
C. rubber-soled shoes, hairpins, and watch bands.
D. levis, harnesses, and wagon wheels.
270 APPENDIX
Passage C*
One summer day in 1842 a man appeared on Broadway in New York City
with five bricks under his arm. Solemnly, he placed one of the bricks at a busy
comer. Then he walked three blocks down Broadway, placing a brick at each
corner. With the fifth brick under his arm, he walked to the American Museum.
There he presented a ticket for admission, walked through the building, and went
out the rear door.
In a few minutes, he was back on Broadway, where he picked up each brick
and put down in its place the brick he had been carrying. Then he went back to the
Museum and through its doors without a word to the curious people who followed
in his path.
Each time he entered the Museum, a crowd paid the admission charge of 25
cents, hoping to find the answer to the puzzle of the walking bricklayer inside. Half
an hour after this strange business began, 500 people were following the man.
After several days, the police put a stop to the performance because the crowds
were interfering with street traffic. But P. T. Bamum was pleased. He paid off his
walking bricklayer and chuckled as he counted up his increase in profits from the
Museum. It was known as Bamum’s American Museum.
(25) As he walked toward the American Museum he laid all the bricks but one
down
A. each on a different corner one at a time.
B. in the gutter at the comer of 25th and Vine.
C. and smashed them one by one.
D. and never picked them up again.
(26) Then he carried the last brick into
A. Madison Square Garden.
B. the Astrodome.
C. the Ford factory.
D. the American Museum.
(27) He paid the admission fee and walked right on
A. into the inside.
B. out the back door.
C. up the stairs.
D. down the street
(28) A few minutes later he was back on Broadway where he picked up each bnck
he laid down and
* A. picked up three passengers and drove across town.
B. put the one he was carrying in its place.
C. carried them all back home.
D. put them back in his suitcase.
(29) Then he went back to the
A. Astrodome.
B. movies.
C. museum.
D. fair.
(30) He kept doing this over and over, and after a while people began to follow
him, each time paying admission as they went through the
A. tunnel.
B. museum.
C. theater.
D. school.
(31) P. T. Bamum had paid the man to behave strangely in order to get enough
people to go to the
A. theater.
B. Brooklyn bridge.
C. museum.
D. desk clerk.
5. Dictation
.... DIRECTIONS: This is a dictation test You will hear three paragraphs.
Each one will be read three times. The first time, listen.
272 APPENDIX
The second time, write what you hear, during the pauses.
The third time, you may try to correct any mistakes.
Punctuation marks will be given the second time. Do not
spell out the punctuation marks.
Taped If the speaker says, “This is a book, period,” you will
Instructions write: This is a book.
If the speaker says, “He has a pen, comma, a pencil,
comma, and a notebook, period,” you will write: He
has a pen, a pencil, and a notebook.
If the speaker says, “Those classes, dash, English and
history, dash, were interesting, period,” you will
write: Those classes—English and history—were
interesting.
Remember, you will hear each paragraph three times.
The first time listen; the second time, write; and the third
time, correct mistakes.
Are there any questions?
Now we will begin with paragraph 1.
DICTATION A*
Some days / it doesn’t pay to get up. / Some days / you can’t get anything right /
One day I woke up, / and the sun was shining. / Birds were singing. / I got up, /
spilled hot coffee at breakfast, / tripped and fell down the stairs, / and got to class
just as it was ending. / All in all, it was a bad day to get out of bed. /
DICTATION B
DICTATION C
I have always believed / that every college curriculum / should include a course /
on the nonappreciation of literature. / In such a course / the student would decide /
what books he doesn’t like and why. / A well-reasoned term paper / on why you
don’t like Major Barbara / could be a contribution of great merit / We learn by
what we reject / as well as by what we accept /
•Slashes indicate the points at which pauses were inserted in the text on the second reading.
11. Repetition 273
Part I:
The first exercise is sentence repetition. We will say a
sentence, and you will repeat exactly what you hear. For example, I
will say:
Taped Last summer I saw mountains for the first time.
Instructions Then I stop, and you repeat
Then I continue:
In the distance the mountains looked very blue.
And again you repeat Do you understand? I say the sentence, and
you repeat it when I stop. If you have any questions, raise your hand
now. [Pause for the lab instructor to answer questions.]
OK, let’s begin.
Passage A
Once there was a good man and his good wife.
They lived in a beautiful house.
The man and his wife were very happy except that he could never
find things.
He would look for his shoes, and he couldn’t find them.
He would look for a book, and he couldn’t find it
But it was not his fault
He couldn’t find anything because his good wife had moved every¬
thing.
Passage B
China was one of the oldest and largest kingdoms in the world.
Why then was it so easily divided by foreign imperialists?
There were many reasons.
The government was weak.
The people wanted to keep their old customs and ways of doing
things.
They refused to use western methods of farming and manufacturing.
China in the 19th century was a backward country.
Passage C
Passage B
Passage C
Passage D
Passage E
Besides changing the environment to suit his own needs, man
also makes it uninhabitable for many wildlife species by polluting
and poisoning it Chimneys, factories, and exhaust pipes belch
deadly debris into the atmosphere. Many streams and rivers run
thick with human and industrial refuse, making them death traps
for fish and other forms of aquatic life. A number of lakes are dying,
because man uses them as dumping grounds for wastes of every
kind.
Now we will begin part III. This last exercise is very easy. You
will be given 3 paragraphs to read. All you have to do is read them as
naturally as possible. When you finish the first paragraph, go on to
Taped the rest of the paragraphs. When you finish, turn off your tape
Instructions player and stay in your seats until everyone is finished. You will
have a few minutes to read over the paragraphs first.
Thank you again for coming. We really appreciate your help.
Passage A
Way out at the end of a tiny little town was an old overgrown
garden, and in the garden was an old house. In the house lived Mary
Mullins. She was nine years old, and she lived there all alone. She
had no mother and no father. That was, of course, very nice because
there was no one to tell her to go to bed just when she was having the
most fun.
Passage B
When students and an instructor walk into class the first day of
the semester, they know without thinking what to expect of each
other and how each will behave. The students know that the
instructor will stand in front of the class behind a lectern, probably
276 APPENDIX
call the roll, assign reading, and dismiss them early on the first day.
The instructor knows that, unless the class is required, students will
be shopping around.
Passage C
For long-distance travel, the airplane has replaced the railroad
and the ship as the principal carrier. The airplane has become so
commonplace that we often fail to realize what a recent develop¬
ment in transportation it really is. The first transatlantic passenger
flights were made only a few years before World War II. Frequent
service came into being only after the war, and it was not until jets
were introduced that passenger capacity began to expand.
DIRECTIONS: You will read three short paragraphs. In each one, some of the
words or phrases are underlined. For each underlined word
or phrase, several choices are given inside square brackets as
shown in the example below. YOU ARE TO SELECT THE
CHOICE THAT MEANS THE SAME OR NEARLY THE
SAME AS THE UNDERLINED WORD OR PHRASE.
The best choice is B. little, because the words small and little
mean almost the same thing in this context So, you would
circle the letter B.
You should read the entire passage before you try to mark the
correct choices. Mark your answers right on the test booklet
Passage A
the United States for several (2) reasons A. hopes . First of all, he
B. wonders
C. purposes
D. desires
wanted to (3) learn A. become able to understand and speak English and
B. try to forget all about
C. be able to hear
D. have a chance to use his limited
to really be able to talk with and understand the people who (4) call
A. The number three thing he wanted to do was to finish his degree in engineer-
B. The last thing on his mind
C. The one thing he didn’t want to do
D. The item second to none
ing so that (8) he would be able to A. the engineers would help build much
B. he could
C. the degree would
D. it would be able to
needed dams and other water projects in his homeland back in Africa. There were
278 APPENDIX
not to
(9) canals and reservoirs A. canaries and elephants
B. channels and highways
C. waterways and manmade lakes
D. government buildings and reserves
mention the many other projects that were underway. (10) Joe wanted to contrib-
knew how important the projects were to the survival of his people. For all of these
reasons, (11) learning English for Joe A. teaching Joe to speak English
B. Joe’s learning of English
C. to learn Joe’s English
D. learning about himself in English
was not just a school task. In fact, (12) it was a necessary step toward a goal he had
Passage B
Nicholas Rizos had come to Athens to work. He was not merely visiting
Greece as a (13) tourist A. friendly laborer. He left America and (14) anived
B. book author.
C. sailor.
D. casual traveler.
Nicholas to come and help him in his garage. Nicholas had (17) accepted
15. Reading match test 279
(18) thought A. believed the work would do him good. (19) At least
B. wondered if
C. wished that
D. promised
A. On the other hand the experience would be good for him. He would have
B. In any case
C. For the best
D. For the time being
country (21) of his parents birth A. where his father and mother were born
B. where his parents used to live
C. of his own birth and his youth
D. of his heritage and grandfathers
For several months he had studied Greek. (22) He wanted to speak it as well as
and to he able to read signs along the streets of Athens. Now (23) he wished he
could have practiced more A. Nicholas was sorry he had not spoken more Greek
B. He wished he had practiced his English more
C. Nicholas wanted to practice sailing more
D. He was sorry he hadn’t practiced reading signs
with the Greek sailors on board the cargo ship. That first morning in Athens was
some experience! Finally, (24) Nicholas was really there—in the land where his
Passage C
This book is about power and the (25) means A. agencies of producing
B. methods
C. theories
D. memories
it The word “energy” (26) has A. repairs more than one meaning in
B. restores
C. clarifies
D. possesses
energy,” “electrical energy,” and so on. (29) Usually A. In the long run
B. In summary
C. In general
D. In this analysis
tion. The word “power” will be (33) used to mean A. limited to control
B. asked to reflect
C. established to produce
D. utilized to signify
energy that has been converted into a form that can (34) readily be applied
15. Reading match test 281
A. vigorously be denied for many practical purposes. The first section of this
B. correctly be appointed
C. quickly be chosen
D. easily be used
book deals with (35) the sources of energy that can be turned into power.
The second section presents the scientific principles that lie behind the conver¬
sion of energy into power. This section also shows (36) how and why different
DIRECTIONS: You will read 2 short paragraphs. Some of the words in the
paragraphs have been left out Try to guess the missing words.
For example, you might read:
EXAMPLE: The cat ran up the_.
You should fill in the blank with the word that seems to fit the
context best. If you do not know you should guess. In the
example you might answer
ANSWER: The cat ran up the tree.
Other words that would also fit in the blank are wall, branch,
street, and so on. Remember that when there is more context,
the choices are more limited. Make sure your answers fit. Be
sure to read the whole text before you try to fill in the missing
words. It is all right to guess if you are not sure. Try to use only
one word for each blank. Do not turn page until the instruc¬
tor tells you to begin. You will have 20 minutes to complete
this exercise.
282 APPENDIX
Man has always made music. His (1) (vo'ce) is a natural musical
instrument From (2) times, music in some form has (3) (always)
been with him. Man made music (4) (w‘^) his voice long before he ever
(5)_(created) a musical instrument like a guitar (6) (<>r)_a flute. For
thousands of years man’s (7) (music) wag soun(J 0f own (g)
_(X2.-C.e)_, the sound of animals and the (9) (singing) Qf girc|s> There
was also the (10) _(sound)_ Gf streams? ang other natural things (11)
_(ajKun(i)_him. Today composers write difficult music (12) (in)_
symbols that are learned by other people. (13) ( ^)_performer can sing
or play a (14)_(hong)_that he has never heard before (15) _he
has learned to read those symbols.
NAME: _INSTRUCTOR:_
Passage A
A farmer s daughter had been out to milk cows and was returning home, carry¬
(A) started
(B) had to
ing her pail of milk on her head. As she walked along she (1). (C) prepared
(D) began to
Then 1 will buy eggs and these will produce chickens. Soon I will have
(A) a large
(B) a hard
(3)- (C) an expensive poultry-yard. After that I will sell some of
(D) a high
which I will wear when I go shopping. All the young men will love me, but 1 will
which a little more than one and a quarter acres was available per person for the
(A) on
(B) at
world as a whole (2) (C) in 1955. If the world population in
(D) upon
(A) for
(B) to
creases by the year 2000 (3) (C) around about seven billion
(D) at
persons, which some scientists now predict, the amount of arable land per person
will decrease to just over one-half acre. There will be some new land under culti¬
the factors that slow down the rate of population growth. As the population
increases, therefore, the size of the piece of land from which each person
Passage C
Economics of Development
of trading profits. The colonial power was primarily interested in supplies and
(A) is meaning
(B) had meant
(1) (C) was meaning it was primarily interested in the colony’s
(D) meant
stuck to such an extent that even the Pearson Report considers the expansion of
(A) someone
(B) no one
oping countries. But of course, (4) (C) people do not live by
(D) country
(A) herself
(B) itself
exporting, and what they produce for (5) (C) themselves
(D) them
DIRECTIONS: You will read three more short passages. The first is about
Gifted Children; the second is about the Migration of Birds;
the third is about the Origins of Language. In each passage
some of the words or groups of words are underlined. Next to
286 APPENDIX
(A) NO CHANGE
(B) g°
John (1) goed (C) will go to town yester¬
(D) went day.
(E) has went
Passage A
Gifted Children
(A) NO CHANGE
(B) with
Most people have misconceptions (1) for (C) toward
(D) through
(E) about
(A) NO CHANGE
(B) that
(2) which (C) because these qualities reflect heredity or
(D) whether
(E) unless
environment Parents and the home atmosphere (3) count heavily with
(A) NO CHANGE
(B) count heavily in stimulating
stimulating (C) count heavily for stimulate giftedness. What’s
(D) count heavy in stimulating
(E) count heavily for stimulating
(A) NO CHANGE
(B) need
(4) needing (C) needed is a strong mother and father who
(D) for need
(E) to need
love, appreciate, respect and trust their children and a home (5) where there is
(A) NO CHANGE
(B) that there is harmony and sharing experiences.
(C) where there are harmony and sharing experiences.
(D) where there is harmony and to share experiences.
(E) where there is harmony and share experiences.
(A) NO CHANGE
(B) to intellectual development
(6) of intellectual development (C) to intellectual develop.
(D) to intellectually development.
(E) for intellectually developing.
Passage B
With the coming of autumn, many species of birds living in northern latr
(A) NO CHANGE
(B) to south.
tudes migrate (1) toward southward. (C) southward. They do
(D) southern.
(E) to the southward.
288 APPENDIX
(A) NO CHANGE
(B) since
(C) because they have “thought out” the
not migrate (2) as
(D) on account of
(E) why _
situation and have planned ahead for winter. This is a human wa\
(A) NO CHANGE
(B) in seeing
(C) of seeing the problem. F or the migrating
(3) be seeing
(D) in thinking
(E) to think
(A) NO CHANGE
(B) nothing
birds there is no “problem.’ There is (4) anything (C) not everything
(D) nobody
(E) not all of it
(A) NO CHANGE
(B) It is seemed that the gradual shortening of daylight hours triggers certain
glands to acting,
(C) The gradual shortening of daylight hours has to be triggering certain glands
into acting,
(D) The gradual shortening of daylight hours seems to triggering certain glands
into action,
(E) It has been that the gradual shortening of daylight hours has triggered
certain glandular reactions,__
Passage C
(A) NO CHANGE
(B) to be contented
origins of language, and have (2) to content (C) been contented
(D) contently been
(E) to contently
(A) NO CHANGE
(B) even do not know
(3) never know (Q do not knowing for certain whether language
(D) do not even know
(E) do even not know
(A) NO CHANGE
(B) at
arose (4) with (C) by the same time as tool making
(D) on
(E) as
and the earliest forms of specifically human cooperation. In the great Ice Ages of
the Pleistocene period people made fire and cooked their food. (5) They
(A) NO CHANGE
(B) They had big hunting games,
hunted big game, (C) They hunted with big game, often by
(D) The games were largely hunted,
(E) They were hunted by big game.
to believe that (6) speech makers lacked the power in this culture.
(A) NO CHANGE
(B) the power of speech made a lack in the culture.
(C) making speeches was lacked in this culture but power was not
(D) the power of the culture was in the speech makers who lacked it.
(E) the makers of this culture lacked the power of speech.
DIRECTIONS: In this part, again, you will read three short passages. The first
passage is about Growing Up; the second is about Modem
Man; and the third is about Driver Education and Traffic
Safety. In each passage there are blanks where some words or
groups of words have been left out The words that are
missing are given, but they are not given in the correct order.
You must decide what order is the correct order.
290 APPENDIX
In this part you use all of the possible choices. Because the
sentence should say “George really loves Sarah, (1) but (2) Sarah
(3) doesn’t (4) love him,” you should write the letterB in blank (1),
C in blank (2), A in blank (3), and D in blank (4). Do not turn the
page until your instructor tells you to begin.
Passage A
Growing Up
of the future. The year between one Christmas and the next Christmas
(A) seems
(B) eternity
(A) shorter
(B) day
(15) (16) (C) each and each week seemed to
(D) seemed
have fewer days in it Things were changing so fast for Francie that (17)
19. Writing task 291
(A) she
(B) mixed
(18) -(19) _(20) (C) got
(D) up
Passage B
Modern Man
One finds that progress can also have its drawbacks. It is true that today
man moves more swiftly through the world. But in doing so, he often loses (1) _
(A) of
(B) and traditions
(2) (3) (4) (C) the roots that give sub¬
(D) sight
(A) men of
(B) than
(11) (12) (C) wiser Instead, the ease
(D) earlier generations
Passage C
The “traffic problem” was created by man and perhaps it can progressively be
solved by man. You, the driver, have a big stake (1) -(2) -
(A) effective
(B) in
(4) (C) control of traffic, which includes everyone
(3)
(D) the
and everything that moves on our streets and highways—pedestrians, passenger cars,
Signs, signals, and markings are used to direct and regulate traffic and to (9)
Police supervision
is provided to aid and protect the safe drivers and to remove from the traffic those
(A) of people
(B) throughout this country
(15) (16) (C) working to help keep you
(D) tens of thousands
safe.
NAME: _INSTRUCTOR: _
DIRECTIONS: You will read three short passages one at a time on the screen
in the front of the room. You must write down everything you
remember after you read the passage. Try to get the meaning.
The exact wording of the passage is less important You will
have exactly one minute to study the passage. Then, the pro¬
jector will be turned off and you will not see the passage
again. When the projector is off, you may begin writing. You
may not take notes while the projector is on. You will have
five minutes to write what you remember of the passage. Here
is a very simple example:
EXAMPLE: Sarah is an older woman who lives alone. She works at a
library. She rarely goes out at night because she is afraid of
the street gangs in New York.
ANSWER:
Now you may check your answer. Did you tell where Sarah
lives? Her name? Did you mention her job and where she
works? Did you tell about the fact that she lives alone? Did
you mention her fear and why she is afraid?
Your score depends on how much of the passage you are able
to reconstruct from memory and write down. The facts are
important. The order and manner in which you write them
down may be important, too. Try not to leave out anything, or
to add things that were not in the passage.
Passage A
A snake sheds its skin as it grows. First, the snake breaks the skin on its nose. Then
the snake crawls out, leaving the skin behind. Although snakes have no legs, they
move about very well. Some can crawl as fast as you can walk. Some climb trees.
Most of them can swim.
Passage B
Composition, oral as well as written, is the controlled use of language. The two
forms make up the “expressive” language arts of speaking and writing. But compo¬
sition is more than merely talk or writing; it is speech or writing with a plan and a
purpose and a conscious choice of words and ideas.
Passage C
DIRECTIONS: You will read three passages. The first is about students prac¬
ticing for an English test; the second is about a mother and
her children; the third is about two friends discussing work.
In each passage, there are blanks where a word has been left
out; in a few cases, a contraction (like Fin, isn't, doesn't) is
needed to fill the blank Read the entire passage before filling
in the blanks because you may get help from the next
sentence or from a sentence further along in the passage.
Write in the word (or the contraction) you think goes best in
the blank; if you change your mind, erase and write in a new
word.
EXAMPUE: The doctor said that John_swim if he wanted
__ , but that he should_careful.
John said, “Well _ you think I should
_go swimming, I wont.”
ANSWER: The doctor said that John_could_swim if he wanted
_to_, but that he should_be_careful.
John said, “Well, d_ you think I should
_H£t_go swimming, I wont.”
In a few cases you may be able to think of more than one word
that can fit. Be sure that each one fits exactly with the words
that come before and after it; then you can write in either one
of the words. REMEMBER: Only one word in each blank
(contractions count as one word).
Passage A
Two students are practicing for an English test They are asking each other
questions and giving the answers. “ (1)_How-many months are there in a
(2)_year_?” “There are twelve, three in (3)_eacb-season, and of
course that means (4)_there_are four seasons.”
“Yes. Do you know the (5)_days-of the week
“Of course I (6) do_Ask me a hard question!”
“OK. Do you want (7)_to_practice spelling?”
“No. I (8) can gpell everything without (9)_“Y-trouble.”
“All right Do you know (10)_when_summer begins?”
“No, I don’t Ask me (11) some_thing_ else.”
“All right (12) Will_it snow next month?”
296 APPENDIX
Passage B
Mrs. Smith was (26) reading _a book to her children, and they (27)
were listening very carefully to her. (28) Their faces showed
how deeply they were interested (29) _ 111 the story and if one of (30)
them_made any sound, the other children (31) would tell the
offender to be silent. The smallest, only four years (32). old was sound
asleep on a cushion (33) next_to his mother, and the others were becom¬
ing (34) sleepy_ also, yawning and rubbing their (35)_eyes_
Mother looked around and said to them, “(36) Has_ every one heard
enough yet?” “(37) Please don’t stop, Mother,” said the oldest child, “(38)
there isn’t much more.” “ Yes, keep reading. 1(39)_Ml_enjoy¬
ing it so much,” said the next oldest “(40) Which one of you wants to
read? I simply (41) must wash the dishes.’Do you really (42)
have_to?” asked the second youngest “I think Daddy (43) should
do the dishes tonight.” “Your father (44) has_worked hard all day.”
“But so have (45) you _ , Mother. You’ve worked all day too, (46)
_thatisn’t so?” asked the oldest. “I agree (47) with_you, my
dear child, but mothers get (48)_use<f to working hard all day, more (49)
_{han_fathers.” The second youngest said, “Daddy has (50) been
sleeping for an hour, so I think mothers are much stronger than fathers.”
Passage C
John: “Well, hello there, Frank! How are you these days? I haven’t seen you
since last September. How is everything?”
Frank: “Just fine. I’ve really (51)_been_studying hard this semes-
ter.
John: “Listen, (52)_have_you eaten lunch yet? (53) Would
you like to have something (54) with me?”
22. Grammar 297
Frank: “That sounds fine. I (55) -fod'11_even have coffee this raorn-
ing or anything at all (56) to eat since I began work. There (57)
were people coming in all morning for information
John: (58)-Do_you have to work like (59)_ that every
day?
Frank: "No, but today half (60)_of_the workers were out sick, so
(61)-fh?-rest of us, those who came in, (62) had_to Jo every.
thing.”
John: “Oh, I see. And (63)_^2_you enjoy your job? Is it (64)
—something you can do easily at the (65) same time that you’re going
to school?”
Frank: "(66) -^hy_ do you think it’s different from any (67)
-other-job here at school? (68) D°es_it seem like hard work?
Actually, it (69)_'sn 1_hard at all: I go (70) there_at work for
three hours (71)__leave at noon. I think that (72) most
students work at least three hours a day. That (73) doesn t seem like a lot to
me.
John: “O.K. Now (74)_where shall we go for lunch?
Frank: “Well, Superburger (75)_1!_good enough and it’s close,
that is, (76)_if_you like hamburgers.
NAME: _INSTRUCTOR: _
DIRECTIONS: You will read two passages. The first is about a literature test;
the second is about taking a morning walk. In each passage,
there are blanks where a word has been left out; in a few
cases, a contraction (like I'm, isn't, doesn't) is needed to fill
the blank. Read the entire passage before filling in the blanks
because you may get help from the next sentence or from a
sentence further along in the passage. Write in the word (or
the contraction) you think goes best in the blank; if you
change your mind, erase and write in the new word.
EXAMPLE: The doctor said that John_swim if he wanted
_, but that he should_careful.
John said, “Well, _ you think I should
_go swimming, I won’t”
298 APPENDIX
In a few cases you may be able to think of more than one word
that can fit. Be sure that each one fits exactly with the words
that come before and after it; then you can write in either one
of the words.
Passage A
An interesting discussion arose in our literature class the other day. Since it
was getting near the final examination period, the teacher (77)_l2]d-the
class that there (78) would be a test on the entire textbook the following
week. “(79) D°es that mean that we have (80)_*2_know
everything?” asked Tom. “Not quite everything, (81)_kill_almost every¬
thing,” answered the teacher. “(82) Did_you say the whole book?
asked John. “(83) Are_you asking because it’s too (84)_much_
reading?” said the teacher. “Actually, it (85)_does_seem like a lot,” said
John, “because (86) *ts_ such a big book.” “Would you (87)
rather have two short tests on the book, (88) instead_0f a long
test?” “I would, but the (89)_others_ _ might not” “All right Let’s see: how
(90) do_the rest of you students feel (91) ah°ut_the matter?
How many of you (92)_prefer__ only 0ne test and how many prefer two?”
“(93) Is_it possible for us to have (94) °rdy_one test on only
one half (95)_of_the book?” asked Sam. “I'm (96) afraid_
not,” said Mr. Jackson, “because (97) d_I don't test you on the entire
book, I (98) can’t_ grade you on the entire book.” “But (99)
we_all read the book, Mr. Jackson!” said Helen. “(100)_Lm_
sure of that, but I don’t know (1 01) who_understands the meaning of the
book, and (102) that_is what tests are really for.”
Passage B
I got up very early this morning and went out for a little walk. I think that it
was about 6 a.in., still not very light out I put on a warm sweater and a heavy jacket
(103) because it was quite cold. I wore (104)_gi°ves on my hands,
and I put a scarf (105) around my neck to keep warm. The weather (106)
was_fair, and I could see that (107)_if_was going to be a nice
day. There was (108) nobody_outside but me, and I (109) thought_
that was unusual; after all, six o’clock (110) wasn t really very early. I
23. Attitude questionnaire 299
asked (111)_myself why nobody was in the street: (112) coti 1<1_ft
be that my watch was wrong? (113) Was_it really five o’clock, and not
six? (114)_After_walking another block without meeting anyone, 1(115)
__a newsboy delivering papers on his bicycle. “(116) Why_
are the papers so thick today?” I wondered. (117) Like_a holt of light¬
ning, the reason flashed (118) through my head: It was Sunday! Somehow it
(119) had_slipped my mind, and that was strange: (120) After
all, I wouldn’t have gotten (121)_2P_so early if it hadn’t been Sunday,
(122) would p When I returned home, everyone was (123)
_stdl sleeping, so I prepared a big (124) breakfast for myself, sat
and ate it while reading the morning (125)_paper___. And still nobody got up!
“Well,” I said to myself after some reflection, “(126) If_you don’t like
to be alone, you should (127) n°t_get up so early.”
NAME: _COURSE: _
Dear Student:
The CESL staff would like to know your reaction to all the testing that has been
done this term. Would you please give us your reaction in addition to filling out the
other questions below? Thank you.
1. I thought the tests were
a. too easy
b. a little too easy
c. okay
d. a little too hard
e. much too hard
2. I would have preferred
a. much less testing
b. a little less testing
c. about the same
d. a little more testing
e. much more testing
3. How long have you been in the United States?_
300 APPENDIX
4. How long did you study English before coming to the United States?
a. less than one year d. five to six years
b. one to two years e. seven years or more
c. three to four years
5. For how many hours a day did you study English in your own country?
a. less than one hour a day c. three to four hours
b. one to two hours d. more than four hours
6. What languages did your English teacher in your own country speak?
a. English only
b. English and some other language
c. only a little English
7. Do the people you live with here in the United States speak English or your
native language?
a we only speak in my native language
b. we speak some in English and some in another language
8. What was the highest level of education of either of your parents?
a. Ph.D., doctorate, or the equivalent
b. Masters, graduate study, or the equivalent
c. BA, BS, or college graduate, or the equivalent
d. secondary (high) school graduation or the equivalent
e. eighth grade education, or equivalent
f. went to school for 4 or more years
g. less than 4 years of school
h. no schoohng
9. What is your father’s job in your home country?
a a businessman d. professor or school teacher
b. farmer or laborer e. government official
c. doctor or lawyer f. other
10. Do you enjoy your English classes?
a always d. rarely
b. usually e. never
c. sometimes
11. Do you feel that you learn from your English instructors?
a never d. usually
b. rarely e. always
c. sometimes
12. How much time do you sspend each day studying your English lessons
CESU outside of class?
a none d. two hours or more
b. an hour e. three hours or more
c. more than an hour
13. Do you review the material covered in each class outside of class on your own?
a. every day d. rarely
b. usually e. never
c. sometimes
23. Attitude questionnaire 301
Following is a list of words that might be used to describe people. First, indicate
whether these qualities are desirable, neutral, or undesirable. Circle the word
which is desirable. If neither word is desirable, don’t circle anything. For example,
if you think that kind is better than unkind, you would circle kind.
How well do these words describe you? Notice that these words are on a scale. If
you think that you are kind most of the time, but not all of the time, you might indi¬
cate that in the following manner
Below is a list of words that might be used to describe people. First indicate
whether these qualities are desirable, neutral, or undesirable. Circle the word
which is desirable. If neither word is desirable, don't circle anything.
Below is a list of words that might be used to describe people. Think of each word
in terms of how well it describes Americans. For example, if you think Americans
are helpful, you would indicate as follows:
23. Attitude questionnaire 303
® 2 3 4 5 6 7 8
Below is a list of reasons frequently given by students for studying English. After
careful thought please evaluate each statement according to how it reflects your
feelings by placing an X in one of the 7 blanks. Look at the example below.
_X_
agree disagree
1. It will help me to understand the American people and their way of life.
agree disagree
304 APPENDIX
4. One needs a good knowledge of at least one foreign language to merit social
recognition.
6. I feel that no one is really educated unless he is fluent in the English language.
9. If you could stay in the United States for as long as you wanted after you finish
your studies, how long would you stay? Please circle your answer. I would
a. leave as soon as my studies were finished.
b. stay 3 months.
c. stay 6 months.
d. stay 1 year.
e. stay 5 years.
f. stay permanently. (Immigrate)
About the Authors
Donn R. Callaway holds the B.A. from University of Santa Clara and the M.A. in
English as a Second Language from SIU. During academic 1977-1978 and 1978-
1979 he worked as a master teacher in the joint program between the University of
Illinois (Urbana-Champaign) and Arya Mehr University in Isfahan, Iran. While at
Arya Mehr, he coordinated their English language testing program as well as begin¬
ning and advanced courses in technical English.
306
ABOUT THE AUTHORS 307
Jill Evola holds the B.A. in Latin American Studies from the University of
Michigan in 1975. Recently she too completed requirements for an M.A. in Lin¬
guistics at Southern Illinois University.
Michelle Fishman has an M.A. in ESL from SIU and is currently teaching
English as a second language at the Defense Language Institute at Lackland Air
Force Base in San Antonio,Texas. During academic 1976-1977 she served as a
Teaching Assistant in the Center for English as a Second Language at SIU.
Kay Hisama holds the Ph.D. in Educational Psychology from Southern Illinois
University. She has presented a number of papers on language testing at national
and international conferences. Her most recent effort includes the preparation of a
book on psychological aspects of second language learning.
Christine Hjelt completed her B.A. at Willamette University and earned her
M.A. in English as a Second Language at Southern Illinois University. She has
taught ESL/EFL at CESL and in Zambia and Botswana.
Thomas Ray Johnson received his B.A. in Philosophy from Northern Illinois
University and an M.A. in Linguistics at Southern Illinois University. From 1972-
1975 he served as an EFL Instructor in Safi, Morocco. Presently he is teaching
EFL at Lockhart English Language Academy in Pamplona, Spain.
Becky Lentz holds the B.A. in Spanish and Latin American Studies from the
University of Arkansas and the M.A. in English as a Second Language from South¬
ern Illinois University. She is now employed by the Office of International Educa¬
tion at the Central YMCA Community College in Chicago.
Keith Pharis received an M.A. in English as a Second Language from SIU and
has worked in the Peace Corps training programs in Micronesia. He has also taught
EFL in Saudi Arabia and Japan and is currently an Instructor at the Center for
English as a Second Language at SIU.
George Scholz holds the M.A. in English as a Second Language from SIU and is
currently responsible for the testing program at the Institute for Electronics and
Electricity, English Training Program, at Boumerdes, Algeria, North Africa.
Scholz has presented papers at national TESOL meetings and has recently com¬
pleted additional work on the factorial structure of language proficiency.
Becky Gerlach Snow is an Instructor in the ESL program at SIU. Snow has
presented professional papers at TESOL and NAFSA in recent years and has had
wide ranging experience in teaching and testing ESL. She also holds an M.A. in the
Teaching of English as a Second Language.
Randon Spurling completed the M.A. in ESL at SIU before accepting a post at
INELEC in Bourmerdes, Algeria where she is presently an instructor in EFL.
Lela Vandenburg finished her M.A. in ESL at SIU in the spring of 1977 and is
currently teaching ESL in Africa.
311
312 Index
0-88377-131-4
2640R