0% found this document useful (0 votes)
24 views18 pages

Dictation

This document provides context and analysis of a study re-evaluating the use of dictation as a test of English language proficiency. 1) It describes the UCLA English as a Second Language Placement Exam (UCLA ESLPE 1), which consisted of 5 parts testing vocabulary, grammar, composition, phonology, and dictation. The study aimed to understand how the parts overlapped and which provided the most information. 2) It responds to critiques of the original study by providing more details on the test administration, scoring, and population of 350 students from 50 language backgrounds. 3) It re-analyzes the statistical data from the original study to more accurately assess correlations between subtest and

Uploaded by

Ilmia Walda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views18 pages

Dictation

This document provides context and analysis of a study re-evaluating the use of dictation as a test of English language proficiency. 1) It describes the UCLA English as a Second Language Placement Exam (UCLA ESLPE 1), which consisted of 5 parts testing vocabulary, grammar, composition, phonology, and dictation. The study aimed to understand how the parts overlapped and which provided the most information. 2) It responds to critiques of the original study by providing more details on the test administration, scoring, and population of 350 students from 50 language backgrounds. 3) It re-analyzes the statistical data from the original study to more accurately assess correlations between subtest and

Uploaded by

Ilmia Walda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Oller, J. W., Jr. & Streiff, V. (1975). Dictation: A test of grammar based expectancies.

In R. Jones and B. Spolsky (Eds.), Testing language proficiency (pp. 71-88).


Arlington, Virginia: Center for Applied Linguistics (reprinted from English Language
Teaching). https://eric.ed.gov/?id=EJ144883

Dictation: A Test of Grammar Based Expectancies


John W. Oller. Jr. and Virginia Streiff*

1. DICTATION REVISITED
Since the publication of "Dictation as a Device for Testing Foreign
Language Proficiency" in English Language Teaching (henceforth
referred to as the 1971 paper), 1 the utility of dictation for testing has
been demonstrated repeatedly. It is an excellent measure of overall
language proficiency (Johansson 1974; Oller 1972a, 1972b) and has
proved useful as an elicitation technique for diagnostic data (Angelis
1974). Although some of the discussion concerning the validity of
dictation has been skeptical (Rand 1972; Breitenstein 1972), careful
research increasingly supports confidence in the technique.
The purpose of this paper is to present a re-evaluation of the 1971
paper. That data showed the Dictation scores on the UCLA English as
a Second Language Placement Examination (UCLA ESLPE 1) corre-
lated more highly with Total test scores and with other Part scores
than did any other Part of the ESLPE. The re-evaluation was prompt-
ed by useful critiques (Rand 1972; Breitenstein 1972) . An error in
the computation of correlations between Part (subtesi) scores and
Total scores in that analysis is corrected; additional information con-
cerning test rationale, administration, scoring, and interpretation is
provided; and finally, a more comprehensive theoretical explanation
is offered to account for the utility of dictation as a measure of lan-
guage proficiency.
In a Reader's Letter, Breitenstein (1972) commented that many
factors which enter into the process of giving and taking dictation
were not mentioned in the 1971 paper . For example, there is "the
eyesight of the reader" (or the "dictator" as Breitenstein terms him).
the condition of his eye glasses (which "may be dirty or due for re-
newal"). "the speaker's diction," (possibly affected by "speech de-

*We wish to thank Professor Lois McIntosh (UCLA) for providing us with a detailed
description of the test given in the fall of 1968. It is actually Professor McIntosh whose
teaching skill and experience supported confidence in dictation that is at base respon-
sible for not only this paper but a number of others on the topic. We gratefully ac-
knowledge our indebtedness to her. Without her insight into the testing of langauge
skills, the facts discussed here, which were originally uncovered more or less by acci-
dent in a routine analysis, might have gone unnoticed for another 20 years of discrete-
point testing.

71
72 Testing Language Proficiency

fects or an ill-fitting denture")' "the size of the room," "the acoustics


of the room," or the hearing acuity of the examinees, etc. The hyper-
bole of Breitenstein's facetious commentary reaches its asymptote
when he observes that "Oller's statement that 'dictation tests a broad
range of integrative skills' is now taking on a wider meaning than
he probably meant."
Quite apart from the humor in Breitenstein's remarks, there is an
implied serious criticism that merits attention. The earlier paper did
not mention some important facts about how the dictation was se-
lected, administered, scored, and interpreted. We discuss these
questions below. 2
Rand's critique (1972) suggests a re-evaluation of the statistical data
reported in the 1971 paper. Rand correctly observes that the inter-
correlations between Part scores and the Total score on the UCLA
ESLPE 1 were influenced by the weighting of the Part scores. (See the
discussion of the test Parts and their weighting below.) In order to
achieve a more accurate picture of the intercorrelations, it is neces-
sary to adjust the weightings of the Part scores so that an equal num-
ber of points are allowed on each subsection of the test, or alterna-
tively to systematically eliminate the Part scores from the Total score
for purposes of correlation.

II. RE-EVALUATION OF DATA DISCUSSED IN THE 1971 PAPER


We will present the re-evaluation of the data from the 1971 paper in
three parts: (1) a more complete description of the tested population
and the rationale behind the test (in response to Breitenstein 1972),
(2) a more complete description of the test, and (3) a new look at the
Part and Total score correlations (in response to Rand 1972).
Population and Test Rationale
The UCLA ESLPE 1 was administered to about 350 students in the fall
of 1968. A sample of 102 students was selected. They were representa-
tive of about 50 different language backgrounds. About 70 percent of
them were males, and 30 percent females. Approximately 60 percent
of the students were graduates, while the remainder were under-
graduates with regular or part-time status. (See Oller 1972c for a
description of a similar population tested in the fall of 1970.)
The objective of the test is to measure English language proficiency
for placement purposes. Students who have near native speaker pro-
ficienc y are exempted from ESL courses and are allowed to enroll in
a full course load in their regular studies. Those students who have
difficulties with English are required to take one or more courses in
r emedial English and ma y be limited to a smaller course load in their
regular course of study.
Dictation: A Test of Grammar Based Expectancies 73

Prior to 1969 when the research reported in the 1971 paper was
carried out , the UCLA ESLPE 1 had never been subjected to the close
empirical scrutiny of any statistical analysis. It had been assumed
earlier that Part I measured skills closely associated with reading
comprehension, Part II indicated how well students could handle
English structure, Part III was a good measure of essay writing ability,
Part IV tested discrimination skills in the area of sounds, and Part V
was a good measure of spelling and listening comprehension. The
extent of overlap between the various Parts, and the meaning of the
Total score, were actually unknown. The intent of the test was to
provide a reliable and valid estimate of overall skill in English along
with diagnostic information concerning possible areas of specific
weakness.
It would not be difficult to formulate criticisms of the test as a
whole and its particular subsections independent of any statistical
analysis. This is not the concern of this paper, however. What we are
interested in ar e answers to the following questions. Given the several
parts of the UCLA ESLPE 1, what was the amount of overlap be tween
them? Was ther e one subtest that provided more information than the
rest? Should anyone or more subtests have been replaced or done
away with? These are some of the concerns that prompted the analy-
sis presented in the 1971 paper and which, together with the observa-
tions stated earlier in this paper, motivated the computations reported
here.
Description of the Test: UCLA ESLPE 1
The UCLA ESLPE 1 consists of five parts. Part I, a Vocabulary Test of
20 items, requires the student to match a word in a story-like context
with a synonym. For example:
But the frontier fostered -FOSTERED
positive traits too .. .. (AJ discouraged
(BJ promoted
(C] adopted
The student reads the context and then selects from (A). (B). or (C)
the one that most nearly matches the meaning of the stem word
FOSTERED.
Part II is a Grammar Test of 50 items. Each item asks the student to
select the most acceptable sentence from three choices. For instance:
(A) The boy's parents let him to play in the water.
(B) The boy's parents let him play in the water.
(C] The boy's parents let him playing in the water.
Part III is a Composition. Students were instructed:
74 Testing Language Proficiency

Write a composition of 200 words, discussing ONE of the follow-


ing topics. Your ideas should be clear and well organized. When
you have finished, examine your paper carefully to be sure that
your grammar, spelling and punctuation are correct. Then count
the number of words. PLACE A LARGE X after the two hun-
dredth word (200). If you have written fewer than 200 words give
the exact number at the end of your composition. Choose ONE
and ONLY ONE of the following topics:
1. An interesting place to visit in my country.
2. Advances in human relations in our time .
3. A problem not yet solved by science.
4. The most popular sport in my countr y.
Part IV, Phonology, tests perception of English sounds. It consists
of 30 tape recorded items. The student hears a sentence on tape. The
sentence contains one of two words that are similar phonologically,
e.g. long and wrong as in "His answer was (A) long (B) wrong." The
student has a written form of th e sentence on the test paper and must
decide which of the two words were on the tape.
Part V is a Dictation. The Dictation is actually in two sections. The
two passages selected are each about 100 words in length. One is on a
topic of general interest; the other has a science-oriented focus. The
material selected for the Dictation is language of a type college-level
students are expected to encounter in their course of study. The stu-
dent is given the following instructions in writing and on tape:
The purpose of this dictation exercise is to test your aural com-
prehension and spelling of English. First, listen as the instructor
reads the selection at a normal rate . Then proceed to write as the
instructor begins to read the selection a second time sentence by
sentence . Correct your work when he reads each sentence a
third time. The instructor will tell you when to punctuate.
The student then hears the dictation on tape. The text for the UCLA
ESLPE 1 follows:
(1)
There are many lessons which a new student has to learn when
he first comes to a large university. Among other things he must
adjust himself to the new environment; he must learn to be inde-
pendent and wise in managing his affairs; he must learn to get
along with many people. Above all , he should recognize with
humility that there is much to be learned and that his main job is
to grow in intellect and in spirit. But he mustn' t lose sight of the
fact that education, like life, is most worthwhile when it i~ en-
joyed.
Dictation: A Test of Grammar Based Expectancies 75

(2)
In scientific inquiry, it becomes a matter of duty to expose a
supposed law to every kind of verification. and to take care.
moreover. that it is done intentionally. For instance. if you drop
something, it will immediately fall to the ground. That is a very
common verification of one of the best established laws of na-
ture-the law of gravitation. We believe it in such an extensive,
thorough. and unhesitating manner because the universal experi-
ence of mankind verifies it. And that is the strongest foundation
on which any natural law can rest.
The scoring of Parts I-III. all of which were multiple-choice ques-
tions, was purely objective. Each item in Part I was worth 1 point,
the whole section being worth 20 points. Items in Part II were each
worth V2 point, making the whole section worth 25 points. Part III was
worth 15 points , with each item valued at liz point each.
Parts IVand V require more explanation. Part IV was worth a total
of 25 points with each error subtracting liz point. Students who made
more than 50 errors (with a maximum of 1 error per word attempted)
were given a score of o. There wer e no negative scores, i.e. if a stu-
dent made 50 errors or more, he scored o. Spelling errors were
counted along with errors in word order, grammatical form, choice of
words. and the like. If the student wrote less than 200 words. his
errors were pro-rated on the basis of the following formula: Number
of words written by the student + 200 words = Number of errors
made by the student + X.
The variable X is the pro-rated number of errors, so the student's
pro-rated score would be 25 - (Vz )X. For example. if he wrote 100 words
and made 10 errors, by the formula X == 20. his score would be
25 - V2 (20) == 15 points. The scoring of Part IV involved a considerable
amount of subjective judgment and was probably less reliable than
the scoring of any of the other sections.
A maximum of 15 points was allowed for the Dictation. Clear errors
in spelling (e.g. shagrin for chagrin). phonology (e.g. long hair for
lawn care). grammar (e.g. it became for it becomes), or choice of
wording (e .g. humanity for mankind) counted as % point subtracted
from the maximum possible score of 15 points. A maximum of %
point could be subtracted for multiple errors in a single word. e.g. an
extra word inserted into the text which was ungrammatical. mis-
spelled. and out of order would count as only one error. If the student
made 60 errors or more on the Dictation. a score of 0 was recorded.
Alternative methods of scoring are suggested by Valette (1967).
Part and Total Intercorrelations on the UCLA ESLPE 1
The surprising finding in the 1971 paper was that the Dictation corre-
76 Testing Language Proficiency

lated better with each other Part of the UCLA ESLPE 1 than did
any other Part. Also, Dictation correlated at .86 with the Total score,
which was only slightly less than the correlation of .88 between the
Total and the Composition score. What these data suggested was that
the Dictation was providing more information concerning the totality
of skills being measured than any other Part of the test. In fact, it
seemed to be tapping an underlying competence in English.
The data presented in the 1971 paper, however, have been ques-
tioned by Rand (1972). As mentioned earlier, Rand (1972) correctly
observes that the weightings of Part scores will affect their correlation
with the Total score. Obviously, there is perfect correlation between
the portion of the Total score and the Part score to which it corre-
sponds. Also, differential weightings of scores will have slight effects
on Part and Total correlations even if the self-correlations are sys-
tematically eliminated. If Part scores are unevenly weighted (which
they were in the 1971 paper), the intercorrelations between Part
scores and the Total will be misleading.
One way of removing the error is to adjust the weightings of the
Part scores so that each part is worth an equal number of points
toward the Total. Table I presents the results of a re-analysis of the
data on just such a basis (see Appendix) . For convenience of com-
parison the correlation data from the 1971 paper is reproduced as
Table II (see Appendix). Table II was actually based on 102 subjects,
rather than 100, as was incorrectly reported in the earlier paper. Two
errors in the data deck discovered in the re-analysis and corrected
in Table I are not corrected for Table II. It is reproduced exactly as
it was originally presented in the 1971 paper.
It is noteworthy that the re-analysis (see Table I) shows a .94 cor-
relation between the adjusted Dictation score and adjusted Total,
while the correlation between Composition and Total is reduced from
.88 (Table II) to .85 (Table I). Corrections of the two errors detected
in the data cards account for the slight discrepancies in intercorrela-
Hons between the Parts in Tables I and II.
The data indicate that the Dictation by itself could validly be su b-
stituted for the Total (where the Total is computed by adding the
equally weighted scores on Vocabulary, Grammar, Composition,
Phonology, and Dictation).
Table III (see Appendix) presents correlations with the Total scores,
eliminating self-correlations of Parts in a step-wise fashion. In other
words, each Part is correlated with the Total computed by the sum of
scores on the remaining Parts. For example, Dictation is correlated
with the sum of Vocabulary, Grammar, Composition, and Phonology.
Here again we see clearly the superior performance of Dictation as a
measure of the composite of skills being tested.
Dictation: A Test of Grammar Based Expectancies 77

Together with the earlier research of Valette (1964, 1967), the


follow-up research of Johansson (1974), and Oller (1972a, 1972b,
1972c). the foregoing constitutes a clear refutation of the claims by
language testing experts that dictation is not a good language test (ef.
Harris 1969; Lado 1961; Somaratne 1957; Anderson 1953 as cited in the
1971 paper but not in the references to this paper).
Moreover, the high correlations achieved repeatedly between dicta-
tion and other integrative tests such as the cloze procedure (see Oller
1972b, 1972c) support a psycholinguistic basis contrary to much recent
theorizing (see TOEFL: Interpretive Manual, 1970) for interpreting
intercorrelations of tests of language proficiency. When intercorrela-
tions between diverse tests are near or above the .90 level, a psy-
cholinguistic model leads us to infer high test validity for both tests.
In a cloze test, for example, material is presented visually, whereas in
dictation, it is presented auditorily. When such vastly different tests
consistently inter correlate at the .85 level or better (cf. Olier 1972c,
and references), we may reasonably conclude that they are tapping an
underlying competence. Since we can assume on the grounds of inde-
pendent psycholinguistic research that such an underlying com-
petence exists, we may without danger of circular reasoning argue
that the two tests cross-validate each other. Obviously this will lead
us to expect high intercorrelations between valid language tests of all
sorts. Low intercorrelations must be interpreted as indicating low test
validity, i.e. that one of the tests being correlated does not tap under-
lying linguistic competence or that it does so to an insufficient extent.

III. HOW DOES DICTATION MEASURE LANGUAGE COMPETENCE?


The complexity of taking dictation is greater than might have been
suspected before the advent of "constructivist" models of speech per-
ception and information processing (Neisser 1967; Chomsky and
Halle 1968; Cooper 1972; Stevens and House 1972; Liberman et al
1967). The claims underlying these psycholinguistic models is that
comprehension of speech, like other perceptual activities, requires
active analysis-by-synthesis. "All of these models for perception ...
have in common a listener who actively participates in producing
speech as well as in listening to it in order that he may compare ...
[his synthesis] with the incoming [sequence]. It may be that the com-
parators are the functional component of central interest. ... "3 We
suggest that the comparator is no more nor less than a grammar of
expectancy. It seems that the perceiver formulates expectancies (or
hypotheses) concerning the sound stream based on his internalized
grammar of the language. 4 We refer to this process in the title of the
paper where we suggest that dictation is a device which measures the
efficiency of grammar-based expectancies.
78 Testing Language Proficiency

Neisser (1967) posits a two stage model of cognitive processing of


speech input and other sorts of cognitive information as well. In the
case of speech perception, the listener first formulates a kind of
synthesis that is "fast, crude, wholistic, and parallel " ; the second
stage of perception is a "deliberate, attentive, detailed, and sequen-
tial" analysis. We may apply this model to the writing of a dictation,
providing that we rememb er there must be a rapid-fire alternation
between synthetic and analytic processes. We may assume that a
non-native speaker forms a "fast, crude . . ." notion of what is being
talked about (i.e . meaning) and then analyzes in a "deliberate, atten-
tive ... sequential" fashion in order to write down the segmented and
classifi ed sequences that he has heard. As Chomsky and Halle (1968)
suggest in another context, " the hypothesis [or "synthesis based on
grammar generated expectancies," in our terms] will then be ac-
cepted if it is not too radically at variance with the acoustic mate-
rial."s Of course, if the student's (or listener's] grammar of ex-
pectancy is incomplete, the kinds of hypotheses that he will accept
will deviate substantially from the actual sequences of elements in
the dictation. When students convert a phrase like "scientists from
many nations" into "scientist's imaginations" and "scientist's exami-
nations ," an active analysis-by-synthesis is clearly apparent. On a
dictation given at UCLA not long ago, one student converted an entire
paragraph on "brain cells" into a fairly readabl e and phonetically
similar paragraph on "brand sales." It would be absurd to suggest that
the process of analysis-by-synth esis is only taking place when stu-
dents make errors. It is the process underlying their listening behavior
in general and is only mor e obvious in creative errors .
Since dictation activates the learner's internalized grammar of
expectancy, which we assume is the central component of his lan-
guage competence, it is not surprising that a dictation test yields
substantial information concerning his overall proficiency in the lan-
guage - indeed, more information than some other tests that have
been blessed with greater approval by the "experts" (see discussion
in the 1971 paper] . As a testing device it "yields useful information on
errors at all levels" (Angelis 1974) and meets rigorous standards of
validity (Johansson 1974). It seems likely to be a useful instrument
for testing short-term instructional goals as well as integrated lan-
guage achievement over the long-term. There are many experimental
and practical uses which remain to be explored.

NOTES

1. The paper referred to actuall y appeared first in UCLA TESL Workpopers 4 (1970),
37-41. It was published subsequently in English Language Teaching 25:3 (June 1971),
254-9. and in a revised and expanded form in H. B. Allen and R. N. Campbell, eds.,
Dictation: A Test of Grammar Based Expectancies 79

Teaching English as a Second Language: A Book of Readings, New York, McGraw


Hill , 1972, pp. 346-54.
2. On the other hand, Breitenstein's remarks also indicate two serious misunderstand-
ings. The first concerns the use of dictation as a test. Breitenstein suggests, " le t us not
forget tha t in our mother tongue we can fill in gaps in what we hear up to ten times
better than in the case of a foreign language we have not yet mastered" (p . 203) .
Ignoring the trivial matter of Breitenstein's arithmetic and its questionable empirical
basis, his observation does not point up a disadvantage of dictation as a testing
device -r ather a crucial advantage. It is largely the disparity between our ability to
"fill in gaps in our mother tongue" and in a " foreign language" that a di ctation test
serves to reveal.
The second misunderstanding in Breitenstein's letter concerns student e rrors.
He says, "the mistakes are there, but ar e they due to the 'dictator,' the acoustics of
the room, the hearing of the candidate, or his knowledge? " (p. 203). Admittedly, bad
room acoustics or weak hearing may result in e rrors unique to a particular student,
but difficulties generated by the person giving the dictation will show up in the
performance of many if not all of the examinees and, contrary to wha t Breitenstein
implies, it is possible to identify such errors. Moreover, the purpose of the particular
dictation Breitenstein was discussing was to measure the listening comprehension of
college-level, non-native speakers of English under simulated classroom listening
conditions. To attempt perfect control of acoustic conditions and hearing acuity
would not be realistic. An important asp ect of the ability to understand spoken
English is being able to do it under the constraints and difficulties afford ed by a
normal classroom situation.
3. Cooper 1972, p . 42.
4. Throughout this paper we assume a pragmatic definition of grammar as discussed
by Oller (1970, 1973a). Oller and Richards (1973) . The main distinction betwee n this
sort of definition of grammar and the early Chomskyan paradigm is our claim that
one must include semantic and pragmatic facts in the grammar. Also see Oller
(1973b). Later Chomsky an theory has begun to take steps to correci the earlier in-
adequacy (Chomsky 1972).
5. As cited by Cooper 1972, p. 41.

REFERENCES

Allen, H . B. and R. R. Campbell, eds. (1972) . Teaching English as a Second Language :


A Book of Readings. New York: McGraw Hill.
Angelis, P. (1974) . "Listening Comprehension and Error Analysis. " In G. Nickel, ed.,
AILA Proceedings, Copenhagen 1972, Volume 1: Applied Contrastive Linguistics.
Heidelberg: Julius Groos Verlag. 1-11.
Breitenstein, P. H. (1972) . "Reader's Letters. " English Language Teaching 26:2, 202-3.
Chomsky, N. (1972) . Language and Mind. 2nd ed. New York: Harcourt, Brace, Jovano-
vich.
_ _ and M. Halle (1968). Sound Patterns of English. New York: Harper and Row.
Cooper, F. (1972). "How is Language Conveyed by Speech." In Kavanagh and Mattingly,
eds., 25-46.
Johansson, S. (1974). " Controlled Distortion as a Language Testing Tool." In J. Qvist-
gaard, H . Schwarz and H. Spang-Hanssen, eds., AILA Proceedings, Copenhagen
1972, Volume III: Applied Linguistics, Problems and Solutions. Heidelberg: Julius
Groos Verlag. 397-411.
Kavanagh, J. F. and L G. Mattingly, eds. (1972) . Language by Ear and by Eye: The Rela-
tionships Between Speech and Reading. Cambridge, Mass.: M.LT. Press.
80 Testing Language Proficiency

Liberman, A. M., F. S. Cooper, D. P. Shankweiler and M. Studdert-Kennedy (1967).


"The Perception of the Speech Code." Psychological Review 74,431-61.
Makkai, A., V. B. Makkai and L. Heilman, eds. (1973). Linguistics at the Crossroads:
Proceedings of the 11th International Congress of Linguists, Bologna, ltaly. The
Hague: Mouton.
Neisser, U. (1967). Cognitive Psychology. New York: Appleton-Century-Crofts.
Oller, J. W., Jr. (1970). "Transformational Theory and Pragmatics." Modern Language
Journal. 54:7, 504-7.
_ _ (1971). "Dictation as a Device for Testing Foreign Language Proficiency." English
Language Teaching 25:3, 254-9. .
_ _ (1972a). "Assessing Competence in ESL : Reading." Paper presented at the Annual
Convention of Teachers of English to Speakers of Other Languages, Washington, D.C.
Published in TESOL Quarterly 6:4, 313-24.
_ _ (1972b) . "Dictation as a Test of ESL Proficiency." In Allen and Campbell, eds.,
346-54.
_ _ (1972c). "Scoring Methods and Difficulty Levels for Cloze Tests of Proficiency
in English as a Second Language." Modern Language Journal 56:3, 151-B.
_ _ (1973a). "On the Relation Between Syntax, Semantics, and Pragmatics." In
Makkai. Makkai, and Heilman, eds .
_ _ (1973b). "Pragmatics and Language Testing." Paper presented at the First Joint
Meeting of AILA/TESOL, San Juan, Puerto Rico. Revised and expanded version in
Spolsky (1973).
_ _ and J. C. Richards, eds. (1973). Focus on the Learner: Pragmatic Perspectives for
the Language Teacher. Rowley, Mass.: Newbury House.
Rand, E. ). (1972) . " Integrative and Discrete Point Tests at UCLA." UCLA TESL Work-
papers (June), 67-7B.
Spolsky, B., ed. Current Trends in Language Testing. Forthcoming.
Stevens, K. N. and A. S. House (1972) . "Speech Perception." In Wathen-Dunn, ed.,
Models for the Perception of Speech and Visual Form. Cambridge, Mass.: M.I.T.
Press.
Valette, R. M. (1964). "The Use of the Dictee in the French Language Classroom."
Modern Language Journal 4B:7, 431-4.
_ _ (1967). Modern Language Tesling: A Handbook. New York: Harcourt, Brace,
and World.
Dictation: A Test of Grammar Based Expectancies 81

APPENDIX

Table I
He -evaluation of Intercorrelations Between
Port Scores and Total Score on the UCLA ESLPE 1
with Adjusted (Equal) Weightings of Part Scores (n=102)

Vocabulary Grammar Composition Phonology Dictation


(25 pts) (25 pts) (25 pIs) (25 pts) (25 pts)

Total (125 pIs) .79 .76 .85 .69 .94

Vocabulary .57 .52 .42 .72

Grammar .50 .50 .65

Composition .50 .72

Phonology .57

Table II
Original Intercorrelations Between Port Scores and Total Score
on UCLA ESLPE 1 from Oller (1971) - Weightings Indicated
(n = 102)

Vocabulary Grammar Composition Phonology Dictation


(20 pis) (25 pts) (25 pts) (15 pts) (15 pIs)

Tolal (lOa pts) .77 .78 .88 .69 .86

Vocabulary .58 .51 .45 .67

Grammar .55 .50 .64

Composition .53 .69

Phonology .57
82 Testing Language Proficiency

Table III
Inlercorrelations of Part Scores and Total on UCLA ESLPE 1:
With Self-correlations Removed and with Equal Weightings of
Part Scores In = 102)

1 2 3 4 5
Vocabula ry Grammar Composition Phonology Dictation
(25 pts) (25 pts) (25 pts) (25 pts) (25 pts)

Total I .69
(2 + 3 + 4 + 5=100 pts)
Total II .69
(1 + 3 + 4 + 5=100 pts)
Total III .72
(1 + 2 +' 4 + 5=100 pts)
Total IV .59
(1 + 2 + 3 + 5=100pts)
Total V .85
(1 + 2 + 3 + 4=100 pts)

DISCUSSION

Davies: May I make two points? The first relates to the last point that John
Oller made about high and low correlations. It seems to me that the classical
view of this would be that in a test battery you are looking for low correla-
tions between tests or subtests, but high correlations b e tween each subtest
and some kind of criterion. Clearly, if as he suggests two tests are correlating
highly with one another, this would mean that they would both be valid in
terms of the criterion, assuming that you have a criterion. It would also mean
presumably that you would only need to use one of them. Now the other
point, this business of the grammar of expectancy. I find John Oller's com-
ments very persuasive. Clearly, what we have is a test that is spreading peo-
ple very widely. He didn't tell us what the standard deviation was, but I
would suspect that it would be quite high, and it is essentially for this reason.
I think. tha t he 's getting the high correlations with the other tests when he
groups them together. The dictation test is providing a rank order, which is
what one demands from a test. and it is spreading people out. Now, this is a
persuasive argument in favor of a test. Of course it isn 't the ultimate one, be-
cause the ultimate one is whether the test is valid. However, h e provides
Dictation: A Test of Grammar Based Expectancies 83

evidence for this validity in terms of the additive thing he's done with the
other subtests. But I don't understand why this has to be linked onto a gram-
mar of expectancy. It seems to me that if there is a grammar of expectancy, it
should be justified in its own terms, in grammatical terms. And I don't know
where this justification is . It seems to me that what we have is a satisfactory
test which is , if you like, a kind of work sample test. I don't understand the
connection that is being made, if I understand it rightly, on theoretical
grounds, and I don't see the need for this.
Oller: Concerning the point on correlation, I tried to start off with what I
think is a substantial departure from common testing theory of the 50s and 60s
that says that low corr elations indicate that test parts are actually measuring
different skills. I don't think there's any psycholinguistic basis for that kind of
inference. That is, unless there is some obvious reason why the two skills in
question might not be related, like spelling for example. We know that native
speakers in many cases can't spell, so that the degree of facility with the
language is obviously not related to spelling. On the other hand, the inference
that grammatical proficiency or grammatical skills in terms of manipulation
of structures is not related, say to vocabulary, seems a little less defensible.
There are a great many studies now that show that even fairly traditional
grammatical tests, provided the y're beefed up." are long enough, and contain
enough items and alternatives, intercorrelat~dabout the 80-85 0/0 level. I think
I have about nine tables on as many differ ~nt studies in an article that ap-
peared in the TESOL Quarterly of 1972 illustrating that. Now, if those tests
intercorrelate at that level, you have to search for some explanation for that.
It turns out that I expected this, not on th e basis of testing theory, bu~ rather
on the hasis of what I think is an understanding of the way language func-
tions from a linguistic point of view. If you've read my stuff, you know that I
haven't bought the Chomsky an paradigm, but rather argue for a grammar that
is pragmatically based, that relates sentences to extra-linguistic context, and
it seems to me that a crucial element in a realistic grammar underlying lan-
guage use has to involve the element of time. That's an element that I think
has been rather mistakenly left out of transformational theory until quite
recently, and now we 're beginning to talk about presuppositions and the
notion of pragmatics. So much for the theoretical justification of the notion of
grammar of expectancy. If you're interested in going into it further, I would
suggest articles by Rohert Woods of Harvard University, who is developing
some computer simulation models for grammars that meet the criteria of what
I call the grammar of expectancy. On the other comment, you asked about the
spread on a dictation tes t. Typically, on a 50-point dictation I think that
the usual standard deviation is about 15 points. Compare that against a stand-
ard deviation of probably 8 or 9 points on grammar tests of the sort I have
described. So it's about twice as much as on other tests, and you're getting
just that much more information, apparently, out of the dictation.
Petersen: Are you assuming here that test variance increas es proportionate
84 Testing Language Proficiency

to length, and is that the way you weighted these?


Oller: No. There is a tendency for test variance to increase somewhat accord-
ing to length, but probably not in a linear way. However, I don't mean to
suggest that it is a proportionate. What I am suggesting is that typically the
variance, that is the amount of spread, the tendency of a test to spread people
out on a scale, is higher for dictation than it is for more traditional tests. What
that means in terms of reliability is that you can have a shorter dictation and
get the same reliability as you would get with the much longer discrete point
grammar test.
Petersen: There's one thing I'm wondering about here in terms of your part-
whole correlations with the total. Why didn't you standardize within your
subtests first instead of multiplying by a constant? It would seem to me to
be a much better procedure just to convert your subtotals to standard scores
before running the correlation.
Oller: I suppose statistically that would have been a more sensible way of
doing it. I wanted the comparison with the 1971 study to be as straightforward
as possible, and frankly when I ran those statistics I really wasn't aware of
that statistical error. But even when it's corrected, it supports the notion. I
guess my defense there would be that I'm not relying primarily on the statis-
tics but rather on the psycholinguistic argument which seems to explain the
statistics. The statistics, after all, are quite reliable. They've been repeated
many times now in a great many different studies. Dick Tucker got similar
results in comparing a cloze test, for example, with the American University
of Beirut's test of English language proficiency. They have 96% reliability
on practically every form that they've generated.
Frey: The time required for scoring these two tests is incredibly different,
because in one case it's an objective test where time is hardly a factor, and in
the other case it's a dictation where you have to hand score it, I assume.
Oller: It's a little harder to score dictation than it is to score an objectively
constructed vocabulary test. On the other hand, it's a whole lot harder to con-
struct a good multiple-choice vocabulary test than it is to construct a good
dictation. So I think that the two factors tend to balance out and the advan-
tages gained in validity on the side of dictation tend to vie for that. I think
that's part of the motivation behind Gradman and Spolsky's research. In
investigating possible multiple-choice formats for dictation, it is perhaps pos-
sible that you can objectivize the technique. This has been done very effec-
tively with reading comprehension tests. Just because a test is multiple-
choice doesn't necessarily mean that it has to be based on naive discrete point
testing philosophy. A paraphrase matching task, for example, seems to work
rather well as an estimation of reading comprehension, and it can be done
in a multiple-choice format. The only trouble with that for classroom pur-
poses or for unsophisticated test researchers is that it's awfully easy to make
a very bad multiple-choice test. And it needs pre-testing; it needs some statis-
tics done on it; you need to run item facility and item discrimination indices,
Dictation: A Test of Grammar Based Expectancies 85

and to make deletions and changes, rewrites; and you can't do that in a class-
room situation. So there are serious disadvantages on the side of the multiple-
choice test as well.
Sako: I wonder if you could explain how dictation measures language com-
petence?
Oller: Dictation invokes the learner's internalized grammar of the language,
and if that gr ammar is incomplete, it will be reflected in the score on the dic-
tation . If it's more complete, that too will be reflected in the score. You can
show evidence for that by virtue of the fact tha t native speakers nearly al-
ways score 1000/0 on dictations, or at least the ones we've investigated, and
non-native spe akers tend to vary according to their proficiency. So I think
that it's an indication of an internalized competence on the part of the
learner.
Clark: Just a technical qu estion. I think we're concerned about the prac-
ticality of our testing instruments, and certainly when large volumes of stu-
dents are involved we want to devise a procedure which can be very effi-
ciently used. It's always impressed me that a typical dictation has quite a lot
of dead material in it in the sense that the student is rather easily able to ,
let's say, do half of the sentence, and it's the second half of the sentence
where the problem comes. If this is the case, would it be possible to think of
some format where the student is not required to write out the entire passage,
but only to write a certain portion of the passage, let's say when a light comes
on at a critical moment? I think this might obj ectify the administration and
scoring process quite a bit.
Oller: I don't think there's any de ad material in a dictation. A person can
make errors at any point whatsoever , and he makes all kinds of creative
errors, for example, the ocean and its waves instead of the ocean and its
ways. We had one student at UCLA who converted an entire passage on
"brain cells" into fairly readable prose on " brand sales. " The fact is that
listening comprehension as exhibited by taking dictation is a highly creative
process, and it's creative in much the same way that speech production is.
Stig Johansson did do the kind of thing that you 've suggested. He deleted the
last half of a sentence, and in spite of the fact that both Spolsky and I dis-
agree with some of his inferences about the noise test, he did show that the
deletion of the last half of the sentence works just about as well, and seems
to have similar properties as a test, as does the straight dictation. And I think
that's a perfectly viable way of getting data.
Spolsky: Dictation tests with all their theoretical justification in practice are
likely to be as limited as FSI tests in their necessary relevance, that is , they
suit a particular kind of language learner. The FSI test is only a direct meas-
ure specifically for people who are going to engage in fairly limited kinds of
conversational language use in particular domains. This was defined nicely
by the statement that only standard dialects are acceptable, which is a very
good way of limiting the range in the same way that the dictation test tends to
86 Testing Language Proficiency

b e limited to a literate subject, and therefore it's likely to be most useful in


dea ling with college students. That raises another interesting point, namely
that the r esearch with these tests is done with specific populations. The work
with the noise test and the dictation test has been done largely with foreign
students studying in the United States. The basic research work of the FSI
interview has been done specifically with government employees. There's
always the danger that we might do the same sort of thing as psychologists
have done for so long when they assume that, since all their experiments
were performed with rats or college freshmen , they are able to make gen-
eralizations from that to other kinds of animals and to other kinds of human
beings.
Oller: I don't think that's necessarily true; Stig Johansson did some work with
a modifi ed doze dictation type of procedure in Lund, Sweden with university
level students. His research was more or less replicated with a dictation de-
sign, and similar results were achieved with a population of elementary
children in Sweden. Spolsky's and Gradman's research in many ways pro-
duces similar results to those found at UCLA with a population of foreign
students from a tremendous variety of backgrounds, and was comparable to
the results found by Tucker in the Middle East. They were also very similar
to results of other studies I have seen. That is, these tests seem to have cer-
tain remarkably stable properties; they tend to be robust, resistant to level of
language differences. They seem to produce a very high level of variance
and to spread people out rather widely on a scale, and there seems to be a
comparability between tests of widely divergent sorts, that is, doze tests are
quite different from dictations, and yet the results are similar. So I think that
all of these factors taken togethe r seem to suggest that there 's something
rather fundamental that is similar abo ut language processing in these various
superficially different modes, and these tests seem to be capable of revealing
that. I think that you can produce a sociolinguistically significant difference
in performance in tests. Quaker showed that in New Mexico with elementary
school children. She presented an oral doze test and found she could discrim-
inate between four major ethnic groups: Spanish-speaking, native Americans,
Blacks and Anglos. She found significant differences between each of these
groups, but I think if she looked at the characteristics of the test, she would
find that the test was performing similarly across the groups, in spite of the
slight but significant sociocultural variances. So I would suggest that the
sociocultural variable is probably a significant but small factor , and the bulk
of the variance is attributed to certain rather robust properties of the tests.
Davies: I'm interested in how the dictation text was selected. How do you
Sample for your dictation?
Oller: In this particular study the dictation was selected on the basis of Lois
McIntosh's intuition. She assumed that there was a differenc e between hu-
manities and sCiences that might be revealed in dictation. Therefore to coun-
ter balance for that she selected a passage from the sciences and a passage
Dictation: A Test of Grammar Based Expectancies 87

from the humanities. I did a little research on that later with some 359 incom-
ing students. I selected a passage from an elementary level text that was hor-
rendously Simple. Then another passage was selected from the Jean Pranins-
kas grammar review. It was a slightly high e r level of language; ther e were
more complicated sentences and so forth . Then another passage was selected
from a reader of college level literary pi eces, a much more complicated
passage. Those three dictations were all given to random samples of the same
population. The performance on the three dictations in terms of correlation
with other external validating criteria was almost the same in spite of the
widely divergent levels of difficulty. So again, the evidence suggests that it
doesn't make a whole lot of difference whether you take a fairly hard pas-
sage, a fairly easy one or one somewhere in the middle. The test seems to
perform similarly, and the correlations you get with external validating cri-
teria are similar. The Praninskas passage and the other passage correlated al-
most identically with each of the other external criteria.
Davies: But doesn't this depend on the level of your student? I mean , if you
take too easy a passage with an advanced group, your mean score goes right
up.
Oller: You're back to the dead data in dictation. If a fairly advanced student
wouldn 't mak e any errors, granted he'l1 be off the scale. But the fact is that
fairly advanced students make errors in fairly simple passages, otherwise you
wouldn't get that kind of correlation . .
Scott: I'd like to know whether the students whom you tested were taught
dictation as a teaching techniqu e and whether or not they were therefore
accustomed to taking dictation. It 's been my experience that a student who's
been taught dictation can do much better on a dictation test.
Oller: Rebecca Valette seemed to think that it made a difference in her study.
All I know is that in the several studies that were done at UCLA by Kern in
33 classes of beginning, intermediate and advanced classes of English, there
seemed to be no practic e effect. That is, you give people dictations 15 or 20
times during the course of a quarter and they don't seem to do much better
at the end of the quarter than the y did at the beginning. The test se ems to
resist the practice effect. Perhaps one way of attacking that would be to
replace the dictation with passages that were demonstrably similar in dif-
ficulty level, in terms of mean scores of similar populations. And then do a
pre- and post-test evaluation and see if people improved. There might be a
slightly significant improvement, but it's again going to be a very small part
of the total variance unless current research at UCLA was all wrong.
Scott: Is the underlying premise for the use of dictation tests that natives tend
to perform perfectly and near natives or non-natives tend to perform at some
point down the scale from that? If that's taken as a basic premise, then I think
there may be some problems with trying to apply that to learners at the op-
posite end of the scale , namely beginning college learners of a second lan-
guage.
88 Testing Language Proficiency

Oller: This is part of the argument in favor of the validity of the test. One of
the surprising things that you find in discrete-point tests of various sorts is
that if they're sufficiently cleverly constructed, as some discrete-point tests
are, non-native speakers will do slightly better than natives. That is probably
because of a tendency to emphasize certain kinds of things that people con-
centrate on in classroom situations. What we're doing is teaching a sort of
artificial classroomese instead of teaching the language. So I think that it's
very important to test the examination against native speaker performance.
Surprisingly, tests that have been in existence for some time now have not
consistently used that technique to test their own validity. For example, that
is not a standard procedure in the development of the TOEFL exam. I think it
should be. I think it ought to be for exams at all of our institutions where
we're trying to measure language proficiency. If the native speaker can't do
it, then it's not a language test, it's something else.
Cartier: John, you talked about how the selections were chosen, and I'm
inclined to believe that experienced people can, in fact, rank prose by dif-
ficulty without the use of the Flesh scale, although Flesh or Lorge or Dale or
Chall might be useful in this respect. But I don't think we ought to slide over
the point quite so lightly as you seem to. I doubt very much whether you're
presenting the whole range, because I can think of a paragraph from Korzyb-
ski's Science and Sanity, for example, and that's part of the language too.
And to use your intuition only is to introduce the bias that you're going to
give the students something that you think they can cope with. I don't think
you're really justified in saying that the difficulty level doesn't seem to be
all that important.
Oller: I think you're probably right. If you really stretched it to its limits and
presented something like e. e. Cummings' "up so many floating bells down
anyone lived in a pretty how town," or something like that, then you're out
of the norms of language usage. If you present John Dewey's prose, I think
people would have a little more trouble with that than they would with
Walter Cronkite. The point is that people are able to make fairly good sub-
jective judgments. Language teachers can judge pretty well what level is ap-
propriate to the students that they're teaching. If you're trying to find out
if these people can succeed in a college-level course of study, then the ob-
vious material is college-level text and lecture material, the kind of thing
they're going to have to deal with in the classroom. Here you come back
to Spolsky's point. If you've got a different kind of task, if they're going to
have to drive tanks in battle, then the sociolinguistic variables would dictate
a different kind of level, a different kind of a task perhaps. I think that peo-
ple can make pretty good subjective judgments, though, about levels. Much
better than we've thought. Much better than the Dale and Chall and the other
formulas that are available. Our own subjective judgments are usually super-
ior to those kinds of evaluations.

You might also like