Dictation
Dictation
1. DICTATION REVISITED
Since the publication of "Dictation as a Device for Testing Foreign
Language Proficiency" in English Language Teaching (henceforth
referred to as the 1971 paper), 1 the utility of dictation for testing has
been demonstrated repeatedly. It is an excellent measure of overall
language proficiency (Johansson 1974; Oller 1972a, 1972b) and has
proved useful as an elicitation technique for diagnostic data (Angelis
1974). Although some of the discussion concerning the validity of
dictation has been skeptical (Rand 1972; Breitenstein 1972), careful
research increasingly supports confidence in the technique.
The purpose of this paper is to present a re-evaluation of the 1971
paper. That data showed the Dictation scores on the UCLA English as
a Second Language Placement Examination (UCLA ESLPE 1) corre-
lated more highly with Total test scores and with other Part scores
than did any other Part of the ESLPE. The re-evaluation was prompt-
ed by useful critiques (Rand 1972; Breitenstein 1972) . An error in
the computation of correlations between Part (subtesi) scores and
Total scores in that analysis is corrected; additional information con-
cerning test rationale, administration, scoring, and interpretation is
provided; and finally, a more comprehensive theoretical explanation
is offered to account for the utility of dictation as a measure of lan-
guage proficiency.
In a Reader's Letter, Breitenstein (1972) commented that many
factors which enter into the process of giving and taking dictation
were not mentioned in the 1971 paper . For example, there is "the
eyesight of the reader" (or the "dictator" as Breitenstein terms him).
the condition of his eye glasses (which "may be dirty or due for re-
newal"). "the speaker's diction," (possibly affected by "speech de-
*We wish to thank Professor Lois McIntosh (UCLA) for providing us with a detailed
description of the test given in the fall of 1968. It is actually Professor McIntosh whose
teaching skill and experience supported confidence in dictation that is at base respon-
sible for not only this paper but a number of others on the topic. We gratefully ac-
knowledge our indebtedness to her. Without her insight into the testing of langauge
skills, the facts discussed here, which were originally uncovered more or less by acci-
dent in a routine analysis, might have gone unnoticed for another 20 years of discrete-
point testing.
71
72 Testing Language Proficiency
Prior to 1969 when the research reported in the 1971 paper was
carried out , the UCLA ESLPE 1 had never been subjected to the close
empirical scrutiny of any statistical analysis. It had been assumed
earlier that Part I measured skills closely associated with reading
comprehension, Part II indicated how well students could handle
English structure, Part III was a good measure of essay writing ability,
Part IV tested discrimination skills in the area of sounds, and Part V
was a good measure of spelling and listening comprehension. The
extent of overlap between the various Parts, and the meaning of the
Total score, were actually unknown. The intent of the test was to
provide a reliable and valid estimate of overall skill in English along
with diagnostic information concerning possible areas of specific
weakness.
It would not be difficult to formulate criticisms of the test as a
whole and its particular subsections independent of any statistical
analysis. This is not the concern of this paper, however. What we are
interested in ar e answers to the following questions. Given the several
parts of the UCLA ESLPE 1, what was the amount of overlap be tween
them? Was ther e one subtest that provided more information than the
rest? Should anyone or more subtests have been replaced or done
away with? These are some of the concerns that prompted the analy-
sis presented in the 1971 paper and which, together with the observa-
tions stated earlier in this paper, motivated the computations reported
here.
Description of the Test: UCLA ESLPE 1
The UCLA ESLPE 1 consists of five parts. Part I, a Vocabulary Test of
20 items, requires the student to match a word in a story-like context
with a synonym. For example:
But the frontier fostered -FOSTERED
positive traits too .. .. (AJ discouraged
(BJ promoted
(C] adopted
The student reads the context and then selects from (A). (B). or (C)
the one that most nearly matches the meaning of the stem word
FOSTERED.
Part II is a Grammar Test of 50 items. Each item asks the student to
select the most acceptable sentence from three choices. For instance:
(A) The boy's parents let him to play in the water.
(B) The boy's parents let him play in the water.
(C] The boy's parents let him playing in the water.
Part III is a Composition. Students were instructed:
74 Testing Language Proficiency
(2)
In scientific inquiry, it becomes a matter of duty to expose a
supposed law to every kind of verification. and to take care.
moreover. that it is done intentionally. For instance. if you drop
something, it will immediately fall to the ground. That is a very
common verification of one of the best established laws of na-
ture-the law of gravitation. We believe it in such an extensive,
thorough. and unhesitating manner because the universal experi-
ence of mankind verifies it. And that is the strongest foundation
on which any natural law can rest.
The scoring of Parts I-III. all of which were multiple-choice ques-
tions, was purely objective. Each item in Part I was worth 1 point,
the whole section being worth 20 points. Items in Part II were each
worth V2 point, making the whole section worth 25 points. Part III was
worth 15 points , with each item valued at liz point each.
Parts IVand V require more explanation. Part IV was worth a total
of 25 points with each error subtracting liz point. Students who made
more than 50 errors (with a maximum of 1 error per word attempted)
were given a score of o. There wer e no negative scores, i.e. if a stu-
dent made 50 errors or more, he scored o. Spelling errors were
counted along with errors in word order, grammatical form, choice of
words. and the like. If the student wrote less than 200 words. his
errors were pro-rated on the basis of the following formula: Number
of words written by the student + 200 words = Number of errors
made by the student + X.
The variable X is the pro-rated number of errors, so the student's
pro-rated score would be 25 - (Vz )X. For example. if he wrote 100 words
and made 10 errors, by the formula X == 20. his score would be
25 - V2 (20) == 15 points. The scoring of Part IV involved a considerable
amount of subjective judgment and was probably less reliable than
the scoring of any of the other sections.
A maximum of 15 points was allowed for the Dictation. Clear errors
in spelling (e.g. shagrin for chagrin). phonology (e.g. long hair for
lawn care). grammar (e.g. it became for it becomes), or choice of
wording (e .g. humanity for mankind) counted as % point subtracted
from the maximum possible score of 15 points. A maximum of %
point could be subtracted for multiple errors in a single word. e.g. an
extra word inserted into the text which was ungrammatical. mis-
spelled. and out of order would count as only one error. If the student
made 60 errors or more on the Dictation. a score of 0 was recorded.
Alternative methods of scoring are suggested by Valette (1967).
Part and Total Intercorrelations on the UCLA ESLPE 1
The surprising finding in the 1971 paper was that the Dictation corre-
76 Testing Language Proficiency
lated better with each other Part of the UCLA ESLPE 1 than did
any other Part. Also, Dictation correlated at .86 with the Total score,
which was only slightly less than the correlation of .88 between the
Total and the Composition score. What these data suggested was that
the Dictation was providing more information concerning the totality
of skills being measured than any other Part of the test. In fact, it
seemed to be tapping an underlying competence in English.
The data presented in the 1971 paper, however, have been ques-
tioned by Rand (1972). As mentioned earlier, Rand (1972) correctly
observes that the weightings of Part scores will affect their correlation
with the Total score. Obviously, there is perfect correlation between
the portion of the Total score and the Part score to which it corre-
sponds. Also, differential weightings of scores will have slight effects
on Part and Total correlations even if the self-correlations are sys-
tematically eliminated. If Part scores are unevenly weighted (which
they were in the 1971 paper), the intercorrelations between Part
scores and the Total will be misleading.
One way of removing the error is to adjust the weightings of the
Part scores so that each part is worth an equal number of points
toward the Total. Table I presents the results of a re-analysis of the
data on just such a basis (see Appendix) . For convenience of com-
parison the correlation data from the 1971 paper is reproduced as
Table II (see Appendix). Table II was actually based on 102 subjects,
rather than 100, as was incorrectly reported in the earlier paper. Two
errors in the data deck discovered in the re-analysis and corrected
in Table I are not corrected for Table II. It is reproduced exactly as
it was originally presented in the 1971 paper.
It is noteworthy that the re-analysis (see Table I) shows a .94 cor-
relation between the adjusted Dictation score and adjusted Total,
while the correlation between Composition and Total is reduced from
.88 (Table II) to .85 (Table I). Corrections of the two errors detected
in the data cards account for the slight discrepancies in intercorrela-
Hons between the Parts in Tables I and II.
The data indicate that the Dictation by itself could validly be su b-
stituted for the Total (where the Total is computed by adding the
equally weighted scores on Vocabulary, Grammar, Composition,
Phonology, and Dictation).
Table III (see Appendix) presents correlations with the Total scores,
eliminating self-correlations of Parts in a step-wise fashion. In other
words, each Part is correlated with the Total computed by the sum of
scores on the remaining Parts. For example, Dictation is correlated
with the sum of Vocabulary, Grammar, Composition, and Phonology.
Here again we see clearly the superior performance of Dictation as a
measure of the composite of skills being tested.
Dictation: A Test of Grammar Based Expectancies 77
NOTES
1. The paper referred to actuall y appeared first in UCLA TESL Workpopers 4 (1970),
37-41. It was published subsequently in English Language Teaching 25:3 (June 1971),
254-9. and in a revised and expanded form in H. B. Allen and R. N. Campbell, eds.,
Dictation: A Test of Grammar Based Expectancies 79
REFERENCES
APPENDIX
Table I
He -evaluation of Intercorrelations Between
Port Scores and Total Score on the UCLA ESLPE 1
with Adjusted (Equal) Weightings of Part Scores (n=102)
Phonology .57
Table II
Original Intercorrelations Between Port Scores and Total Score
on UCLA ESLPE 1 from Oller (1971) - Weightings Indicated
(n = 102)
Phonology .57
82 Testing Language Proficiency
Table III
Inlercorrelations of Part Scores and Total on UCLA ESLPE 1:
With Self-correlations Removed and with Equal Weightings of
Part Scores In = 102)
1 2 3 4 5
Vocabula ry Grammar Composition Phonology Dictation
(25 pts) (25 pts) (25 pts) (25 pts) (25 pts)
Total I .69
(2 + 3 + 4 + 5=100 pts)
Total II .69
(1 + 3 + 4 + 5=100 pts)
Total III .72
(1 + 2 +' 4 + 5=100 pts)
Total IV .59
(1 + 2 + 3 + 5=100pts)
Total V .85
(1 + 2 + 3 + 4=100 pts)
DISCUSSION
Davies: May I make two points? The first relates to the last point that John
Oller made about high and low correlations. It seems to me that the classical
view of this would be that in a test battery you are looking for low correla-
tions between tests or subtests, but high correlations b e tween each subtest
and some kind of criterion. Clearly, if as he suggests two tests are correlating
highly with one another, this would mean that they would both be valid in
terms of the criterion, assuming that you have a criterion. It would also mean
presumably that you would only need to use one of them. Now the other
point, this business of the grammar of expectancy. I find John Oller's com-
ments very persuasive. Clearly, what we have is a test that is spreading peo-
ple very widely. He didn't tell us what the standard deviation was, but I
would suspect that it would be quite high, and it is essentially for this reason.
I think. tha t he 's getting the high correlations with the other tests when he
groups them together. The dictation test is providing a rank order, which is
what one demands from a test. and it is spreading people out. Now, this is a
persuasive argument in favor of a test. Of course it isn 't the ultimate one, be-
cause the ultimate one is whether the test is valid. However, h e provides
Dictation: A Test of Grammar Based Expectancies 83
evidence for this validity in terms of the additive thing he's done with the
other subtests. But I don't understand why this has to be linked onto a gram-
mar of expectancy. It seems to me that if there is a grammar of expectancy, it
should be justified in its own terms, in grammatical terms. And I don't know
where this justification is . It seems to me that what we have is a satisfactory
test which is , if you like, a kind of work sample test. I don't understand the
connection that is being made, if I understand it rightly, on theoretical
grounds, and I don't see the need for this.
Oller: Concerning the point on correlation, I tried to start off with what I
think is a substantial departure from common testing theory of the 50s and 60s
that says that low corr elations indicate that test parts are actually measuring
different skills. I don't think there's any psycholinguistic basis for that kind of
inference. That is, unless there is some obvious reason why the two skills in
question might not be related, like spelling for example. We know that native
speakers in many cases can't spell, so that the degree of facility with the
language is obviously not related to spelling. On the other hand, the inference
that grammatical proficiency or grammatical skills in terms of manipulation
of structures is not related, say to vocabulary, seems a little less defensible.
There are a great many studies now that show that even fairly traditional
grammatical tests, provided the y're beefed up." are long enough, and contain
enough items and alternatives, intercorrelat~dabout the 80-85 0/0 level. I think
I have about nine tables on as many differ ~nt studies in an article that ap-
peared in the TESOL Quarterly of 1972 illustrating that. Now, if those tests
intercorrelate at that level, you have to search for some explanation for that.
It turns out that I expected this, not on th e basis of testing theory, bu~ rather
on the hasis of what I think is an understanding of the way language func-
tions from a linguistic point of view. If you've read my stuff, you know that I
haven't bought the Chomsky an paradigm, but rather argue for a grammar that
is pragmatically based, that relates sentences to extra-linguistic context, and
it seems to me that a crucial element in a realistic grammar underlying lan-
guage use has to involve the element of time. That's an element that I think
has been rather mistakenly left out of transformational theory until quite
recently, and now we 're beginning to talk about presuppositions and the
notion of pragmatics. So much for the theoretical justification of the notion of
grammar of expectancy. If you're interested in going into it further, I would
suggest articles by Rohert Woods of Harvard University, who is developing
some computer simulation models for grammars that meet the criteria of what
I call the grammar of expectancy. On the other comment, you asked about the
spread on a dictation tes t. Typically, on a 50-point dictation I think that
the usual standard deviation is about 15 points. Compare that against a stand-
ard deviation of probably 8 or 9 points on grammar tests of the sort I have
described. So it's about twice as much as on other tests, and you're getting
just that much more information, apparently, out of the dictation.
Petersen: Are you assuming here that test variance increas es proportionate
84 Testing Language Proficiency
and to make deletions and changes, rewrites; and you can't do that in a class-
room situation. So there are serious disadvantages on the side of the multiple-
choice test as well.
Sako: I wonder if you could explain how dictation measures language com-
petence?
Oller: Dictation invokes the learner's internalized grammar of the language,
and if that gr ammar is incomplete, it will be reflected in the score on the dic-
tation . If it's more complete, that too will be reflected in the score. You can
show evidence for that by virtue of the fact tha t native speakers nearly al-
ways score 1000/0 on dictations, or at least the ones we've investigated, and
non-native spe akers tend to vary according to their proficiency. So I think
that it's an indication of an internalized competence on the part of the
learner.
Clark: Just a technical qu estion. I think we're concerned about the prac-
ticality of our testing instruments, and certainly when large volumes of stu-
dents are involved we want to devise a procedure which can be very effi-
ciently used. It's always impressed me that a typical dictation has quite a lot
of dead material in it in the sense that the student is rather easily able to ,
let's say, do half of the sentence, and it's the second half of the sentence
where the problem comes. If this is the case, would it be possible to think of
some format where the student is not required to write out the entire passage,
but only to write a certain portion of the passage, let's say when a light comes
on at a critical moment? I think this might obj ectify the administration and
scoring process quite a bit.
Oller: I don't think there's any de ad material in a dictation. A person can
make errors at any point whatsoever , and he makes all kinds of creative
errors, for example, the ocean and its waves instead of the ocean and its
ways. We had one student at UCLA who converted an entire passage on
"brain cells" into fairly readable prose on " brand sales. " The fact is that
listening comprehension as exhibited by taking dictation is a highly creative
process, and it's creative in much the same way that speech production is.
Stig Johansson did do the kind of thing that you 've suggested. He deleted the
last half of a sentence, and in spite of the fact that both Spolsky and I dis-
agree with some of his inferences about the noise test, he did show that the
deletion of the last half of the sentence works just about as well, and seems
to have similar properties as a test, as does the straight dictation. And I think
that's a perfectly viable way of getting data.
Spolsky: Dictation tests with all their theoretical justification in practice are
likely to be as limited as FSI tests in their necessary relevance, that is , they
suit a particular kind of language learner. The FSI test is only a direct meas-
ure specifically for people who are going to engage in fairly limited kinds of
conversational language use in particular domains. This was defined nicely
by the statement that only standard dialects are acceptable, which is a very
good way of limiting the range in the same way that the dictation test tends to
86 Testing Language Proficiency
from the humanities. I did a little research on that later with some 359 incom-
ing students. I selected a passage from an elementary level text that was hor-
rendously Simple. Then another passage was selected from the Jean Pranins-
kas grammar review. It was a slightly high e r level of language; ther e were
more complicated sentences and so forth . Then another passage was selected
from a reader of college level literary pi eces, a much more complicated
passage. Those three dictations were all given to random samples of the same
population. The performance on the three dictations in terms of correlation
with other external validating criteria was almost the same in spite of the
widely divergent levels of difficulty. So again, the evidence suggests that it
doesn't make a whole lot of difference whether you take a fairly hard pas-
sage, a fairly easy one or one somewhere in the middle. The test seems to
perform similarly, and the correlations you get with external validating cri-
teria are similar. The Praninskas passage and the other passage correlated al-
most identically with each of the other external criteria.
Davies: But doesn't this depend on the level of your student? I mean , if you
take too easy a passage with an advanced group, your mean score goes right
up.
Oller: You're back to the dead data in dictation. If a fairly advanced student
wouldn 't mak e any errors, granted he'l1 be off the scale. But the fact is that
fairly advanced students make errors in fairly simple passages, otherwise you
wouldn't get that kind of correlation . .
Scott: I'd like to know whether the students whom you tested were taught
dictation as a teaching techniqu e and whether or not they were therefore
accustomed to taking dictation. It 's been my experience that a student who's
been taught dictation can do much better on a dictation test.
Oller: Rebecca Valette seemed to think that it made a difference in her study.
All I know is that in the several studies that were done at UCLA by Kern in
33 classes of beginning, intermediate and advanced classes of English, there
seemed to be no practic e effect. That is, you give people dictations 15 or 20
times during the course of a quarter and they don't seem to do much better
at the end of the quarter than the y did at the beginning. The test se ems to
resist the practice effect. Perhaps one way of attacking that would be to
replace the dictation with passages that were demonstrably similar in dif-
ficulty level, in terms of mean scores of similar populations. And then do a
pre- and post-test evaluation and see if people improved. There might be a
slightly significant improvement, but it's again going to be a very small part
of the total variance unless current research at UCLA was all wrong.
Scott: Is the underlying premise for the use of dictation tests that natives tend
to perform perfectly and near natives or non-natives tend to perform at some
point down the scale from that? If that's taken as a basic premise, then I think
there may be some problems with trying to apply that to learners at the op-
posite end of the scale , namely beginning college learners of a second lan-
guage.
88 Testing Language Proficiency
Oller: This is part of the argument in favor of the validity of the test. One of
the surprising things that you find in discrete-point tests of various sorts is
that if they're sufficiently cleverly constructed, as some discrete-point tests
are, non-native speakers will do slightly better than natives. That is probably
because of a tendency to emphasize certain kinds of things that people con-
centrate on in classroom situations. What we're doing is teaching a sort of
artificial classroomese instead of teaching the language. So I think that it's
very important to test the examination against native speaker performance.
Surprisingly, tests that have been in existence for some time now have not
consistently used that technique to test their own validity. For example, that
is not a standard procedure in the development of the TOEFL exam. I think it
should be. I think it ought to be for exams at all of our institutions where
we're trying to measure language proficiency. If the native speaker can't do
it, then it's not a language test, it's something else.
Cartier: John, you talked about how the selections were chosen, and I'm
inclined to believe that experienced people can, in fact, rank prose by dif-
ficulty without the use of the Flesh scale, although Flesh or Lorge or Dale or
Chall might be useful in this respect. But I don't think we ought to slide over
the point quite so lightly as you seem to. I doubt very much whether you're
presenting the whole range, because I can think of a paragraph from Korzyb-
ski's Science and Sanity, for example, and that's part of the language too.
And to use your intuition only is to introduce the bias that you're going to
give the students something that you think they can cope with. I don't think
you're really justified in saying that the difficulty level doesn't seem to be
all that important.
Oller: I think you're probably right. If you really stretched it to its limits and
presented something like e. e. Cummings' "up so many floating bells down
anyone lived in a pretty how town," or something like that, then you're out
of the norms of language usage. If you present John Dewey's prose, I think
people would have a little more trouble with that than they would with
Walter Cronkite. The point is that people are able to make fairly good sub-
jective judgments. Language teachers can judge pretty well what level is ap-
propriate to the students that they're teaching. If you're trying to find out
if these people can succeed in a college-level course of study, then the ob-
vious material is college-level text and lecture material, the kind of thing
they're going to have to deal with in the classroom. Here you come back
to Spolsky's point. If you've got a different kind of task, if they're going to
have to drive tanks in battle, then the sociolinguistic variables would dictate
a different kind of level, a different kind of a task perhaps. I think that peo-
ple can make pretty good subjective judgments, though, about levels. Much
better than we've thought. Much better than the Dale and Chall and the other
formulas that are available. Our own subjective judgments are usually super-
ior to those kinds of evaluations.