A Framework For Second Language Vocabulary Assessment: John Read Carol A. Chapelle
A Framework For Second Language Vocabulary Assessment: John Read Carol A. Chapelle
vocabulary assessment
John Read Victoria University of Wellington
and Carol A. Chapelle Iowa State University
Vocabulary tests are used for a wide range of instructional and research purposes
but we lack a comprehensive basis for evaluating the current instruments or
developing new lexical measures for the future. This article presents a framework
that takes as its starting point an analysis of test purpose and then shows how
purpose can be systematically related to test design. The link between the two is
based on three considerations which derive from Messick’s (1989) validation
theory: construct definition, performance summary and reporting, and test presen-
tation. The components of the framework are illustrated throughout by reference
to eight well-known vocabulary measures; for each one there is a description of
its design and an analysis of its purpose. It is argued that the way forward for
vocabulary assessment is to take account of test purposes in the design and vali-
dation of tests, as well as considering an interactionalist approach to construct
definition. This means that a vocabulary test should require learners to perform
tasks under contextual constraints that are relevant to the inferences to be made
about their lexical ability.
I Introduction
Research on second language vocabulary development has been thriv-
ing for the last 10 years or more, as attested by numerous articles in
applied linguistics journals, anthologies (Huckin et al., 1993; Harley,
1995; Coady and Huckin, 1997; Schmitt and McCarthy, 1997) and
single-authored volumes (Singleton, 1999; Read, 2000; Nation,
2001). An observation that emerges from a review of this literature
is the ill-defined nature of vocabulary as a construct, in the sense that
different authors appear to approach this from different perspectives,
making a variety of – often implicit – assumptions about the nature
and scope of the lexical dimension of learners’ language. One per-
spective, reflected in the work of influential researchers like Laufer,
Meara and Nation (see, Laufer, 1998; Laufer and Nation, 1999; Meara
Address for correspondence: John Read, School of Linguistics and Applied Language Studies,
Victoria University of Wellington, PO Box 600, Wellington, New Zealand; email:
john.read얀vuw.ac.nz
the simple structure of the test items, it seems obvious that this test
is assessing vocabulary knowledge rather than, say, grammatical
knowledge or reading comprehension ability. It represents the tra-
ditional conception of what a vocabulary test is like. On the other
hand, vocabulary assessment may also be embedded as part of the
measurement of a larger construct. For instance, the ESL Composition
Profile (Jacobs et al., 1981) (Appendix 3) is an instrument to measure
the construct of writing proficiency in English by means of five rating
scales, one of which focuses on the range and appropriateness of the
test-takers’ vocabulary use. In this case, vocabulary is separately rated
in the first instance but then the rating is combined with those from
the other four scales to form an overall profile of the learners’ writ-
ing performance.
According to Read’s second dimension, we can distinguish vocabu-
lary tests which are selective, in that they focus on specific lexical
items, from other vocabulary measures based on a comprehensive
analysis of all the content words either in an input text or in the
learner’s response to a test task. Conventional vocabulary tests are
selective in nature: the test designer chooses particular target words
as the focus of the assessment. For example, Paribakht and Wesche
(1997) developed their Vocabulary Knowledge Scale (Appendix 7)
as a means of tracking how much knowledge of specific words a
group of learners acquired through encountering them in their reading
during a university ESL course. In the test the students were presented
with the words in isolation and prompted to show how much they
could recall of the meaning and use of each one. Another case of a
selective test is the multiple-choice rational deletion cloze (Hale et
al., 1989) (Appendix 5), which is created by selecting certain content
words in a written text and using each one as the basis for a multiple-
choice item. By contrast, the Lexical Density Index (O’Loughlin,
1995) (Appendix 8) is a comprehensive measure, which calculates
the proportion of content words in the test-takers’ responses to a
speaking test. Every word is taken into account in the calculation.
Read’s third dimension is concerned with the role of context in a
vocabulary test. A context-independent instrument like the Vocabu-
lary Levels Test presents words to the test-takers in isolation and
requires them to select meanings for the words without reference to
any linguistic context. A less obvious example of such a measure is
represented by the vocabulary items in the reading section of the
paper-based Test of English as a Foreign Language (TOEFL)
(Educational Testing Service, 1995) (Appendix 4). These items focus
on words taken from the reading passages, and thus on the face of it
are ‘contextualized’, but the issue is whether the test-takers are
required to make use of contextual information to answer the items
6 A framework for L2 vocabulary assessment
correctly. Read (1997, 2000) points out that in many cases they can
be answered as if they were isolated items and, to the extent that
this is true, the TOEFL items can be classified as relatively context
independent, despite the manner in which they are presented. On the
other hand, in the C-test (Singleton and Little, 1991) (Appendix 6),
the test-takers have a clear need to make use of contextual clues to
complete the mutilated words in the text; this makes it an example
of a context-dependent test. Similarly, in the ESL Composition Pro-
file the test-takers are assessed on their ability to use lexical items
correctly and appropriately in the context defined by a text that they
themselves create. The other measure based on written compositions,
the Lexical Frequency Profile (LFP) (Laufer and Nation, 1995)
(Appendix 2), is somewhat less context dependent in this sense. The
use of the LFP includes a procedure whereby content words which
have clearly been used incorrectly by the learner are excluded from
the frequency analysis. Thus, the context dependence of the LFP has
to be judged according to the extent to which contextual, rather than
purely formal, criteria are employed for excluding words.
In principle, we see the three dimensions as being continua. Never-
theless, for the sake of parsimonious analysis, it is useful to take
all eight of the vocabulary measures we are using as examples and
summarize their design in the relatively dichotomous fashion in which
they have been discussed so far. The summary is set out in Table 1.
Classifying the tests according to the first two dimensions is a rela-
tively straightforward matter; context dependence, on the other hand,
is more variable and applies somewhat differently from one test to
another. The tilde (苲) symbol is used to indicate that the TOEFL
vocabulary items are variably context dependent. A more extensive
discussion of context dependence can be found in Read (2000).
Table 1 Design features of the eight exemplary tests (profiled in the Appendixes 1–8)
Test Features
may be largely eliminated from the test items. The Vocabulary Levels
Test (Appendix 1) is a clear case of such an instrument, because it
is designed to measure learners’ vocabulary size as a trait without
reference to any particular context of use, and the target words them-
selves are presented in isolation, devoid of any linguistic context that
might indicate how they are used. Another less clear-cut example is
Singleton and Little’s use of the C-test to investigate the nature of
the mental lexicon (Appendix 6). In this case, the test items occurred
within a written text and the test-takers had to look for clues else-
where in the text to make the correct responses. However, the text
itself was not chosen to represent a particular context of use; the
researchers merely employed it a vehicle for eliciting evidence of
general mental processes involved in lexical access and retrieval.
The second type of construct definition, adopted by the behav-
iourists, goes to the other extreme in a sense, by giving a central role
to features of context. Language testing researchers working within
the tradition of ‘performance testing’ (e.g., Wesche, 1987) have relied
on the same theoretical foundation when they have attempted to create
test methods which replicate the conditions of the settings for which
they wish to predict the test-taker’s future performance. The idea is
that the learner’s underlying knowledge is considered too elusive to
define and so construct definition becomes a matter of defining the
context in which language is used. In practice this means that usually
no one aspect of linguistic knowledge – like vocabulary – is singled
out for discrete scoring in the assessment of learners on a performance
test. Vocabulary may not be explicitly mentioned at all, or it is
included among the factors that raters are to take into account in
making an overall judgement of communicative effectiveness.
None of the measures profiled in the Appendixes exemplifies this
approach to the definition of vocabulary, but we can illustrate it by
reference to the Interagency Language Roundtable (n.d.) skill level
descriptions. These scales are used to assess whether US diplomats
and other government personnel assigned to positions requiring com-
petence in a foreign language have sufficient proficiency to be able
to carry out their duties effectively. Here is the description of the
Speaking 2 (Limited working proficiency) level:
Able to satisfy routine social demands and limited work requirements. Can
handle routine work-related interactions that are limited in scope. In more com-
plex and sophisticated work-related tasks, language usage generally disturbs
the native speaker. Can handle with confidence, but not with facility, most
normal, high-frequency social conversational situations including extensive,
but casual conversations about current events, as well as work, family, and
autobiographical information. The individual can get the gist of most everyday
conversations but has some difficulty understanding native speakers in situ-
ations that require specialized or sophisticated knowledge. The individual’s
John Read and Carol A. Chapelle 9
utterances are minimally cohesive. Linguistic structure is usually not very elab-
orate and not thoroughly controlled; errors are frequent. Vocabulary use is
appropriate for high-frequency utterances, but unusual or imprecise elsewhere.
(Interagency Language Roundtable, n.d.; emphasis added)
VI Mediating factors
Returning to the framework in Figure 2, we can see that the validity
considerations identified in the second line help to identify three fac-
tors that mediate between test purpose and test design.
1 Construct definition
The first mediating factor, construct definition, has already been
extensively discussed in Section III above, where we presented three
ways of approaching the defining of vocabulary constructs. Each type
of definition entails a distinctive form of inference, which in turn has
implications for test design and validation.
A trait definition of vocabulary generally entails a test which is
discrete, selective and context independent, such as the Vocabulary
Levels Test. Even if the target words are presented in some form of
John Read and Carol A. Chapelle 15
The second aspect of purpose – the uses made of test scores – has
implications for the way in which relevance and utility should be
evaluated and, in turn, the way in which test-takers’ performance
should be reported and interpreted. Performance can be summarized
as one score or as a profile containing multiple components. When
test scores are used for making decisions, such as whether or not a
student should be admitted to college or hired for a job, the decision
maker involved often desires a single score as a summary of the appli-
cant’s language ability. In contrast, when test uses include achieve-
ment or diagnosis for instructional purposes within the language
classroom or program, more specific information is necessary. For
example, for placing students into language programs, administrators
may need a profile indicating learners’ levels of development in areas
corresponding to the classes in the program. And when learners take
classroom progress and achievement tests, they themselves want to
know how they are doing and what they need to study (Canale, 1987).
In these latter cases, one single score fails to provide sufficient infor-
mation.
An example of a vocabulary test tailored specifically for its use by
teachers has been included in the placement test battery administered
at the English Language Institute of Victoria University of Welling-
ton. For many years, the core texts in the English proficiency program
at the institute were the five workbooks in the Advanced English
Vocabulary series (Barnard, 1971–75). In order to assist teachers in
determining which workbooks to use with a particular class, the test
was based on a random sample of words from each of the three fre-
quency levels covered by the workbooks. By reviewing the section
scores, teachers could see which workbooks were likely to be the
most suitable for the vocabulary learning needs of students in their
class. This, then, represents a test whose design and reporting were
very much governed by the diagnostic use of the test scores by class-
room teachers, and the case for its validity resides primarily in evi-
dence of its usefulness for making the appropriate instructional
decisions.
By contrast, the reporting of test-takers’ performance on TOEFL
reflects the rather different uses made of the test results. The main
users of the test are admissions personnel in colleges and universities
in the USA, who have a strong preference for a single overall score
for each international student applicant. Although the TOEFL score
report form includes scores for the individual sections of the test, the
focus of the decision makers is on the ‘magic number’ represented
by the total score (550, 213 or whatever it may be). Thus, the
John Read and Carol A. Chapelle 17
3 Test presentation
The third aspect of purpose – impacts of the test – is tied to the
investigation of the actual consequences of testing and therefore
implies that it is necessary to consider how and to whom the test is
to be presented. Test developers choose to portray their tests in ways
that will appeal to particular audiences, such as L2 teachers, program
administrators or L2 researchers. The audience of test users may con-
sist of admissions people who value information that is easy to inter-
pret and presented with comparative data from other institutions. On
the other hand, teachers, students and researchers may be more inter-
ested in tests offering comprehensive information which can be inter-
preted relative to learners’ needs and areas of development. Teachers
may also be concerned about the degree of ‘authenticity’ in language
tests, and with the washback effects of tests on their instructional pro-
grams.
To illustrate the role of presentation, let us turn once more to the
Vocabulary Levels Test (Appendix 1). Nation designed this test as a
practical instrument for classroom teachers to encourage them to take
a more systematic approach to identifying their learners’ existing
word knowledge and planning their vocabulary learning program. In
order to achieve this impact, he included the full test in two publi-
cations (Nation, 1983, 1990) and made it freely available to teachers
through other channels. In addition, he provided guidelines as to how
the test scores should be interpreted and what kind of vocabulary
study was appropriate for learners at the various word frequency lev-
els covered by the test. One unintended impact of the wide dissemi-
nation of the test was, however, that it has also been adopted by a
number of researchers as a convenient measure of vocabulary size in
studies of L2 lexical acquisition. If this latter effect had been
intended, the original presentation of the test should ideally have
included technical evidence of its reliability and validity for this rather
different use. Some such evidence eventually became available (Read,
1988; Beglar and Hunt, 1999), but it is only now – nearly two dec-
ades after the original test was devised – that two more thoroughly
John Read and Carol A. Chapelle 19
If test scores are reported Then the central validity questions about relevance
as. . . and utility are. . .
a single score is the test score relevant and useful for the intended
selection or placement decisions?
multiple scores are the multiple scores relevant and useful for informing
instructional decisions or a theory of L2 lexical
development?
If the audience for test Then the central consequential validity question
presentation is. . . is. . .
students/teachers does the test help to focus on the relevant aspects of
vocabulary development?
administrators/the public does the test play a positive role in program decisions?
applied linguists does a sounder theory of L2 vocabulary development
result from test use?
VIII Conclusion
One problem with many current vocabulary tests is that they appear
to follow principles of test design which are out of step with current
thinking in educational measurement and applied linguistics. They are
often discrete measures, composed of vocabulary items selected on a
22 A framework for L2 vocabulary assessment
IX References
Alderson, J.C. and Wall, D. 1993: Does washback exist? Applied Linguis-
tics 14, 115–29.
Bachman, L.F. 1990: Fundamental considerations in language testing.
Oxford: Oxford University Press.
Bachman, L.F. and Palmer, A.S. 1996: Language testing in practice.
Oxford: Oxford University Press.
Barnard, H. 1971–75: Advanced English vocabulary. Workbooks 1–3B.
Rowley, MA: Newbury House.
Beglar, D. and Hunt, A. 1999: Revising and validating the 2,000 Word
Level and University Word Level Vocabulary Tests. Language Testing
16, 131–62.
Brindley, G. 1991: Assessing achievement in a learner-centred curriculum.
In Alderson, J.C. and North, B., editors, Language testing in the 1990s.
London: Macmillan, 153–66.
24 A framework for L2 vocabulary assessment
Impacts: Relatively low stakes test for the learners. It may encour-
age the teaching and learning of words in isolation, and may lead
students to focus on the particular words tested.
PURPOSE
Inferences: Interactionalist definition (vocabulary ability within the
context of academic writing). Subtest level: The vocabulary compo-
nent of the profile measures the extent to which the test-takers choose
a good range of words and idioms, and use them correctly and appro-
priately in their writing.
PURPOSE
Inferences: Trait definition of vocabulary (the speaking test was
intended to provide a general sample of oral performance). Subtest
level: The extent to which each format (direct vs. semi-direct) for
each speaking task produces responses that are more ‘oral’ or more
‘literate’ in nature. If the semi-direct format produces speech that is
substantially more literate than that elicited by the direct one, this may
be evidence that the two formats are measuring different constructs.