0% found this document useful (0 votes)
106 views32 pages

A Framework For Second Language Vocabulary Assessment: John Read Carol A. Chapelle

Uploaded by

yen.tran.001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views32 pages

A Framework For Second Language Vocabulary Assessment: John Read Carol A. Chapelle

Uploaded by

yen.tran.001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

A framework for second language

vocabulary assessment
John Read Victoria University of Wellington
and Carol A. Chapelle Iowa State University

Vocabulary tests are used for a wide range of instructional and research purposes
but we lack a comprehensive basis for evaluating the current instruments or
developing new lexical measures for the future. This article presents a framework
that takes as its starting point an analysis of test purpose and then shows how
purpose can be systematically related to test design. The link between the two is
based on three considerations which derive from Messick’s (1989) validation
theory: construct definition, performance summary and reporting, and test presen-
tation. The components of the framework are illustrated throughout by reference
to eight well-known vocabulary measures; for each one there is a description of
its design and an analysis of its purpose. It is argued that the way forward for
vocabulary assessment is to take account of test purposes in the design and vali-
dation of tests, as well as considering an interactionalist approach to construct
definition. This means that a vocabulary test should require learners to perform
tasks under contextual constraints that are relevant to the inferences to be made
about their lexical ability.

I Introduction
Research on second language vocabulary development has been thriv-
ing for the last 10 years or more, as attested by numerous articles in
applied linguistics journals, anthologies (Huckin et al., 1993; Harley,
1995; Coady and Huckin, 1997; Schmitt and McCarthy, 1997) and
single-authored volumes (Singleton, 1999; Read, 2000; Nation,
2001). An observation that emerges from a review of this literature
is the ill-defined nature of vocabulary as a construct, in the sense that
different authors appear to approach this from different perspectives,
making a variety of – often implicit – assumptions about the nature
and scope of the lexical dimension of learners’ language. One per-
spective, reflected in the work of influential researchers like Laufer,
Meara and Nation (see, Laufer, 1998; Laufer and Nation, 1999; Meara

Address for correspondence: John Read, School of Linguistics and Applied Language Studies,
Victoria University of Wellington, PO Box 600, Wellington, New Zealand; email:
john.read얀vuw.ac.nz

Language Testing 2001 18 (1) 1–32 0265-5322(01)LT190OA  2001 Arnold


2 A framework for L2 vocabulary assessment

and Fitzpatrick, 2000), is to investigate the size and growth of lear-


ners’ vocabulary, largely on the basis of counting, classifying and
assessing knowledge of individual word forms. They appear to treat
vocabulary as a separate component of language knowledge, which
can be investigated without reference to the functions of words in
grammatical structures, text or discourse. A contrasting perspective
is offered by Singleton (1999), who adduces a wide range of findings
from linguistic and applied linguistic research to make a strong case
for the pervasiveness of lexical phenomena in language, to the extent
that ‘the viability of a separate lexical construct has to be seriously
questioned’ (1999: 269). He goes on to argue that it is no longer
justifiable to restrict vocabulary research to measures involving
knowledge of individual content words, and the scope of vocabulary
teaching needs to be similarly expanded.
A third perspective is represented by the work of Skehan and his
associates (Skehan, 1996, 1998; Foster and Skehan, 1996), which did
not set out to investigate vocabulary as such. In the course of
investigating the effects of variables such as the type of task and
amount of planning time on the linguistic outcomes of language learn-
ing tasks in the classroom, they have discovered that lexical measures
are a useful tool in the analysis. In their earlier studies, there was no
lexical dimension to the analysis but, more recently, Mehnert (1998)
found that a vocabulary measure was a useful indicator of the fluency
of the learners’ speech and Foster (2001) has explored the extent to
which native and non-native speakers use lexicalized language in the
performance of tasks. This analysis draws on Skehan’s (1996) theor-
etical framework, which proposes that memorized lexical units play
a major role in learner production both at the early stages of language
acquisition and in the development of nativelike fluency at more
advanced stages.
At the theoretical level, the distinctions among these three perspec-
tives on vocabulary may not be clear cut. They become evident only
through examination of the distinct forms of vocabulary assessment
associated with each one. Those scholars who treat vocabulary as a
separate construct tend to use tests that fit comfortably within the
psychometric-structuralist tradition in language testing, assessing
knowledge of content words with such relatively decontextualized
item types as multiple-choice, word-definition matching (Beglar and
Hunt, 1999), word completion (Laufer and Nation, 1999) and the
checklist (Meara, 1992). By contrast, Singleton (Singleton and Little,
1991; Singleton, 1999) has employed as a basic instrument in his
research on the mental lexicon the C-test, a more integrative measure
that requires learners to restore partial deletions in a series of short
texts. The deletions result in items intended to sample a number of
John Read and Carol A. Chapelle 3

linguistic features in the texts rather than just knowledge of content


words. The third perspective relies on tests requiring extended linguis-
tic performance. Mehnert (1998) used the lexical density statistic (the
proportion of content words) and Foster (2001) has analysed the
occurrence of lexicalized phrases in the learners’ output, as identified
by a group of applied linguists acting as expert judges.
Can tests constructed through such different methods all be con-
sidered to measure the construct of vocabulary? Is there more than
one lexical construct, which should be measured in different ways?
How can results from different vocabulary tests be interpreted and
integrated in a way that allows for progress in our understanding of
learners’ vocabulary and how to measure it?
Development of lexical knowledge is now regarded, by both
researchers and teachers, as central to learning a language, and thus
vocabulary tests are being used for a wide variety of purposes. What
is lacking, though, is any comprehensive basis for evaluating current
tests and developing new lexical measures for the future. With regard
to test design, decisions must be made about, for example, whether
vocabulary is to be tested separately from other constructs such as
reading comprehension, how vocabulary items are to be chosen, and
the amount of linguistic context in the test input and expected
response. In addition, there need to be guidelines for evaluating the
quality of test outcomes, like the decisions made about students’
achievement in vocabulary learning on the basis of their test scores.
On what grounds can such decisions be justified? We see the need
for a framework which articulates the interrelationships between the
various factors influencing vocabulary test design and validation. It
is still common to find that the primary case for the validity of a
vocabulary test is made – in a perpetuation of traditional psycho-
metric practice – by means of a simple correlation with a criterion
measure. There should be a more sophisticated approach that draws
on contemporary theory and practice in test validation, as developed
by researchers in educational measurement, together with current per-
spectives on language adopted by applied linguists. Using Messick’s
(1989) theory of test validity, Chapelle (1994) made a first attempt
to set out the forms of evidence required to justify the use of the C-
test as a lexical measure. The present article builds on that work to
provide a more general framework for vocabulary assessment that
seeks to encompass the range of possible uses for vocabulary tests.
One key innovation in our framework is a fuller specification than
has hitherto been available of the purposes of language assessment.
We believe that debate about the appropriateness of particular lexical
measures should be much better informed by a consideration of the
implications of test purpose for test design. Thus, we have identified
4 A framework for L2 vocabulary assessment

three components of test purpose, which can be systematically related


to the design of vocabulary tests – or other language tests for that
matter because, although we have chosen to focus on vocabulary
assessment here, the principles are more generally applicable. As we
present the framework progressively through the various sections of
the article, we will be illustrating its components by reference to eight
lexical measures which are well documented in the language testing
literature. There are profiles of these eight measures in Appendixes
1–8, including for each one a summary of its design and a specifi-
cation of its purpose.

II Design of vocabulary measures


Let us take the design of vocabulary tests as a starting point. There
is a wide variety of characteristics that needs to be considered in a
full specification of how a language test is to be set up (cf. Bachman
and Palmer, 1996: Chapter 3). Our intention here is to identify design
options that have particular relevance for vocabulary measures, using
the three dimensions proposed by Read (2000) (Figure 1).
The first option to consider is whether vocabulary is being assessed
as a discrete construct. At the simplest level, this means that the test
is labelled as a ‘vocabulary test’ and is seen as measuring some aspect
of the learners’ knowledge of target language words. The Vocabulary
Levels Test (Nation, 1990) (Appendix 1) is a good example of a dis-
crete test, in that it is intended to estimate the size of a learner’s
vocabulary using a sample of high-frequency English words. Given

Figure 1 Three dimensions of vocabulary assessment (Read, 2000: 9)


John Read and Carol A. Chapelle 5

the simple structure of the test items, it seems obvious that this test
is assessing vocabulary knowledge rather than, say, grammatical
knowledge or reading comprehension ability. It represents the tra-
ditional conception of what a vocabulary test is like. On the other
hand, vocabulary assessment may also be embedded as part of the
measurement of a larger construct. For instance, the ESL Composition
Profile (Jacobs et al., 1981) (Appendix 3) is an instrument to measure
the construct of writing proficiency in English by means of five rating
scales, one of which focuses on the range and appropriateness of the
test-takers’ vocabulary use. In this case, vocabulary is separately rated
in the first instance but then the rating is combined with those from
the other four scales to form an overall profile of the learners’ writ-
ing performance.
According to Read’s second dimension, we can distinguish vocabu-
lary tests which are selective, in that they focus on specific lexical
items, from other vocabulary measures based on a comprehensive
analysis of all the content words either in an input text or in the
learner’s response to a test task. Conventional vocabulary tests are
selective in nature: the test designer chooses particular target words
as the focus of the assessment. For example, Paribakht and Wesche
(1997) developed their Vocabulary Knowledge Scale (Appendix 7)
as a means of tracking how much knowledge of specific words a
group of learners acquired through encountering them in their reading
during a university ESL course. In the test the students were presented
with the words in isolation and prompted to show how much they
could recall of the meaning and use of each one. Another case of a
selective test is the multiple-choice rational deletion cloze (Hale et
al., 1989) (Appendix 5), which is created by selecting certain content
words in a written text and using each one as the basis for a multiple-
choice item. By contrast, the Lexical Density Index (O’Loughlin,
1995) (Appendix 8) is a comprehensive measure, which calculates
the proportion of content words in the test-takers’ responses to a
speaking test. Every word is taken into account in the calculation.
Read’s third dimension is concerned with the role of context in a
vocabulary test. A context-independent instrument like the Vocabu-
lary Levels Test presents words to the test-takers in isolation and
requires them to select meanings for the words without reference to
any linguistic context. A less obvious example of such a measure is
represented by the vocabulary items in the reading section of the
paper-based Test of English as a Foreign Language (TOEFL)
(Educational Testing Service, 1995) (Appendix 4). These items focus
on words taken from the reading passages, and thus on the face of it
are ‘contextualized’, but the issue is whether the test-takers are
required to make use of contextual information to answer the items
6 A framework for L2 vocabulary assessment

correctly. Read (1997, 2000) points out that in many cases they can
be answered as if they were isolated items and, to the extent that
this is true, the TOEFL items can be classified as relatively context
independent, despite the manner in which they are presented. On the
other hand, in the C-test (Singleton and Little, 1991) (Appendix 6),
the test-takers have a clear need to make use of contextual clues to
complete the mutilated words in the text; this makes it an example
of a context-dependent test. Similarly, in the ESL Composition Pro-
file the test-takers are assessed on their ability to use lexical items
correctly and appropriately in the context defined by a text that they
themselves create. The other measure based on written compositions,
the Lexical Frequency Profile (LFP) (Laufer and Nation, 1995)
(Appendix 2), is somewhat less context dependent in this sense. The
use of the LFP includes a procedure whereby content words which
have clearly been used incorrectly by the learner are excluded from
the frequency analysis. Thus, the context dependence of the LFP has
to be judged according to the extent to which contextual, rather than
purely formal, criteria are employed for excluding words.
In principle, we see the three dimensions as being continua. Never-
theless, for the sake of parsimonious analysis, it is useful to take
all eight of the vocabulary measures we are using as examples and
summarize their design in the relatively dichotomous fashion in which
they have been discussed so far. The summary is set out in Table 1.
Classifying the tests according to the first two dimensions is a rela-
tively straightforward matter; context dependence, on the other hand,
is more variable and applies somewhat differently from one test to
another. The tilde (苲) symbol is used to indicate that the TOEFL
vocabulary items are variably context dependent. A more extensive
discussion of context dependence can be found in Read (2000).

Table 1 Design features of the eight exemplary tests (profiled in the Appendixes 1–8)

Test Features

1) Vocabulary Levels Test Discrete Selective Context


independent
2) Lexical Frequency Profile Discrete Comprehensive Context dependent
3) ESL Composition Profile Embedded Comprehensive Context dependent
4) TOEFL vocabulary items Embedded Selective 苲 Context
dependent
5) Multiple-choice cloze Embedded Selective Context dependent
6) C-test Discrete Selective Context dependent
7) Vocabulary Knowledge Discrete Selective Context
Scale independent
8) Lexical Density Index Embedded Comprehensive Context dependent

Note: 苲 indicates that the items are variably context dependent


John Read and Carol A. Chapelle 7

Six of the eight possible combinations of the features actually


occur. We have not been able to find a clear case of a vocabulary
measure which is both comprehensive and context independent. Since
a comprehensive measure is by definition text-based, the text provides
context for the lexical items which compose it and – whether they
engage in comprehension or production – the learners will be required
to pay attention to the context in order to produce the correct or appro-
priate response. It is difficult to imagine a situation in which test-
takers would be expected to deal with meaningless strings of words
as test input, or be given credit for producing similarly incoherent
output. Therefore, a comprehensive but context-independent measure
is probably ruled out in principle.

III Approaches to construct definition


Returning to the first dimension of test design, we need to explore
further the nature of vocabulary as a discrete construct. Second langu-
age (L2) vocabulary researchers have given comparatively little atten-
tion to defining ‘vocabulary knowledge’ or ‘vocabulary size’ as theor-
etical constructs, forming the basis of test choice or design.
Conversely, in language testing – as in educational measurement gen-
erally – construct definition is recognized as having a central role in
the validation of tests because at least a working conception of an
underlying construct is needed for test results to be meaningfully
interpreted. Drawing on Messick’s (1981, 1989) work, Chapelle
(1998) outlines three ways of defining the construct of vocabulary,
as summarized in Table 2.
A trait definition attributes test performance to the characteristics
of the learner and therefore contextual variables play no significant
role in the design of a test to assess a particular trait. The implication
of this perspective for vocabulary testing is that context of any kind

Table 2 Three approaches to construct definition

Trait definition Behaviourist Interactionalist


definition definition

Principle underlying Person Person Person


construct definition characteristics must characteristics characteristics must
be specified cannot be specified. be specified relative
independent of Context must be to a particular
context. specified. context.
Example of Vocabulary size Vocabulary use in Vocabulary size for
construct definition mathematics writing writing in
mathematics
8 A framework for L2 vocabulary assessment

may be largely eliminated from the test items. The Vocabulary Levels
Test (Appendix 1) is a clear case of such an instrument, because it
is designed to measure learners’ vocabulary size as a trait without
reference to any particular context of use, and the target words them-
selves are presented in isolation, devoid of any linguistic context that
might indicate how they are used. Another less clear-cut example is
Singleton and Little’s use of the C-test to investigate the nature of
the mental lexicon (Appendix 6). In this case, the test items occurred
within a written text and the test-takers had to look for clues else-
where in the text to make the correct responses. However, the text
itself was not chosen to represent a particular context of use; the
researchers merely employed it a vehicle for eliciting evidence of
general mental processes involved in lexical access and retrieval.
The second type of construct definition, adopted by the behav-
iourists, goes to the other extreme in a sense, by giving a central role
to features of context. Language testing researchers working within
the tradition of ‘performance testing’ (e.g., Wesche, 1987) have relied
on the same theoretical foundation when they have attempted to create
test methods which replicate the conditions of the settings for which
they wish to predict the test-taker’s future performance. The idea is
that the learner’s underlying knowledge is considered too elusive to
define and so construct definition becomes a matter of defining the
context in which language is used. In practice this means that usually
no one aspect of linguistic knowledge – like vocabulary – is singled
out for discrete scoring in the assessment of learners on a performance
test. Vocabulary may not be explicitly mentioned at all, or it is
included among the factors that raters are to take into account in
making an overall judgement of communicative effectiveness.
None of the measures profiled in the Appendixes exemplifies this
approach to the definition of vocabulary, but we can illustrate it by
reference to the Interagency Language Roundtable (n.d.) skill level
descriptions. These scales are used to assess whether US diplomats
and other government personnel assigned to positions requiring com-
petence in a foreign language have sufficient proficiency to be able
to carry out their duties effectively. Here is the description of the
Speaking 2 (Limited working proficiency) level:
Able to satisfy routine social demands and limited work requirements. Can
handle routine work-related interactions that are limited in scope. In more com-
plex and sophisticated work-related tasks, language usage generally disturbs
the native speaker. Can handle with confidence, but not with facility, most
normal, high-frequency social conversational situations including extensive,
but casual conversations about current events, as well as work, family, and
autobiographical information. The individual can get the gist of most everyday
conversations but has some difficulty understanding native speakers in situ-
ations that require specialized or sophisticated knowledge. The individual’s
John Read and Carol A. Chapelle 9
utterances are minimally cohesive. Linguistic structure is usually not very elab-
orate and not thoroughly controlled; errors are frequent. Vocabulary use is
appropriate for high-frequency utterances, but unusual or imprecise elsewhere.
(Interagency Language Roundtable, n.d.; emphasis added)

To use the terminology introduced above, vocabulary assessment


is both comprehensive and deeply embedded here, to the extent that
it plays a strictly limited role in determining the overall rating of the
learner’s performance of the test task. The italicized last sentence is
the only one that refers explicitly to vocabulary, although the refer-
ences in the fourth sentence to the kind of conversational topics which
the test-takers can handle could also be interpreted as statements
about the adequacy and range of their vocabulary use. Nevertheless,
no explicit inferences are made about underlying lexical competence;
the focus is on describing characteristics of the actual performance.
The third type of construct definition, the interactionalist approach,
requires the researcher to specify the relevant aspects of both trait
and context because it refers to a context-specific underlying ability.
This is different from the trait definition, which assumes the construct
is equally relevant across situations and different from the behav-
iourist construct, which is defined as performance in a particular con-
text. An interactionalist construct defines vocabulary as an underlying
trait, but one that needs to be specified relative to a particular context
of use. Several of the tests in the Appendixes might be analysed as
consistent with an interactionalist definition of vocabulary. For
instance, both the Lexical Frequency Profile (Appendix 2) and the
ESL Composition Profile (Appendix 3) assess learners’ ability to use
vocabulary correctly and appropriately in written compositions. If the
writing task is explicitly academic in nature – and especially if the
test-takers are required to write in the genre of a specific academic
discipline – the test results allow inferences to be made about their
vocabulary ability in the context of academic study. Similarly, the
TOEFL vocabulary items (Appendix 4) and the Vocabulary Knowl-
edge Scale (Appendix 7) may provide a basis for making inferences
about learners’ knowledge of the words used in academic texts, which
could be quite different from vocabulary use in other contexts, even
when the same word forms are involved.
In short, then, there has been no real tradition in vocabulary testing
of construct definition in any explicit form. Implicitly, this area has
been dominated by trait definitions, operationalized in discrete, selec-
tive and context-independent tests of learners’ knowledge of individ-
ual words presented in isolation. The dominance of trait-oriented con-
structs of vocabulary continued through the 1980s, when for a limited
period advocates of a strong version of performance testing intro-
duced behaviourist definitions of language constructs into the field
10 A framework for L2 vocabulary assessment

of language testing. At this point, in keeping with current trends in


educational measurement generally and language testing in particular,
the interactionalist approach should be adopted more widely as the
basis for developing vocabulary constructs which can be located in
meaningful contexts of language use rather than representing simply
a discrete form of structural knowledge. However, a change in current
practices requires a better understanding of the interrelated factors
underlying decisions about test design.

IV An overview of the framework


Construct definition represents just one element of our overall frame-
work (outlined in Figure 2), the main objective of which is to specify
the relationship between test purpose and test design, by explaining
how validity considerations come into play. In other words, the design
of a vocabulary test must be founded on an explicit description of
the purpose it is intended to serve, and it is through processes of
systematic analysis and inquiry that we establish the extent to which
the link between the two is well motivated in practice: does the oper-
ational test deliver the information required about the learners’
vocabulary knowledge or ability, while having the desired impacts on
those who use it?

Figure 2 A framework for vocabulary testing


John Read and Carol A. Chapelle 11

V Test purpose and validity considerations


The first step in elaborating the framework, then, is to analyse test
purpose. We have devoted considerable attention to this aspect
because previous treatments in the language testing literature (e.g.,
Bachman and Palmer, 1996: 95–100) have not fully explored it. We
define test purpose as consisting of three components, which are
identified in the top line of Figure 2 and further classified in Table 3.
Let us look at each one in turn.

1 Inferences and construct validity


Inferences refer to the conclusions drawn about language ability or
performance on the basis of how the test-takers perform on the test.
As a simple example, the five scores on the Vocabulary Levels Test
(Appendix 1) can be interpreted as estimates of a learner’s ‘vocabu-
lary size’ at each of the five frequency levels covered by the test.
Obviously, the extent of the learner’s vocabulary knowledge cannot
be observed directly but is inferred from the proportion of correct
responses to the items in each part of the test. Inferences may be
made at various levels: (1) item, (2) subtest and (3) whole test. The
example just given illustrates a subtest inference. The five subtest
scores could be summed to give a more general measure of the lear-
ner’s vocabulary knowledge at the whole test level. Although item-
level inferences are also possible in this case, they make less sense
in a test where the target items sample a particular word frequency
level rather than being words of interest in their own right. A better
example of a test for which item-level inferences are appropriate is
the Vocabulary Knowledge Scale (Appendix 7), which is used to
measure what growth in knowledge of individual words occurs when
learners encounter them repeatedly through reading tasks.
Inferences at the subtest level are made when the vocabulary meas-
ure itself is not a discrete one but embedded in a larger test. A case

Table 3 Components of test purpose

1) Inferences to be drawn from test performance


item level
sub-test level
whole test level
2) Uses of the test results
instruction: placement, progress/achievement, diagnosis, proficiency
research
evaluation
3) Impacts that the test is intended to have
12 A framework for L2 vocabulary assessment

in point is the ESL Composition Profile (Appendix 3), in which a


rating is made of vocabulary use as one component of the overall
score. In this test, the inference at the whole test level concerns the
learner’s writing ability. Similarly in TOEFL (Appendix 4) any infer-
ence about vocabulary is strictly at the subtest level, since the whole
test focuses on the learner’s proficiency in English for academic
study. Whether they be at the item, subtest or whole test level, infer-
ences need to be supported or defended through the process of con-
struct validation, which requires evidence that they are justified.

2 Uses: relevance and utility


The second component of test purpose, Uses, refers to the practical
outcomes of test results, what they are used for. Uses can be divided
into three categories, which are identified by Bachman and Palmer
(1996: 96–99) in their account of test purpose, but we have classified
and labelled them a little differently. Instructional uses are defined
broadly as involving decision making about learners, most commonly
in an educational context. These are the most familiar uses of langu-
age tests, represented by labels such as placement, achievement and
proficiency. For instance, the original use of the Vocabulary Levels
Test (Appendix 1) was as a diagnostic tool for classroom teachers,
to assist them in preparing suitable vocabulary learning programs for
their students. And one use of the ESL Composition Profile
(Appendix 3) is as an achievement measure to assess how well lear-
ners have developed their writing skills by the end of an English
composition course. Research uses relate to the role of vocabulary
tests in empirical investigations in the fields of language testing and
second language acquisition. Thus, the research may consist of studies
to explore the suitability of a test for a particular instructional purpose
– as in the work of Hale et al. (1989) on the merits of a multiple-
choice cloze as a subtest of the TOEFL battery (Appendix 5) – or
studies to gain a better understanding of the processes of language
acquisition – as in Singleton and Little’s (1991) work on the nature
of lexical storage among L2 learners (Appendix 6). The essential dif-
ference in these cases from instructional uses of vocabulary measures
is that the test results are not being employed to make any decisions
about the learners as individuals, but rather to address the research
questions formulated by the investigators.
The third category, evaluation uses, involves decision making
about the quality of language education methods and programs. On
the face of it, none of our test profiles exemplifies this use of vocabu-
lary tests. Nonetheless, Paribakht and Wesche’s (1997) study
John Read and Carol A. Chapelle 13

(Appendix 7), which they present as experimental research on inci-


dental vocabulary learning, could also be seen as an evaluation of
two design options for an ESL reading course: Reading Only (with
no class work on vocabulary) and Reading Plus (with structured
vocabulary learning exercises). On a larger scale, vocabulary tests
may be included in a battery of measures to evaluate a whole langu-
age teaching program, as in the case of the Reading English for
Science and Technology Project at the University of Guadalajara,
Mexico (Lynch, 1992). Although evaluation has much in common
with research, it is distinguished by its explicit focus on decisions
about, say, whether to recommend wider adoption of a learning tech-
nique or whether a language program should continue to be funded.
A particular test may have more than one use. The TOEFL func-
tions not only as a proficiency measure determining admission to
academic degree programs in US universities but often also as a
placement test into ESL classes for students who do not achieve the
minimum score for academic admission. Similarly, from its original
purpose as a classroom diagnostic test, the use of the Vocabulary
Levels Test (Appendix 1) has been extended to acting as a measure
of vocabulary size for the subjects in various research studies on the
learning of L2 words. Evaluation of test use is conducted through
inquiry into its relevance and utility for making the decisions for
which it is intended.

3 Intended impacts and actual consequences


Impacts, the third component of test purpose, refer to the test’s
intended effects on its users, including individual students and teach-
ers, language classes and programs, and even society as a whole (cf.
Bachman and Palmer, 1996: 29–35). Intended impact needs to be
considered during the process of test design if actual consequences
are to be evaluated in a well-motivated fashion as part of the vali-
dation process. Current theory and practice in test evaluation places
considerable significance on the consequences of testing (Frederiksen,
1984; Canale, 1987; Messick, 1989; Bachman and Palmer, 1996) and,
in the field of language testing specifically, there is great interest in
the washback of major tests on teaching and learning (Alderson and
Wall, 1993; Messick, 1996; Wall, 1997). Research on washback has
concentrated on the impact of tests which were already being adminis-
tered for instructional purposes, but we are arguing here that the logic
of incorporating consequences into test evaluation is that intended
effects should be seen as an integral part of the design of new tests
and should thus be specified as one component of test purpose. The
actual consequences of implementing the test for a particular purpose
14 A framework for L2 vocabulary assessment

can then be evaluated in relation to intended effects at the design


stage. In seeking to illustrate impacts in this sense, though, we have
encountered the practical difficulty that the authors of the tests pro-
filed in the Appendixes have typically not articulated the intended
effects of their instruments and so we have been obliged to infer what
they might have been, as well as noting the actual effects of the test
in some cases.
At the classroom level, the impact of a weekly progress test is
presumably to encourage the students to study and revise the vocabu-
lary items presented in each unit of their course textbook. In the case
of TOEFL, the basic motivation for developing the whole battery in
the early 1960s – i.e., its original macrolevel impact, in our terms –
was a concern that decisions about the linguistic ability of inter-
national students to cope with study in a US university should be
made on a fair and consistent basis by means of an instrument which
met high standards of psychometric quality (Spolsky, 1995: 217–36).
At a more microlevel, the original TOEFL vocabulary items, which
were selective and context-independent multiple-choice items
presenting words in isolation, were criticized by language teachers as
giving international students an incentive to spend time unproduc-
tively memorizing long lists of words together with their synonyms
or definitions. To a significant extent, then, subsequent revisions of
the test, which have presented words in some context and integrated
the vocabulary items more into the reading comprehension section of
TOEFL (cf. Appendix 4), have had the intended effect of encouraging
learners to develop skill at dealing with words as they occur in written
academic texts (for further discussion, see Read, 1997, 2000).

VI Mediating factors
Returning to the framework in Figure 2, we can see that the validity
considerations identified in the second line help to identify three fac-
tors that mediate between test purpose and test design.

1 Construct definition
The first mediating factor, construct definition, has already been
extensively discussed in Section III above, where we presented three
ways of approaching the defining of vocabulary constructs. Each type
of definition entails a distinctive form of inference, which in turn has
implications for test design and validation.
A trait definition of vocabulary generally entails a test which is
discrete, selective and context independent, such as the Vocabulary
Levels Test. Even if the target words are presented in some form of
John Read and Carol A. Chapelle 15

linguistic context, the objective of the assessment is to determine


whether the learners really ‘know’ the words, without having the
opportunity to infer the meaning from contextual cues. The focus
tends to be on receptive knowledge of vocabulary; usually, productive
ability is assessed only in the limited sense of requiring the test-takers
to supply the word within a restricted context, e.g., by completing a
blank in a short sentence. Researchers and test developers who adopt
a trait-oriented approach have generally sought to validate their tests
by means of correlational procedures. Two or more vocabulary tests
consisting of different item types (or test methods) are correlated,
primarily to obtain evidence of convergent validity by controlling for
the influence of method variance.
From the behaviourist perspective, vocabulary as such is typically
not very salient in test design. The constructs of interest may be
macroskills like listening comprehension or writing ability, but they
are usually embedded in broader concepts of communicative pro-
ficiency for specified academic, occupational or social purposes. Tests
are based on integrated tasks which resemble real-life uses of langu-
age as closely as possible, thus creating a context for appropriate
vocabulary use. To the extent that vocabulary plays any explicit role
in the assessment of learner performance, it is embedded, comprehen-
sive and context dependent in nature. Certainly, individual lexical
items are not considered crucial for comprehension or production, and
no attempt is made to single them out for attention. In the validation
of performance tests, the match between the test task and the corre-
sponding language use situation is of particular importance and is
investigated through content analysis of their respective features.
An interactionalist approach assesses vocabulary as a trait which
is manifested in particular contexts of use. This means that there is
more explicit attention to vocabulary than a behaviourist definition
would allow. There are also more design options available as com-
pared to the other types of construct definition. Some form of context
dependence should be an essential feature of test design if learners
are to demonstrate their ability to deal with vocabulary receptively
and/or productively under construct-relevant contextual constraints,
but otherwise an interactionalist vocabulary test may be either discrete
or embedded, while both selective and comprehensive methods of
identifying the lexical items are possible. Validation of tests from an
interactionalist perspective is likely to be multi-faceted, drawing on
the various forms of validity enquiry identified by Messick (1989,
1996) and discussed in relation to vocabulary testing by Chapelle
(1994, 1998).
16 A framework for L2 vocabulary assessment

2 Performance summary and reporting

The second aspect of purpose – the uses made of test scores – has
implications for the way in which relevance and utility should be
evaluated and, in turn, the way in which test-takers’ performance
should be reported and interpreted. Performance can be summarized
as one score or as a profile containing multiple components. When
test scores are used for making decisions, such as whether or not a
student should be admitted to college or hired for a job, the decision
maker involved often desires a single score as a summary of the appli-
cant’s language ability. In contrast, when test uses include achieve-
ment or diagnosis for instructional purposes within the language
classroom or program, more specific information is necessary. For
example, for placing students into language programs, administrators
may need a profile indicating learners’ levels of development in areas
corresponding to the classes in the program. And when learners take
classroom progress and achievement tests, they themselves want to
know how they are doing and what they need to study (Canale, 1987).
In these latter cases, one single score fails to provide sufficient infor-
mation.
An example of a vocabulary test tailored specifically for its use by
teachers has been included in the placement test battery administered
at the English Language Institute of Victoria University of Welling-
ton. For many years, the core texts in the English proficiency program
at the institute were the five workbooks in the Advanced English
Vocabulary series (Barnard, 1971–75). In order to assist teachers in
determining which workbooks to use with a particular class, the test
was based on a random sample of words from each of the three fre-
quency levels covered by the workbooks. By reviewing the section
scores, teachers could see which workbooks were likely to be the
most suitable for the vocabulary learning needs of students in their
class. This, then, represents a test whose design and reporting were
very much governed by the diagnostic use of the test scores by class-
room teachers, and the case for its validity resides primarily in evi-
dence of its usefulness for making the appropriate instructional
decisions.
By contrast, the reporting of test-takers’ performance on TOEFL
reflects the rather different uses made of the test results. The main
users of the test are admissions personnel in colleges and universities
in the USA, who have a strong preference for a single overall score
for each international student applicant. Although the TOEFL score
report form includes scores for the individual sections of the test, the
focus of the decision makers is on the ‘magic number’ represented
by the total score (550, 213 or whatever it may be). Thus, the
John Read and Carol A. Chapelle 17

reporting of TOEFL results primarily serves the needs of these users,


and arguments for the validity of the test should produce evidence that
the total scores allow them to make reliable decisions about whether
incoming students can cope with the language demands of their aca-
demic degree programs. Once they arrive on campus, many of these
students need to be placed in courses within the university’s ESL
program. Since TOEFL scores do not provide the diagnostic infor-
mation required for placement, ESL programs have developed their
own tests for this purpose. Examples are the English as a Second
Language Placement Examination (ESLPE) at UCLA and the ELI
Placement Test at the University of Hawaii (Brown, 1993). It is only
through such tests that information can be obtained about the stu-
dents’ vocabulary knowledge or ability, for it has been many years
since TOEFL provided a section score for vocabulary.
The kind of mismatch illustrated here between the reporting needs
of external audiences (like admissions personnel) and those involved
in language teaching programs has been extensively discussed by
Brindley (1991, 1998). In his earlier work, he distinguishes three lev-
els of assessment. Level 1, which assesses overall language achieve-
ment, is carried out by means of standardized proficiency tests like
TOEFL or IELTS (International English Language Testing System),
and generally produces a single score or rating. At Level 2, students
are assessed continuously during a language course on their achieve-
ment of particular communicative objectives. The results are typically
reported in the form of a descriptive profile describing in functional
terms what the learner can do in the second language. At the third
level, less formal assessment is undertaken by teachers to monitor
students’ learning of the course content on a regular basis. The result
may consist of corrective feedback on a short writing task or the
number of items correct in a weekly vocabulary test. Thus, the audi-
ence for Level 3 assessment is limited to the teacher and learners in
the classroom, whereas at Level 2 and particularly Level 1 there is a
wider range of people who may be interested in the results as well,
including educational administrators, evaluators, government officials
and perhaps even the general public.
In a recent (1998) article, Brindley explores the implications of the
trend in the 1990s in many countries towards assessment and
reporting systems based on educational outcomes, in the form of
national standards, benchmarks, competencies and so on. On the one
hand, teachers are increasingly expected to use various performance
assessment procedures on a continuing basis in the classroom, but on
the other hand it is difficult to use the formative information they
obtain to provide the kind of summative reporting of student achieve-
ment required by policy makers and educational bureaucrats. Within
18 A framework for L2 vocabulary assessment

this environment, discrete vocabulary measures used by teachers and


language program administrators will probably not be interpretable
by a wider audience, unless perhaps they are properly contextualized
measures of vocabulary size. Any reporting of students’ vocabulary
ability to external decision makers is thus likely to be embedded in
a summary of the learners’ more general communicative ability in
the language.

3 Test presentation
The third aspect of purpose – impacts of the test – is tied to the
investigation of the actual consequences of testing and therefore
implies that it is necessary to consider how and to whom the test is
to be presented. Test developers choose to portray their tests in ways
that will appeal to particular audiences, such as L2 teachers, program
administrators or L2 researchers. The audience of test users may con-
sist of admissions people who value information that is easy to inter-
pret and presented with comparative data from other institutions. On
the other hand, teachers, students and researchers may be more inter-
ested in tests offering comprehensive information which can be inter-
preted relative to learners’ needs and areas of development. Teachers
may also be concerned about the degree of ‘authenticity’ in language
tests, and with the washback effects of tests on their instructional pro-
grams.
To illustrate the role of presentation, let us turn once more to the
Vocabulary Levels Test (Appendix 1). Nation designed this test as a
practical instrument for classroom teachers to encourage them to take
a more systematic approach to identifying their learners’ existing
word knowledge and planning their vocabulary learning program. In
order to achieve this impact, he included the full test in two publi-
cations (Nation, 1983, 1990) and made it freely available to teachers
through other channels. In addition, he provided guidelines as to how
the test scores should be interpreted and what kind of vocabulary
study was appropriate for learners at the various word frequency lev-
els covered by the test. One unintended impact of the wide dissemi-
nation of the test was, however, that it has also been adopted by a
number of researchers as a convenient measure of vocabulary size in
studies of L2 lexical acquisition. If this latter effect had been
intended, the original presentation of the test should ideally have
included technical evidence of its reliability and validity for this rather
different use. Some such evidence eventually became available (Read,
1988; Beglar and Hunt, 1999), but it is only now – nearly two dec-
ades after the original test was devised – that two more thoroughly
John Read and Carol A. Chapelle 19

researched versions have been presented for general use by Schmitt


and his associates (Schmitt, 2000; Schmitt et al., 2001).
The ESL Composition Profile (Appendix 3) was presented as a
central component of a three-volume English Composition Program,
which was designed to encourage ESL teachers at the time (the early
1980s) first to give more emphasis to the skill of writing and, sec-
ondly, to develop writing as a communicative resource for their lear-
ners, rather than just a contextualized grammar activity. Thus, the
profile had the role of providing the basis for reliable and valid assess-
ment of the students’ writing from a communicative perspective. In
keeping with this objective, the content, organization and vocabulary
subscales were given greater weight collectively in the scoring system
than language use (syntax and morphology) and mechanics. In
addition, teachers and program administrators were given a wealth of
advice on all aspects of writing assessment. Perhaps anticipating the
fact that the profile has subsequently been used in numerous research
studies on L2 writing, the authors included in the testing volume a
technical chapter that presents statistical evidence for its reliability
and validity as a measure of college-level writing ability.

VII Implications for test design and validation


The implications of construct definition, performance summary and
reporting, and test presentation for test design are set out in Table 4.
The basis on which the construct is defined should have a strong
influence on the form of the test. Trait principles are most likely to
lead to a discrete vocabulary test with context-independent items,
whereas a behaviourist definition requires task-based test design,
which may employ embedded and comprehensive lexical measures.
An interactionalist approach will pay more explicit attention to
vocabulary than the behaviourist one, while at the same time
assessing vocabulary in relation to a specific context of use. For
reporting purposes, it is important that the test results should be in a
form that will allow the users to make soundly-based decisions,
whether they be for instruction, research or evaluation. In addition,
the quality of their decisions will be enhanced if they are well infor-
med about the nature of the test and how the results should properly
be interpreted.
These factors, which were drawn from validation theory, in turn
have implications for the process of validation, as summarized in
Table 5 through the guiding questions for validation. The construct
validity issues follow logically from the three types of construct defi-
nition, and it is the role of contextual influences on test performance
that is the key element distinguishing the three approaches. This
20 A framework for L2 vocabulary assessment
Table 4 The implications of construct definition, score reporting, and test presentation for
test design

If the construct definition is Then. . .


based on. . .
trait principles design a discrete test with randomly selected
vocabulary items.
behaviourist principles design a test comprised of tasks with characteristics
sufficiently similar to those in the context of interest.
interactionalist principles design a discrete or embedded test with carefully
chosen vocabulary items and tasks with characteristics
similar to those in the context of interest.

If test scores are reported Then. . .


as. . .
a single score ensure that the various items/parts of the test can be
meaningfully combined for reporting to the test users.
multiple scores ensure that each element of the profile can be reliably
assessed and the overall profile meets the users’
needs.

If the audience for test Then be sure to include. . .


presentation is. . .
students/teachers a statement of how the test is intended to help the
teaching/learning process.
administrators/the public an explanation of the relevance of the test for making
decisions.
applied linguists evidence concerning what the test does and does not
measure.

means that test developers who adopt either behaviourist or interac-


tionalist principles should give priority to obtaining evidence of the
relationship between test content and relevant situations of language
use. With regard to relevance and utility, the usefulness of the scores
for making the appropriate decisions or addressing the research ques-
tions needs to be established. Finally there should be evidence that
the use of the test has a positive impact on its intended audience and
that any negative effects are minimized.
As Messick (1989) suggested, the process of validation should con-
sist of an argument entailing theoretical and empirical rationales for
the outcomes of testing. As indicated in Figure 2 (p. 10), such argu-
ments should be based on theory, evidence and consequences of test-
ing. Validation theory in general, however, needs to be guided by
testing needs within a particular context, where specific choices concern-
ing construct definition, score reporting and presentation need to be justi-
fied on the basis of an integrated argument. The questions set out in
Table 5 illustrate the way in which these choices motivate particular
questions that might be probed through the process of validation.
Such questions may form the basis of logical analysis and a more
John Read and Carol A. Chapelle 21
Table 5 The implications of construct definition, score reporting, and test presentation
for validation

If construct definition is Then the central construct validity question is. . .


based on. . .
trait principles does the test measure the defined underlying
characteristics without any influence from the test
context?
behaviourist principles does the test sample performance in a test context
which has characteristics sufficiently similar to the
context of interest?
interactionalist principles does the test measure the defined underlying
characteristics with appropriate influence from the test
context?

If test scores are reported Then the central validity questions about relevance
as. . . and utility are. . .
a single score is the test score relevant and useful for the intended
selection or placement decisions?
multiple scores are the multiple scores relevant and useful for informing
instructional decisions or a theory of L2 lexical
development?

If the audience for test Then the central consequential validity question
presentation is. . . is. . .
students/teachers does the test help to focus on the relevant aspects of
vocabulary development?
administrators/the public does the test play a positive role in program decisions?
applied linguists does a sounder theory of L2 vocabulary development
result from test use?

elaborate series of questions for this purpose can be found in Bach-


man and Palmer’s (1996: 149–55) checklist for evaluating test useful-
ness. Such questions can also provide guidance for conceptualizing
empirical analysis through a variety of recognized methods for val-
idity inquiry (e.g., Bachman, 1990; Chapelle, 1998) including empiri-
cal item investigation, empirical task analysis, correlations with other
tests and behaviours, and experimental studies of test performance.
Examination of these three components – construct definition, score
reporting, and presentation – is useful for identifying particular ques-
tions to direct validation, but ultimately the validity argument must
integrate the various forms of justification into a single statement con-
cerning test interpretation and use.

VIII Conclusion
One problem with many current vocabulary tests is that they appear
to follow principles of test design which are out of step with current
thinking in educational measurement and applied linguistics. They are
often discrete measures, composed of vocabulary items selected on a
22 A framework for L2 vocabulary assessment

random basis to provide an estimate of learners’ vocabulary size with-


out reference to contexts of use. In these tests, the target words are
typically presented with little or no linguistic context. If we view the
situation from the perspective of the framework presented in this arti-
cle, we can see that vocabulary tests of this kind adhere to the design
principles associated with the trait definition of vocabulary. There is
certainly a role for such measures, but difficulties arise when tests
designed for specific, narrowly focused research uses are employed
in various other ways in the instructional domain without adequate
consideration of their wider usefulness. To the extent that the ultimate
learning objectives of contemporary language teaching programs,
especially those for learners with specific purposes, focus on com-
municative use of language in particular contexts, these objectives are
not well served by tests based on a trait definition of vocabulary. The
prevalence of discrete, selective and context-independent tests may
in fact have a negative educational impact if, for instance, language
teachers avoid vocabulary assessment because the only measures
available seem to be irrelevant to their needs. Thus, our framework
draws attention to the importance of considering not just the practical
value of the test results but also the construct definition on which the
test is based and the educational or social consequences of test use.
The major development that we advocate in this article is a rethink-
ing of vocabulary assessment from a three-component perspective on
test purpose and an interactionalist perspective to construct definition
when it is appropriate. This means a fresh analysis of test purpose
taking in all three of its components within our framework. An inter-
actionalist approach to inferences requires that vocabulary knowledge
and use should be defined in relation to particular contexts. For
instance, if the construct of vocabulary size is still a productive one
for researchers or educationists, it needs to be framed in terms of the
adequacy of the learner’s lexical knowledge in achieving particular
communicative purposes. This has implications not only for the selec-
tion of lexical items to be assessed but also the nature of the task the
learners are asked to perform: a ‘contextualized’ test design has little
value unless the test-takers are required to engage with the contextual
features in a meaningful way.
A new approach to test uses means going beyond tests designed
to measure learners’ knowledge of relatively decontextualized word
lists and considering what other vocabulary assessment needs have
to be met. In the area of instructional uses, for example, how should
we test the achievement of learners in a specific-purpose course
whose objectives include the effective use of their lexical resources
within relevant genres, styles or registers? In second language acqui-
sition research, what measures are available to investigate the role of
John Read and Carol A. Chapelle 23

multiword lexical items in the development of native-like fluency by


advanced learners? And in program evaluation, what diagnostic meas-
ures can be employed with students who are being educated through
immersion in a second language to assess the adequacy of their
vocabulary knowledge in the various curriculum areas?
The third component of the framework, impacts, is almost virgin
territory for vocabulary assessment. Test design – especially for esti-
mating vocabulary size – seems to be driven largely by practical con-
siderations: which of the existing word lists is the most convenient
to sample from, and which simple item type will allow a large number
of words to be covered. The enduring popularity of the Vocabulary
Levels Test for a whole range of uses must owe much to its avail-
ability, simplicity and convenience. But what about tests that give
learners the incentive to deepen their knowledge of lexical items, or
to develop effective communication strategies to deal with gaps in
their vocabulary knowledge? Instead of designing tests which have
unanticipated outcomes, one might identify and plan for particular
positive consequences. For instance, if the lexicon is as central to
language use as Singleton (1999) claims, how should that be made
salient to learners and teachers in the way that performance tasks
are assessed?
There is plenty of scope, then, for the development of innovative
types of vocabulary assessment that are congruent with educational
measurement theory and at the same time responsive to the changing
needs of language teachers and applied linguistic researchers. An
understanding of test purpose, and the factors intervening between
purpose and design, provides the necessary foundation for significant
advances in vocabulary testing in the twenty-first century.

IX References
Alderson, J.C. and Wall, D. 1993: Does washback exist? Applied Linguis-
tics 14, 115–29.
Bachman, L.F. 1990: Fundamental considerations in language testing.
Oxford: Oxford University Press.
Bachman, L.F. and Palmer, A.S. 1996: Language testing in practice.
Oxford: Oxford University Press.
Barnard, H. 1971–75: Advanced English vocabulary. Workbooks 1–3B.
Rowley, MA: Newbury House.
Beglar, D. and Hunt, A. 1999: Revising and validating the 2,000 Word
Level and University Word Level Vocabulary Tests. Language Testing
16, 131–62.
Brindley, G. 1991: Assessing achievement in a learner-centred curriculum.
In Alderson, J.C. and North, B., editors, Language testing in the 1990s.
London: Macmillan, 153–66.
24 A framework for L2 vocabulary assessment

Brindley, G. 1998: Outcomes-based assessment and reporting in language


learning programmes: a review of the issues. Language Testing 15,
45–85.
Brown, J.D. 1993: A comprehensive criterion-referenced language testing
project. In Douglas, D. and Chapelle, C, editors, A new decade of
language testing research. Arlington, VA: TESOL, 163–84.
Canale, M. 1987: Language assessment: the method is the message. In
Tannen, D. and Alatis, J.E., editors, The interdependence of theory,
data, and application. Washington, DC: Georgetown University Press,
249–62.
Chapelle, C.A. 1994: Are C-tests valid measures for L2 vocabulary
research? Second Language Research 10, 157–87.
—— 1998: Construct definition and validity inquiry in SLA research. In
Bachman, L.F. and Cohen, A.D., editors, Interfaces between second
language acquisition and language testing research. Cambridge: Cam-
bridge University Press, 32–70.
Coady, J. and Huckin, T., editors, 1997: Second language vocabulary
acquisition. Cambridge: Cambridge University Press.
Educational Testing Service 1995: TOEFL sample test 5th edn. Princeton,
NJ: Educational Testing Service.
Foster, P. 2001: Rules and routines: a consideration of their role in the
task-based language production of native and non-native speakers. In
Bygate, M., Skehan, P. and Swain, M., editors, Language tasks: teach-
ing, learning and testing. London: Longman.
Foster, P. and Skehan, P. 1996: The influence of planning and task-type
on second language performance. Studies in Second Language Acqui-
sition 18, 299–323.
Frederiksen, N. 1984: The real test bias: influences of testing on teaching
and learning. American Psychologist 39, 193–202.
Hale, G.A., Stansfield, C.W., Rock, D.A., Hicks, M.M., Butler, F.A. and
Oller, J.W. Jr. 1989: The relation of multiple-choice cloze items to the
Test of English as a Foreign Language. Language Testing 6, 47–76.
Harley, B., editor, 1995: Lexical issues in language learning. Amsterdam:
John Benjamins.
Huckin, T., Haynes, M. and Coady, J., editors, 1993: Second language
reading and vocabulary learning. Norwood, NJ: Ablex.
Interagency Language Roundtable, n.d.: ILR skill level descriptions.
Available at: http:/ /www.call.gov/testing/testing.htm.
Jacobs, H.L., Zingraf, S.A., Wormuth, D.R., Hartfiel, V.F. and Hughey,
J.B. 1981: Testing ESL composition: a practical approach. Rowley,
MA: Newbury House.
Laufer, B. 1998: The development of passive and active vocabulary in a
second language: same or different? Applied Linguistics 19, 255–71.
Laufer, B. and Nation, P. 1995: Vocabulary size and use: lexical richness
in L2 written production. Applied Linguistics 16, 307–22.
—— 1999: A vocabulary-size test of controlled productive ability. Language
Testing 16, 33–51.
Lynch, B.K. 1992: Evaluating a program inside and out. In Alderson, J.C.
John Read and Carol A. Chapelle 25

and Beretta, A., editors, Evaluating second language education. Cam-


bridge: Cambridge University Press, 61–99.
Meara, P. 1992: EFL vocabulary tests. Swansea: Centre for Applied Langu-
age Studies, University of Wales, Swansea.
Meara, P. and Fitzpatrick, T. 2000: Lex30: an improved method of
assessing productive vocabulary in an L2. System 28, 19–30.
Mehnert, U. 1998: The effects of different lengths of time for planning on
second language performance. Studies in Second Language Acquisition
20, 83–108.
Messick, S. 1981: Constructs and their vicissitudes in educational and
psychological measurement. Psychological Bulletin 89, 575–88.
—— 1989: Validity. In Linn, R.L, editor, Educational measurement. 3rd
edn. New York: Macmillan, 13–103.
—— 1996: Validity and washback in language testing. Language Testing
13, 241–56.
Nation, I.S.P. 1983: Testing and teaching vocabulary. Guidelines 5, 12–25.
—— 1990: Teaching and learning vocabulary. New York: Heinle and Hein-
le.
Nation, I.S.P. 2001: Learning vocabulary in another language. Cambridge:
Cambridge University Press.
O’Loughlin, K. 1995: Lexical density in candidate output on direct and
semi-direct versions of an oral proficiency test. Language Testing 12,
217–37.
Paribakht, T.S. and Wesche, M. 1997: Vocabulary enhancement activities
and reading for meaning in second language vocabulary acquisition.
In Coady, J. and Huckin, T., editors, Second language vocabulary
acquisition. Cambridge: Cambridge University Press, 174–200.
Read, J. 1988: Measuring the vocabulary knowledge of second language
learners. RELC Journal 19, 12–25.
—— 1997: Vocabulary and testing. In Schmitt, N. and McCarthy, M., edi-
tors, Vocabulary: description, acquisition and pedagogy. Cambridge:
Cambridge University Press, 303–20.
—— 2000: Assessing vocabulary. Cambridge: Cambridge University Press.
Schmitt, N. 2000: Vocabulary in language teaching. Cambridge: Cambridge
University Press.
Schmitt, N. and McCarthy, M., editors, 1997: Vocabulary: description,
acquisition and pedagogy. Cambridge: Cambridge University Press.
Schmitt, N., Schmitt, D. and Clapham, C. 2001: Developing and exploring
the behaviour of two new versions of the Vocabulary Levels Test.
Language Testing 18(1), 55-89.
Singleton, D. 1999: Exploring the second language mental lexicon. Cam-
bridge: Cambridge University Press.
Singleton, D. and Little, D. 1991: The second language lexicon: Some evi-
dence from learners of French and German. Second Language
Research 7, 61–81.
Skehan, P. 1996: A framework for the implementation of task-based instruc-
tion. Applied Linguistics 17, 38–62.
26 A framework for L2 vocabulary assessment

—— 1998: A cognitive approach to language learning. Oxford: Oxford


University Press.
Spolsky, B. 1995: Measured words. Oxford: Oxford University Press.
Wall, D. 1997: Impact and washback in language testing. In Clapham, C.
and Corson, D., editors, Encyclopedia of language and education. Vol.
7: Language testing and assessment. Dordrecht: Kluwer, 291–302.
Wesche, M.B. 1987: Second language performance testing: the Ontario Test
of ESL as an example. Language Testing 4, 28–47.

Appendixes Analyses of purpose for some well-known


vocabulary tests

Appendix 1 The Vocabulary Levels Test (Nation, 1990)


DESIGN
A discrete, selective vocabulary test with the words presented in iso-
lation. Input: A 90-item test, with eighteen for each of five frequency
levels. Items are presented in groups of three, together with six poss-
ible definitions. Expected response: Test-takers select the definition
that matches each of the target words.
PURPOSE
Inferences: Trait-definition of vocabulary (vocabulary size inde-
pendent of contexts of use). Item level: Knowledge of a common
meaning of each of a sample of high-frequency words. Test level: The
estimated size of the learner’s vocabulary, based on the proportion of
the words known at different frequency levels.

Uses: Instructional: A classroom test intended to assist teachers to


design suitable vocabulary learning programs for their students.
Research: Intended to measure vocabulary size for various kinds of
L2 vocabulary research.

Impacts: Relatively low stakes test for the learners. It may encour-
age the teaching and learning of words in isolation, and may lead
students to focus on the particular words tested.

Appendix 2 The Lexical Frequency Profile (LFP) (Laufer and


Nation, 1995)
DESIGN
A discrete, comprehensive vocabulary test, involving the use of words
by test-takers in a written text that they compose themselves.
John Read and Carol A. Chapelle 27

Input: A short prompt on a controversial topic of general interest.


Expected response: The test-takers write a composition of 300–350
words in one hour, giving their opinion about the topic. Scoring: The
correctly used word forms written by each test-taker are lemmatized
and classified by a computer program into four frequency levels: First
1000 most frequent words; second 1000 words; University Word List;
and words outside the three lists. The profile consists of the percent-
ages of the total word forms that belong to each frequency level.
PURPOSE
Inferences: Trait definition of vocabulary (the specific topic or
genre of the writing has no particular significance for the assessment).
Test level: The ability to produce words correctly in written compo-
sition. Higher percentages of lower frequency words are considered
to represent larger vocabulary size and a higher level of proficiency
in the language.

Uses: Research: A measure of the lexical richness of ESL student


writing for investigating its development over time as the result of
instruction.

Impacts: The scope of impact may depend on whether the test-


takers compose by pen-and-paper or on a computer. The latter case
simplifies the processing of the LFP considerably and that, together
with the objectivity of the measure, may encourage wider use for
instructional purposes.

Appendix 3 ESL Composition Profile (Jacobs et al., 1981)


DESIGN
An embedded comprehensive test. The vocabulary component is one
of five analytic scales used to assess the test-takers’ compositions
(along with content, organization, language use and mechanics).
Input: A variety of prompts can be used. The original work on the
profile involved short prompts used in the Michigan Test. Expected
response: The test-takers write a composition under test conditions
(200–300 words in 30 minutes in the Michigan case). Scoring:
Vocabulary is scored by raters’ judgements on a scale of 0–20. The
scores are grouped into four levels from ‘Excellent to very good’ to
‘Very poor’, each with their own descriptors. The total profile score
can range from 0–100.
28 A framework for L2 vocabulary assessment

PURPOSE
Inferences: Interactionalist definition (vocabulary ability within the
context of academic writing). Subtest level: The vocabulary compo-
nent of the profile measures the extent to which the test-takers choose
a good range of words and idioms, and use them correctly and appro-
priately in their writing.

Uses: The test as a whole is designed as a flexible instrument to


serve a variety of instructional and research uses within ESL pro-
grams, with the general goal of measuring the development of the
learners’ written communicative abilities in the language.

Impacts: The profile has been widely used in ESL programs in


North America and elsewhere during the last fifteen years. It requires
readers to look at learner compositions from several points of view
and, in particular, to consider the contribution of vocabulary use to
the overall quality of the writing. It communicates to teachers and
learners a relative value of vocabulary use in academic writing.

Appendix 4 The vocabulary items in section 3 of the Test of


English as a Foreign Language (TOEFL) (Educational Testing
Service, 1995)
DESIGN
In the paper-based version of TOEFL introduced in 1995, the vocabu-
lary test items are embedded in a reading comprehension test, which
in turn forms one of three sections in a proficiency test battery. Words
are presented within reading passages Input: Section 3 of the TOEFL
consists of five reading passages (200–300 words). For each one,
there are about three multiple-choice items with the generic stem:
‘The word “[ ]” in line [ ] is closest in meaning to . . .’ Expected
response: The test-takers select one of four single-word options for
each item. Scoring: Items are scored dichotomously.
PURPOSE
Inferences: Interactionalist definition of vocabulary (vocabulary
knowledge and strategies in academic reading). Item level: Knowl-
edge of the semantic features of particular words occurring in a read-
ing text or metacognitive strategies for guessing meanings in context.
Subtest level: Comprehension of the semantic content of academic-
style reading texts.
John Read and Carol A. Chapelle 29

Uses: Instructional: One component of a battery of measures of pro-


ficiency in English, for selection and admission of nonnative-speaker
students to English-medium academic study programs at the tertiary
level.

Impacts: A high stakes test, so that there is great demand from


learners for intensive coaching and practice with the item types
included in the test. Some vocabulary items may require the test-
takers to pay attention to the linguistic context in which the target
words occur, but for others the correct option might be selected with-
out reference to the context. The items may encourage intensive
teaching and learning of likely target words and their synonyms.

Appendix 5 A multiple-choice cloze test for proficiency testing


(Hale et al., 1989)
DESIGN
An embedded, selective vocabulary measure in which words are
presented in the context of a reading passage. Input: Three short writ-
ten texts (about 200 words each) from which a total of fifty words
are deleted selectively. Each is replaced by a four-option multiple-
choice item. Twelve items are judged to have vocabulary as the pri-
mary source of difficulty and fourteen have vocabulary as a secondary
source. Expected response: The test-takers select the correct option
for each item. Scoring: Items are scored dichotomously.
PURPOSE
Inferences: Interactionalist definition of vocabulary (to the extent
that vocabulary ability is assessed with reference to academic texts).
Item: Knowledge of word meaning in context and strategies for word
selection. Test: Language ability not precisely defined (investigated
in the research).

Uses: Research: An investigation of what is measured by items in


a multiple-choice cloze test. Instructional: A possible replacement
part of the reading comprehension section of TOEFL, assessing pro-
ficiency in English for academic study purposes.

Impacts: If it had been adopted as part of the operational versions


of TOEFL, it would have assessed vocabulary knowledge integrat-
ively in a discourse context, in contrast to the sentence-based vocabu-
lary test items that were used in TOEFL at that time. This would
30 A framework for L2 vocabulary assessment

presumably have encouraged more contextualized study of vocabu-


lary by learners preparing for the test.

Appendix 6 The C-test for vocabulary research (Singleton and


Little, 1991)
DESIGN
A discrete, selective test with items contextualized in a written text.
Input: A short (about 100 word) text in French/German, with the
second half of every second word deleted, yielding about forty items.
Expected response: Test-takers write in the missing half of each muti-
lated word.
PURPOSE
Inferences: Trait definition of vocabulary (inferences about mental
lexical processes without reference to a context of use). Item: Whether
a particular response is semantically motivated; whether an incorrect
response shows evidence of lexical creativity; and whether a parti-
cular lexical creation shows cross-linguistic influence. Test: The
extent to which the L2 lexicon is semantically motivated rather than
phonologically driven. The extent to which L1 and L2 lexical storage
and processing are interconnected.

Uses: Research: A tool in the investigation of the nature of the L2


mental lexicon.

Impacts: Researchers attempted to gain L2 acquisition data from


performance on classroom activities which learners saw as normal
and routine.

Appendix 7 The Vocabulary Knowledge Scale (VKS)


(Paribakht and Wesche, 1997)
DESIGN
A discrete, selective vocabulary test with words presented in isolation.
The test is administered twice, at the beginning and end of a period
of vocabulary instruction/acquisition. Input: A list of words taken
from reading texts that the learners study, including content words
and discourse connectives. Expected response: The test-takers rate
their knowledge of each word on a five-point scale, give an expla-
nation of the meaning and compose a sentence containing the word.
Scoring: The response to each word is scored on a scale of 1–5,
John Read and Carol A. Chapelle 31

depending on the reported familiarity of the word to the test-taker,


the adequacy of the explanation, and the appropriateness and gram-
maticality of the sentence.
PURPOSE
Inferences: Trait definition of vocabulary (in the test itself, knowl-
edge of word meaning is assessed independent of any particular
context of use) / Interactionalist definition of vocabulary (within the
preceding ESL course, knowledge of word meaning has developed in
the context of a theme studied in class). Item level: The gain in knowl-
edge of each target word as a result of instruction/acquisition. Test
level: The overall gain in depth of knowledge of the selected target
words.

Uses: Research: A measure of the knowledge of particular vocabu-


lary items acquired by learners as a result of encountering the words
in texts. Instructional: An achievement measure of gains in vocabu-
lary knowledge as the result of an instructional program.

Impacts: The test is hand-scored, which limits its application on a


large scale. It provided L2 researchers with the first tool of its kind
to measure depth of vocabulary knowledge.

Appendix 8 The Lexical Density Index for the speaking


subtest of access (the Australian Assessment of Communicative
English Skills) (O’Loughlin, 1995)
DESIGN
A comprehensive vocabulary measure embedded in the oral interac-
tion subtest of an English proficiency test administered to intending
migrants to Australia in a network of test centres outside the country.
Input: The test-takers are set a range of speaking tasks – narration,
description, discussion, role play, etc. – using a combination of
spoken and printed prompts. The test is prepared in two formats:
direct (administered live by an interlocutor following a detailed
script) and semi-direct (administered by means of a pre-recorded
tape). Expected response: The test-takers respond orally to each task
and their responses are audiotaped for later rating by trained raters
in Australia. Scoring: The index is derived by calculating the percent-
age of lexical items in the test-takers’ responses to each format for
each task.
32 A framework for L2 vocabulary assessment

PURPOSE
Inferences: Trait definition of vocabulary (the speaking test was
intended to provide a general sample of oral performance). Subtest
level: The extent to which each format (direct vs. semi-direct) for
each speaking task produces responses that are more ‘oral’ or more
‘literate’ in nature. If the semi-direct format produces speech that is
substantially more literate than that elicited by the direct one, this may
be evidence that the two formats are measuring different constructs.

Uses: Research: The lexical density index is a tool used to investi-


gate one significant characteristic of the speaking subtest, the format
of the input.

Impacts: The access: test was a high-stakes instrument, so it was


important for the test developers to ensure that the procedures for
administering the test were fair to all of the test-takers. In the case
of the speaking test, this meant obtaining evidence that it made no
meaningful difference whether the test-takers were given the direct
or the semi-direct format. The lexical density index was one kind of
evidence used to address this issue.

You might also like