0% found this document useful (0 votes)
5 views

Session 6 - Standardised Testing

Uploaded by

james0445
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Session 6 - Standardised Testing

Uploaded by

james0445
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

SESSION 5: STANDARDISED TESTING

On completion of this session, you will be able to:

 examine the process of constructing, validating, administering, and interpreting


standardised tests of Language
 to examine a variety of current standardized tests that claim to test overall language
proficiency.
Introduction
Every educated person has at some point been touched if not deeply affected by a
standardized test. For almost a century, schools, universities, businesses, and governments
have looked to standardized measures for economical, reliable, and valid assessments of
those who would enter, continue in, or exit their institutions. Proponents of these large-scale
instruments make strong claims for their usefulness when great numbers of people must be
measured quickly and effectively. Those claims are well supported by reams of research data
that comprise construct validations of their efficacy. And so we have become a world that
abides by the results of standardized tests.
The rush to carry out standardized testing in every walk of life has not gone unchecked. Some
psychometricians have stood up in recent years to caution the public against reading too
much into tests that require what may be a narrow band of specialized intelligence (Sternberg,
1997; Gardner, 2000; Koo, 2000). Organizations such as the National Centre for Fair and Open
Testing (www.fairtest.org) have reminded us that standardization of assessment procedures
creates an illusion of validity. Strong claims from the giants of the testing industry, they say,
have pulled the collective wool over the public's eyes and in the process have incorrectly
marginalized thousands, if not millions, of children and adults worldwide.
Whichever side is "right" -and both sides have legitimate cases-it is important for teachers to
understand the educational institutions they are working in, and an integral part of virtually
all of those institutions is the use of standardized tests. So it is important for you to
understand what standardised tests are, what they are not, how to interpret them, and how
to put them into a balanced perspective in which we strive to accurately assess all learners
on all proposed outcomes or objectives. We can learn a great deal about many learners and
their competencies through standardized forms of assessment. But some of those learners
and some of those learning outcomes may not be adequately measured by a sit-down, timed,
multiple-choice format that is likely to be decontextualized.

20
This session has two goals: to introduce the process of constructing, validating, administering,
and interpreting standardised tests of Language and to acquaint you with a variety of current
standardized tests that claim to test overall language proficiency.
It should be clear from these goals that in this session we are not focusing centrally on
classroom-based assessment. Don't forget, however, that standardised tests affect all
classrooms, and some of the practical steps that are involved in creating standardized tests
are directly transferable to designing classroom tests.
WHAT IS STANDARDIZATION?
A standardized test presupposes certain standard objectives, or criteria, that are held
constant across one form of the test to another. The criteria in large-scale standardized tests
are designed to apply to a broad band of competencies that are usually not exclusive to one
particular curriculum. A good standardized test is the '. product of a thorough process of
empirical research and development. It dictates standard procedures for administration. and
scoring. And finally, it is' typical of a norm-referenced test, the goal of which is to place test-
takers on a continuum across a range of scores and to differentiate test-takers by their
relative ranking.
Most elementary and secondary schools in the United States have standardized achievement
tests to measure children's mastery of the standards or competencies that have been
prescribed for specified grade levels. These. tests vary by states, counties, and school districts,
but they all share the common objective of economically large-scale assessment.
College· entrance exams such as the Scholastic Aptitude Test (SAT) are part of the educational
experience of many high school seniors seeking further education. The Graduate Record Exam
(GRE) is a required standardized test for entry into many graduate school programmes. Tests
like the Graduate Management Admission Test (GMAT) and the Law School Aptitude Test
(LSAT) specialize in particular disciplines. One genre of standardized test that you may already
be-familiar with is the Test of English as a Foreign Language (TOEFL, produced by the
Educational Testing Service (ETS) in the United States and/or its British counterpart, the
International English Language Testing System (IELTS), which features standardized tests in
affiliation with the University of Cambridge Local Examinations Syndicate (UCLES).They are all
standardized because they specify a set of competencies (or standards) for a given domain,
and through a process of construct validation they program a set of tasks that have been
designed to measure those competencies.
Many people are under the incorrect impression that all standardized tests consist of items
that have predetermined responses presented in a multiple-choice format. While it is true
that many standardized tests conform to a multiple-choice format, by no means is multiple-
choice a prerequisite characteristic. It so happens that a multiple-choice format provides the
test producer with an "objective" means for determining correct and incorrect responses, and
therefore is the preferred mode for large-scale tests. However, standards are equally involved
in certain human scored tests of oral production and writing, such as the Test of Spoken
English (TSE and the Test of Written English (TWE both produced by ETS).

20
ADVANTAGES AND DISADVANTAGES OF STANDARDIZED TESTS
Advantages of standardized testing include, foremost, a readymade previously validated
product that frees the teacher from having to spend hours creating a test. Administration to
large groups can be accomplished within reasonable time limits. In the case of multiple-choice
formats, scoring procedures are streamlined (for either scan able computerized scoring or
hand-scoring with a hole-punched grid) for fast turnaround time. And, for better or for worse,
there is often an air of face validity to such authoritative-looking instruments.
Disadvantages centre largely on the inappropriate use of such tests, for example, using an
overall proficiency test as an achievement tests simply because of the convenience of the
standardization. A colleague told me-about a course director who, after a frantic search for a
last-minute placement test, administered a multiple choice grammar achievement test, even
though the curriculum was mostly listening and speaking and involved few of the grammar
points tested. This instrument had the appearance and face validity of a good test when in
reality it had no-content Validity whatsoever.
Another disadvantage is the potential misunderstanding of the difference between direct and
indirect testing. Some standardised tests include tasks that do not directly specify the
performance target outcomes or objectives. For example, before 1996, the TOEFL included
neither a written nor an oral production section, yet statistics showed a reasonably strong
correspondence between performance on the TOEFL and a student's written and-to a lesser
extent-oral production. The comprehension-based TOEFL could therefore be claimed to be
an indirect test of production. A test of reading comprehension that proposes to measure
ability to read extensively and that engages test-takers in reading only short one or two
paragraph passages is an indirect measure of extensive reading.
Those who use standardized tests need to acknowledge both the advantages and limitations
of indirect testing. In the pre1996 TOEFL administrations, the expense of giving a direct test
of production was considerably reduced by offering only comprehension performance and
showing through construct validation the appropriateness of conclusions about a test-taker's
production competence. likewise, short reading passages are easier to administer, and if
research validates the assumption that short reading passages indicate extensive reading
ability, then the use of the shorter passages is justified. Yet the construct validation statistics
that offer that support never offer a 100 percent probability of the relationship, leaving room
for some possibility that the indirect test is not valid for its targeted use.
A more serious issue lies in the assumption (alluded to above) that standardized tests
correctly assess all learners equally well. Well established standardized tests usually
demonstrate high correlations between performance on such 'tests and target objectives, but
correlations are not sufficient to demonstrate unequivocally the acquisition of criterion
objectives by all test-takers. Here is a non-language example. In the United States, some
driver's license renewals require taking a paper and-pencil multiple-choice test that covers
signs, safe speeds and distances, lane changes and other rules of the road. Correlational
statistics show a strong relationship between high scores on those tests and good driving
records, so people who do well on these tests are a safe bet to relicense. Now, an extremely
high correlation (of perhaps 80% or above) may be loosely interpreted to mean that a large
20
majority of the drivers whose licenses are renewed by virtue of their having passed the little
quiz are good behind-the-wheel drivers. What about those few who do not fit the model?
That small minority of drivers could endanger the lives of the majority, and is that a risk worth
taking? Motor vehicle registration departments in the United States seem to think so, and
thus avoid the high cost of behind-the-wheel driving tests.

Are you willing to rely on a standardized test result in the case of all the learners in your class?
Of an applicant to your institution, or of a potential degree candidate exiting your program?.
DEVELOPING A STANDARDIZED TEST
While it is not likely that a classroom teacher, with a team of test designers and researchers,
could be in a position to develop a brand-new standardized test of large-scale proportions, it
is a virtual certainty that someday you will be in a position (a) to revise an existing test, (b) to
adapt or expand an existing test, and/or (c) to create a smaller-scale standardized test for a
program you are teaching in. And even if none of the above three cases should ever apply to
you, it is of paramount importance to understand the process of the development of the
standardized tests that have become ingrained in our educational institutions.
How are standardized tests developed? Where do test tasks and items come from? How are
they evaluated? Who selects items and their arrangement in a test?
How do such items and tests-achieve consequential validity? How are different forms of tests
equated for difficulty level? Who sets norms and cut off limits? Are security and
confidentiality an issue? Are cultural and racial biases an issue in test development? All these
questions typify those that you might pose in an attempt to understand the process of test
'development.
In the steps outlined below, three different standardized-tests will be used to exemplify the
process of standardized test design:
(A) The Test of English as a Foreign Language (TOEFL), Educational Testing Service (ETS).
(B) The English as a Second Language Placement Test (ESLP1), San Francisco State University
(SFSU).
(C)The Graduate Essay Test (GME, SFSU).
The first is a test of general language ability or proficiency. The second. is a placement test at
a university. And the third is a gate-keeping essay test that all prospective students must pass
in order to take graduate-level courses. As we look at the steps, one by one, you will see
patterns that are consistent with those outlined in the previous sessions for evaluating and
developing a classroom test.
1. Determine the purpose and objectives/outcomes of the test.
Most standardized tests are expected to provide high practicality in administration and
scoring without unduly compromising validity. The initial outlay of time and money for such
a test is Significant, but the test would be used repeatedly. It is therefore important for its
purpose and objectives to be stated specifically. Let's look at the three tests.
20
(A) The purpose of the TOEFL is "to evaluate the English proficiency of people whose native
language is not English" (TOEFL Test and Score Manual, 2001, p. 9). More specifically, the
TOEFL is designed to help institutions of higher learning make "valid decisions concerning
English language proficiency in terms of thelr own requirements" (p. 9). Most colleges and
universities in the United States use TOEFL scores to admit or refuse international applicants
for admission. Various cut-off scores apply, but most institutions require scores from 475 to
525 (paper-based) or from 150 to 195 (computer-based) in order to consider students for
admission. The high-stakes, gate-keeping nature of the TOEFL is obvious.
(B) The ESLPT is designed to place already admitted students at San Francisco State University
in an appropriate course in academic writing, with the secondary goal of placing students into
courses in oral production and grammar-editing. While the test's primary purpose is to make
placements, another desirable objective is to provide teachers with some diagnostic
information about their students on the first day or two of class. The ESLPT is locally designed
by university faculty and staff.
(C) The GET is another test designed at SFSU, is given to prospective graduate students-both
native and non-native speakers-in all disciplines to determine whether their writing ability is
sufficient to permit them to enter graduate-level courses in their programs. It is offered at
the beginning of each term. Students who fail or marginally pass the GET are technically
ineligible to take graduate courses in their field. Instead, they may elect to take a course in
graduate-level writing of research papers. A pass in that course is equivalent to passing the
GET.
As you can see, the objectives/outcomes of each of these tests are specific. The content of
each test must be designed to accomplish those particular ends. This is first stage of goal-
setting might be seen as one in which the consequential validity of the test is foremost in the
mind of the developer: each test has a specific gate-keeping function to perform; therefore
the criteria for entering those gates must be specified accurately.
2. Design test specifications.
Now comes the hard part. Decisions need to be made on how to go about structuring the
specifications of the test. Before specs can be addressed, a comprehensive program of
research must identify a set of constructs underlying the test itself. This stage of laying the
foundation stones can occupy weeks, months, or even years of effort. Standardized tests that
don't work: are often the product of short-sighted construct validation. Let's look at the three
tests again.
(A) Construct validation for the TOEFL is carried out by the TOEFL staff at ETS under the
guidance of a Policy Council that works with a Committee of Examiners that is composed of
appointed external university faculty, linguists, and assessment specialists. Dozens of
employees are involved in a complex process of reviewing current TOEFL specifications,
commissioning and developing test tasks and items, assembling forms of the test, and
performing ongoing exploratory research related to formulating new specs. Reducing such a
complex process to a set of simple steps runs the risk of gross overgeneralization, but here is
an idea of how a TOEFL is created.

20
Because the TOEFL is· a proficiency test, the first step in the developmental process is to
define the construct of language proficiency. First, it should be made clear that many
assessment specialists such as Bachman (1990) and Palmer (Bachman & Palmer, 1996) prefer
the term ability to proficiency and thus speak of language ability as the overarching concept.
The latter phrase is more consistent, they argue with our understanding that the specific
components of language ability must be assessed separately. Others, such as the American
Council on Teaching Foreign Languages (ACTFL), still prefer the term proficiency because it
connotes more of a holistic, unitary trait view of language ability (Lowe, 1988). Most current
views accept the ability argument and therefore strive to specify and assess the many
components of language. For the purposes of consistency in this module, the term proficiency
will nevertheless be retained, with the above caveat.

How you view language will make a difference in how you assess language proficiency. After
breaking language competence down into subsets of listening, speaking, reading, and writing,
each performance mode can be examined on a continuum of linguistic units: phonology
(pronunciation) and orthography (spelling), words lexicon), sentences (grammar), discourse,
and pragmatic (sociolinguistic, contextual, functional, cultural) features of language.

How will the TOEFL sample from all these possibilities? Oral production tests can be tests of
overall conversational fluency or pronunciation of a particular subset of phonology, and can
take the form of imitation, structured responses, or free responses. Listening comprehension
tests can concentrate on a particular feature of language or on overall listenings for general
meaning. Tests of rea<ling can cover the range of language units and can aim to test
comprehension of long or short passages, single sentences, or even phrases and words.
Writing-tests can take on an open-ended form with free composition, or be structured to elicit
anything from correct spelling to discourse-level competence. Are you overwhelmed yet?

From the sea of potential performance modes that could be sampled in a test, the developer
must select a subset on some systematic basis. To make a very long story short (and leaving
out numerous controversies), the TOEFL had for many years included three types of
performance in its organizational specifications: listening. structure, and reading, all of which
tested comprehension through standard multiple-choice tasks.

In 1996 a major step was taken to include written production in the computer based TOEFL
by adding a slightly modified version of the already existing Test of Written English (TWE). In
doing so, some face validity and content validity were improved along with, of course, a
significant increase in administrative expense! Each of these four major sections is capsulized
in the box below (adapted from the description of the current computer-based TOEFL at
www.roefLorg). Such descriptions are not, strictly speaking, specifications, which are kept
confidential by ETS. Nevertheless, they can give a sense of many of the constraints that are
placed on the· design of actual TOEFL specifications.
20
TOEFL specifications

20
(B) The designing of the test specs for the ESLPT was a somewhat simpler task because the
purpose is placement and the construct validation of the test consisted of an examination of
the content of the ESL courses. In fact, in a recent revision of the ESLPT (lmao et al., 2000;
Imao, 2001), content validity (coupled with its attendant face validity) was the central
theoretical issue to be considered. The major issue centred on designing practical and reliable
tasks and item response formats. Having established the importance of designing ESLPT tasks
that simulated classroom tasks used in the courses, the designers ultimately specified two
writing production tasks (one a response to an essay that students read, and the other a
summary of another essay) and one multiple-choice grammar-editing task. These
specifications mirrored the reading based, process writing approach used in the courses.
(C) Specifications for the GET arose out of the perceived need to provide a threshold of
acceptable writing ability for all prospective graduate students at SFSU, both native and non-
native speakers of English. The specifications for the GET are the skills of writing
grammatically and rhetorically acceptable prose on a topic of some interest, with clearly
produced organization of ideas and logical development. The GET is a direct test of writing
ability in which test-takers must, in a two-hour time period, write an essay on a given topic.

20
3. Design, select, and arrange test tasks/items.
Once specifications for a standardized test have been stipulated, the sometimes never-ending
task of designing, selecting, and arranging items begins. The specs act much like a blueprint
in determining the number and types of items to be created. Let's look at the! three examples.
(A) TOEFL test design specifies that each item be coded for content and statistical
characteristics. Content coding ensures that each examinee will receive test questions that
assess a variety of skills (reading, comprehending the main idea, or understanding inferences)
and cover a variety of subject matter without unduly biasing the content toward a subset of
test-takers (for example, in the listening section involving an academic lecture, the content
must be universal enough for students from many different academic fields of study).
Statistical characteristics, including the IRT equivalents of estimates of item facility(IF) and the
ability of an item to discriminate (ID) between higher or lower ability levels, ate also' coded.
Items are then designed by a team who select and adapt items solicited from a bank. of items
that have been "deposited" by freelance writers and ETS staff. Probes for the reading section,
for example, are usually excerpts from authentic general or academic reading that are edited
for linguistic difficulty, culture bias, or other topic biases. Items are designed to test overall
comprehension, certain specific information, and inference.
Consider the following sample of a reading selection and ten items based on it, from a practice
TOEFL (Phillips, 2001,pp.423-424):

20
20
As you can see, items target the assessment of comprehension of the main idea (item #11),
stated details (#17, 19), unstated details (#12, 15, 18), implied details (#14, 20), and
vocabulary in context (#13, 16). An argument could be made about the cultural schemata
implied in a passage about pirate ships, and you could engage in an "angels on the head of a
pin" argument about the importance of picking certain vocabulary for emphasis, but every
test item is a sample of a larger domain, and each of these fulfills, its designated specification.
Before any such items are released into a form of the TOEFL (or any validated standardized
test), they are piloted and scientifically selected to meet difficulty specifications within each
subsection, section, and the test overall. Furthermore, those items are also selected to meet
a desired discrimination index. Both of these' indices are important considerations in the
design of a computer-adaptive test, where performance on one item determines the next one
to be presented to the test-taker
(B)The selection of items in the ESLPT entailed two-entirely different processes. In the two
subsections of -the test that elicit writing performance (summary of reading; response to
reading), the main hurdles were (a) selecting appropriate passages for test-takers to read, (b)
providing appropriate prompts, and (c) processing data from pilot testing. Passages have to
conform to standards of content validity by being within the genre and the difficulty of the
material used in the courses. The prompt in each case (the section asking for a summary and
the section asking for a response) has to be tailored to fit the passage, but a general template
is used.
In the multiple-choice editing test that seeks to test grammar proofreading ability, the first
and easier task is to choose an appropriate essay within which to embed errors. The more
complicated task is to embed a specified number of errors from a previously determined
taxonomy of error categories. Those error categories came directly from student errors as
perceived by their teachers (verb tenses, verb agreement, logical connectors, articles, etc.).
The distractors for each item were selected from actual errors that students make. Itemised
pilot versions were then coded for difficulty and discrimination indices, after which final
assembly of items could occur.
(C) The GET prompts are designed by a faculty committee of examiners who are specialists in
the field of university academic writing. The assumption is made that the topics are universally
appealing and capable of yielding the intended product of an essay that requires an organized
logical argument and conclusion. No pilot testing of prompts is conducted. The conditions for
administration remain constant: two-hour time limit, sit-down context, paper and pencil,
closed-book format.

20
4. Make appropriate evaluations of different kinds of items.
In earlier session the concepts of item facility (IF), item discrimination (ID), and distractor
analysis were introduced. As the discussion there showed, such calculations provide useful
information for classroom tests, but sometimes the time and effort involved in performing
them may not be practical, especially if the classroom-based test is a one time test. Yet for a
standardized multiple-choice test that is designed to be marketed commercially, and/or
administered a number of times, and/or administered in a different form, these indices are a
must.
For other types of response formats, namely, production responses, different forms of
evaluation become important. The principles of practicality and reliability are prominent,
along with the concept of facility. Practicality issues in such items include the clarity of
directions, timing of the test, ease of administration, and how much time is required to score
responses. Reliability is a major player in instances where more than one scorer is employed;
and to a lesser extent when: a single scorer has to evaluate tests over long spans of time that
could lead to deterioration of standards. Facility is also a key to the validity and success of an
item type unclear direction, complex-language, obscure topics, fuzzy topics, and culturally
biased information may lead to a higher level of difficulty than one desires.
(A) The IF, ID, and efficiency statistics of the multiple-choice items of current forms of the
TOEFL are not publicly available information. For reasons of security and protection of
patented, copyrighted materials, they must remain behind the closed doors of the ETS
development staff. Those statistics remain of paramount importance in the ongoing
production of TOEFL items and forms and are the foundation stones for demonstrating the
equitability of forms. Statistical indices on retired forms of the TOEFL are available on request
for research purposes.
The essay portion of the TOEFL undergoes scrutiny for its practicality, reliability, and facility.
Special attention is given to reliability since two human scorers must read each essay, and
every time a third reader becomes necessary (when the two readers disagree by more than
one point), it costs ETS more money.
(B) In the case of the open-ended responses on the two written tasks on the ESLPT, a similar
set of judgments must be made. Some evaluative impressions of the effectiveness of prompts
and passages are gained from informal student and scorer feedback. In the developmental
stage of the newly revised ESLPT, both types of feedback were formally solicited through
questionnaires and interviews. That information proved to be invaluable in the revision of
prompts and stimulus reading passages. After each administration now, the teacher-scorers
provide informal feedback on their perceptions of the effectiveness of the prompts and
readings.

20
The multiple-choice editing passage showed the value of statistical findings in determining
the usefulness of items and pointing administrators toward revisions. Following is a sample
of the format used:
Multiple-choice editing passage
(1) Ever since supermarkets first appeared, they have been take over the world.
A B C D
(2) Supermarkets have changed people's life styles yet and at the same time, changes in
people's life styles have encouraged the opening of supermarkets.

A B C D

The task was to locate the error in each sentence. Statistical tests on the experimental version
of this section revealed that a number of the 45 items were found to be of zero IF (no difficulty
whatsoever) and of inconsequential discrimination power (some IDs of .15 and lower). Many
distractors were of no consequence because they lured no one. Such information led to a
revision of numerous items and their options, eventually strengthening the effectiveness of
this section.
(C)The GET, like its written counterparts in the ESLPT, is a test of written ability with a single
prompt, and therefore questions of practicality and facility are also largely observational. No
data are collected from students on their perceptions, but the scorers have an opportunity to
reflect on the validity of a given topic. After one sitting, a topic is retired, which eliminates the
possibility of improving a specific topic but future framing of topics might benefit from
scorers' evaluations. Inter rater reliability is checked periodically, and reader training sessions
are modified if too many instances of unreliability appear.
5. Specify scoring procedures and reporting formats.
A systematic assembly of test items in pre-selected arrangements and sequences, all of which
are validated to conform to an expected difficulty level, should yield a test that can then be
scored accurately and reported back to test-takers and institutions efficiently.
(A) Of the three tests being exemplified here, the most straightforward scoring procedure
comes from the TOEFL,the one with the most complex issues of validation, design, and
assembly. Scores are calculated and reported for all three sections of the TOEFL (the
essay ratings are combined with the Structure and Written Expression score) and (b)
a total score (range 40 to 300 on the computer-based TOEFL and 310 to 677 on the
paper-and-pencil TOEFL). A separate score (c) for the Essay (range 0 to 6) is also
provided on the examinee's score record (see simulation of a score record on page
80).

20
The rating scale for the essay is virtually the same one that is used for the Test of Written
English (see Chapter 9 for details), with a "zero" level added for no response, copying the
topic only, writing completely off topic, or not writing in English.
(B) The ESLPT reports a score for each of the· essay sections, but the rating scale differs
between them because in one case the objective is to write a summary, and in the other to
write a response to a reading. Each essay is read by two readers if there is a discrepancy of
more than one level, a third reader resolves' the difference. The editing section is machine
scanned and scored with a total score and part scores for each of the grammatical rhetorical
sections. From these data, placement administrators have adequate information to make
placements, and teachers receive some diagnostic information on each student in their
classes. Students do not receive their essays back.
(C) Each GET is read by two trained readers, who give a score between 1 and 4 according to
the following scale:
Graduate Essay Test: Scoring Guide

20
The two readers' scores are added to yield a total possible score of 2 to 8. Test administrators
recommend a score of 6 as the threshold for allowing a student to pursue graduate-level
courses. Anything below that is accompanied by a recommendation that the student either
repeat the test or take a "remedial" course in graduate writing offered in one of several
different departments. Students receive neither their essays nor any feedback other than the
final score.
6. Perform ongoing construct validation studies.
From the above discussion, it should be clear that no standardized instrument is expected to
be used repeatedly without a rigorous program of ongoing construct validation. Any
standardized test, once developed, must be accompanied by systematic periodic
corroboration of its effectiveness and by steps toward its improvement. This rigor is especially
true of tests that are produced in equated forms; that is, forms must be reliable across tests
such that a score on a subsequent form of a test has the same validity and interpretability as
its original.
(A) The TOEFL program, in cooperation with other tests produced by ETS, has an impressive
program of research. Over the years dozens of TOEFL-sponsored research studies have
appeared in the TOEFL Monograph Series. An early example of such a study was the seminal
Duran et aI. (1985) study, TOEFL from a Communicative Viewpoint on Language Proficiency,
which examined the content characteristics of the TOEFL from a communicative perspective
based on current research in applied linguistics and language proficiency assessment. More
recent studies (such as Ginther, 2001; Leacock & Chodorow, 2001; Powers et aI., 2002)
demonstrate an impressive array of scrutiny.
(B) For approximately 20 years, the ESLPT appeared to be placing students reliably by means
of an essay and a multiple-choice grammar and vocabulary test. Over the years the security
of the latter became suspect, and the faculty administrators wished to see some content
validity achieved in the process. In the year 2000 that process began with a group of graduate
students (Imao et aI., 2000) in consultation with faculty members, and continued to fruition
in the form of a new ESLPT, reported in lmao (2002). The development of the new ESLPT
involved a lengthy process of both content and construct validation, along with facing such

20
practical issues as scoring the written sections and a machine scorable multiple-choice answer
sheet.
The process of ongoing validation will no doubt continue as new forms of the editing section
are created and as new prompts and reading passages are created for the writing section.
Such a validation process should also include consistent checks on placement accuracy and
on face validity.
(B) At this time there is little or no research to validate the GET itself. For its construct
validation, its administrators rely on a stockpile of research on university-level
academic writing tests such as the TWE. The holistic scoring rubric and the topics and
administrative conditions of the GET are to some extent patterned after that of the
TWE. In recent years some criticism of the GET has come from international test-takers
(Hosoya, 2001) who posit that the topics and time limits of the GET, among other
factors, work to the disadvantage of writers whose native language is not English.
These validity issues remain to be fully addressed in a comprehensive research study.
STANDARDIZED LANGUAGE PROFICIENCY TESTING
Tests of language proficiency presuppose a comprehensive definition of the specific
competencies that comprise overall language ability. The specifications for the TOEFL
provided an illustration of an operational definition of ability for assessment purposes.
This is not the only way to conceptualize the concept. Swain (1990) offered a
multidimensional view of proficiency assessment by referring to three linguistic traits
(grammar, discourse, and sociolinguistics) that can be assessed by means of oral, multiple-
choice, and written responses (see Table 4.1). Swain's conception was not meant to be an
exhaustive analysis of ability, but rather to serve as an operational framework for
constructing proficiency assessments.
Another definition and conceptualization of proficiency is suggested by the' ACTFL
association, mentioned earlier. ACTFL takes a holistic and more unitary view of proficiency
in describing four levels: superior, advanced, intermediate, and novice. Within each level,
descriptions of listening, speaking, reading, and writing are provided as guidelines for
assessment. For example, the ACTFL Guidelines describe the superior level of speaking as
follows:

20
The other three ACTFL levels use the same parameters in describing progressively lower
proficiencies across all four skills. Such taxonomies have the advantage of considering a
number of functions of linguistic discourse, but the disadvantage, at the lower levels, of overly
emphasizing test-takers' deficiencies.

FOUR STANDARDIZED LANGUAGE PROFICIENCY TESTS


We now tum to some of the better-known standardized tests of overall language ability, or
proficiency, to examine some of the typical formats used in commercially available tests. We
will not look at standardized tests of other specific skills here, but that should not lead you to
think, by any means, that proficiency is the only kind of test in the field that is standardized.
Three standardized oral production tests, the Test of Spoken English (TSE), the Oral
Proficiency Inventory (OPI), PhonePass® and the Test of Written English (!WE) are various
examples.
Four commercially produced standardized tests of English language proficiency are described
briefly in this section: the TOEFL, the Michigan English Language Assessment Battery (MELAB),

20
the International English Language Testing System (lELTS) and the Test of English for
International Communication (TOEIC®). Use the following questions to help you evaluate
these four tests and their subsections:
1. What item types are included?
2. How practical and reliable does each subsection of each test appear to be?
3. Do the item types and tasks appropriately represent a conceptualization of language
proficiency (ability)? That is, can you evaluate their construct validity?
4. Do the tasks achieve face validity?
5. Are the tasks authentic?
6. Is there some washback potential in the tasks?

20
20
The construction of a valid standardized test is no minor accomplishment, whether the
instrument is large-or small-scale. The designing of specifications alone, as this chapter
illustrates, requires a sophisticated process of construct validation coupled with
considerations of practicality. Then, the construction of items and scoring/interpretation
procedures may require a lengthy period of trial and error with prototypes of the final form
of the test. With painstaking attention to all the details of construction, the end product can
result in a cost-effective, timesaving, accurate instrument. Your use of the results of such
assessments can provide useful data on learners' language abilities.

20

You might also like