0% found this document useful (0 votes)
41 views

Language Testing and Assessment Part I

Uploaded by

mtorres
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Language Testing and Assessment Part I

Uploaded by

mtorres
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/248729579

Language testing and assessment (Part I)

Article in Language Teaching · April 2002


DOI: 10.1017/S0261444802001751

CITATIONS READS

113 21,369

2 authors:

Charles Alderson Jayanti Banerjee


Lancaster University Worden Consulting LLC
68 PUBLICATIONS 4,589 CITATIONS 17 PUBLICATIONS 514 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Jayanti Banerjee on 31 July 2015.

The user has requested enhancement of the downloaded file.


State-of-the-Art Review
Language testing and assessment (Part I)
J Charles Alderson and Jayanti Banerjee Lancaster University, UK

to reflect the state of the art than are full-length


Introduction
books. We have also referred to other similar reviews
This is the third in a series of State-of-the-Art published in the last 10 years or so, where we judged
review articles in language testing in this journal, the it relevant. We have usually begun our review with
first having been written by Alan Davies in 1978 and articles printed in or around 1988, the date of the last
the second by Peter Skehan in 1988/1989. Skehan review, aware that this is now 13 years ago, but also
remarked that testing had witnessed an explosion of conscious of the need to cover the period since the
interest, research and publications in the ten years last major review in this journal. However, we have
since the first review article, and several commenta- also, where we felt it appropriate, included articles
tors have since made similar remarks. We can only published somewhat earlier.
concur, and for quantitative corroboration would This review is divided into two parts, each of
refer the reader to Alderson (1991) and to the roughly equal length. The bibliography for works
International Language Testing Association (ILTA) referred to in each part is published with the relevant
Bibliography 1990-1999 (Banerjee et al., 1999). In part, rather than in a complete bibliography at the
the latter bibliography, there are 866 entries, divided end. Therefore, readers wishing to have a complete
into 15 sections, from Testing Listening to Ethics and bibliography will have to put both parts together.
Standards.The field has become so large and so active The rationale for the organisation of this review is
that it is virtually impossible to do justice to it, even that we wished to start with a relatively new concern
in a multi-part 'State-of-the-Art' review like this, and in language testing, at least as far as publication of
it is changing so rapidly that any prediction of trends empirical research is concerned, before moving on to
is likely to be outdated before it is printed. more traditional ongoing concerns and ending with
In this review, therefore, we not only try to avoid aspects of testing not often addressed in international
anything other than rather bland predictions, we also reviews, and remaining problems. Thus, we begin
acknowledge the partiality of our choice of topics with an account of research into washback, which
and trends, as well, necessarily, of our selection of then leads us to ethics, politics and standards. We then
publications.We have tried to represent the field fair- examine trends in testing on a national level,
ly, but have tended to concentrate on articles rather followed by testing for specific purposes. Next, we
than books, on the grounds that these are more likely survey developments in computer-based testing
before moving on to look at self-assessment and
alternative assessment. Finally in this first part, we
J Charles Alderson is Professor of Linguisics and survey a relatively new area: the assessment of young
English Language Education at Lancaster University. learners.
He holds an MA in German and French from Oxford In the second part, we address new concerns in
University and a PhD in Applied Linguistics from test validity theory, which argues for the inclusion of
Edinburgh University. He is co-editor of the journal test consequences in what is now generally referred
Language Testing (Edward Arnold), and co-editor to as a unified theory of construct validity. Thereafter
of the Cambridge Language Assessment Series we deal with issues in test validation and test devel-
(C. UP), and has published many books and articles on opment, and examine in some detail more traditional
language testing, reading in a foreign language, and research into the nature of the constructs (reading,
evaluation of language education. listening, grammatical abilities, etc.) that underlie
Jayanti Banerjee is a PhD student in the tests. Finally we discuss a number of remaining con-
Department of Linguistics and Modern English troversies and puzzles that we call, following
Language at Lancaster University. She has been McNamara (1995),'Pandora's Boxes'.
involved in a number of test development and research We are very grateful to many colleagues for their
projects and has taught on introductory testing courses. assistance in helping us draw up this review, but in
She has also been involved in teaching English for particular we would like to acknowledge the help,
Academic Purposes (EAP) at Lancaster University. Her advice and support of the Lancaster Language Testing
research interests include the teaching and assessment of Research Group, above all of Dianne Wall and
EAP as well as qualitative research methods. She is par- Caroline Clapham, for their invaluable and insightful
ticularly interested in issues related to the interpretation comments. All faults that remain are entirely our
and use of test scores. responsibility.

Lang.Teach. 34,213-236. DOI: 10.1017/S0261444801001707 Printed in the United Kingdom © 2001 Cambridge University Press 213

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
education, since the notion that tests will automati-
Washback
cally have an impact on the curriculum and on learn-
The term 'washback' refers to the impact that tests ing has been advocated atheoretically. Following on
have on teaching and learning. Such impact is usually from this suggestion, Wall (1996) reviews key con-
seen as being negative: tests are said to force teachers cepts in the field of educational innovation and shows
to do things they do not necessarily wish to do. how they might be relevant to an understanding of
However, some have argued that tests are potentially whether and how tests have washback. Lynch and
also 'levers for change' in language education: the Davidson (1994) describe an approach to criterion-
argument being that if a bad test has negative impact, referenced testing which involves practising teachers
a good test should or could have positive washback in the translation of curricular goals into test specifi-
(Alderson, 1986b; Pearson, 1988). cations. They claim that this approach can provide a
Interestingly, Skehan, in the last review of the State link between the curriculum, teacher experience and
of the Art in Language Testing (Skehan, 1988,1989), tests and can therefore, presumably, improve the
makes only fleeting reference to washback, and even impact of tests on teaching.
then, only to assertions that communicative language Recently, a number of empirical washback studies
testing and criterion-referenced testing are likely to have been carried out (see, for example, Khaniyah,
lead to better washback - with no evidence cited. 1990a, 1990b; Shohamy, 1993; Shohamy et al, 1996;
Nor is research into washback signalled as a likely Wall & Alderson, 1993; Watanabe, 1996; Cheng,
important future development within the language 1997) in a variety of settings. There is general agree-
testing field. Let those who predict future trends do ment among these that high-stakes tests do indeed
so at their peril! impact on the content of teaching and on the nature
In the Annual Review of Applied Linguistics series, of the teaching materials. However, the evidence that
equally, the only substantial reference to washback is they impact on how teachers teach is much scarcer
by McNamara (1998) in a chapter entitled: 'Policy and more complicated. Wall and Alderson (1993)
and social considerations in language assessment'. found no evidence for any change in teachers'
Even the chapter entitled 'Developments in language methodologies before and after the introduction of a
testing' by Douglas (1995) makes no reference to new style school-leaving examination in English in
washback. Given the importance assigned to conse- Sri Lanka. Alderson and Hamp-Lyons (1996) show
quential validity and issues of consequences in the that teachers may indeed change the way they teach
general assessment literature, especially since the when teaching towards a test (in this case, the
popularisation of the Messickian view of an all- TOEFL —Test of English as a Foreign Language), but
encompassing construct validity (see Part Two), this is they also show that the nature of the change and the
remarkable, and shows how much the field has methodology adopted varies from teacher to teacher,
changed in the last six or seven years. However, a a conclusion supported by Watanabe's 1996 findings.
recent review of validity theory (Chapelle, 1999) Alderson and Hamp-Lyons argue that it is not
makes some reference to washback under construct enough to describe whether and how teachers might
validity, reflecting the increased interest in the topic. adapt their teaching and the content of their teaching
Although the notion that tests have impact on to suit the test. They believe that it is important to
teaching and learning has a long history, there was explain why teachers do what they do, if we are to
surprisingly little empirical evidence to support such understand the washback effect. Alderson (1998) sug-
notions until recently. Alderson and Wall (1993) were gests that testing researchers should explore the litera-
among the first to problematise the notion of test ture on teacher cognition and teacher thinking to
washback in language education, and to call for understand better what motivates teacher behaviour.
research into the impact of tests. They list a number Cheng (1997) shows that teachers only adapt their
of'Washback Hypotheses' in an attempt to develop a methodology slowly, reluctantly and with difficulty,
research agenda. One Washback Hypothesis, for and suggests that this may relate to the constraints on
example, is that tests will have washback on what teachers and teaching from the educational system
teachers teach (the content agenda), whereas a sepa- generally. Shohamy et al. (1996) show that the nature
rate washback hypothesis might posit that tests also of washback varies according to factors such as the
have impact on how teachers teach (the methodology status of the language being tested, and the uses of the
agenda). Alderson and Wall also hypothesise that test. In short, the phenomenon of washback is slowly
high-stakes tests - tests with important consequences coming to be recognised as a complex matter, influ-
- would have more impact than low-stakes tests.They enced by many factors other than simply the exis-
urge researchers to broaden the scope of their tence of a test or the nature of that test. Nevertheless,
enquiry, to include not only attitude measurement no major studies have yet been carried out into the
and teachers' accounts of washback but also classs- effect of test preparation on test performance, which
room observation. They argue that the study of wash- is remarkable, given the prevalence, for high-stakes
back would benefit from a better understanding of tests at least, of test preparation courses.
student motivation and of the nature of innovation in
Hahn et al. (1989) conducted a small-scale study
214

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
of the effects on beginning students of German of and political context, the time that has elapsed since the test was
whether they were or were not graded on their oral introduced, and the role of publishers in materials design and
performance in the first six months of instruction. teacher training (2000: 502).
Although no effects on developing oral proficiency In other words, test washback is far from being
were found, attitudes in the two groups were differ- simply a technical matter of design and format, and
ent: those who had been graded considered the needs to be understood within a much broader
experience stressful and unproductive, whereas the framework. Wall suggests that such a framework
group that had not been graded would like to have might usefully come from studies and theories of
been graded. Moeller and Reschke (1993) also found educational change and innovation, and she sum-
no effect whatsoever of the formal scoring of class- marises the most important findings from these areas.
room performance on student proficiency or achieve- She develops a framework derived from Hen-
ment. More studies are needed of learners' views of richsen (1989), and owing something to the work
tests and test preparation. of Hughes (1993) and Bailey (1996), and shows
There are in fact remarkably few studies of the how such a framework might be applied to under-
impact of tests on motivation or of motivation on standing better the causes and nature of washback.
test preparation or test performance. A recent excep- She makes a number of recommendations about the
tion is Watanabe (2001). Watanabe calls his study a steps that test developers might take in the future in
hypothesis-generating exercise, acknowledging that order to assess the amount of risk involved in
the relationship between motivation and test prepa- attempting to bring about change through testing.
ration is likely to be complex. He interviewed These include assessing the feasibility of examination
Japanese university students about their test prepara- reform by studying the 'antecedent' conditions —
tion practices. He found that attitudes to test prepa- what is increasingly referred to as a 'baseline study'
ration varied and that impact was far from uniform, (Weir & Roberts, 1994, Fekete et al, 1999); involving
although those exams which the students thought teachers at all stages of test development; ensuring
most important for their future university careers the participation of other key stakeholders including
usually had more impact than those perceived as less policy-makers and key institutions; ensuring clarity
critical. Thus, if an examination for a university and acceptability of test specifications, and clear
which was the student's first choice contained gram- exemplification of tests, tasks, and scoring criteria;
mar-translation tasks, the students reported that they full piloting of tests before implementation; regular
had studied grammar-translation exercises, whereas if monitoring and evaluation not only of test perfor-
a similar examination was offered by a university mance but also of classrooms; and an understanding
which was their second choice, they were much less that change takes time. Innovating through tests is
likely to study translation exercises. Interestingly, stu- not a quick fix if it is to be beneficial.'Policy makers
dents studied in particular those parts of the exam and test designers should not expect significant
that they perceived to be more difficult, and more impact to occur immediately or in the form they
discriminating. Conversely those sections perceived intend. They should be aware that tests on their own
to be easy had less impact on their test preparation will not have positive impact if the materials and
practices: far fewer students reported preparing for practices they are based on have not been effective.
easy or non-discriminating exam sections. However, They may, however, have negative impact and the sit-
those students who perceived an exam section to be uation must be monitored continuously to allow
too difficult did not bother preparing for it. early intervention if it takes an undesirable turn'
Watanabe concludes that washback is caused by the (2000:507).
interplay between the test and the test taker in a Similar considerations of the potential complexity
complex manner, and he emphasises that what may of the impact of tests on teaching and learning
be most important is not the objective difficulty of should also inform research into the washback of
the test, but the students' perception of difficulty. existing tests. Clearly this is a rich field for further
Wall (2000) provides a very useful overview and investigation. More sophisticated conceptual frame-
up-date of studies of the impact of tests on teaching, works, which are slowly developing in the light of
from the field of general education as well as in lan- research findings and related studies into innovation,
guage education. She summarises research findings motivation theory and teacher thinking, are likely to
which show that test design is only one of the factors provide better understanding of the reasons for
affecting washback, and lists as factors influencing the washback and an explanation of how tests might be
nature of test washback: developed to contribute to the engineering of desir-
able change.
teacher ability, teacher understanding of the test and the
approach it was based on, classroom conditions, lack of resources,
management practices within the school... the status of the sub- Ethics in language testing
ject within the curriculum, feedback mechanisms between the
schools and the testing agency, teacher style, commitment and Whilst Alderson (1997) and others have argued that
willingness to innovate, teacher background, the general social testers have long been concerned with matters of
215

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
fairness (as expressed in their ongoing interest in ous stakeholder groups can democratise the testing
validity and reliability), and that striving for fairness process, promote fairness and therefore enhance an
is an aspect of ethical behaviour, others have separat- ethical approach.
ed the issue of ethics from validity, as an essential part A number of case studies have been presented
of the professionalising of language testing as a disci- recently which illustrate the use and misuse of
pline (Davies, 1997). Messick (1994) argues that all language tests. Hawthorne (1997) describes two
testing involves making value judgements, and there- examples of the misuse of language tests: the use of
fore language testing is open to a critical discussion the access test to regulate the flow of migrants into
of whose values are being represented and served; Australia, and the step test, allegedly designed to play
this in turn leads to a consideration of ethical con- a central role in the determining of asylum seekers'
duct. Messick (1994, 1996) has redefined the scope residential status. Unpublished language testing lore
of validity to include what he calls consequential has many other examples, such as the misuse of the
validity - the consequences of test score interpreta- General Training component of the International
tion and use. Hamp-Lyons (1997) argues that the English Language Testing System (IELTS) test with
notion of washback is too narrow and should be applicants for immigration to New Zealand, and the
broadened to cover 'impact', defined as the effect of use of the TOEFL test and other proficiency tests to
tests on society at large, not just on individuals or on measure achievement and growth in instructional
the educational system. In this, she is expressing a programmes (Alderson, 2001a). It is to be hoped that
concern that has grown in recent years with the the new concern for ethical conduct will result in
political and related ethical issues which surround more accounts of such misuse.
test use. Norton and Starfield (1997) claim, on the basis of
Both McNamara (1998) and Hamp-Lyons (1998) a case study in South Africa, that unethical conduct is
survey the emerging literature on the topic of ethics, evident when second language students' academic
and highlight the need for the development of writing is implicitly evaluated on linguistic grounds
language testing standards (see below). Both com- whilst ostensibly being assessed for the examinees'
ment on a draft Code of Practice sponsored by the understanding of an academic subject. They argue
International Language Testing Association (ILTA, that criteria for assessment should be made explicit
1997), but where Hamp-Lyons sees it as a possible and public if testers are to behave ethically. Elder
way forward, McNamara is more critical of what he (1997) investigates test bias, arguing that statistical
calls its conservatism, and this inadequate acknowl- procedures used to detect bias such as DIF
edgement of the force of current debates on the (Differential Item Functioning) are not neutral since
ethics of language testing. Davies (1997) argues that, they do not question whether the criterion used to
since tests often have a prescriptive or normative make group comparisons is fair and value-free.
role, their social consequences are potentially far- However, in her own study she concludes that what
reaching. He argues for a professional morality may appear to be bias may actually be construct-rele-
among language testers, both to protect the profes- vant variance, in that it indicates real differences in
sion's members, and to protect individuals from the the ability being measured. One similar study was
misuse and abuse of tests. However, he also argues Chen and Henning (1985), who compared inter-
that the morality argument should not be taken too national students' performance on the UCLA
far, lest it lead to professional paralysis, or cynical (University of California, Los Angeles) English as a
manipulation of codes of practice. Second Language Placement Test, and discovered
Spolsky (1997) points out that tests and examina- that a number of items were biased in favour of
tions have always been used as instruments of social Spanish-speaking students and against Chinese-
policy and control, with the gate-keeping function speaking students. The authors argue, however, that
of tests often justifying their existence. Shohamy this 'bias' is relevant to the construct since Spanish is
(1997a) claims that language tests which contain much closer to English typologically and therefore
content or employ methods which are not fair to all biased in favour of speakers of Spanish, who would
test-takers are not ethical, and discusses ways of be expected to find many aspects of English much
reducing various sources of unfairness. She also easier to learn than speakers of Chinese would.
argues that uses of tests which exercise control and Reflecting this concern for ethical test use,
manipulate stakeholders rather than providing infor- Cumming (1995) reviews the use in four Canadian
mation on proficiency levels are also unethical, and settings of assessment instruments to monitor learn-
she advocates the development of'critical language ers' achievements or the efFectiveness of pro-
testing' (Shohamy, 1997b). She urges testers to exer- grammes, and concludes that this is a misuse of such
cise vigilance to ensure that the tests they develop are instruments, which should be used mainly for plac-
fair and democratic, however that may be defined. ing students onto programmes. Cumming (1994)
Lynch (1997) also argues for an ethical approach to asks whether use of language assessment instruments
language testing and Rea-Dickins (1997) claims that for immigrants to Canada facilitates their successful
taking full account of the views and interests of vari- participation in Canadian society. He argues that

216

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
such a criterion should be used to evaluate whether research, drawing on, at least, the following disci-
assessment practices are able to overcome institutional plines andfields:philosophy, especially ethics and the
or systemic barriers that immigrants may encounter, epistemology of social science; critical theory; policy
to account for the quality of language use that may analysis; program evaluation, and innovation theory'
be fundamental to specific aspects of Canadian life, (loc cit).
and to prompt majority populations and instruments The International Language Testing Association
to better accommodate minority populations. (ILTA) has recently developed a Code of Ethics
In the academic context, Pugsley (1988) prob- (rather than finalising the draft Code of Practice
lematises the assessment of the need of international referred to above), which is 'a set of principles which
students for pre- and in-sessional linguistic training draws upon moral philosophy and strives to guide
in the light of test results. Decisions on whether a good professional conduct ... All professional codes
student should receive the benefit of additional lan- should inform professional conscience and judge-
guage instruction are frequently made at the last ment ... Language testers are independent moral
minute, and in the light of competing demands on agents, and they are morally entitled to refuse to par-
the student and on finance. Language training may ticipate in procedures which would violate personal
be the victim of reduced funding, and many aca- moral belief. Language testers accepting employment
demics downplay on importance of language in positions where they foresee they may be called on
academic performance. Often, teachers and students to be involved in situations at variance with their
perceive students' language related problems differ- beliefs have a responsibility to acquaint their employ-
ently, and the question of the relevance or influence er or prospective employer with this fact. Employers
of the test result is then raised. and colleagues have a responsibility to ensure that
In another investigation of score interpretation such language testers are not discriminated against
and use, Yule (1990) analyses the performance of in their workplace.' [http://www.surrey.ac.uk/ELI/
international teaching assistants, attempting to pre- ltrfile/ltrframe.html]
dict on the basis of TOEFL and Graduate Record These are indeed fine words and the moral tone
Examinations Program scores whether the subjects and intent of this Code is clear: testers should follow
should have received positive or negative recom- ethical practices, and have a moral responsibility to
mendations to be teaching assistants. Students who do so. Whether this Code of Ethics will be acceptable
received negative recommendations did indeed have in the diverse environments in -which language
lower scores on both tests than those with positive testers work around the world remains to be seen.
recommendations, but the relationship between Some might even see this as the imposition of
subsequent grade point average (GPA) and positive Western cultural or even political values.
recommendations only held during the first year of
graduate study, not thereafter. The implications for
making decisions about the award of teaching assist- Politics
antships are discussed, and there are obvious ethical Tests are frequently used as instruments of educa-
implications about the length of time a test score tional policy, and they can be very powerful — as
should be considered to be valid. attested by Shohamy (2001a). Inevitably, therefore,
Both these case studies show the difficulty in inter- testing - especially high-stakes testing - is a political
preting language test results, and the complexity of the activity, and recent publications in language testing
issues that surround gate-keeping decisions. They also have begun to address the relation between testing
emphasise that there must be a limit on what informa- and politics, and the politics of testing, perhaps rather
tion one can ethically expect a language test to deliver, belatedly, given the tradition in educational assess-
and what decisions test results can possibly inform. ment in general.
Partly as a result of this heightened interest in Brindley (1998,2001) describes the political use of
ethics and the role of tests in society, McNamara test-based assessment for reasons of public account-
(1998:313) anticipates in the future: abilty, often in the context of national frameworks,
standards or benchmarking. However, he points out
1. a renewed awareness ... of the socially constructed nature of that political rather than professional concerns are
test performance and test score interpretation; usually behind such initiatives, and are often in con-
2. an awareness of the issues raised for testing in the context of
English as an International Language;
flict with the desire for formative assessment to be
3. a reconsideration of the social impact of technology in the closely related to the learning process. He addresses a
delivery of tests; number of political as well as technical and practical
4. an explicit consideration of issues of fairness at every stage of issues in the use of outcomes-based assessment for
the language testing cycle, and accountability purposes, and argues for the need for
5. an expanded agenda for research on fairness accompanying increased consultation between politicians and pro-
test development.
fessionals and for research into the quality of associ-
He concludes that we are likely to see 'a broaden- ated instruments.
ing of the range of issues involved in language testing Politics can be defined as action, or activities, to
217

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessnnent (Part 1)
achieve power or to use power, and as beliefs about argues for more democratic and accountable testing
government, attitudes to power, and to the use of practice.
power. But this need not only be at the macro-politi- As an example of the influence of politics, it is
cal level of national or local government. National instructive to consider Alderson (2001b). In Hungary
educational policy often involves innovations in test- translation is still used as a testing technique in
ing in order to influence the curriculum, or in order the current school-leaving exams, and in the tests
to open up or restrict access to education and administered by the State Foreign Language
employment - and even, as we have seen in the cases Examinations Board (SFLEB), a quasi-commercial
ofAustralia and New Zealand, to influence immigra- concern. Language teachers have long expressed
tion opportunities. But politics can also operate at their concern at the continued use of a test method
lower levels, and can be a very important influence which has uncertain validity (this has not been estab-
on test development and deployment. Politics can be lished to date in Hungary), where the marking of
seen as methods, tactics, intrigue, manoeuvring, translations is felt to be subjective and highly vari-
within institutions which are themselves not politi- able, where no marking criteria or scales exist, and
cal, but commercial, financial and educational. where the washback effect is felt to be negative
Indeed, Alderson (1999) argues that politics with a (Fekete et al, 1999). New school-leaving examina-
small 'p' includes not only institutional politics, but tions are due to be introduced in 2005, and the
also personal politics: the motivation of the actors intention is not to use translation as a test method in
themselves and their agendas. And personal politics future. However, many people, including teachers,
can influence both test development and test use. and also Ministry officials, have resisted such a pro-
Experience shows that, in most institutions, test posal, and it has recently been declared that the
development is a complex matter where individual Minister himself will take the decision on this mat-
and institutional motives interact and are interwo- ter. Yet the Minister is not a language expert, knows
ven. Yet the language testing literature has virtually nothing about language testing, and is therefore not
never addressed such matters, until very recently. technically competent to judge. Many suspect that
The literature, when it deals with test development the SFLEB, which wishes to retain translation, is
matters at all, which is not very often, gives the lobbying the Minister to insist that translation be
impression that testing is basically a technical matter, retained as a test method. Furthermore, many suspect
concerned with the development of appropriate that the SFLEB fears that foreign language examina-
specifications, the creation and revision of appropri- tions, which necessarily do not use translation as a
ate test tasks and scoring criteria, and the analysis of test method, might take over the language test mar-
results from piloting. But behind that facade is a ket in Hungary if translation is no longer required
complex interplay of personalities, of institutional (by law) as a testing technique. Alderson (2001b)
agendas, and of intrigue. Although the macro-politi- suggests that translation may be being used as a
cal level of testing is certainly important, one also weapon in the cause of commercial protectionism.
needs to understand individual agendas, prejudices
and motivations. However, this is an aspect of lan-
guage testing which rarely sees the light of day, and Standards in testing
which is part of the folklore of language testing. One area of increasing concern in language testing
Exploring such issues is difficult because of the has been that of standards. The word 'standards' has
sensitivities involved, and it is difficult to publish any various meanings in the literature, as the Task Force on
account of individual motivations for proposing or Language Testing Standards set up by ILTA discovered
resisting test use and misuse. However, that does not (http://www.surrey.ac.uk/ELI/ilta/tfts_report.pdf).
make it any the less important. Alderson (2001a) has One common meaning used by respondents to the
the title: 'Testing is too important to be left to ILTA survey was that of procedures for ensuring
testers', and he argues that language testers need to quality, standards to be upheld or adhered to, as in
take account of the different perspectives of various 'codes of practice'. A second meaning was that of
stakeholders: not only classroom teachers, who are all 'levels of proficiency' - 'what standard have you
too often left out of consideration in test develop- reached?'A related, third meaning was that contained
ment, but also educational policy makers and poli- in the phrase 'standardised test', which typically
ticians more generally. Although there are virtually means a test whose difficulty level is known, which
no studies in this area at present (exceptions being has been adequately piloted and analysed, the results
Alderson et al, 2000a, Alderson, 1999, 2001b, and of which can be compared with those of a norming
Shohamy, 2001), it is to be hoped that the next population: standardised tests are typically norm-
decade will see such matters discussed much more referenced tests. In the latter context 'standards' is
openly in language testing, since politics, ethics and equivalent to 'norms'.
fairness are rather closely related. Shohamy (2001b)
In recent years, language testing has sought to
describes and discusses the potential abuse of tests as
establish standards in the first sense (codes of prac-
instruments of power by authoritarian agencies, and
tice) and to investigate whether tests are developed
218

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
following appropriate professional procedures. Groot students. Although he concludes that such a use of
(1990) argues that the standardisation of procedures the test might be defensible statistically, additional
for test construction and validation is crucial to the measures might nevertheless be necessary for a pop-
comparability and exchangeability of test results ulation different from the norming group.
across different education settings. Alderson and The meaning of'standards' as 'levels of proficiency'
Buck (1993) and Alderson et al. (1995) describe or 'levels certified by public examinations' has been
widely accepted procedures for test development and an issue for some considerable time, but has received
report on a survey of the practice of British EFL new impetus, both with recent developments in
examining boards. The results showed that current Central Europe and with the publication of the
(in the early 1990s) practice was wanting. Practice Council of Europe's Common European Framework
and procedures among boards varied greatly, yet (Council of Europe, 2001). Work in the 1980s by
(unpublished) information was available which could West and Carroll led to the development of the
have attested to the quality of examinations. Exam English Speaking Union's Framework (Carroll &
boards appeared not to feel obliged to follow or West, 1989), but this was not widely accepted,
indeed to understand accepted procedures, nor did probably because of commercial rivalries within the
they appear to be accountable to the public for the British EFL examining industry. Milanovic (1995)
quality of the tests they produced. Fulcher and reports on work towards the establishment of com-
Bamford (1996) argue that testing bodies in the USA mon levels of proficiency by ALTE, which has devel-
conduct and report reliability and validity studies oped its own definitions of five levels of proficiency,
partly because of a legal requirement to ensure that based upon an inspection and comparison of the
all tests meet technical standards.They conclude that examinations of its members. This has had more
British examination boards should be subject to sim- acceptability, possibly because it was developed by
ilar pressures of litigation on the grounds that their cooperating examination bodies, rather than for
tests are unreliable, invalid or biased. In the German competing bodies. However, such a framework of
context, Kieweg (1999) makes a plea for common levels is still not seen by many as being neutral: it is,
standards in examining EFL, claiming that within after all, associated with the main European com-
schools there is litde or no discussion of appropriate mercial language test providers. The Council of
methods of testing or of procedures for ensuring the Europe's Common European Framework, on the
quality of language tests. other hand, is not only seen as independent of any
Possibly as a result of such pressures and publica- possible vested interest, it also has a long pedigree,
tions, things appear to be changing in Europe, an originating over 25 years ago in the development of
example of this being the publication of the ALTE the Threshold level (van Ek, 1977), and thus broad
(Association of Language Testers in Europe) Code of acceptability across Europe is guaranteed. In addi-
Practice, which is intended to ensure quality work in tion, the scales of various aspects of language profi-
test development throughout Europe. 'In order to ciency that are associated with the Framework
establish common levels of proficiency, tests must be have been extensively researched and validated by
comparable in terms of quality as well as level, and the Swiss Language Portfolio Project (North &
common standards need, therefore, to be applied to Schneider, 1998).
their production' (ALTE, 1998).To date, no mecha- de Jong (1992) predicted that international stan-
nism exists for monitoring whether such standards dards for language tests and assessment procedures,
are indeed being applied, but the mere existence of and internationally interpretable standards of profi-
such a Code of Practice is a step forward in establish- ciency would be developed, with the effect that
ing the public accountability of test developers. internationally comparable language tests would be
Examples of how such standards are applied in prac- established. In the 21st century, that prediction is
tice are unfortunately rare, one exception being coming true. It is now clear that the Common
Alderson et al. (2000a), which presents an account of European Framework will become increasingly
the development of new school-leaving examina- influential because of the growing need for interna-
tions in Hungary. tional recognition of certificates in Europe, in order
Work on standards in the third sense, namely to guarantee educational and employment mobility.
'norms' for different test populations, was less com- National language qualifications, be they provided by
monly published in the last decade. Baker (1988) dis- the state or by quasi-private organisations, presently
cusses the problems and procedures of producing test vary in their standards - both quality standards and
norms for bilingual school populations, challenging standards as levels.Yet international comparability of
the usual a priori procedure of classifying populations certificates has become an economic as well as an
into mother tongue and second language groups. educational imperative, especially after the Bologna
Employing a range of statistical measures, Davidson Declaration of 1999 (http://europa.eu.int/comm/
(1994) examines the appropriacy of the use of a education/socrates/erasmus/bologna.pdf), and the
nationally standardised test normed on native English availability of a transparent, independent framework
speakers, when used with non-English speaking like the Common European framework is crucial to
219

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
the attempt to establish a common scale of reference Certificate in Secondary Education) French exami-
and comparison. Moreover, the Framework is not nation, and find problems particularly in the rating
just a set of scales, it is also a compendium of what is critera, which they hold should be based on a princi-
known about language learning, language use and pled model of language proficiency and be informed
language proficiency. As an essential guide to syllabus by an analysis of communicative development.
construction, as well as to the development of test Hurman (1990) is similarly critical of the imprecise
specifications and rating criteria, it is bound to be specifications of objectives, tasks and criteria for
used for materials design and textbook production, as assessing speaking ability in French at GCSE level.
well as in teacher education. The Framework is also Barnes and Pomfrett (1998) find that teachers need
the anchor point for the European Language training in order to conform to good practice in
Portfolio, and for new diagnostic tests like DIALANG assessing German for pupils at Key Stage 3 (age 14).
(see below). Buckby (1999) reports an empirical comparison of
The Framework is particularly relevant to countries recent and older GCSE examinations, to determine
in East and Central Europe, where many educational whether standards of achievement are falling, and
systems are currently revising their assessment proce- concludes that although the evidence is that stan-
dures. The intention is that the reformed exam- dards are indeed being maintained, there is a need for
inations should have international recognition, a range of different question types in order to enable
unlike the current school-leaving exams. Calibrating candidates to demonstrate their competencies.
the new tests against the Framework is essential, and Barnes et al. (1999) consider the recent introduction
there is currently a great deal of activity in the devel- of the use of bilingual dictionaries in school exami-
opment of school-leaving achievement tests in the nations, report teachers' positive reactions to this
region (for one account of such development, see innovation, but call for more research into the use
Alderson et ah, 2000a).We are confident that we will and impact of dictionaries on pupil performance in
hear much more about the Common European examinations.
Framework in the coming years, and it will increas- Similar research in the Netherlands (Jansen &
ingly become a point of reference for language Peer, 1999) reports a study of the recently introduced
examinations across Europe and beyond. use of dictionaries in Dutch foreign language exami-
nations and shows that dictionary use does not have
National tests any significant effect on test scores. Nevertheless,
pupils are very positive about being allowed to use
The development of national language tests contin- dictionaries, claiming that it reduces anxiety and
ues to be the focus of many publications, although enhances their text comprehension. Also in the
many are either simply descriptions of test develop- Netherlands, Welling-Slootmaekers (1999) describes
ment or discussions of controversies, rather than the introduction of a range of open-ended questions
reports on research done in connection with test into national examinations of reading ability in
development. foreign languages, arguing that these will improve
In the UK context, Neil (1989) discusses what the assessment of language ability (the questions
should be included in an assessment system for foreign are to be answered in Dutch, not the target foreign
languages in the UK secondary system but reports language), van Elmpt and Loonen (1998) question
no research. Roy (1988) claims that writing tasks the assumption that answering test questions in the
for modern languages should be more relevant, task- target language is a handicap, and report research that
based and authentic, yet criticises an emphasis on shows results to be similar, regardless of whether
letter writing, and argues for other forms of writing, candidates answered comprehension questions in
like paragraph writing. Again, no research is report- Dutch (the mother tongue) or in English (the target
ed. Page (1993) discusses the value and validity of language). However, Bhgel and Leijn (1999) report
having test questions and rubrics in the target research that showed low interrater reliability in
language and asserts that the authenticity of such marking these new item types and they call for
tasks is in doubt. He argues that the use of the target improved assessment practice.
language in questions makes it more difficult to Guillon (1997) evaluates the assessment of English
sample the syllabus adequately, and claims that the in French secondary schools, criticises the time taken
more communicative and authentic the tasks in by test-based assessment and the technical quality
examinations become, the more English (the mother of the tests, and makes suggestions for improved
tongue) has to be used on the examination paper in pupil profiling. Mundzeck (1993) similarly criticises
order to safeguard both the validity and the authen- many of the objective tests in use in Germany for
ticity of the task. No empirical research into this official school assessment of modern languages,
issue is reported. Richards and Chambers (1996) and arguing that they do not reflect the communicative
Chambers and Richards (1992) examine the reliabil- approach to language required by the syllabus. He
ity and validity of teacher assessments in oral produc- recommends that more open-ended tasks be used,
tion tasks in the school-leaving GCSE (General and that teachers be trained in the reliable use of

220

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
valid criteria for subjective marking, instead of their for focusing and organising learning activities and
current practice of merely counting errors in pro- find them motivating and useful for the feedback
duction. Kieweg (1992) makes proposals for the they provide to learners.
improvement of English assessment in German In the USA, one example of concern with school-
schools, and for the comparability of standards with- based assessment is Manley (1995) who describes a
in and across schools. project in a large Texas school district to develop
Dollerup et al. (1994) describe the development in tape-mediated tests of oral language proficiency in
Denmark of an English language reading proficiency French, German, Spanish and Japanese, with positive
test which is claimed to help diagnose reading weak- outcomes.
nesses in undergraduates. Further afield, in Australia, These descriptive accounts of local and national
Liddicoat (1996) describes the Language Profile oral test development contrast markedly with the litera-
interaction component which sees listening and ture surrounding international language proficiency
speaking as interdependent skills and assesses school examinations, like TOEFL, TWE (Test of Written
pupils' ability to participate successfully in sponta- English), IELTS and some Cambridge exams.
neous conversation. Liddicoat (1998) criticises the Although some reports of the development of inter-
Australian Capital Territory's guidelines for the national proficiency tests are merely descriptive (for
assessment of proficiency in languages like Chinese, example, Charge & Taylor, 1997, and Kalter &
Japanese and Indonesian, as well as French, German, Vossen, 1990), empirical research into various aspects
Spanish and Italian. He argues that empirically-based of the validity and reliability of such tests is common-
descriptions of the achievement of learners of place, often revealing great sophistication in analytic
such different languages should inform the revision methodology.
of the descriptors of different levels in profiles of This raises a continuing problem: language testing
achievement. researchers tend to research and write about large-
In Hong Kong, dissatisfaction with graduating scale international tests, and not about more localised
students' levels of language proficiency has resulted tests (including school-leaving achievement tests
in plans for tertiary institution exit controls of lan- which are clearly relatively high-stakes). Thus, the
guage. Li (1997) describes these plans and discusses a language testing and more general educational com-
range of problematic issues that need resolving munities lack empirical evidence about the value of
before valid measures can be introduced. Coniam many influential assessment instruments, and research
(1994, 1995) describes the construction of a com- often fails to address matters of educational political
mon scale which attempts to cover the range of importance.
English language ability of Hong Kong secondary However, there are exceptions. For example, in
school pupils in English. An Item Response Theory- connection with examination reform in Hungary,
based test bank - the TeleNex - has been construct- research studies have addressed issues like the use
ed to provide teachers both with reference points for of sequencing as a test method (Alderson et al.,
ability levels and help in school-based testing. 2000b), the pairing of candidates in oral tests (Csepes
A similar concern with levels or standards of profi- et al., 2000), experimentation with procedures for
ciency is evinced by Peirce and Stewart (1997), who standard setting (Alderson, 2000a), and evidence
describe the development of the Canadian Language informing ongoing debates about how many hours
Benchmarks Assessment (CLBA), which is intended per week should be devoted to foreign language
to be used across Canada to place newcomers into education in the secondary school system (Alderson,
appropriate English instructional programmes, as 2000b).
part of a movement to establish a common frame- In commenting on the lack of international dis-
work for the description of adult ESL language pro- semination of national or regional test development
ficiency. The authors give an account of the history work, we do not wish to deny the value of local
of the project and the development of the instru- descriptive publications. Indeed, such descriptions
ments. However, Rossiter and Pawlikowsska-Smith can serve many needs, including necessary publicity
(1999) are critical of the usefulness of the CLBA for reform work, helping teachers to understand
because it is based on very broad-band differences in developments, their rationale and the need for them,
proficiency among individuals and is insensitive to persuading authorities about a desired course of
smaller, but important, differences in proficiency. action or counselling against other possible actions.
They argue that the CLBA should be supplemented Publication can serve political as well as professional
by more appropriate placement instruments. and academic purposes. Standard setting data can
Vandergrift and Belanger (1998) describe the back- reveal what levels are achieved by the school popula-
ground to and development of formative instru- tion, including comparisons of those who started
ments to evaluate achievement in Canadian National learning the language early with late-starters, those
Core French programmes, and argue that research studying a first foreign language with those studying
shows that reactions to the instruments are positive. the same language as their second or third foreign
Both teachers and pupils regard them as beneficial language, and so on.
221

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
Language testing can inform debates in language the context of their academic or professional field
education more generally. Examples of this include and that they would be disadvantaged by taking a test
baseline studies associated with examination reform based on content outside that field.
which attempt to describe current practice in lan- The development of an LSP test typically begins
guage classrooms (Fekete et al, 1999). What such with an in-depth analysis of the target language use
studies have revealed has been used in in-service and situation, perhaps using genre analysis (see Tarone,
pre-service teacher education and baseline studies 2001). Attention is paid to general situational features
can also be referred to in impact studies to show the such as topics, typical lexis and grammatical struc-
effect of innovations, and to help language educators tures. Specifications are then developed that take into
to understand how to do things more effectively. account the specific language characteristics of the
Washback studies have also been used in teacher context as well as typical scenarios that occur (e.g.,
training, both in order to influence test preparation Plakans & Abraham, 1990; Stansfield et al, 1990;
practices, but also to encourage teachers to reflect on Scott et al, 1996; Stansfield et al, 1997; Stansfield et
the reasons for their and others' practices. al., 2000). Particular areas of concern, quite under-
standably, tend to relate to issues of background
LSP Testing knowledge and topic choice (e.g.,Jensen & Hansen,
1995; Clapham, 1996; Fox et al, 1997; Celestine &
The development of specific purpose testing, i.e., Cheah, 1999; Jennings et al, 1999; Papajohn, 1999;
tests in which the test content and test method are Douglas, 2001a) and authenticity of task, input or,
derived from a particular language use context rather indeed, output (e.g., Lumley & Brown, 1998; Moore
than more general language use situations, can be & Morton, 1999; Lewkowicz, 2000; Elder, 2001;
traced back to the Temporary Registration Assess- Douglas, 2001a; Wu & Stansfield; 2001) and these
ment Board (TRAB), introduced by the British areas of concern have been a major focus of research
General Medical Council in 1976 (see Rea-Dickins, attention in the last decade.
1987) and the development of the English Lan- Results, though somewhat mixed (cf. Jensen &
guage Testing Development Unit (ELTDU) scales Hansen, 1995 and Fox et al, 1997), suggest that back-
(Douglas, 2000).The 1980s saw the introduction of ground knowledge and language knowledge interact
English for Academic Purposes (EAP) tests and it is differently depending on the language proficiency of
these that have subsequently dominated the research the test taker. Clapham's (1996) research into sub-
and development agenda. It is important to note, ject-specific reading tests (research she conducted
however, that Language for Specific Purposes (LSP) during and after the ELTS revision project) shows
tests are not the diametric opposite of general pur- that, at least in the case of her data, the scores of nei-
pose tests. Rather, they typically fall along a continu- ther lower nor higher proficiency test takers seemed
um between general purpose tests and those for influenced by their background knowledge. She
highly specialised contexts and include tests for hypothesises that for the former this was because
academic purposes (e.g., the International English they were most concerned with decoding the text
Language Testing System, IELTS) and for occupa- and for the latter it was because their linguistic
tional or professional purposes (e.g., the Occupational knowledge was sufficient for them to be able to
English Test, OET). decode the text with that alone. However, the scores
Douglas (1997, 2000) identifies two aspects that of medium proficiency test takers were affected by
typically distinguish LSP testing from general purpose their background knowledge. On the basis of these
testing.The first is the authenticity of the tasks, i.e., the findings she argues that subject-specific tests are not
test tasks share key features with the tasks that a test equally valid for test takers at different levels of lan-
taker might encounter in the target language use sit- guage proficiency.
uation. The assumption here is that the more closely Fox et al. (1997), examining the role of back-
the test and 'real-life' tasks are linked, the more likely ground knowledge in the context of the listening
it is that the test takers' performance on the test task section of an integrated test of English for Academic
would reflect their performance in the target situa- Purposes (the Carleton Academic English Test,
tion.The second distinguishing feature of LSP testing CAEL), report a slight variation on this finding. They
is the interaction between language knowledge and specific too find a significant interaction between language
content knowledge.This is perhaps the most crucial dif- proficiency and background knowledge with the
ference between general purpose testing and LSP scores of low proficiency test takers showing no ben-
testing, for in the former, any sort of background efit from background knowledge. However, the
knowledge is considered to be a confounding vari- scores of the high proficiency candidates and analysis
able that contributes construct-irrelevant variance to of their verbal protocols indicate that they did make
the test score. However, in the case of LSP testing, use of their background knowledge to process the
background knowledge constitutes an integral part listening task.
of what is being tested, since it is hypothesised that Clapham (1996) has further shown that back-
test takers' language knowledge has developed within ground knowledge is an extremely complex con-
222

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessnnent (Part 1)
cept. She reveals dilemmas including the difficulty of scales (Hamilton et al., 1993) as well as the effect of
identifying with any precision the absolute specifici- rater variables on test scores (Brown, 1995; Lumley &
ty of an input passage and the nigh impossibility of McNamara, 1995) and the question of who should
being certain about test takers' background knowl- rate test performances — language specialists or sub-
edge (particularly given that test takers often read ject specialists (Lumley, 1998).
outside their chosen academic field and might even There have also been concerns related to the
have studied in a different academic area in the past). interpretation of test scores. Just as in general pur-
This is of particular concern when tests are topic- pose testing, LSP test developers are concerned with
based and all the sub-tests and tasks relate to a single minimising and accounting for construct-irrelevant
topic area. Jennings et al. (1999) and Papajohn (1999) variables. However, this can be a particularly thorny
look at the possible effect of topic, in the case of the issue in LSP testing since construct irrelevant vari-
former, for the CAEL and, in the case of the latter, in ables can be introduced as a result of the situational
the chemistry TEACH test for international teaching authenticity of the test tasks. For instance, in his
assistants. They argue that the presence of topic effect study of the chemistry TEACH test, Papajohn (1999)
would compromise the construct validity of the test describes the difficulty of identifying when a teach-
whether test takers are offered a choice of topic dur- ing assistant's teaching skills (rather than language
ing test administration (as with the CAEL) or not. skills) are contributing to his/her test performance.
Papajohn finds that topic does play a role in chem- He argues that test behaviours such as the provision
istry TEACH test scores and warns of the danger of of accessible examples or good use of the blackboard
assuming that subject-specificity automatically guar- are not easily distinguished as teaching or language
antees topic equivalence. Jennings et al. are relieved skills and this can result in construct-irrelevant vari-
to report that choice of topic does not seem to affect ance being introduced into the test score. He suggests
test taker performance on the CAEL. However, they that test takers should be given specific instructions
do note that there is a pattern in the choices made by on how to present their topics, i.e., teaching tips so
test takers of different proficiency levels and suggest that teaching skills do not vary widely across perfor-
that more research is needed into the implications of mances. Stansfield et al. (2000) have taken a similar
these patterns for test performance. approach in their development of the LSTE-
Another particular concern of LSP test developers Taiwanese. The assessment begins with an instruction
has been authenticity (of task, input and/or output), section on the summary skills needed for the test
one example of the care taken to ensure that the test with the aim of ensuring that test performances are
materials are authentic being Wu and Stansfield's not unduly influenced by a lack of understanding of
(2001) description of the test construction procedure the task requirements.
for the LSTE-Taiwanese (listening summary transla- It must be noted, however, that, because of the
tion exam). Yet Lewkowicz (1997) somewhat puts need for in-depth analysis of the target language use
the cat among the pigeons when she demonstrates situation, LSP tests are time-consuming and expen-
that it is not always possible accurately to identify sive to produce. It is also debatable whether English
authentic texts from those specially constructed for for Specific Purposes (ESP) tests are more informa-
testing purposes. She further problematises the tive than a general purpose test. Furthermore, it is
valuing of authenticity in her study of a group of test increasingly unclear just how 'specific' an LSP test is
takers' perceptions of an EAP test, finding that they or can be. Indeed, more than a decade has passed
seemed unconcerned about whether the test materi- since Alderson (1988) first asked the crucial question
als were situationally authentic or not. Indeed, they of how specific ESP testing could get. This question
may even consider multiple-choice tests to be is recast in Elder's (2001) work on LSP tests for
authentic tests of language, as opposed to tests of teachers when she asks whether for all their 'teacher-
authentic language (Lewkowicz, 2000). (For further liness' these tests elicit language that is essentially dif-
discussion of this topic, see Part Two of this review.) ferent from that elicited by a general language test.
Other test development concerns, however, are An additional concern is the finding that con-
very much like those of researchers developing tests struct relevant variables such as background knowl-
in different sub-skills. Indeed, researchers working on edge and compensatory strategies interact differently
LSP tests have contributed a great deal to our under- with language knowledge depending on the lan-
standing of a number of issues related to the testing guage proficiency of the test taker (e.g., Halleck &
of reading, writing, speaking and listening. Apart Moder, 1995; Clapham, 1996). As a consequence of
from being concerned with how best to elicit sam- Clapham's (1996) research, the current IELTS test
ples of language for assessment (Read, 1990), they has no subject-specific reading texts and care is taken
have investigated the influence of interlocutor to ensure that the input materials are not biased for
behaviour on test takers' performance in speaking or against test takers of different disciplines. Though
tests (e.g., Brown & LunJey, 1997; McNamara & the extent to which this lack of bias has been
Lumley, 1997; Reed & Halleck, 1997). They have achieved is debatable (see Celestine & Cheah, 1999),
also studied the assumptions underpinning rating it can still be argued that the attempt to make texts
223

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessnnent (Part 1)
accessible regardless of background knowledge has identifying when it is absolutely necessary to know
resulted in the IELTS test being very weakly specific. how well someone can communicate in a specific
Its claims to specificity (and indeed similar claims by context or if the information being sought is equally
many EAP tests) rest entirely on the fact that it is obtainable through a general-purpose language test.
testing the generic language skills needed in academ- The answer to this challenge might not be as easily
ic contexts.This leaves it unprotected against sugges- reached as is sometimes presumed.
tions like Clapham's (2000a) when she questions the
theoretical soundness of assessing discourse knowl- Computer-based testing
edge that the test taker, by registering for a degree
taught in English, might arguably be hoping to learn Computer-based testing has witnessed rapid growth
and that even a native speaker of English might lack. in the past decade and computers are now used to
Recently the British General Medical Council has deliver language tests in many settings. A computer-
abandoned its specific purpose test, the Professional based version of the TOEFL was introduced on a
and Linguistic Assessment Board (PLAB, a revised regional basis in the summer of 1998, tests are now
version of theTRAB), replacing it with a two-stage available on CD ROM, and the Internet is increas-
assessment process that includes the use of the IELTS ingly used to deliver tests to users. Alderson (1996)
test to assess linguistic proficiency. These develop- points out that computers have much to offer
ments represent the thin end of the wedge. Though language testing: not just for test delivery, but also for
the IELTS is still a specific purpose test, it is itself less test construction, test compilation, response capture,
so than its precursor the English Language Testing test scoring, result calculation and delivery, and test
System (ELTS) and it is certainly less so than the analysis. They can also, of course, be used for storing
PLAB. And so the questioning continues. Davies tests and details of candidates.
(2001) has joined the debate, debunking the theoret- In short, computers can be used at all stages in the
ical justifications typically put forward to explain test development and administration process. Most
LSP testing, in particular the principle that different work reported in the literature, however, concerns
fields demand different language abilities. He argues the compilation, delivery and scoring of tests by
that this principle is based far more on differences of computer. Fulcher (1999b) describes the delivery of
content rather than on differences of language (see an English language placement test over the Web and
also Fulcher, 1999a). He also questions the view that Gervais (1997) reports the mixed results of transfer-
content areas are discrete and heterogeneous. ring a diagnostic paper-and-pencil test to the com-
Despite all the rumblings of discontent, Douglas puter. Such articles set the scene for studies of
(2000) stands firmly by claims made much earlier in computer-based testing which compare the accuracy
the decade that in highlyfield-specificlanguage con- of the computer-based test with a traditional paper-
texts, a field-specific language test is a better predic- and-pencil test, addressing the advantages of a com-
tor of performance than a general purpose test puter-delivered test in terms of accessibility and
(Douglas & Selinker, 1992). He concedes that many speed of results, and possible disadvantages in terms
of these contexts will be small-scale educational, pro- of bias against those with no computer familiarity, or
fessional or vocational programmes in which the with negative attitudes to computers.
number of test takers is small but maintains (Douglas, This concern with bias is a recurrent theme in the
2000:282): literature, and it inspired a large-scale study by the
Educational Testing Service (ETS), the developers of
if we want to know how well individuals can use a language in
specific contexts of use, we will require a measure that takes into
the computer-based version of the TOEFL, who
account both their language knowledge and their background needed to show that such a test would not be biased
knowledge, and their use of strategic competence in relating the against those with no computer literacy. Jamieson et
salient characteristics of the target language use situation to their ah (1998) describe the development of a computer-
specific purpose language abilities. It is only by so doing ... that based tutorial intended to train examinees to take
we can make valid interpretations of test performances.
the computerised TOEFL. Taylor et al. (1999) exam-
He also suggests that the problem might not be ine the relationship between computer familiarity
with the LSP tests or with their specification of the and TOEFL scores, showing that those with high
target language use domain but with the assessment computer familiarity tend to score higher on the
criteria applied. He argues (Douglas, 2001b) that just traditional TOEFL. They compare examinees with
as we analyse the target language use situation in high and low computer familiarity in terms of their
order to develop the test content and methods, we performance on the computer tutorial and on com-
should exploit that source when we develop the puterised TOEFL-like tasks.They claim that no rela-
assessment criteria. This might help us to avoid tionship was found between computer familiarity
expecting a perfection of the test taker that is not and performance on the computerised tasks after
manifested in authentic performances in the target controlling for English language proficiency. They
language use situation. conclude that there is no evidence of bias against
candidates with low computer familiarity, but also
But perhaps the real challenge to the field is in

224

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessnnent (Part 1)
take comfort in the fact that all candidates will be jargon as a testlet — they may not be suitable for
able to take the computer tutorial before taking an computer-adaptivity. This concern for the inherent
operational computer-based TOEFL. conservatism of computer-based testing has a long
The commonest use of computers in language history (see Alderson, 1986a, 1986b, for example), and
testing is to deliver tests adaptively (e.g.,Young et al., some claimed innovations, for example, computer-
1996). This means that the computer adjusts the generated cloze and multiple-choice tests (Coniam,
items to be delivered to a candidate in the light of 1997, 1998) were actually implemented as early as
that candidates success or failure on previous items. the 1970s, and were often criticised in the literature
If the candidate fails a difficult item, s/he is presented for risking the assumption of automatic validity. But
with an easier item, and if s/he gets an item correct, recent developments offer some hope. Burstein et al.
s/he is presented with a more difficult item. This has (1996) argue for the relevance of new technologies
advantages: firstly, candidates are presented with in innovation in test design, construction, trialling,
items at their level of ability, and are not faced with delivery, management, scoring, analysis and report-
items that are either too easy or too difficult, and sec- ing. They review ways in which new input devices
ondly, computer-adaptive tests (CATs) are typically (e.g., voice and handwriting recognition), output
quicker to deliver, and security is less of a problem devices (e.g., video, virtual reality), software such as
since different candidates are presented with different authoring tools, and knowledge-based systems for
items. Many authors discuss the advantages of CATs language analysis could be used, and explore
(Laurier, 1998; Brown, 1997; Chalhoub-Deville & advances in the use of new technologies in comput-
Deville, 1999;Dunkel, 1999), but they also emphasise er-assisted learning materials. However, as they point
issues that test developers and score users must out, 'innovations applied to language assessment lag
address when developing or using CATs. When behind their instructional counterparts ... the situa-
designing such tests, developers have to take a num- tion is created in which a relatively rich language
ber of decisions: what should the entry level be, and presentation is followed by a limited productive
how is this best determined for any given popula- assessment.'(1996:245).
tion? At what point should testing cease (the so- No doubt, this is largely due to the fact that com-
called exit point) and what should the criteria be puter-based tests require the computer to score
that determine this? How can content balance best responses. However, Burstein et al. (1996) argue that
be assured in tests where the main principle for human-assisted scoring systems could reduce this
adaptation is psychometric? What are the conse- dependency. (Human-assisted scoring systems are
quences of not allowing users to skip items, and can computer-based systems where most scoring of
these consquences be ameliorated? How to ensure responses is done by computer but responses that the
that some items are not presented much more fre- programs are unable to score are given to humans for
quendy than others (item exposure), because of their grading.) They also give details of free-response scor-
facility, or their content? Brown and Iwashita (1996) ing tools which are capable of scoring responses up
point out that grammar items in particular will vary to 15 words long which correlate highly with human
in difficulty according to the language background judgements (coefficients of between .89 and .98 are
of candidates, and they show how a computer-adap- reported). Development of such systems for short-
tive test of Japanese resulted in very different item answer questions and for essay questions has since
difficulties for speakers of English and Chinese. Thus gone on apace. For example, ETS has developed an
a CAT may also need to take account of the lan- automated system for assessing productive language
guage background of candidates when deciding abilities, called 'e-rater'. e-rater uses natural language
which items to present, at least in grammar tests, and processing techniques to duplicate the performance
conceivably also in tests of vocabulary. of humans rating open-ended essays. Already, the
Chalhoub-Deville and Deville (1999) point out system is used to rate GMAT (Graduate Management
that, despite the apparent advantages of computer- Admission Test) essays and research is ongoing
based tests, computer-based testing relies over- for other programmes, including second/foreign
whelmingly on selected response (typically multiple- language testing situations. Burstein et al. conclude
choice questions) discrete-point tasks rather than that 'the barriers to the successful use of technology
performance-based items, and thus computer-based for language testing are less technical than conceptu-
testing may be restricted to testing linguistic knowl- al' (1996: 253), but progress since that article was
edge rather than communicative skills. However, published is extremely promising.
many computer-based tests include tests of reading, An example of the use of IT to assess aspects of
which is surely a communicative skill. The question the speaking ability of second/foreign language
is whether computer-based testing offers any added learners of English is PhonePass. PhonePass (www.
value over paper-and-pencil reading tests: adaptivity ordinate.org) is delivered over the telephone, and
is one possibility, although some test developers are candidates are asked to read texts aloud, repeat heard
concerned that since reading tests typically present sentences, say words opposite in meaning to heard
several items on one text — what is known in the words, and give short answers to questions. The sys-
225

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
tern uses speech recognition technology to rate it is argued, then computer-based deliveries can be
responses, by comparing candidate performance to developed that incorporate desirable innovations.
statistical models of native and non-native perfor- DIALANG (http://www.dialang.org) is a suite
mance on the tasks. The system gives a score that of computer-based diagnostic tests (funded by the
reflects a candidate's ability to understand and European Union) which are available over the Inter-
respond appropriately to decontextualised spoken net, thus capitalising on the advantages of Internet-
material, with 40% of the evaluation reflecting the based delivery (see below). DIALANG uses self-
fluency and pronunciation of the responses. Alderson assessment as an integral part of diagnosis. Users'
(2000c) reports that reliability coefficients of 0.91 self-ratings are combined with objective test results
have been found as well as correlations with the Test in order to identify a suitably difficult test for the
of Spoken English (TSE) of 0.88 and with an ILR user. DIALANG gives users feedback immediately,
(Inter-agency Language Roundtable) Oral not only on their test scores, but also on the relation-
Proficiency Interview (OPI) of 0.77. An interesting ship between their test results and their self-assess-
feature is that the scored sample is retained on a data- ment. DIALANG also gives extensive advice to users
base, classified according to the various scores on how they can progress from their current level to
assigned. This enables users to access the speech sam- the next level of language proficiency, basing this
ple, in order to make their own judgements about advice on the Common European Framework
the performance for their particular purposes, and to (Council of Europe, 2001).The interface and support
compare how their candidate has performed with language, and the language of self-assessment and of
other speech samples that have been rated either the feedback, can be chosen by the test user from a list of
same, or higher or lower. 14 European languages. Users can decide which skill
In addition to e-rater and PhonePass there are a or language aspect (reading, writing, listening, gram-
number of promising initiatives in the use of com- mar and vocabulary) they wish to be tested in, in any
puters in testing. The listening section of the com- one of the same 14 European languages. Currently
puter-based TOEFL uses photos and graphics to available test methods consist of multiple-choice, gap-
create context and support the content of the mini- filling and short-answer questions, but DIALANG
lectures, producing stimuli that more closely approx- has already produced CD-based demonstrations of 18
imate 'real world' situations in which people do more different experimental item types which could be
than just listen to voices. Moreover, candidates wear implemented in the future, and the CD demonstrates
headphones, can adjust the volume control, and are the use of help, clue, dictionary and multiple-attempt
allowed to control how soon the next question is features.
presented. One innovation in test method is that Although DIALANG is limited in its ability to
candidates are required to select a visual or part of a assess users' productive language abilities, the experi-
visual; in some questions candidates must select two mental item types include a promising combination
choices, usually out of four, and in others candidates of self-assessment and benchmarking. Tasks for the
are asked to match or order objects or texts. elicitation of speaking and writing performances are
Moreover, candidates see and hear the test questions administered to pilot candidates and performances are
before the response options appear. (Interestingly, rated by human judges.Those performances on which
Ginther, forthcoming, suggests, however, that the use raters achieve the greatest agreement are chosen as
of visuals in the computer-based TOEFL listening 'benchmarks'. A DIALANG user is presented with the
test depresses scores somewhat, compared with tradi- same task, and, in the case of a writing task, responds
tionally delivered tests. More research is clearly need- via the keyboard. The user's performance is then pre-
ed.) sented on screen alongside the pre-rated benchmarks.
In the Reading section candidates are required to The user can compare their own performance with
select a word, phrase, sentence or paragraph in the the benchmarks. In addition, since the benchmarks are
text itself, and other questions ask candidates to pre-analysed, the user can choose to see raters' com-
insert a sentence where it fits best. Although these ments on various features of the benchmarks, in
techniques have been used elsewhere in paper-and- hypertext form, and consider whether they could pro-
pencil tests, one advantage of their computer format duce a similar quality of such features. In the case of
is that the candidate can see the result of their choice Speaking tasks, the candidate is simply asked to imag-
in context, before making a final decision. Although ine how they would respond to the task, rather than
these innovations may not seem very exciting, actually to record their performance. They are then
Bennett (1998) claims that the best way to innovate presented with recorded benchmark performances,
in computer-based testing is first to mount on com- and are asked to estimate whether they could do bet-
puter what can already be done in paper-and-pencil ter or worse than each performance. Since the perfor-
format, with possible minor improvements allowed mances are graded, once candidates have self-assessed
by the medium, in order to ensure that the basic soft- themselves against a number of performances, the sys-
ware works well, before innovating in test method tem can tell them roughly what level their own (imag-
and construct. Once the delivery mechanisms work, ined) performance is likely to be.

226

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
These developments illustrate some of the advan- teachers' assessments of the students and their own
tages of computer-based assessment, which make self-assessments. He also shows that in multicultural
computer-based testing not only more user-friendly, groups such as those typical of pre-sessional EAP
but also more compatible with language pedagogy. courses, overestimates of language proficiency are
However, Alderson (2000c) argues the need for a more common than underestimates. Finally, he
research agenda, which would address the challenge argues that learners'lack of familiarity with metalan-
of the opportunities afforded by computer-based guage and with the practice of discussing language
testing and the data that can be amassed. Such an proficiency in terms of its composite skills impairs
agenda would investigate the comparative advantages their capacity for identifying their precise language
and added value of each form of assessment — IT- learning needs.
based or not IT-based. This includes issues like the Such concerns, however, did not dampen enthusi-
effect of providing immediate feedback, support asm for investigations in this area and research in the
facilities, second attempts, self-assessment, confidence 1980s was concerned with the development of self-
testing, and the like. Above all, it would seek to throw assessment instruments and their validation (e.g.,
more light onto the nature of the constructs that can Oscarson, 1984; Lewkowicz & Moon, 1985). Con-
be tested by computer-based testing: sequently, a variety of approaches were developed
including pupil progress cards, learning diaries, log
What is needed above all is research that will reveal more about books, rating scales and questionnaires. In the last
the validity of the tests, that will enable us to estimate the effects
of the test method and delivery medium; research that will pro- decade the research focus has shifted towards
vide insights into the processes and strategies test-takers use; enhancing our understanding of the evaluation tech-
studies that will enable the exploration of the constructs that are niques that were already in existence through
being measured, or that might be measured ... And we need continued validation exercises and by applying self-
research into the impact of the use of the technology on learn-
assessment in new contexts or in new ways.
ing, on learners and on the curriculum. (Alderson, 2000c: 603)
For instance, Blanche (1990) uses standardised
achievement and oral proficiency tests both for test-
Self-assessment ing and for self-assessment purposes, arguing that this
approach helps to circumvent the problems of train-
The previous section has shown how computer- ing that are associated with self-assessment question-
based testing can incorporate test takers' self-assess- naires. Hargan (1994) documents the use of a
ment of their abilities in the target language. Until 'do-it-yourself instrument for placement purposes,
the 1980s references to self-assessment were rare but reporting that it results in much the same placement
since then interest in self-assessment has increased. levels as suggested by a traditional multiple-choice
This increase can at least in part be attributed to an test. Hargun argues that placement testing for large
increased interest in involving the learner in all phas- numbers in her context has resulted in the imple-
es of the learning process and in encouraging learner mentation of a traditional multiple-choice grammar-
autonomy and decision making in (and outside) the based placement test and a consequent emphasis on
language classroom (e.g., Blanche & Merino, 1989). teaching analytic grammar skills. She believes that
The introduction of self-assessment was viewed as the 'do-it-yourself-placement' instrument might help
promising by many, especially in formative assess- to redress the emphasis on grammar and stem the
ment contexts (Oscarson, 1989). It was considered to neglect of reading and writing skills in the classroom.
encourage increasing sophistication in learner aware- Carton (1993) discusses how self-assessment can
ness, helping learners to: gain confidence in their become part of the learning process. He describes his
own judgement; acquire a view of evaluation that use of questionnaires to encourage learners to reflect
covers the whole learning process; and see errors as on their learning objectives and preferred modes of
something helpful. It was also seen to be potentially learning. He also presents an approach to monitoring
useful to teachers, providing information on learning learning that involves the learners in devising their
styles, on areas needing remediation and feedback on own criteria, an approach that he argues helps learn-
teaching (Barbot, 1991). ers to become more aware of their own cognitive
However, self-assessment also met with consider- processes.
able scepticism, largely due to concerns about the A typical approach to validating self-assessment
ability of learners to provide accurate judgements of instruments has been to obtain concurrent validity
their achievement and proficiency. For instance, Blue statistics by correlating the self-assessment measure
(1988), while acknowledging that self-assessment is with one or more external measures of student per-
an important element in self-directed learning and formance (e.g., Shameem, 1998; Ross, 1998). Other
that learners can play an active role in the assessment approaches have included the use of multi-trait
of their own language learning, argues that learners multi-method (MTMM) designs and factor analysis
cannot self-assess unaided. Taking self-assessment (Bachman & Palmer, 1989) and a split-ballot tech-
data gathered from students on a pre-sessional EAP nique (Heilenman, 1990). In general, these studies
programme, he reports a poor correlation between have found self-assessment to be a robust method for

227

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessnnent (Part 1)
gathering information about learner proficiency and
that the risk of cheating is low (see Barbot, 1991).
Alternative assessnnent
However, they also indicate that some approaches to Self-assessment is one example of what is increasingly
gathering self-assessment data are more effective than called 'alternative assessment'. 'Alternative assessment'
others. Bachman and Palmer (1989) report that is usually taken to mean assessment procedures
learners were more able to identify what they found which are less formal than traditional testing, which
difficult to do in a language than what they found are gathered over a period of time rather than being
easy. Therefore, 'Can-do' questions were the least taken at one point in time, which are usually forma-
effective question type of the three they used in their tive rather than summative in function, are often
MTMM study, while the most effective question low-stakes in terms of consequences, and are claimed
type appeared to be that which asked about the to have beneficial washback effects. Although such
learners' perceived difficulties with aspects of the procedures may be time-consuming and not very
language. easy to administer and score, their claimed advantages
Additionally, learner experience of the self- are that they provide easily understood information,
assessment procedure and/or the language skill being they are more integrative than traditional tests and
assessed has been found to affect self-assessments. they are more easily integrated into the classroom.
Heilenman (1990), in a study of the role of response McNamara (1998) makes the point that alternative
effects, reports both an acquiescence effect (the ten- assessment procedures are often developed in an
dency to respond positively to an item regardless of attempt to make testing and assessment more respon-
its content) and a tendency to overestimate ability, sive and accountable to individual learners, to pro-
these tendencies being more marked among less mote learning and to enhance access and equity in
experienced learners. Ross (1998) has found that the education (1998: 310). Hamayan (1995) presents a
reliability of learners' self-assessments is affected detailed rationale for alternative assessment, describes
by their experience of the skill being assessed. He different types of such assessment, and discusses pro-
suggests that when learners do not have memory of a cedures for setting up alternative assessment. She also
criterion, they resort to recollections of their general provides a very useful bibliography for further refer-
proficiency in order to make their judgement. This ence.
process is more likely to be affected by the method A recent special issue of Language Testing, guest-
of the self-assessment instrument and by factors such edited by McNamara (Vol 18, 4, October 2001)
as self-flattery. He argues, therefore, for the design of reports on a symposium to discuss challenges to the
instruments that are cast in terms which offer learn- current mainstream in language testing research,
ers a reference point such as specific curricular con- covering issues like assessment as social practice,
tent. In a similar finding Shameem (1998) reports democratic assessment, the use of outcomes based
that respondents' self-assessments of their oral profi- assessment and processes of classroom assessment.
ciency in Fijian Hindi are less reliable at the highest Such discussions of alternative perspectives are close-
levels of the self-assessment scale. Like Ross, he ly linked to so-called critical perspectives (what
attributes this slip in accuracy to the respondents' Shohamy calls critical language testing).
lack of familiarity with the criterion measure. The alternative assessment movement, if it may be
Oscarson (1997) sums up progress in this area by termed such, probably began in writing assessment,
reminding us that research in self-assessment is still where the limitations of a one-off impromptu single
relatively new. He acknowledges that conundrums writing task are apparent. Students are usually given
remain. For instance, learner goals and interpreta- only one, or at most two tasks, yet generalisations
tions need to be reconciled with external impera- about writing ability across a range of genres are
tives. Also self-assessment is not self-explanatory; it often made. Moreover, it is evidently the case that
must be introduced slowly and learners need to be most writing, certainly for academic purposes but
guided and supported in their use of the instruments. also in business settings, takes place over time,
Furthermore, particularly when using self-assessment involves much planning, editing, revising and redraft-
in multicultural groups, it is important to consider ing, and usually involves the integration of input
the cultural influences on self-assessment. Never- from a variety of (usually written) sources. This is in
theless, he considers the research so far to be promis- clear contrast with the traditional essay which usually
ing. Despite residual concerns about the accuracy of has a short prompt, gives students minimal input,
self-assessment, the majority of studies report favour- minimal time for planning and virtually no opportu-
able results and we have already learned a great deal nity to redraft or revise what they have produced
about the appropriate methodology to use for cap- under often stressful, time-bound circumstances. In
turing self-assessments. However, as Oscarson points such situations, the advocacy of portfolios of pieces
out, more work is needed, both in the study of fac- of writing became a commonplace, and a whole
tors that influence self-assessment ratings in various portfolio assessment movement has developed, espe-
contexts and in the selection and design of materials cially in the USA for first language writing (Hamp-
and methods for self-assessment. Lyons & Condon, 1993, 1999) but also increasingly

228

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
for ESL writing assessment (Hamp-Lyons, 1996) and together. She offers practical tips for how teachers
also for the assessment of foreign languages (French, can reduce the amount of paperwork involved in
Spanish, German, etc.) writing assessment. classroom assessment of this sort. Sciarone (1995) dis-
Although portfolio assessment in other subject cusses the difficulties of monitoring learning with
areas (art, graphic design, architecture, music) is not large groups of students (in contrast with that of
new, in foreign language education portfolios have individuals) and describes the use, with 200 learners
been hailed as a major innovation, supposedly over- of Dutch, of a simple monitoring tool (a personal
coming the drawbacks of traditional assessment. A computer) to keep track of the performance of indi-
typical example is Padilla et al. (1996) who describe vidual learners on a variety of learning tasks.
the design and implementation of portfolio assess- Typical of these accounts, however, is the fact that
ment in Japanese, Chinese, Korean and Russian, to they are descriptive and persuasive, rather than
assess growth in foreign language proficiency. They research-based, or empirical studies of the advantages
make a number of practical recommendations to and disadvantages of'alternative assessment'. Brown
assist teachers wishing to use portfolios in progress and Hudson (1998) present a critical overview of
assessment. such approaches, criticising the evangelical way in
Hughes Wilhelm (1996) describes how portfolio which advocates assert the value and indeed validity
assessment was integrated with criterion-referenced of their procedures without any evidence to support
grading in a pre-university English for academic their assertions. They point out that there is no such
purposes programme, together with the use of con- thing as automatic validity, a claim all too often made
tract grading and collaborative revision of grading by the advocates of alternative assessment. Instead
criteria. It is claimed that such an assessment scheme of 'alternative assessment', they propose the term
encourages learner control whilst maintaining 'alternatives in assessment', pointing out that there
standards of performance. are many different testing methods available for
Short (1993) discusses the need for better assessment assessing student learning and achievement. They
models for instruction where content and language present a description of these methods, including
instruction are integrated. She describes examples selected-response techniques, constructed-response
of the implementation of a number of alternative techniques and personal-response techniques.
assessment measures, such as checklists, portfolios, Portfolio and other forms of'alternative assessment'
interviews and performance-tasks, in elementary and are classified under the latter category, but Brown
secondary school integrated content and language and Hudson emphasise that they should be subject to
classes. the same criteria of reliability, validity and practicali-
Alderson (2000d) describes a number of alterna- ty as any other assessment procedure, and should be
tive procedures for assessing reading, including critically evaluated for their 'fitness for purpose',
checklists, teacher-pupil conferences, learner diaries what Bachman and Palmer (1996) called'usefulness'.
and journals, informal reading inventories, classroom Hamp-Lyons (1996) concludes that portfolio scoring
reading aloud sessions, portfolios of books read, self- is less reliable than traditional writing rating; little
assessments of progress in reading, and the like. training is given and raters may be judging the writer
Many of the accounts of alternative assessment are as much as the writing. Brown and Hudson empha-
for classroom-based assessment, often for assessing sise that decisions for use of any assessment proce-
progress through a programme of instruction. dure should be informed by considerations of
Gimenez (1996) gives an account of the use of consequences (washback), the significance and need
process assessment in an ESP course; Bruton (1991) for, and value of, feedback based on the assessment
describes the use of continuous assessment over a full results, and the importance of using multiple sources
school year in Spain, to measure achievement of of information when making decisions based on
objectives and learner progress. Haggstrom (1994) assessment information.
describes ways she has successfully used a video Clapham (2000b) makes the point that many
camera and task-based activities to make classroom- alternative assessment procedures are not pre-tested
based oral testing more communicative and realistic, and trialled, their tasks and mark schemes are there-
less time-consuming for the teacher, and more fore of unknown or even dubious quality, and despite
enjoyable and less stressful for students. Lynch (1988) face validity, they may not tell the user very much at
describes an experimental system of peer evaluation all about learners' abilities.
using questionnaires in a pre-sessional EAP summer In short, as Hamayan (1995) admits, alternative
programme, to assess speaking abilities. He concludes assessment procedures have yet to 'come of age', not
that this form of evaluation had a marked effect on only in terms of demonstrating beyond doubt their
the extent to which speakers took their audience usefulness, in Bachman and Palmer's terms, but
into account. Lee (1989) discusses how assessment also in terms of being implemented in mainstream
can be integrated with the learning process, illustrat- assessment, rather than in informal class-based assess-
ing her argument with an example where pupils pre- ment. She argues that consistency in the application
pare, practise and perform a set task in Spanish of alternative assessment is still a problem, that mech-

229

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
anisms for thorough self-criticism and evaluation of language disorder. Windsor (1999) investigates the
alternative assessment procedures are lacking, that effect of semantic inconsistency on sentence gram-
some degree of standardisation of such procedures maticality judgements for children with and without
will be needed if they are to be used for high-stakes language-learning disabilities (LD), finding that chil-
assessment, and that the financial and logistic viability dren with LD differed most from their chronological
of such procedures remains to be demonstrated. age-group peers in the identification of ungrammati-
cal sentences and that it is important to consider the
effect on performance of competing linguistic infor-
Assessing young learners
mation in the task. Holm et al (1999) have developed
Finally, in this first part of our review, we consider a phonological assessment procedure for bilingual
recent developments in the assessment of young children, using this assessment to describe the
learners, an area where it is often argued that alterna- phonological development, in each language, of
tive assessment procedures are more appropriate than normally developing bilingual children as well as of
formal testing procedures. Typically considered to two bilingual children with speech disorders. They
apply to the assessment of children between the ages conclude that the normal phonological development
of 5 and 12 (but also including much younger and of bilingual children differs from monolingual
slightly older children), the assessment of young development in each of the languages and that the
learners dates back to the 1960s. However, research phonological output of bilingual children with
interest in this area is relatively new and the last speech disorders reflects a single underlying deficit.
decade has witnessed a plethora of studies (e.g., Low The findings of these studies have implications for
et ah, 1993; McKay et al, 1994; Edelenbos & the design of assessment tools as well as for the need
Johnstone, 1996; Breen et al, 1997; Leung & to identify appropriate norms against which to
Teasdale, 1997;TESOL, 1998; Blondin et al, 1998). measure performance on the assessments.
This trend can be largely attributed to three factors. Such issues, particularly the identification of
Firstly, second language teaching (particularly appropriate norms of performance, are also impor-
English) to children in the pre-primary and primary tant in studies of young learners' readiness to access
age groups both within mainstream education and mainstream education in a language other than their
by commercial organisations, has mushroomed. heritage language. Recent research involving learn-
Secondly, it is recognised that classrooms have ers of English as an additional or second language
become increasingly multi-cultural and, particularly (EAL/ESL) has benefited from work in the 1980s
in the context of Australia, Canada, the United States (e.g., Stansfield, 1981; Cummins, 1984a, 1984b; Barrs
and the UK, many learners are speakers of English as et ah, 1988;Trueba, 1989) which problematised the
an additional/second language (rather than heritage use of standardised tests that had been normed on
speakers of English). Thirdly, the decade has seen an monolingual learners of English. The equity consid-
increased proliferation, within mainstream education, erations they raised, particularly the false positive diag-
of teaching and learning standards (e.g., the National nosis of EAL/ESL learners as having learning
disabilities, has resulted in the development of
Curriculum Guidelines in England and Wales) and
EAL/ESL learner 'profiles' (also called standards/
demands for accountability to stakeholders.
benchmarks/scales) (see NLLIA, 1993; Australian
The research that has resulted falls broadly into
Education Council, 1994;TESOL, 1998). Research
three areas: the assessment of language delay and/or
has also focused on the provision of guidance for
impairment, the assessment of young learners with
teachers when monitoring and reporting on learner
English as an additional/second language, and the
progress (see McKay & Scarino, 1991; Genesee &
assessment of foreign languages in primary/elemen- Hamayan, 1994; Law & Eckes, 1995). Curriculum-
tary school. based age-level tasks have also been developed to
Changes in the measurement of language delay help teachers observe performance and place learn-
and/or impairment have been attributed to theoreti- ers on a common framework/standard (Lumley et
cal and practical advances in speech and language ah, 1993).
therapy. It is claimed that these advances have, in
turn, wrought changes in the scope of what is However, these directions, though productive,
involved in language assessment and in the methods have not been unproblematic, not least because they
by which it takes place (Howard et ah, 1995). imply (and indeed encourage) differential assessment
Resulting research has included reflection on the for EAL/ESL learners in order for individual
predictive validity of tests involving language pro- students' needs to be identified and addressed. This
duction that are used as standard screening for lan- can result in tension between the concerns of the
guage delay in children as young as 18 months educational system for ease of administration, appear-
(particularly in the light of research evidence that ances of equity and accountability and those of
production and comprehension are not functionally teachers for support in teaching and learning (see
discrete before 28 months) (Boyle et ah, 1996). Other Brindley, 1995). Indeed, Australia and England and
research, however, has looked at the nature of the Wales have now introduced standardised testing

230

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
for all learners regardless of language background. for certification of progress. The latest additions to the
The latter two countries are purportedly follow- certificates available are the Saxoncourt Tests forYoung
ing a policy of entitlement for all but, as McKay Learners of English (STYLE) (http://www.saxon-
(2000) argues, their motives are far more likely to court.com/publishing.htm) and a suite of tests for
be to simplify/rationalise reporting in order to young learners developed by the University of
make comparisons across schools and on which Cambridge Local Examinations Syndicate (UCLES):
to predicate funding. Furthermore, and somewhat Starters, Movers and Flyers (http://www.cambridge-
paradoxically, as Leung and Teasdale (1996) have efl.org/exam/young/bg_yle.htm)
established, the use of standardised attainment targets In the development of the latter, the cognitive
does not result in more equitable treatment of learners, development of young learners has purportedly been
because teachers implicitly apply native-speaker taken into account and though certificates are issued,
norms in making judgements of EAL/ESL learner these are intended to reward young learners for what
performances. they can do. By adopting this approach it is hoped
Latterly, research has focused on classroom-based that the tests will be used to find out what the learn-
teacher assessment, looking, in the case of Rea- ers already know/have learned and to check if teach-
Dickins and Gardner (2000), at the constructs under- ing objectives have been achieved (Wilson, 2001).
lying formative and summative assessment and, in the It is clear that, despite an avowed preference for
case of Teasdale and Leung (2000), at the epistemic teacher-based formative assessment, recent research
and practical challenges for alternative assessment. on assessing young learners documents a growth in
The overriding conclusion of both studies is that formal assessment and ongoing research exemplifies
'insufficient research has been done to establish the movement towards greater standardisation of
what, if any, elements of assessment for learning and assessment activities and measures of attainment.
assessment as measurement are compatible' (Teasdale Furthermore, the expansion in formal assessment has
& Leung, 2000: 180), a concern no doubt shared by led to increased specification of the language targets
researchers studying the introduction of assessment young learners might plausibly be expected to reach
of foreign languages in primary/elementary schools. and indicates the spread of centrally specified
Indeed, the growing tendency to introduce a curriculum goals. It seems that the field has moved
foreign language at the primary school level has forward in its understanding of the assessment needs
resulted in a parallel growth in interest in how this of young learners yet has been pressed back by eco-
early learning might be assessed. This research focuses nomic considerations. The challenge in the next
on both formative (e.g., Hasselgren, 1998; Gattullo, decade will perhaps lie in addressing the tension
2000; Hasselgren, 2000; Zangl, 2000) and summative between these competing agendas.
assessment (Johnstone, 2000; Edelenbos & Vinje, In this first part of the two-part review of lan-
2000) and is primarily concerned with how young guage testing and assessment, we have reviewed rela-
learners' foreign language skills might be assessed, tively new concerns in language testing, beginning
with an emphasis on identifying what learners can with an account of research into washback, and then
do. Motivated in many cases by a need to evaluate moving on to discuss issues in the ethics and politics
the effectiveness of language programmes (e.g., of language testing and the development of standards
Carpenter et al., 1995; Edelenbos & Vinje, 2000), for language tests. After describing trends in testing
these studies document the challenges of designing on a national level and developments in testing
tests for young learners. In doing so they cite, among for specific purposes, we surveyed developments
other factors: the learners' need for fantasy and fun, in computer-based testing before discussing self-
the potentially detrimental effect of perceived 'fail- assessment and alternative assessment. Finally we
ure' on future language learning, the need to design reviewed the assessment of young learners.
tasks that are developmentally appropriate and com- In the second part of this review, to appear in April
parable for children of different language abilities 2002, we describe developments in what are basically
who have studied in different schools/language pro- rather traditional concerns in language testing
grammes and the potential problem inherent in tasks research, looking at the major language constructs
which encourage children to interact with an unfa- (reading, listening, and so on) but in the context of
miliar adult in the test situation (see Carpenter et al, a new approach to validity and validation, some-
1995; Hasselgren, 1998,2000).The studies also reflect times known as the Messick approach, or construct
a desire to understand how teachers implement validation.
assessment (Gatullo, 2000) as well as a need for
inducting teachers into assessment practices in con-
texts where there is no tradition of assessment References
(Hasselgren, 1998).
ALDERSON, J. C. (1986a). Computers in language testing. In
Recent years have also seen a phenomenal G. N. Leech & C. N. Candlin (Eds.), Computers in English
increase in the number of commercial language language education and research (pp. 99-111). London: Longman.
classes for young learners with a consequent market ALDERSON, J. C. (1986b). Innovations in language testing? In M.

231

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
Portal (Ed.), Innovations in language testing (pp. 93-105). BARNES, A., H U N T , M . & POWELL, B. (1999). Dictionary use in
Windsor: NFER/Nelson. the teaching and examining of MFLs at GCSE. Language
ALDERSON, J. C. (1988). Testing English for Specific Purposes: Learning Journal, 19,19-27.
how specific can we get? ELTDocuments, 127,16-28. BARNES, A. & POMFRETT, G. (1998). Assessment in German at
ALDERSON, J. C. (1991). Language testing in the 1990s: H o w far KS3: how can it be consistent, fair and appropriate? Deutsch:
have we got? How much further have we to go? In S. Anivan Lehren und Lernen, 17,2-6.
(Ed.), Current developments in language testing (Vol. 25, pp. 1-26). BARRS, M., ELLIS, S., HESTER, H. & THOMAS, A. (1988). Tlie
Singapore: SEAMEO Regional Language Centre. Primary Language Record: A handbook for teachers. London:
ALDERSON, J. C. (1996). D o corpora have a role in language Centre for Language in Primary Education.
assessment? In J. Thomas & M . Short (Eds.), Using corpora for BENNETT, R . E. (1998). Reinventing assessment: speculations on the
language research (pp. 248—59). Harlow:Longman. future of large-scale educational testing. Princeton, N e w Jersey:
ALDERSON,J. C. (1997). Ethics and language testing. Paper present- Educational Testing Service.
ed at the annualTESOL Convention, Orlando, Florida. BHGEL, K. & LEIJN, M. (1999). N e w exams in secondary educa-
ALDERSON, J. C. (1998).Testing and teaching: the dream and the tion, new question types. An investigation into the reliability
reality. novELTy, 5(4), 23-37. of the evaluation of open-ended questions in foreign-
ALDERSON, J. C. (1999). What does PESTI have to do with us language exams. LevendeTalen, 537,173-81.
testers? Paper presented at the International Language BLANCHE, P. (1990). Using standardised achievement and oral
Education Conference, Hong Kong. proficiency tests for self-assessment purposes: the DLIFLC
ALDERSON, J. C. (2000a). Levels of performance. In J. C. study. Language Testing, 7(2), 202—29.
Alderson, E. Nagy, & E. Oveges (Eds.), English language educa- BLANCHE, P. & M E R I N O , B. J. (1989). Self-assessment of foreign
tion in Hungary, Part II: Examining Hungarian learners' achieve- language skills: implications for teachers and researchers.
ments in English. Budapest: T h e British Council. Language Learning, 39(3), 313-40.
ALDERSON, J. C. (2000b). Exploding myths: Does the number of BLONDIN, C , CANDELIER, M., EDELENBOS, P., JOHNSTONE, R.,
hours per week matter? novELTy, 7(1), 17-32. KUBANEK-GERMAN, A. & TAESCHNER, T. (1998). Foreign
ALDERSON, J. C. (2000c). Technology in testing: the present and languages in primary and preschool education: context and outcomes.
the future. System 28 (4) 593-603. A review of recent research within the European Union. London:
ALDERSON, J. C. (2000d). Assessing reading. Cambridge: CILT.
Cambridge University Press. BLUE, G. M. (1988). Self assessment: the limits of learner inde-
ALDERSONj. C. (2001a).Testing is too important to be left to the pendence. ELT Documents, 131,100-18.
tester. Paper presented at the 3rd Annual Language Testing BOLOGNA DECLARATION (1999) Joint declaration of the
Symposium, Dubai, United Arab Emirates. European Ministers of Education convened in Bologna on the
ALDERSON, J. C. (2001b). The lift is being fixed. You will be un- 19th of June 1999. http://europa.eu.int/comm/education/
bearable today (Or why we hope that there will not be translation socrates/erasmus/bologna.pdf
on the new English erettsegi). Paper presented at the Magyar BOYLE, J., GlLLHAM, B. & SMITH, N. (1996). Screening for early
Macmillan Conference, Budapest, Hungary. language delay in the 18-36 month age-range: the predictive
ALDERSON.J. C. & BUCK, G. (1993). Standards in testing: a study validity of tests of production and implications for practice.
of the practice of UK examination boards in EFL/ESL testing. Child Language Teaching and Tlierapy, 12(2), 113-27.
LanguageTesting, 20(1), 1-26. B R E E N , M . P., B A R R A T T - P U G H , C , DEREWIANKA, B., H O U S E , H.,
ALDERSONJ. C , CLAPHAM, C. & WALL, D. (1995). Language test con- H U D S O N , C , LUMLEY,T., & R O H L , M . (Eds.) (1997). Profiling
struction and evaluation. Cambridge: Cambridge University Press. ESL children: how teachers interpret and use national and state
A L D E R S O N J . C. & HAMP-LYONS, L. (1996).TOEFL preparation assessment frameworks (Vol. 1). Commonwealth of Australia:
courses: a study of washback. Language Testing, 13(3), 280-97. Department of Employment, Education, Training and Youth
A L D E R S O N J . C , NAGY, E. & OVEGES^E. (Eds.) (2000a). English Affairs.
language education in Hungary, Part II: Examining Hungarian BRINDLEY, G. (1995). Assessment and reporting in language learning
learners' achievements in English. Budapest: The British Council. programs: Purposes, problems and pitfalls. Plenary presentation at
ALDERSON.J.C, PERCSICH, R . & SZABO, G. (2000b). Sequencing the International Conference on Testing and Evaluation in
as an item type. Language Testing, 17 (4), 423—47. Second Language Education, Hong Kong University of
ALDERSON, J. C. & WALL, D . (1993). Does washback exist? Science and Technology, 21-24 June 1995.
Applied Linguistics, 14(2), 115-29. BRINDLEY, A. (1998). Outcomes-based assessment and reporting
ALTE (1998)MLTJ5 handbook ojEuropean examinations and exam- in language learning programmes: a review of the issues.
ination systems. Cambridge: UCLES. LanguageTesting, 35(1),45-85.
AUSTRALIAN EDUCATION COUNCIL (1994). ESL Scales. BRINDLEY, G. (2001). Outcomes-based assessment in practice:
Melbourne: Curriculum Corporation. some examples and emerging insights. Language Testing, 18(4),
BACHMAN, L. F. & PALMER, A. S. (1989).The construct validation 393-407.
of self-ratings of communicative language ability. Language BROWN, A. (1995). The effect of rater variables in the develop-
Testing, 6(1), 14-29. ment of an occupation-specific language performance test.
BACHMAN, L. F. & PALMER,A. S. (1996). Language testing in practice. LanguageTesting, 32(1), 1-15.
Oxford: Oxford University Press. BROWN, A. & IWASHITA, N. (1996). Language background and
BAILEY, K. (1996). Working for washback: A review of the wash- item difficulty: the development of a computer-adaptive test
back concept in language testing. Language Testing, 13(3), ofjapanese. System, 24(2), 199-206.
257-79. B R O W N . A . & LUMLEY,T. (1997). Interviewer variability in specif-
BAKER, C. (1988). Normative testing and bilingual populations. ic-purpose language performance tests. In A. Huhta, V.
Journal of Multilingual and Multicultural Development, 9(5), Kohonen, L. Kurki-Suonio & S. Luoma (Eds.), Current develop-
399-409. ments and alternatives in language assessment (137-50).Jyvaskyla:
BANERJEE, J., CLAPHAM, C , CLAPHAM, P. & WALL, D. (Eds.) Centre for Applied Language Studies, University of Jyvaskyla.
(1999). ILTA language testing bibliography 1990-1999, First edi- BROWN, J. D. (1997). Computers in language testing: present
tion. Lancaster, UK: Language Testing Update. research and some future directions. Language Learning and
BARBOT, M.-J. (1991). N e w approaches to evaluation in self- Technology, 3(1), 44-59.
access learning (trans, from French). Etudes de Linguistique BROWN, J. D. & HUDSON,T. (1998).The alternatives in language
Appliquee, 79,77-94. assessment. TESOL Quarterly, 32(4), 653-75.

232

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
BRUTON, A. (1991). Continuous assessment in Spanish state CUMMINS, J. (1984b). Wanted: A theoretical framework for relat-
schools. Language Testing Update, 10,14—20. ing language proficiency to academic achievement among
BUCKBY, M. (1999). The'use of the target language at GCSE. bilingual students. In C. Rivera (Ed.), Language proficiency and
Language Learning Journal, 19,4-11. academic achievement (Vol. 10). Clevedon, England: Multilingual
BURSTEIN, J., FRASE, L. T., G I N T H E R , A. & G R A N T , L. (1996). Matters.
Technologies for language assessment. Annual Review of DAVIDSON, F. (1994). Norms appropriacy of achievement tests:
Applied Linguistics, 16,240-60. Spanish-speaking children and English children's norms.
CARPENTER, K., FUJII, N. & KATAOKA, H. (1995). An oral inter- Language Testing, 11(1), 83-95.
view procedure for assessing second language abilities in chil- DAVIES, A. (1978). Language testing: survey articles 1 and 2.
dren. Language Testing, 12(2), 157-81. Language Teaching and Linguistics Abstracts, 11, 145-59 and
CARROLL, B. J. & WEST, R . (1989). ESU Framework: Performance 215-31.
scalesfor English language examinations. Harlow: Longman. DAVIES, A. (1997). Demands of being professional in language
CARTON, F. (1993). Self-evaluation at the heart of learning. Le testing. Language Testing, 14(3), 328-39.
Francais dans le Monde (special number), 28-35. DAVIES, A. (2001). The logic of testing Languages for Specific
CELESTINE, C. & CHEAH, S. M. (1999). The effect of background Purposes. Language Testing, 18(2), 133-47.
disciplines on IELTS scores. In R . Tulloh (Ed.), 1ELTS DE JONG,J. H. A. L. (1992). Assessment of language proficiency in
Research Reports 1999 (Vol. 2, 36-51). Canberra: IELTS the perspective of the 21st century. AILA Review, 9,39-45.
Australia Pty Limited. DOLLERUP, C , GLAHN, E.& ROSENBERG HANSEN, C. (1994).
CHALHOUB-DEVILLE, M. & DEVILLE, C. (1999). Computer- 'Sprogtest': a smart test (or how to develop a reliable and
adaptive testing in second language contexts. Annual Review of anonymous EFL reading test). Language Testing, 11(1), 65-81.
Applied Linguistics, 19,273-99. DOUGLAS, D. (1995). Developments in language testing. Animal
CHAMBERS, F. & RICHARDS, B. (1992). Criteria for oral assess- Review of Applied Linguistics, 15,167-87.
ment. Latiguage Learning Journal, 6, 5-9. DOUGLAS, D. (1997). Language for specific purposes testing. In
CHAPELLE, C. (1999). Validity in language assessment. Annual C. Clapham & D. Corson (Eds.), Language testing and assessment
Review of Applied Linguistics, 19,254-72. (Vol. 7, 111-19). Dordrecht, The Netherlands: Kluwer
CHARGE, N. & TAYLOR, L. B. (1997). Recent developments in Academic Publishers.
IELTS. ELTJournal, 51(4), 374-80. DOUGLAS, D. (2000). Assessing languages for specific purposes.
CHEN, Z. & HENNING, G. (1985). Linguistic and cultural bias in Cambridge: Cambridge University Press.
language proficiency tests. Language Testing, 2(2), 155-63. DOUGLAS, D. (2001a).Three problems in testing language for spe-
CHENG, L. (1997). H o w does washback influence teaching? cific purposes: authenticity, specificity and inseparability. In C.
I m p l i c a t i o n s for H o n g K o n g . Language and Education 11(1), Elder, A. Brown, E. Grove, K. Hill, N. Iwashita.T. Lumley.T. F.
38-54. McNamara & K. O'Loughlin (Eds.), Experimenting with uncer-
CLAPHAM, C. (1996). Tlie development of IELTS: a study of the effect tainty: essays in honour of Alan Davies (Studies in Language Testing
of background knowledge on reading comprehension (Studies in Series, Vol. 11, 45—51). Cambridge: University of Cambridge
Language Testing Series, Vol. 4). Cambridge: University of Local Examinations Syndicate and Cambridge University Press.
Cambridge Local Examinations Syndicate and Cambridge DOUGLAS, D. (2001b). Language for Specific Purposes assessment
University Press. criteria: where do they come from? Language Testing, 18(2),
CLAPHAM, C. (2000a). Assessment for academic purposes: where 171-85.
next? System, 28,511-21. DOUGLAS, D. & SELINKER, L. (1992). Analysing oral proficiency
CLAPHAM, C. (2000b). Assessment and testing. Annual Review of test performance in general and specific-purpose contexts.
Applied Linguistics, 20,147-61. System, 20(3), 317-28.
CONIAM, D. (1994). Designing an ability scale for English across DUNKEL, P. (1999). Considerations in developing or using sec-
the range of secondary school forms. Hong Kong Papers in ond/foreign language proficiency computer-adaptive tests.
Linguistics and Language Teaching, 17,55-61. Language Learning and Technology, 2(2), 77—93.
CONIAM, D. (1995). Towards a common ability scale for Hong EDELENBOS, P. & JOHNSTONE, R . (Eds.). (1996). Researching lan-
Kong English secondary-school forms. Language Testing, 12(2), guages at primary school: some European perspectives. London:
182-93. CILT, in collaboration with Scottish CILT and GION.
CONIAM, D. (1997). A computerised English language proofing EDELENBOS, P. &VlNJE, M. P. (2000).The assessment of a foreign
cloze program. Computer-Assisted Language Learning, 10(1), language at the end of primary (elementary) education.
83-97. LanguageTesting, 17(2), 144-62.
CONIAM, D. (1998). From text to test, automatically - an evalua- ELDER, C. (1997). What does test bias have to do with fairness?
tion of a computer cloze-test generator. Hong Kong Journal of LanguageTesting, 14(3), 261-77.
Applied Linguistics, 3(1), 41-60. ELDER, C. (2001). Assessing the language proficiency of teachers:
COUNCIL OF EUROPE (2001). A Common European Framework of are there any border controls? LanguageTesting, 18(2), 149-70.
reference for learning, teaching and assessment. Cambridge: FEKETE, H., M A J O R , E. & NIKOLOV, M. (Eds.) (1999). English
Cambridge University Press. language education in Hungary: A baseline study. Budapest: The
CSEPES, I., SULYOK, A. & OVEGES, E. (2000). The pilot speaking British Council.
examinations. In J. C. Alderson, E. Nagy & E. Oveges (Eds.), Fox, J., PYCHYL, T. & ZUMBO, B. (1997). An investigation of
English language education in Hungary, Part II: Examining Hungarian background knowledge in the assessment of language profi-
learners' achievements in English. Budapest:The British Council. ciency. In A. Huhta,V. Kohonen, L. Kurki-Suonio & S. Luoma
CUMMING, A. (1994). Does language assessment facilitate recent (Eds.), Current developments and alternatives in language assessment
immigrants' participation in Canadian society? TESL Canada (367-83).Jyvaskyla: University ofjyva'skyla.
Journal, 11 (2), 117-33. FULCHER, G. (1999a). Assessment in English for Academic
CUMMING, A. (1995). Changing definitions of language profi- Purposes: putting content validity in its place. Applied
ciency: functions of language assessment in educational pro- Linguistics, 20(2), 221-36.
grammes for recent immigrant learners of English in Canada. FULCHER, G. (1999b). Computerising an English language
Journal of the CAAL, J 7(1), 35-48. placement test. ELTJournal, 53(4), 289-99.
CUMMINS, J. (1984a). Bilingualism and special education: Issues in FULCHER, G. & BAMFORD, R . (1996). I didn't get the grade I
assessment and pedagogy. Clevedon, England: Multilingual need.Where's my solicitor? System, 24(4), 437-48.
Matters. GATTULLO, F. (2000). Formative assessment in ELT primary

233

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
(elementary) classrooms: an Italian case study. Language Testing, HUGHES, A. (1993). Backwash and TOEFL 2000. Unpublished
77(2), 278-88. manuscript, University of Reading.
GENESEE, F. & HAMAYAN, E.V. (1994). Classroom-based assess- HUGHES WILHELM, K. (1996). Combined assessment model for
ment. In F. Genesee (Ed.), Educating second language children.EAP writing workshop: portfolio decision-making, criterion-
Cambridge: Cambridge University Press. referenced grading and contract negotiation. TESL Canada
GERVAIS, C. (1997). Computers and language testing: a harmo- Journal, 14(1), 21-33.
nious relationship? Francophonie, 16,3-7. HURMANJ. (1990). Deficiency and development. Francophonie, 1,
GIMENEZJ. C. (1996). Process assessment in ESP: input, through- 8-12.
put and output. English for Specific Purposes, 15(3), 233-41. ILTA - INTERNATIONAL LANGUAGE TESTING ASSOCIATION
GlNTHER, A. (forthcoming). Context and content visuals and (1997). Code of practice for foreign/ second language testing.
performance on listening comprehension stimuli. Language Lancaster: ILTA. [Draft,March, 1997].
Testing. ILTA - INTERNATIONAL LANGUAGE TESTING ASSOCIATION.
GROOT, P. J. M. (1990). Language testing in research and educa- Code of Ethics. [http://www.surrey.ac.uk/ELI/ltrfile/ltr-
tion: the need for standards. AILA Review, 7,9-23. frame.html]
GuiLLON, M. (1997). L'evaluation ministerielle en classe de JAMIESON, J., TAYLOR, C , KIRSCH, I. & EIGNOR, D. (1998).
seconde en anglais. Les Langues Modernes, 2,32-39. Design and evaluation of a computer-based TOEFL tutorial.
HAGGSTROM, M. (1994). Using a videocamera and task-based System, 26(4), 485-513.
activities to make classroom oral testing a more realistic com- JANSEN, H. & PEER, C. (1999). Using dictionaries with national
municative experience. Foreign Language Annals, 27(2), foreign-language examinations for reading comprehension.
161-75. Levende Talen, 544,639-41.
JENNINGS, M., FOX.J., GRAVES, B. & SHOHAMY, E. (1999). The
HAHN, S., STASSEN,T. & DESCHKE, C. (1989). Grading classroom
oral activities: effects on motivation and proficiency. Foreign test-takers' choice: an investigation of the effect of topic on
Language Annals, 22(3), 241-52. language-test performance. LanguageTesting, 16(4), 426—56.
HALLECK, G. B. & MODER, C. L. (1995). Testing language and JENSEN, C. & HANSEN, C. (1995) The effect of prior knowledge
teaching skills of international teaching assistants: the limits of on EAP listening-test performance, Language Testing, 12(\),
compensatory strategies. TESOL Quarterly, 29(4), 733-57. 99-119.
HAMAYAN, E. (1995). Approaches to alternative assessment. JOHNSTONE, R. (2000). Context-sensitive assessment of modern
Annual Review of Applied Linguistics, 15,212-26. languages in primary (elementary) and early secondary educa-
HAMILTON.J., LOPES, M., MCNAMARA.T. & SHERIDAN, E. (1993). tion: Scotland and the European experience. Language Testing,
Rating scales and native speaker performance on a commu- 17(2), 123-43.
nicatively oriented EAP test. LanguageTesting, 10(3), 337-53. KALTER, A. O. & VOSSEN, P. W. J. E. (1990). EUROCERT: an
HAMP-LYONS, L. (1996). Applying ethical standards to portfolio international standard for certification of language proficiency.
assessment of writing in English as a second language. In M. AILA Review, 7,91-106.
Milanovic & N. Saville (Eds.), Performance testing, cognition andKHANIYAH.T. R. (1990a). Examinations as instruments for education-
assessment: Selected papers from the 15th Language Testing Research al change: Investigating the washback effect of the Nepalese English
Colloquium (Studies in LanguageTesting Series, Vol. 3,151-64). exams. Unpublished PhD dissertation, University of Edinburgh,
Cambridge: Cambridge University Press. Edinburgh.
HAMP-LYONS, L. (1997). Washback, impact and validity: ethical KHANIYAH,T. R. (1990b). The washback effect of a textbook-
concerns. Language Testing, 14(3), 295-303. based test. Edinburgh Working Papers in Applied Linguistics, 1,
HAMP-LYONS, L. (1998). Ethics in language testing. In C. M. 48-58.
Clapham & D. Corson (Eds.), Language testing and assessment KlEWEG, W. (1992). Leistungsmessung im Fach Englisch:
(Vol. 7). Dordrecht, The Netherlands: Kluwer Academic PraktischeVorschlage zur Konzeption von Lernzielkontrollen.
Publishing. Fremdsprachenunterricht, 45(6), 321-32.
HAMP-LYONS, L. & CONDON, W. (1993). Questioning assump- KIEWEG, W (1999). Allgemeine Giitekriterien fiir Lernziel-
tions about portfolio-based assessment. College Composition and kontrollen (Common standards for the control of learning).
Communication, 44(2), 176-90. Der Fremdsprachliche Unterricht Englisch, 3 7(1), 4—11.
HAMP-LYONS, L. & CONDON, W (1999). Assessing college writingLAURIER, M. (1998). Methodologie devaluation dans des
contextes d'apprentissage des langages assistes par des environ-
portfolios: principles for practice, theory, research. Cresskill, NJ:
Hampton Press. nements informatiques multimedias. Etudes de Linguistique
HARGAN, N. (1994). Learner autonomy by remote control. Appliquee, 110,247-55.
System, 22(4), 455-62. LAW, B. & ECKES, M. (1995). Assessment and ESL. Winnipeg,
HASSELGREN.A. (1998). Small words and good testing. Unpublished Canada: Peguis.
PhD dissertation, University of Bergen, Bergen. LEE, B. (1989). Classroom-based assessment - why and how?
HASSELGR£N,A. (2000). The assessment of the English ability of British Journal of Language Teaching, 27(2), 73—6.
young learners in Norwegian schools: an innovative approach. LEUNG, C. &TEASDALE, A. (1996). English as an additional lan-
LanguageTesting, 17(2), 261-77. guage within the National Curriculum: A study of assessment
HAWTHORNE, L. (1997). The political dimension of language practices. Prospect, 12(2), 58-68.
testing in Australia. Language Testing, 14(3), 248-60. LEUNG, C. & TEASDALE, A. (1997). What do teachers mean by
HEILENMAN, L. K. (1990). Self-assessment of second language speaking and listening: a contextualised study of assessment in
ability: the role of response effects. Language Testing, 7(2), the English National Curriculum. In A. Huhta,V. Kohonen, L.
174-201. Kurki-Suonio & S. Luoma (Eds.), New contexts,goals and alter-
HENRICHSEN, L. E. (1989). Diffusion of innovations in English lan- natives in language assessment (291-324). Jyvaskyla: University
guage teaching: The ELEC effort in fapan, 1956-1968. New ofjyvaskyla.
York: Greenwood Press. LEWKOWICZ, J. A. (1997). Investigating authenticity in language
HOLM, A., DODD, B., STOW, C. & PERT, S. (1999). Identification testing. Unpublished PhD dissertation, Lancaster University,
and differential diagnosis of phonological disorder in bilingual Lancaster.
children. LanguageTesting, 16(3), 271-92. LEWKOWICZ, J. A. (2000). Authenticity in language testing: some
HOWARD, S., HARTLEYJ. & MUELLER, D. (1995).The changing outstanding questions. Language Testing, 17(1), 43-64.
face of child language assessment: 1985-1995. Child Language LEWKOWICZ, J. A., & MOON, J. (1985). Evaluation, a way of
Teaching and Therapy, 11(1), 7-22. involving the learner. In J. C. Alderson (Ed.), Lancaster Practical

234

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
Papers in English Language Education (Vol. 6: Evaluation), messung in einem kommunikativen Fremdsprachenun-
45-80. Oxford: Pergamon Press. terricht: am Beispiel des Franzosischen. Fremdsprachenunterricht,
Li, K. C. (1997). The labyrinth of exit standard controls. Hong 46($), 449-54.
Kongjournal of Applied Linguistics, 2(1), 23—38. NLLIA (NATIONAL LANGUAGES AND LITERACY INSTITUTE OF
LIDDICOAT, A. (1996). The Language Profile: oral interaction. AUSTRALIA) (1993). NLLIA ESL Development: Language and
Babel, 31(2), 4-7,35. Literacy in Schools, Canberra: National Languages and Literacy
LIDDICOAT, A. J. (1998). Trialling the languages profile in the Institute ofAustralia.
A.C.T. Babel, 33(2), 14-38. NEIL, D. (1989). Foreign languages in the National Curriculum
Low, L., DUFFIELD, J., BROWN, S. & JOHNSTONE, R. (1993). — what to teach and how to test? A proposal for the Languages
Evaluating foreign languages in Scottish primary schools: report Taskto Group. Modern Languages, 70(1), 5—9.
Scottish Office. Stirling: University of Stirling: Scottish CILT. NORTH, B. & SCHNEIDER, G. (1998) Scaling descriptors for lan-
LUMLEY, T. (1998). Perceptions of language-trained raters guage proficiency scales. LanguageTesting, 15 (2), 217—62.
and occupational experts in a test of occupational English NORTON, B. & STARFIELD, S. (1997). Covert language assessment
language proficiency. English for Specific Purposes, 17(4), in academic writing. Language Testing, 14(3), 278—94.
347-67. OSCARSON, M. (1984). Self-assessment offoreign language skills: a
LUMLEY, T. & BROWN, A. (1998). Authenticity of discourse in a survey of research and development work. Strasbourg, France:
specific purpose test. In E. Li & G.James (Eds.), Testing and Council of Europe, Council for Cultural Co-operation.
evaluation in second language education (22-33). Hong Kong: The OSCARSON, M. (1989). Self-assessment of language proficiency:
Language Centre.The University of Science and Technology. rationale and applications. LanguageTesting, 6(1), 1-13.
LUMLEY,T. & MCNAMARA.T. F. (1995). Rater characteristics and OSCARSON, M. (1997). Self-assessment of foreign and second
rater bias: implications for training. Language Testing, 12(1), language proficiency. In C. Clapham & D. Corson (Eds.),
54-71. Language testing and assessment (Vol. 7,175-87). Dordrecht.The
LUMLEY, T., RASO, E. & MINCHAM, L. (1993). Exemplar assess- Netherlands: Kluwer Academic Publishers.
ment activities. In NLLIA (Ed.), NLLIA ESL Development: PADILLA.A. M., ANINAO.J. C. & SUNG, H. (1996). Development
Language and Literacy in Schools. Canberra: National Languages and implementation of student portfolios in foreign language
and Literacy Institute ofAustralia. programs. Foreign Language Annals, 29(3), 429-38.
LYNCH, B. (1997). In search of the ethical test. Language Testing, PAGE, B. (1993).The target language and examinations. Language
14(3), 315-27. LearningJournal, 8,6—7.
LYNCH, B. & DAVIDSON, F. (1994). Criterion-referenced test PAPAJOHN, D. (1999). The effect of topic variation in perfor-
development: linking curricula, teachers and tests. TESOL mance testing: the case of the chemistry TEACH test for
Quarterly, 28(4), 727-43. international teaching assistants. LanguageTesting, 16(1), 52-81.
LYNCH, T. (1988). Peer evaluation in practice. ELT Documents, PEARSON, I. (1988).Tests as leversforchange. In D. Chamberlain
131,119-25. & R. Baumgardner (Eds.), ESP in the classroom: Practice and
MANLEY, J. H. (1995). Assessing students' oral language: one evaluation (Vol. 128, 98-107). London: Modern English
school district's response. Foreign Language Annals, 28(1), Publications.
93-102. PEIRCE, B. N. & STEWART, G. (1997). The development of the
MCKAY, P. (2000). On ESL standards for school-age learners. Canadian Language Benchmarks Assessment. TESL Canada
Language Testing, 17(2), 185-214. Journal, 14(2), 17-31.
MCKAY, P., HUDSON, C. & SAPUPPO, M. (1994). ESL bandscales, PLAKANS, B. & ABRAHAM, R. G. (1990).The testing and evalua-
NLLIA ESL development: language and literacy in schools project. tion of international teaching assistants. In D. Douglas (Ed.),
Canberra: National Languages and Literacy Institute of English language testing in U.S. colleges and universities (68-81).
Australia. Washington D C : NAFSA.
MCKAY, P. & SCARINO, A. (1991). Tlie ESL Framework of Stages. PUGSLEY,J. (1988). Autonomy and individualisation in language
Melbourne: Curriculum Corporation. learning: institutional implications. ELT Documents, 131,
MCNAMARA, T. (1998). Policy and social considerations in 54-61.
language assessment. Annual Review of Applied Linguistics, 18, REA-DICKINS, P. (1987).Testing doctors' written communicative
304-19. competence: an experimental technique in English for spe-
MCNAMARA, T. F. (1995). Modelling performance: opening cialist purposes. Quantitative Linguistics, 34,185-218.
Pandora's box. Applied Linguistics, 16(2), 159-75. REA-DICKINS, P. (1997). So why do we need relationships with
McNAMARA.T. F. & LUMLEY,T. (1997).The effect of interlocutor stakeholders in language testing? A view from the UK.
and assessment mode variables in overseas assessments of LanguageTesting, 14(3), 304-14.
speaking skills in occupational settings. Language Testing, 14(2),REA-DICKINS, P. & GARDNER, S. (2000). Snares or silver bullets:
140-56. disentangling the construct of formative assessment. Language
MESSICK, S. (1994). The interplay of evidence and consequences Testing, 17(2), 215-43.
in the validation of performance assessments. Educational READ, J. (1990) Providing relevant content in an EAP writing
Researcher, 23(2), 13-23. test, English for Specific Purposes 1,243-68.
MESSICK, S. (1996). Validity and vvashback in language testing. REED, D. J. & HALLECK, G. B. (1997). Probing above the ceiling
LanguageTesting, 13(3), 241-56. in oral interviews: what's up there? In A. Huhta,V. Kohonen,
MILANOVIC, M. (1995). Comparing language qualifications in L. Kurki-Suonio & S. Luoma (Eds.), Current developments and
different languages: a framework and code of practice. System, alternatives in language assessment. Jyvaskyla: University of
23(4), 467-79. Jyvaskyla'.
MOELLER,A.J. & RESCHKE, C. (1993). A second look at grading RICHARDS, B. & CHAMBERS, F. (1996). Reliability and validity in
and classroom performance: report of a research study. Modern the GCSE oral examination. Language Learning Journal, 14,
Language Journal, 77(2), 163-9. 28-34.
MOORE.T. & MORTONJ. (1999).Authenticity in the IELTS aca- Ross, S. (1998). Self-assessment in second language testing: a
demic module writing test: a comparative study of task 2 items meta-analysis of experiential factors. Language Testing, 15{\),
and university assignments. In R.Tulloh (Ed.), IELTS Research 1-20.
Reports 1999 (Vol. 2, 64-106). Canberra: IELTS Australia Pty ROSSITER, M. & PAWLIKOWSSKA-SMITH, G. (1999). The use of
Limited. CLBA scores in LINC program placement practices in
MUNDZECK, F. (1993). Die Problematik objektiver Leistungs- Western Canada. TESL Canada Journal, 16(2),39-52.

235

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9


Language testing and assessment (Part 1)
ROY, M.-J. (1988). Writing in the GCSE - modern languages. TAYLOR, C , KIRSCH, I., EIGNOR, D. & JAMIESON, J. (1999).
British Journal of Language Teaching, 26(2), 99-102. Examining the relationship between computer familiarity and
SCIARONE.A. G. (1995). A fully automatic homework checking performance on computer-based language tasks. Language
system. IRAL, 33(1), 3 5 ^ 6 . Learning, 49(2), 219-74.
SCOTT, M. L., STANSFIELD, C. W. & KENYON, D. M. (1996). TEASDALE, A. & LEUNG, C. (2000). Teacher assessment and
Examining validity in a performance test: the listening sum- psychometric theory: a case of paradigm crossing? Language
mary translation exam (LSTE). Language Testing, 13,83-109. Testing, 17(2), 163-84.
SHAMEEM, N. (1998).Validating self-reported language proficien- TESOL (1998). Managing the assessment process. A framework for
cy by testing performance in an immigrant community: the measuring student attainment of the ESL standards. Alexandria,
Wellington Indo-Fijans. Language Testing, 15(1), 86—108. VA:TESOL.
SHOHAMY, E. (1993). The power oftests:Tlie impact of language testsTRUEBA, H. T. (1989). Raising silent voices: educating the linguistic
on teaching and learning NFLC Occasional Papers. Washington, minoritiesfor the twenty-first century. New York: Newbury House.
D.C.: The National Foreign Language Center. VAN EK,J.A. (1997). The Threshold Level for modern language
SriOHAMY, E. (1997a).Testing methods, testing consequences: are learning in schools. London: Longman.
they ethical? Language Testing, 14(3), 340-9. VAN ELMPT, M. & LOONEN, P. (1998). Open questions: answers in
SHOHAMY, E. (1997b). Critical language testing and beyond, the foreign language? Toegepaste Taalwetenschap in Artikelen, 58,
plenary paper presented at the American Association for 149-54.
Applied Linguistics, Orlando, Florida. 8-11 March. VANDERGRIFT, L. & BELANGER, C. (1998). The National Core
SHOHAMY.E. (2001a). Tlie power of tests. London: Longman. French Assessment Project: design and field test of formative
SHOHAMY, E. (2001b). Democratic assessment as an alternative. evaluation instruments at the intermediate level. The Canadian
Language Testing, 18(4), 373-92. Modern Language Review, 54(4), 553—78.
SHOHAMY, E., DONITSA-SCHMIDT, S. & FERMAN, I. (1996). Test WALL, D. (1996). Introducing new tests into traditional systems:
impact revisited: washback effect over time. Language Testing, Insights from general education and from innovation theory.
13(3), 298-317. LanguageTesting, 13(3),334-54.
SHORT, D. (1993). Assessing integrated language and content WALL, D. (2000). The impact of high-stakes testing on teaching
instruction. TESOL Quarterly, 27(4), 627-56. and learning: can this be predicted or controlled? System, 28,
SKEHAN, P. (1988). State of the art: language testing, part I. 499-509.
Language Teaching, 211-21. WALL, D. & ALDERSON, J. C. (1993). Examining washback: The
SKEHAN, P. (1989). State of the art: language testing, part II. Sri Lankan impact study. LanguageTesting, 10(1), 41-69.
Language Teaching, 1-13. WATANABE,Y. (1996). Does Grammar-Translation come from the
SPOLSKY,B. (1997).The ethics of gatekeeping tests: what have we Entrance Examination? Preliminary findings from classroom-
learned in a hundred years? LanguageTesting, 14(3), 242-7. based research. LanguageTesting, 13(3), 319-33.
STANSFIELD, C.W. (1981).The assessment of language proficiency WATANABE,Y. (2001). Does the university entrance examination
in bilingual children: An analysis of theories and instrumenta- motivate learners? A case study of learner interviews. Akita
tion. In R.V Padilla (Ed.), Bilingual education and technology. Association of English Studies (ed.). Trans-equator exchanges:
STANSFIELD, C. W., SCOTT, M. L. & KENYON, D. M. (1990). A collection of acadmic papers in honour of Professor David
Listening summary translation exam (LSTE) — Spanish (Final Ingram, 100-10.
Project Report. ERIC Document Reproduction Service, ED WEIR, C. J. & ROBERTS, J. (1994). Evaluation in ELT. Oxford:
323 786).Washington DC: Centre for Applied Linguistics. Blackwell Publishers.
STANSFIELD, C. W, WU, W. M. & Liu, C. C. (1997). Listening WELLING-SLOOTMAEKERS, M. (1999). Language examinations in
Summary Translation Exam (LSTE) in Taiwanese, akak MinnanDutch secondary schools from 2000 onwards. Levende Talen,
(Final Project Report. ERIC Document Reproduction 542,488-90.
Service, ED 413 788). N. Bethesda, MD: Second Language WILSON, J. (2001). Assessing young learners: what makes a good test?
Testing, Inc. Paper presented at the Association of Language Testers in
STANSFIELD, C. W . , W U , W . M. & VAN DER HEIDE, M. (2000). A Europe (ALTE) Conference, Barcelona, 5-7 July 2001.
job-relevant listening summary translation exam in Minnan. WINDSORJ. (1999). Effect of semantic inconsistency on sentence
In A. J. Kunnan (Ed.), Fairness and validation in language assess-grammaticality judgements for children with and without lan-
ment (Studies in Language Testing Series, Vol. 9, 177-200). guage-learning disabilities. LanguageTesting, 16(3), 293-313.
Cambridge: University of Cambridge Local Examinations Wu,W. M. & STANSFIELD, C.W. (2001).Towards authenticity of
Syndicate and Cambridge University Press. task in test development. Language Testing, 18(2), 187-206.
TARONE, E. (2001). Assessing language skills for specific purpos- YOUNG, R., SHERMIS, M. D , BRUTTEN, S. R. & PERKINS, K.
es: describing and analysing the 'behaviour domain'. In C. (1996). From conventional to computer-adaptive testing of
Elder, A. Brown, E. Grove, K. Hill, N. Iwashita.T. Lumley,T. F. ESL reading comprehension. System, 24(1), 23-40.
McNamara & K. O'Loughlin (Eds.), Experimenting with uncer-YULE, G. (1990). Predicting success for international teaching
tainty: essays in honour of Alan Davies (Studies in Language assistants in a US university. TESOL Quarterly, 24(2),227-43.
Testing Series, Vol. 11, 53-60). Cambridge: University of ZANGL, R. (2000). Monitoring language skills in Austrian prima-
Cambridge Local Examinations Syndicate and Cambridge ry (elementary) schools: a case study. Language Testing, 77(2),
University Press. 250-60.

236

http://journals.cambridge.org Downloaded: 26 Mar 2009 IP address: 194.80.32.9

You might also like