Language Testing and Assessment
Language Testing and Assessment
Language testing and assessment (Part I) J Charles Alderson and Jayanti Banerjee Lancaster University, UK
to reflect the state of the art than are full-length books. We have also referred to other similar reviews This is the third in a series of State-of-the-Art published in the last 10 years or so, where we judged review articles in language testing in this journal, the it relevant. We have usually begun our review with first having been written by Alan Davies in 1978 and articles printed in or around 1988, the date of the last the second by Peter Skehan in 1988/1989. Skehan review, aware that this is now 13 years ago, but also remarked that testing had witnessed an explosion of conscious of the need to cover the period since the interest, research and publications in the ten years last major review in this journal. However, we have since the first review article, and several commenta- also, where we felt it appropriate, included articles tors have since made similar remarks. We can only published somewhat earlier. concur, and for quantitative corroboration would This review is divided into two parts, each of refer the reader to Alderson (1991) and to the roughly equal length. The bibliography for works International Language Testing Association (ILTA) referred to in each part is published with the relevant Bibliography 1990-1999 (Banerjee et al., 1999). In part, rather than in a complete bibliography at the the latter bibliography, there are 866 entries, divided end. Therefore, readers wishing to have a complete into 15 sections, from Testing Listening to Ethics and bibliography will have to put both parts together. Standards.The field has become so large and so active The rationale for the organisation of this review is that it is virtually impossible to do justice to it, even that we wished to start with a relatively new concern in a multi-part 'State-of-the-Art' review like this, and in language testing, at least as far as publication of it is changing so rapidly that any prediction of trends empirical research is concerned, before moving on to more traditional ongoing concerns and ending with is likely to be outdated before it is printed. In this review, therefore, we not only try to avoid aspects of testing not often addressed in international anything other than rather bland predictions, we also reviews, and remaining problems. Thus, we begin acknowledge the partiality of our choice of topics with an account of research into washback, which and trends, as well, necessarily, of our selection of then leads us to ethics, politics and standards. We then publications.We have tried to represent the field fair- examine trends in testing on a national level, ly, but have tended to concentrate on articles rather followed by testing for specific purposes. Next, we than books, on the grounds that these are more likely survey developments in computer-based testing before moving on to look at self-assessment and alternative assessment. Finally in this first part, we survey a relatively new area: the assessment of young J Charles Alderson is Professor of Linguisics and learners. English Language Education at Lancaster University. He holds an MA in German and French from Oxford In the second part, we address new concerns in University and a PhD in Applied Linguistics from test validity theory, which argues for the inclusion of Edinburgh University. He is co-editor of the journal test consequences in what is now generally referred Language Testing (Edward Arnold), and co-editor to as a unified theory of construct validity. Thereafter of the Cambridge Language Assessment Series we deal with issues in test validation and test devel(C. UP), and has published many books and articles on opment, and examine in some detail more traditional research into the nature of the constructs (reading, language testing, reading in a foreign language, and listening, grammatical abilities, etc.) that underlie evaluation of language education. tests. Finally we discuss a number of remaining conJayanti Banerjee is a PhD student in the troversies and puzzles that we call, following Department of Linguistics and Modern English McNamara (1995),'Pandora's Boxes'. Language at Lancaster University. She has been involved in a number of test development and research We are very grateful to many colleagues for their projects and has taught on introductory testing courses. assistance in helping us draw up this review, but in She has also been involved in teaching English for particular we would like to acknowledge the help, Academic Purposes (EAP) at Lancaster University. Her advice and support of the Lancaster Language Testing research interests include the teaching and assessment of Research Group, above all of Dianne Wall and EAP as well as qualitative research methods. She is parCaroline Clapham, for their invaluable and insightful ticularly interested in issues related to the interpretation comments. All faults that remain are entirely our responsibility. and use of test scores.
Introduction
Lang.Teach. 34,213-236.
DOI: 10.1017/S0261444801001707
213
http://journals.cambridge.org
IP address: 194.80.32.9
Washback
214
http://journals.cambridge.org
Downloaded: 26 Mar 2009 IP address: 194.80.32.9
In other words, test washback is far from being simply a technical matter of design and format, and needs to be understood within a much broader framework. Wall suggests that such a framework might usefully come from studies and theories of educational change and innovation, and she summarises the most important findings from these areas. She develops a framework derived from Henrichsen (1989), and owing something to the work of Hughes (1993) and Bailey (1996), and shows how such a framework might be applied to understanding better the causes and nature of washback. She makes a number of recommendations about the steps that test developers might take in the future in order to assess the amount of risk involved in attempting to bring about change through testing. These include assessing the feasibility of examination reform by studying the 'antecedent' conditions what is increasingly referred to as a 'baseline study' (Weir & Roberts, 1994, Fekete et al, 1999); involving teachers at all stages of test development; ensuring the participation of other key stakeholders including policy-makers and key institutions; ensuring clarity and acceptability of test specifications, and clear exemplification of tests, tasks, and scoring criteria; full piloting of tests before implementation; regular monitoring and evaluation not only of test performance but also of classrooms; and an understanding that change takes time. Innovating through tests is not a quick fix if it is to be beneficial.'Policy makers and test designers should not expect significant impact to occur immediately or in the form they intend. They should be aware that tests on their own will not have positive impact if the materials and practices they are based on have not been effective. They may, however, have negative impact and the situation must be monitored continuously to allow early intervention if it takes an undesirable turn' (2000:507). Similar considerations of the potential complexity of the impact of tests on teaching and learning should also inform research into the washback of existing tests. Clearly this is a rich field for further investigation. More sophisticated conceptual frameworks, which are slowly developing in the light of research findings and related studies into innovation, motivation theory and teacher thinking, are likely to provide better understanding of the reasons for washback and an explanation of how tests might be developed to contribute to the engineering of desirable change.
http://journals.cambridge.org
IP address: 194.80.32.9
ous stakeholder groups can democratise the testing process, promote fairness and therefore enhance an ethical approach. A number of case studies have been presented recently which illustrate the use and misuse of language tests. Hawthorne (1997) describes two examples of the misuse of language tests: the use of the access test to regulate the flow of migrants into Australia, and the step test, allegedly designed to play a central role in the determining of asylum seekers' residential status. Unpublished language testing lore has many other examples, such as the misuse of the General Training component of the International English Language Testing System (IELTS) test with applicants for immigration to New Zealand, and the use of the TOEFL test and other proficiency tests to measure achievement and growth in instructional programmes (Alderson, 2001a). It is to be hoped that the new concern for ethical conduct will result in more accounts of such misuse. Norton and Starfield (1997) claim, on the basis of a case study in South Africa, that unethical conduct is evident when second language students' academic writing is implicitly evaluated on linguistic grounds whilst ostensibly being assessed for the examinees' understanding of an academic subject. They argue that criteria for assessment should be made explicit and public if testers are to behave ethically. Elder (1997) investigates test bias, arguing that statistical procedures used to detect bias such as DIF (Differential Item Functioning) are not neutral since they do not question whether the criterion used to make group comparisons is fair and value-free. However, in her own study she concludes that what may appear to be bias may actually be construct-relevant variance, in that it indicates real differences in the ability being measured. One similar study was Chen and Henning (1985), who compared international students' performance on the UCLA (University of California, Los Angeles) English as a Second Language Placement Test, and discovered that a number of items were biased in favour of Spanish-speaking students and against Chinesespeaking students. The authors argue, however, that this 'bias' is relevant to the construct since Spanish is much closer to English typologically and therefore biased in favour of speakers of Spanish, who would be expected to find many aspects of English much easier to learn than speakers of Chinese would. Reflecting this concern for ethical test use, Cumming (1995) reviews the use in four Canadian settings of assessment instruments to monitor learners' achievements or the efFectiveness of programmes, and concludes that this is a misuse of such instruments, which should be used mainly for placing students onto programmes. Cumming (1994) asks whether use of language assessment instruments for immigrants to Canada facilitates their successful participation in Canadian society. He argues that
http://journals.cambridge.org
IP address: 194.80.32.9
http://journals.cambridge.org
IP address: 194.80.32.9
http://journals.cambridge.org
IP address: 194.80.32.9
National tests
The development of national language tests continues to be the focus of many publications, although many are either simply descriptions of test development or discussions of controversies, rather than reports on research done in connection with test development. In the UK context, Neil (1989) discusses what should be included in an assessment system for foreign languages in the UK secondary system but reports no research. Roy (1988) claims that writing tasks for modern languages should be more relevant, taskbased and authentic, yet criticises an emphasis on letter writing, and argues for other forms of writing, like paragraph writing. Again, no research is reported. Page (1993) discusses the value and validity of having test questions and rubrics in the target language and asserts that the authenticity of such tasks is in doubt. He argues that the use of the target language in questions makes it more difficult to sample the syllabus adequately, and claims that the more communicative and authentic the tasks in examinations become, the more English (the mother tongue) has to be used on the examination paper in order to safeguard both the validity and the authenticity of the task. No empirical research into this issue is reported. Richards and Chambers (1996) and Chambers and Richards (1992) examine the reliability and validity of teacher assessments in oral production tasks in the school-leaving GCSE (General
220
http://journals.cambridge.org
IP address: 194.80.32.9
http://journals.cambridge.org
IP address: 194.80.32.9
http://journals.cambridge.org
IP address: 194.80.32.9
http://journals.cambridge.org
IP address: 194.80.32.9
identifying when it is absolutely necessary to know how well someone can communicate in a specific context or if the information being sought is equally obtainable through a general-purpose language test. The answer to this challenge might not be as easily reached as is sometimes presumed.
Computer-based testing
Computer-based testing has witnessed rapid growth in the past decade and computers are now used to deliver language tests in many settings. A computerbased version of the TOEFL was introduced on a regional basis in the summer of 1998, tests are now available on CD ROM, and the Internet is increasingly used to deliver tests to users. Alderson (1996) points out that computers have much to offer language testing: not just for test delivery, but also for test construction, test compilation, response capture, test scoring, result calculation and delivery, and test analysis. They can also, of course, be used for storing tests and details of candidates. In short, computers can be used at all stages in the test development and administration process. Most work reported in the literature, however, concerns the compilation, delivery and scoring of tests by computer. Fulcher (1999b) describes the delivery of an English language placement test over the Web and Gervais (1997) reports the mixed results of transferring a diagnostic paper-and-pencil test to the computer. Such articles set the scene for studies of computer-based testing which compare the accuracy of the computer-based test with a traditional paperand-pencil test, addressing the advantages of a computer-delivered test in terms of accessibility and speed of results, and possible disadvantages in terms of bias against those with no computer familiarity, or with negative attitudes to computers. This concern with bias is a recurrent theme in the literature, and it inspired a large-scale study by the Educational Testing Service (ETS), the developers of the computer-based version of the TOEFL, who needed to show that such a test would not be biased against those with no computer literacy. Jamieson et ah (1998) describe the development of a computerbased tutorial intended to train examinees to take the computerised TOEFL. Taylor et al. (1999) examine the relationship between computer familiarity and TOEFL scores, showing that those with high computer familiarity tend to score higher on the traditional TOEFL. They compare examinees with high and low computer familiarity in terms of their performance on the computer tutorial and on computerised TOEFL-like tasks.They claim that no relationship was found between computer familiarity and performance on the computerised tasks after controlling for English language proficiency. They conclude that there is no evidence of bias against candidates with low computer familiarity, but also
He also suggests that the problem might not be with the LSP tests or with their specification of the target language use domain but with the assessment criteria applied. He argues (Douglas, 2001b) that just as we analyse the target language use situation in order to develop the test content and methods, we should exploit that source when we develop the assessment criteria. This might help us to avoid expecting a perfection of the test taker that is not manifested in authentic performances in the target language use situation. But perhaps the real challenge to the field is in 224
http://journals.cambridge.org
IP address: 194.80.32.9
http://journals.cambridge.org
IP address: 194.80.32.9
http://journals.cambridge.org
IP address: 194.80.32.9
http://journals.cambridge.org
IP address: 194.80.32.9
Alternative assessnnent
Self-assessment is one example of what is increasingly called 'alternative assessment'. 'Alternative assessment' is usually taken to mean assessment procedures which are less formal than traditional testing, which are gathered over a period of time rather than being taken at one point in time, which are usually formative rather than summative in function, are often low-stakes in terms of consequences, and are claimed to have beneficial washback effects. Although such procedures may be time-consuming and not very easy to administer and score, their claimed advantages are that they provide easily understood information, they are more integrative than traditional tests and they are more easily integrated into the classroom. McNamara (1998) makes the point that alternative assessment procedures are often developed in an attempt to make testing and assessment more responsive and accountable to individual learners, to promote learning and to enhance access and equity in education (1998: 310). Hamayan (1995) presents a detailed rationale for alternative assessment, describes different types of such assessment, and discusses procedures for setting up alternative assessment. She also provides a very useful bibliography for further reference. A recent special issue of Language Testing, guestedited by McNamara (Vol 18, 4, October 2001) reports on a symposium to discuss challenges to the current mainstream in language testing research, covering issues like assessment as social practice, democratic assessment, the use of outcomes based assessment and processes of classroom assessment. Such discussions of alternative perspectives are closely linked to so-called critical perspectives (what Shohamy calls critical language testing). The alternative assessment movement, if it may be termed such, probably began in writing assessment, where the limitations of a one-off impromptu single writing task are apparent. Students are usually given only one, or at most two tasks, yet generalisations about writing ability across a range of genres are often made. Moreover, it is evidently the case that most writing, certainly for academic purposes but also in business settings, takes place over time, involves much planning, editing, revising and redrafting, and usually involves the integration of input from a variety of (usually written) sources. This is in clear contrast with the traditional essay which usually has a short prompt, gives students minimal input, minimal time for planning and virtually no opportunity to redraft or revise what they have produced under often stressful, time-bound circumstances. In such situations, the advocacy of portfolios of pieces of writing became a commonplace, and a whole portfolio assessment movement has developed, especially in the USA for first language writing (HampLyons & Condon, 1993, 1999) but also increasingly
http://journals.cambridge.org
IP address: 194.80.32.9
http://journals.cambridge.org
IP address: 194.80.32.9
http://journals.cambridge.org
IP address: 194.80.32.9
References
ALDERSON, J. C. (1986a). Computers in language testing. In G. N. Leech & C. N. Candlin (Eds.), Computers in English
language education and research (pp. 99-111). London: Longman.
231
http://journals.cambridge.org
IP address: 194.80.32.9
the teaching and examining of MFLs at GCSE. Language Learning Journal, 19,19-27. BARNES, A. & POMFRETT, G. (1998). Assessment in German at KS3: how can it be consistent, fair and appropriate? Deutsch: Lehren und Lernen, 17,2-6.
BARRS, M., ELLIS, S., HESTER, H. & THOMAS, A. (1988). Tlie
Primary Language Record: A handbook for teachers. London: Centre for Language in Primary Education. BENNETT, R . E. (1998). Reinventing assessment: speculations on the future of large-scale educational testing. Princeton, N e w Jersey: Educational Testing Service. BHGEL, K. & LEIJN, M. (1999). N e w exams in secondary education, new question types. An investigation into the reliability of the evaluation of open-ended questions in foreignlanguage exams. LevendeTalen, 537,173-81. BLANCHE, P. (1990). Using standardised achievement and oral proficiency tests for self-assessment purposes: the DLIFLC study. Language Testing, 7(2), 20229. BLANCHE, P. & M E R I N O , B. J. (1989). Self-assessment of foreign language skills: implications for teachers and researchers. Language Learning, 39(3), 313-40.
BLONDIN, C , CANDELIER, M., EDELENBOS, P., JOHNSTONE, R.,
languages in primary and preschool education: context and outcomes. A review of recent research within the European Union. London: CILT. BLUE, G. M. (1988). Self assessment: the limits of learner independence. ELT Documents, 131,100-18. BOLOGNA DECLARATION (1999) Joint declaration of the European Ministers of Education convened in Bologna on the 19th of June 1999. http://europa.eu.int/comm/education/ socrates/erasmus/bologna.pdf BOYLE, J., GlLLHAM, B. & SMITH, N. (1996). Screening for early language delay in the 18-36 month age-range: the predictive validity of tests of production and implications for practice. Child Language Teaching and Tlierapy, 12(2), 113-27.
B R E E N , M . P., B A R R A T T - P U G H , C , DEREWIANKA, B., H O U S E , H.,
ESL children: how teachers interpret and use national and state assessment frameworks (Vol. 1). Commonwealth of Australia: courses: a study of washback. Language Testing, 13(3), 280-97. Department of Employment, Education, Training and Youth Affairs. A L D E R S O N J . C , NAGY, E. & OVEGES^E. (Eds.) (2000a). English language education in Hungary, Part II: Examining Hungarian BRINDLEY, G. (1995). Assessment and reporting in language learning learners' achievements in English. Budapest: The British Council. programs: Purposes, problems and pitfalls. Plenary presentation at ALDERSON.J.C, PERCSICH, R . & SZABO, G. (2000b). Sequencing the International Conference on Testing and Evaluation in as an item type. Language Testing, 17 (4), 42347. Second Language Education, Hong Kong University of ALDERSON, J. C. & WALL, D . (1993). Does washback exist? Science and Technology, 21-24 June 1995. Applied Linguistics, 14(2), 115-29. BRINDLEY, A. (1998). Outcomes-based assessment and reporting ALTE (1998)MLTJ5 handbook ojEuropean examinations and examin language learning programmes: a review of the issues. ination systems. Cambridge: UCLES. LanguageTesting, 35(1),45-85. AUSTRALIAN EDUCATION COUNCIL (1994). ESL Scales. BRINDLEY, G. (2001). Outcomes-based assessment in practice: some examples and emerging insights. Language Testing, 18(4), Melbourne: Curriculum Corporation. BACHMAN, L. F. & PALMER, A. S. (1989).The construct validation 393-407. of self-ratings of communicative language ability. Language BROWN, A. (1995). The effect of rater variables in the developTesting, 6(1), 14-29. ment of an occupation-specific language performance test. LanguageTesting, 32(1), 1-15. BACHMAN, L. F. & PALMER,A. S. (1996). Language testing in practice. Oxford: Oxford University Press. BROWN, A. & IWASHITA, N. (1996). Language background and BAILEY, K. (1996). Working for washback: A review of the washitem difficulty: the development of a computer-adaptive test back concept in language testing. Language Testing, 13(3), ofjapanese. System, 24(2), 199-206. 257-79. B R O W N . A . & LUMLEY,T. (1997). Interviewer variability in specific-purpose language performance tests. In A. Huhta, V. BAKER, C. (1988). Normative testing and bilingual populations. Kohonen, L. Kurki-Suonio & S. Luoma (Eds.), Current developJournal of Multilingual and Multicultural Development, 9(5), ments and alternatives in language assessment (137-50).Jyvaskyla: 399-409. Centre for Applied Language Studies, University of Jyvaskyla. BANERJEE, J., CLAPHAM, C , CLAPHAM, P. & WALL, D. (Eds.) (1999). ILTA language testing bibliography 1990-1999, First edi- BROWN, J. D. (1997). Computers in language testing: present tion. Lancaster, UK: Language Testing Update. research and some future directions. Language Learning and Technology, 3(1), 44-59. BARBOT, M.-J. (1991). N e w approaches to evaluation in selfaccess learning (trans, from French). Etudes de Linguistique BROWN, J. D. & HUDSON,T. (1998).The alternatives in language Appliquee, 79,77-94. assessment. TESOL Quarterly, 32(4), 653-75.
A L D E R S O N J . C. & HAMP-LYONS, L. (1996).TOEFL preparation
232
http://journals.cambridge.org
Downloaded: 26 Mar 2009 IP address: 194.80.32.9
233
http://journals.cambridge.org
IP address: 194.80.32.9
nicatively oriented EAP test. LanguageTesting, 10(3), 337-53. KALTER, A. O. & VOSSEN, P. W. J. E. (1990). EUROCERT: an international standard for certification of language proficiency. HAMP-LYONS, L. (1996). Applying ethical standards to portfolio AILA Review, 7,91-106. assessment of writing in English as a second language. In M. KHANIYAH.T. R. (1990a). Examinations as instruments for educationMilanovic & N. Saville (Eds.), Performance testing, cognition and al change: Investigating the washback effect of the Nepalese English assessment: Selected papers from the 15th Language Testing Research exams. Unpublished PhD dissertation, University of Edinburgh, Colloquium (Studies in LanguageTesting Series, Vol. 3,151-64). Edinburgh. Cambridge: Cambridge University Press. HAMP-LYONS, L. (1997). Washback, impact and validity: ethical KHANIYAH,T. R. (1990b). The washback effect of a textbookbased test. Edinburgh Working Papers in Applied Linguistics, 1, concerns. Language Testing, 14(3), 295-303. 48-58. HAMP-LYONS, L. (1998). Ethics in language testing. In C. M. KlEWEG, W. (1992). Leistungsmessung im Fach Englisch: Clapham & D. Corson (Eds.), Language testing and assessment PraktischeVorschlage zur Konzeption von Lernzielkontrollen. (Vol. 7). Dordrecht, The Netherlands: Kluwer Academic Fremdsprachenunterricht, 45(6), 321-32. Publishing. HAMP-LYONS, L. & CONDON, W. (1993). Questioning assump- KIEWEG, W (1999). Allgemeine Giitekriterien fiir Lernzieltions about portfolio-based assessment. College Composition and kontrollen (Common standards for the control of learning). Der Fremdsprachliche Unterricht Englisch, 3 7(1), 411. Communication, 44(2), 176-90. LAURIER, M. (1998). Methodologie devaluation dans des HAMP-LYONS, L. & CONDON, W (1999). Assessing college writing contextes d'apprentissage des langages assistes par des environportfolios: principles for practice, theory, research. Cresskill, NJ: nements informatiques multimedias. Etudes de Linguistique Hampton Press. Appliquee, 110,247-55. HARGAN, N. (1994). Learner autonomy by remote control. LAW, B. & ECKES, M. (1995). Assessment and ESL. Winnipeg, System, 22(4), 455-62. HASSELGREN.A. (1998). Small words and good testing. Unpublished Canada: Peguis. LEE, B. (1989). Classroom-based assessment - why and how? PhD dissertation, University of Bergen, Bergen. British Journal of Language Teaching, 27(2), 736. HASSELGRN,A. (2000). The assessment of the English ability of LEUNG, C. &TEASDALE, A. (1996). English as an additional lanyoung learners in Norwegian schools: an innovative approach. guage within the National Curriculum: A study of assessment LanguageTesting, 17(2), 261-77. practices. Prospect, 12(2), 58-68. HAWTHORNE, L. (1997). The political dimension of language LEUNG, C. & TEASDALE, A. (1997). What do teachers mean by testing in Australia. Language Testing, 14(3), 248-60. speaking and listening: a contextualised study of assessment in HEILENMAN, L. K. (1990). Self-assessment of second language ability: the role of response effects. Language Testing, 7(2), the English National Curriculum. In A. Huhta,V. Kohonen, L. Kurki-Suonio & S. Luoma (Eds.), New contexts,goals and alter174-201. HENRICHSEN, L. E. (1989). Diffusion of innovations in English lan- natives in language assessment (291-324). Jyvaskyla: University guage teaching: The ELEC effort in fapan, 1956-1968. New ofjyvaskyla. LEWKOWICZ, J. A. (1997). Investigating authenticity in language York: Greenwood Press. testing. Unpublished PhD dissertation, Lancaster University, HOLM, A., DODD, B., STOW, C. & PERT, S. (1999). Identification Lancaster. and differential diagnosis of phonological disorder in bilingual LEWKOWICZ, J. A. (2000). Authenticity in language testing: some children. LanguageTesting, 16(3), 271-92. outstanding questions. Language Testing, 17(1), 43-64. HOWARD, S., HARTLEYJ. & MUELLER, D. (1995).The changing face of child language assessment: 1985-1995. Child Language LEWKOWICZ, J. A., & MOON, J. (1985). Evaluation, a way of involving the learner. In J. C. Alderson (Ed.), Lancaster Practical Teaching and Therapy, 11(1), 7-22.
test-takers' choice: an investigation of the effect of topic on oral activities: effects on motivation and proficiency. Foreign language-test performance. LanguageTesting, 16(4), 42656. Language Annals, 22(3), 241-52. HALLECK, G. B. & MODER, C. L. (1995). Testing language and JENSEN, C. & HANSEN, C. (1995) The effect of prior knowledge on EAP listening-test performance, Language Testing, 12(\), teaching skills of international teaching assistants: the limits of 99-119. compensatory strategies. TESOL Quarterly, 29(4), 733-57. JOHNSTONE, R. (2000). Context-sensitive assessment of modern HAMAYAN, E. (1995). Approaches to alternative assessment. languages in primary (elementary) and early secondary educaAnnual Review of Applied Linguistics, 15,212-26. tion: Scotland and the European experience. Language Testing, HAMILTON.J., LOPES, M., MCNAMARA.T. & SHERIDAN, E. (1993). 17(2), 123-43. Rating scales and native speaker performance on a commu-
234
http://journals.cambridge.org
IP address: 194.80.32.9
235
http://journals.cambridge.org
IP address: 194.80.32.9
Examining the relationship between computer familiarity and performance on computer-based language tasks. Language Learning, 49(2), 219-74. SCOTT, M. L., STANSFIELD, C. W. & KENYON, D. M. (1996). TEASDALE, A. & LEUNG, C. (2000). Teacher assessment and Examining validity in a performance test: the listening sumpsychometric theory: a case of paradigm crossing? Language mary translation exam (LSTE). Language Testing, 13,83-109. Testing, 17(2), 163-84. SHAMEEM, N. (1998).Validating self-reported language proficien- TESOL (1998). Managing the assessment process. A framework for cy by testing performance in an immigrant community: the measuring student attainment of the ESL standards. Alexandria, Wellington Indo-Fijans. Language Testing, 15(1), 86108. VA:TESOL. SHOHAMY, E. (1993). The power oftests:Tlie impact of language tests TRUEBA, H. T. (1989). Raising silent voices: educating the linguistic on teaching and learning NFLC Occasional Papers. Washington, minoritiesfor the twenty-first century. New York: Newbury House. D.C.: The National Foreign Language Center. V A N EK,J.A. (1997). The Threshold Level for modern language SriOHAMY, E. (1997a).Testing methods, testing consequences: are learning in schools. London: Longman. they ethical? Language Testing, 14(3), 340-9. VAN ELMPT, M. & LOONEN, P. (1998). Open questions: answers in SHOHAMY, E. (1997b). Critical language testing and beyond, the foreign language? Toegepaste Taalwetenschap in Artikelen, 58, plenary paper presented at the American Association for 149-54. Applied Linguistics, Orlando, Florida. 8-11 March. VANDERGRIFT, L. & BELANGER, C. (1998). The National Core SHOHAMY.E. (2001a). Tlie power of tests. London: Longman. French Assessment Project: design and field test of formative SHOHAMY, E. (2001b). Democratic assessment as an alternative. evaluation instruments at the intermediate level. The Canadian Language Testing, 18(4), 373-92. Modern Language Review, 54(4), 55378. SHOHAMY, E., DONITSA-SCHMIDT, S. & FERMAN, I. (1996). Test WALL, D. (1996). Introducing new tests into traditional systems: impact revisited: washback effect over time. Language Testing, Insights from general education and from innovation theory. 13(3), 298-317. LanguageTesting, 13(3),334-54. SHORT, D. (1993). Assessing integrated language and content WALL, D. (2000). The impact of high-stakes testing on teaching instruction. TESOL Quarterly, 27(4), 627-56. and learning: can this be predicted or controlled? System, 28, SKEHAN, P. (1988). State of the art: language testing, part I. 499-509. Language Teaching, 211-21. WALL, D. & ALDERSON, J. C. (1993). Examining washback: The SKEHAN, P. (1989). State of the art: language testing, part II. Sri Lankan impact study. LanguageTesting, 10(1), 41-69. Language Teaching, 1-13. WATANABE,Y. (1996). Does Grammar-Translation come from the SPOLSKY,B. (1997).The ethics of gatekeeping tests: what have we Entrance Examination? Preliminary findings from classroomlearned in a hundred years? LanguageTesting, 14(3), 242-7. based research. LanguageTesting, 13(3), 319-33. STANSFIELD, C.W. (1981).The assessment of language proficiency WATANABE,Y. (2001). Does the university entrance examination in bilingual children: An analysis of theories and instrumentamotivate learners? A case study of learner interviews. Akita tion. In R.V Padilla (Ed.), Bilingual education and technology. Association of English Studies (ed.). Trans-equator exchanges: STANSFIELD, C. W., SCOTT, M. L. & KENYON, D. M. (1990). A collection of acadmic papers in honour of Professor David Listening summary translation exam (LSTE) Spanish (Final Ingram, 100-10. Project Report. ERIC Document Reproduction Service, ED WEIR, C. J. & ROBERTS, J. (1994). Evaluation in ELT. Oxford: 323 786).Washington DC: Centre for Applied Linguistics. Blackwell Publishers. STANSFIELD, C. W, WU, W. M. & Liu, C. C. (1997). Listening WELLING-SLOOTMAEKERS, M. (1999). Language examinations in Summary Translation Exam (LSTE) in Taiwanese, akak MinnanDutch secondary schools from 2000 onwards. Levende Talen, (Final Project Report. ERIC Document Reproduction 542,488-90. Service, ED 413 788). N. Bethesda, MD: Second Language WILSON, J. (2001). Assessing young learners: what makes a good test? Testing, Inc. Paper presented at the Association of Language Testers in STANSFIELD, C. W . , W U , W . M. & VAN DER HEIDE, M. (2000). A Europe (ALTE) Conference, Barcelona, 5-7 July 2001. job-relevant listening summary translation exam in Minnan. WINDSORJ. (1999). Effect of semantic inconsistency on sentence In A. J. Kunnan (Ed.), Fairness and validation in language assessgrammaticality judgements for children with and without lanment (Studies in Language Testing Series, Vol. 9, 177-200). guage-learning disabilities. LanguageTesting, 16(3), 293-313. Cambridge: University of Cambridge Local Examinations Wu,W. M. & STANSFIELD, C.W. (2001).Towards authenticity of Syndicate and Cambridge University Press. task in test development. Language Testing, 18(2), 187-206. TARONE, E. (2001). Assessing language skills for specific purpos- YOUNG, R., SHERMIS, M. D , BRUTTEN, S. R. & PERKINS, K. es: describing and analysing the 'behaviour domain'. In C. (1996). From conventional to computer-adaptive testing of Elder, A. Brown, E. Grove, K. Hill, N. Iwashita.T. Lumley,T. F. ESL reading comprehension. System, 24(1), 23-40. McNamara & K. O'Loughlin (Eds.), Experimenting with uncer-YULE, G. (1990). Predicting success for international teaching tainty: essays in honour of Alan Davies (Studies in Language assistants in a US university. TESOL Quarterly, 24(2),227-43. Testing Series, Vol. 11, 53-60). Cambridge: University of ZANGL, R. (2000). Monitoring language skills in Austrian primaCambridge Local Examinations Syndicate and Cambridge ry (elementary) schools: a case study. Language Testing, 77(2), University Press. 250-60.
236
http://journals.cambridge.org
Downloaded: 26 Mar 2009 IP address: 194.80.32.9