Language Assessment
Language Assessment
NIM : 204200043
Class : TBI B
Language Assessment
Chapter 1
ASSESSMENT CONCEPTS AND ISSUES
Tests on the other hand, are a subset of assessment, a genre of assessment techniques. They
are prepared administrative procedures that occur at identifiable times in a curriculum when
learners muster all their faculties to offer peak performance, knowing that their responses are
being measured and evaluated. In scientific terms, a test is a method of measuring a person's
ability, knowledge, or performance in a given domain
Evaluation does not necessarily entail testing; rather, evaluation is involved when the resuits
of a test (or other assessment procedure) are used to make decisions (Bachman, 1990, pp. 22-
23). Evaluation involves the interpretation of information. Simply recording numbers or
making check marks on a chart does not constitute evaluation.
Assessment and learning Although tests can be useful devices, they are only one among many
procedures and tasks that teachers can ultimately use to assess (and measure) students. For
optimal learning to take place, students in the classroom must have the freedom to experiment,
to try out their own hypotheses about language without feeling their overall competence is
judged in terms of those trials and errors. In the same way that tournament tennis players must,
before a tournament, have the freedom to practice their skills with no implications for their
final placement on that day of days.
Informal assessment can take a number of forms, starting with incidental, unplanned
comments and responses, along with coaching and other impromptu feedback to the student.
Examples include putting a smiley face on homework or saying “Nice job!” or “Good work!.
Informal assessment is virtually always nonjudgmental, in that you asa teacher are not making
ultimate decisions about the student’s performance.
Formal assessments are exercises or procedures specifically designed to tap into a storehouse
of skills and knowledge. They are systematic, planned sampling techniques constructed to give
teacher and student an appraisal of student achievement. To extend the tennis analogy, formal
assessments are the tournament games that occur periodically in the course of a regimen of
practice.
Formative, assessment evaluating students in the process of “forming” their competencies
and skills with the goal of helping them to continue that growth process. The key to such
formation is the delivery (by the teacher) and internalization (by the student) of appropriate
feedback on performance, with an eye toward the future continuation (or formation) of
learning.
Summative assessment aims to measure, or summarize, what a student has grasped and
typically occurs at the end of a course or unit of instruction. A summation of what a student
has learned implies looking back and taking stock of how well that student has accomplished
objectives, but it does not necessarily point to future progress. Final exams in a course and
general proficiency exams are examples of summative assessment. Summative assessment
often, but not always, involves evaluation (decision making).
Criterion-referenced tests, on the other hand, are designed to give test- takers feedback,
usually in the form of grades, on specific course or lesson objectives. Classroom tests involving
students in only one course and connected to a particular curriculum are typical of criterion-
referenced testing.
Diagnostic test is to identify aspects of a language that a student needs to develop or that a
course should include. A test of pronunciation, for example, might diagnose the phonological
features of English that are difficult for learners and should therefore become part of a
curriculum.
Place- ment tests, the purpose of which is to place a student into a particular level or section
of a language curriculum or school. A placement test usually, but not always, includes a
sampling of the material to be covered in the various courses in a Curriculum; a student’s
performance on the test should indicate the point at which the student will find material neither
too easy nor too difficult but appropriately challenging.
Proficiency test is not limited to any one course, curriculum, or single skill in the language;
rather, it tests overall ability. Proficiency tests have traditionally consisted of standardized
multiplechoice items on grammar, vocabulary, reading comprehension, and aural
comprehension. Many commercially produced proficiency tests-the TOEFL, for example
include a sample of writing as well as oral production performance.
Aptitude test is designed to measure capacity or general ability to learn a foreign language a
priori (before taking a course) and ultimate predicted success in that undertaking. Language
aptitude tests were ostensibly designed to apply to the classroom learning of any language.
Traditional and “Alternative” Assessment However, research and practice during the 1990s
provided compelling arguments against the notion that all people and all skills could be
measured by traditional tests. The result was the emergence of what came to be labeled as
alternative assessment.
Tests of pragmatics have primarily been informed by research in interlanguage and cross-
cultural pragmatics (Bardovi-Harlig & Hartford, 2016; BlumKulka, House, & Kasper, 1989;
Kasper & Rose, 2002; Stadler, 2013). Much of pragmatics research has focused on speech acts
(e.g., requests, apologies, refusals, compliments, advice, complaints, agreements, and
disagreements).
Chapter 2
A reliable test is consistent and dependable. If you give the same test to the same student or
matched students on two different occasions, the test should yield similar results.
Rater Reliability. Human error, subjectivity, and bias may enter into the scoring process.
Interrater reliability occurs when two or more scorers yield consistent scores of the same test.
Failure to achieve inter-rater reliability could stem from lack of adherence to scoring criteria,
inexperience, inattention, or even preconceived biases. Lumley (2002) provided some helpful
hints to ensure inter-rater reliability.
Test Administration Reliability. Unreliability may also result from the conditions in which
the test is administered.
Validity. By far the most complex criterion of an effective test and arguably the most important
principle is validity, “the extent to which inferences made from assessment results are
appropriate, meaningful, and useful in terms of the purpose of the assessment” (Gronlund,
1998, p. 226).
Content-Related Evidence If a test actually samples the subject matter about which
conclusions are to be drawn, and if it requires the test-taker to perform the behavior measured,
it can claim content-related evidence of validity, often popularly referred to as contentrelated
validity (e.g., Hughes, 2003; Mousavi, 2009).
Criterion-Related Evidence. A second form of evidence of the validity of a test may be found
in what is called criterion-related evidence, also referred to as criterion-related validity, or the
extent to which the “criterion” of the test has actually been reached.
Construct-Related Evidence. A third kind of evidence that can support validity, but one that
does not play as large a role for classroom teachers, is construct-related validity, commonly
referred to as construct validity.
Face validity refers to the degree to which a test looks right, and appears to measure the
knowledge or abilities it claims to measure, based on the subjective judgment of the examinees
who take it, the administrative personnel who decide on its use, and other psychometrically
unsophisticated observers” (Mousavi, 2009, p. 247).
Authenticity. A fourth major principle of language testing is authenticity, a concept that is
difficult to define, especially within the art and science of evaluating and designing tests.
Bachman and Palmer (1996) defined authenticity as “the degree of correspondence of the
characteristics of a given language test task to the features of a target language task” (p. 23)
and then suggested an agenda for identifying those target language tasks and for transforming
them into valid test items.
Washback A facet of consequential validity is “the effect of testing on teaching and learning”
(Hughes, 2003, p. 1), otherwise known in the language assessment field as washback. Messick
(1996, p. 241) reminded us that the washback effect may refer to both the promotion and the
inhibition of learning, thus emphasizing what may be referred to as beneficial versus harmful
(or negative) washback. Alderson and Wall (1993) considered washback an important enough
concept to define a washback hypothesis that essentially elaborated on how tests influence both
teaching and learning. Cheng, Watanabe, and Curtis (2004) devoted an entire anthology to the
issue of washback, and Spratt (2005) challenged teachers to become agents of beneficial
washback in their language classrooms. (See Cheng, 2014, for a more recent discussion of this
topic.)