Assessment For Language Teaching
Assessment For Language Teaching
ASSESSMENT
FOR LANGUAGE
TEACHING
Aek Phakiti
University of Sydney
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
Constant Leung
King’s College London
Shaftesbury Road, Cambridge CB2 8EA, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre,
New Delhi – 110025, India
103 Penang Road, #05–06/07, Visioncrest Commercial, Singapore 238467
www.cambridge.org
Information on this title: www.cambridge.org/9781009468152
DOI: 10.1017/9781108934091
© Aek Phakiti and Constant Leung 2024
This publication is in copyright. Subject to statutory exception and to the provisions
of relevant collective licensing agreements, no reproduction of any part may take
place without the written permission of Cambridge University Press & Assessment.
When citing this work, please include a reference to the DOI 10.1017/9781108934091
First published 2024
A catalogue record for this publication is available from the British Library.
ISBN 978-1-009-46815-2 Hardback
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
DOI: 10.1017/9781108934091
First published online: April 2024
Aek Phakiti
University of Sydney
Constant Leung
King’s College London
Author for correspondence: Aek Phakiti, [email protected]
1 Introduction 1
2 Assessment 3
8 Further Developments 70
References 95
Assessment for Language Teaching 1
1 Introduction
This Element focusses on the various forms and practices of language
assessment frequently found in educational settings. We will use ‘assess-
ment’ as a superordinate term and ‘testing’ where we refer to it as a specific
assessment instrument. Assessment is an everyday classroom activity.
Various assessment activities can be placed along a continuum from implicit
and informal to explicit and formal. For example, when teachers ask, ‘Do
you have any questions?’ as part of a teaching activity, they are said to be
engaging in so-called informal formative assessment, which focusses on
assessing student learning processes. When teachers evaluate students’
submitted work using a scoring rubric and comment on strong and weak
points about their work, they are said to engage in formative assessment
(FA) if they provide feedback designed to improve learning and summative
assessment (SA) if they give a mark, grade, or score without feedback
comments. Teachers can, of course, do both.
This Element aims to help teachers develop language assessment literacy
(LAL), which refers to the knowledge, skills, and competencies involved in
the principles, roles, and types of assessment; appropriate assessment tasks
and/or task designs; and ethical and fairness considerations (Fulcher, 2012;
Taylor, 2009). Teachers can use LAL to help them carry out effective,
appropriate, and fair assessments. This knowledge will also allow them to
recognise inappropriate and unethical assessment practices whenever it
occurs.
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
2 Assessment
Assessment is a broad term that includes various approaches and methods
for collecting evidence of learning and performance in language teaching
contexts. In language teaching, assessment is used to gather evidence of
students’ language knowledge, ability, skills, and attainment levels for
decision-making purposes. Such decisions relate to certification, placement,
selection for summative purposes, feedback on learning, and syllabus/cur-
riculum development for formative purposes. Assessment subsumes testing
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
SCAN ME 1
SCAN ME 2
all students; as such, they are seen to be fair because all students are treated in the
same way, allowing comparisons among students’ performances. On the strength
of this reasoning, decisions on students’ achievement levels, represented by
marks or grades, can often be based solely on their test performances. While
standardised tests can be valuable and transparent for decision-making on stu-
dents, teachers know that the content of many tests does not require their students
to perform tasks related to their actual classroom activities or real-life situations.
We note that in anglophone applied linguistics and assessment of English as
a foreign/second language, there is a tendency to see the notion of ‘stakes’ (high
versus low) in terms of impact on the individual or institutional functions. The
concept of high-stakes within this purview can vary in degree and conse-
quences. For instance, the impact of passing or failing a high-stakes exit or
national test in English is first and foremost felt by the student. At the same time,
such a test serves an important institutional function for educational institutions
(e.g., student selection), employers (e.g., personnel selection of qualified pro-
fessionals), and society (e.g., fairness for all). Seen in this light, high or low
stakes relate to how assessment use affects individuals and society.
SCAN ME 3
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
SCAN ME 4
Reflection Box 1
Is language assessment necessary for language teaching and learning?
Why or why not?
Share your thoughts here: https://bit.ly/3tcbeZX.
3.1.1 Testing
Evaluation
Assessment
Testing
3.1.2 Assessment
3.1.3 Evaluation
Figure 1 shows that testing and assessment implicate evaluation – a process of
judging the value or quality of information collected by tests, measures, or other
assessment tasks. Evaluation is a broader concept than assessment (depicted in
another larger circle). Evaluation requires language teachers to take a broader
view of assessment outcomes and to interpret them with reference to other
relevant issues. That is, it may require some logical inferencing and reasoning
processes. For instance, teachers ask what scores or observed performance
mean to students’ learning status.
SCAN ME 5
Lado’s (1961) book was a milestone in testing and language assessment because
it helped formalise language testing as a discipline within applied linguistics
(see Read, 2015; Spolsky, 1995). In the past forty years or so, the communica-
tive model of language has been prominent (see Section 2.2.1). Further, LTA
research in the English language has made invaluable contributions to many
aspects of language education (e.g., Fulcher & Harding, 2022; Winke &
Brunfaut, 2021). The following subsections discuss the related theoretical
concepts influencing our understanding of language proficiency.
Language
Proficiency
Non-
linguistic
factors
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
Language Strategic
knowledge competence
* Organisational Metacognitive
knowledge knowledge &
* Pragmatic strategies
knowledge * Goal setting
* Planning
* Appraising
Social dimensions
In Figure 2, language use, observed performance, and scores at the top of the
model are seen as the results of interactions between a given person’s language
proficiency and the context (i.e., language tasks and settings). This multi-
componential view of general language proficiency has had a significant
methodological influence on how people’s language proficiency is tested or
assessed. This view underpins the design of test items requiring students to
produce their responses communicatively. This orientation is also reflected by
the scoring rubrics in language proficiency tests such as TOEFL and IELTS
that focus on assessing test-takers’ vocabulary range, pragmatic appropriacy,
task fulfilment, and grammatical accuracy.
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
SCAN ME 6
• Reading skills refer to the ability to process and understand the meaning of
written texts (see Alderson, 2000; Brunfaut, 2022; Grabe, 2009). Reading
comprehension is derived from the reader’s construction and interpretation of
meaning from a text through language processing and activation of their prior
world knowledge.
• Listening skills refer to processing and understanding audio and spoken texts
and obtaining information (see Field, 2008; Rost, 2016). Unlike reading
Assessment for Language Teaching 17
a written text where the reader can move back and forth, spoken language is
temporally streamed, making retrieving what has been missed more difficult.
• Speaking skills refer to the ability to process and produce coherent speech
that is meaningful and appropriate for a given purpose in a given context (see
Fulcher, 2014; Luoma, 2004). Speaking can be one-way (e.g., reading aloud,
talks, announcements) or two-way (e.g., conversations, interviews).
• Writing skills refer to the ability to process and produce ideas or information
using a given writing system (e.g., alphabets and letters, word orders, punc-
tuations, syntax) to convey meaning (see Hyland, 2016; Weigle, 2002).
SCAN ME 7
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
Table 1 Reading and listening skills (in alphabetical order) of the verbal processes involved
Table 2 Speaking and writing skills (in alphabetical order) of the verbal processes involved
There are different layers to the social dimensions of language proficiency and
language assessment practice (e.g., McNamara & Roever, 2006; Young, 2022).
In terms of language proficiency, an individual’s underlying language profi-
ciency does not entirely manifest in observed language performance. People’s
performance in a test is also affected by the tasks, conditions and social
interactions involved (e.g., the choice of tasks decided by test developers, the
technology used, the type/s of language used in group discussion).
The social dimension in language assessment within an educational or
professional context includes the use of assessment by authorities as
a mechanism or system for controlling or regulating people’s language learning
or language use behaviours and choices. Mandatory or compulsory assessments
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
(e.g., final examinations, exit tests, entrance or admission tests) have a political-
cum-social function of controlling and regulating students (see Section 1.3).
Immigrants are given a test at a particular proficiency level, such as the
Common European Framework of Reference for Languages (CEFR) A2, before
being granted a visa or citizenship in some countries (see also Rocca et al.,
2018; Shohamy, 2006).
From a teaching point of view, it is useful to know the various social functions
of assessment in the different classroom and educational contexts. Assessments
can be seen as tools to help teachers ascertain learner progress and the extent to
which teaching has been effective. In particular, teachers have the power given
by compulsory tests and assessments to decide whether students will pass or fail
(in a particular course or programme of study). However, formative classroom
assessment practice is not necessarily about power and control: FA can engen-
der a unique social dimension for establishing and maintaining a productive
teacher–student relationship. The fostering of such relationships should also
Assessment for Language Teaching 21
Until recently, the concept of language proficiency used in LTA was modelled
on what native speakers of a given language could do when they communicated
(see McNamara, 1996). Such modelling is based on a monoglossic ideology
underpinning a language proficiency view that idealises the notion of native-
speaker competence as the gold standard for learners of English. It follows that
language tests and assessments based on such conceptual assumptions have
tended to norm the TLU on that of native speakers.
However, an idealised language proficiency model requires some revisions on
some theoretical and methodical grounds. For instance, there are many native-
speaker varieties of English (e.g., American, British, Australian, Singaporean, and
Indian Englishes). Furthermore, a one-standard approach in multicultural, multi-
lingual, and globalised societies can be problematic in LTA. Especially in
a localised context, there can be different models and acceptable norms of language
proficiency. Many individuals in ethnolinguistically diverse communities routinely
communicate with one another through their multilingual repertoire without regard
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
1
The term ‘fluid/flexible multilingualism’ refers to the use of all linguistic resources for communication.
For instance, speakers with a knowledge of English, Japanese, and Spanish may draw on their total
linguistic repertoire to communicate with one another without necessarily observing language-specific
grammatical conventions. The term ‘multilingual’ is used to refer to societies that have more than one
language community, but some members of each language community may be monolingual. The term
‘plurilingualism’ is used by the Council of Europe (2020) to refer to the communicative repertoire of
speakers with knowledge of several languages. A plurilingual speaker uses all of their linguistic
knowledge and skills to enhance communication with others. Plurilingualism moves away from the
ideal native-speaker proficiency as the ultimate attainment benchmark; instead, it focusses on speakers
who can freely draw on their diverse and unique linguistic and cultural repertoire in their communica-
tion. In this Element, we use the term ‘multilingualism’ as it is more commonly used than plurilingu-
alism at this time, but the term ‘plurilingualism’ to likely to figure in language education and assessment
research in future years (see Leung, 2022a for a detailed discussion).
22 Language Teaching
SCAN ME 8
Reflection Box 2
What are challenges you often face when assessing students’ language
learning?
Share your thoughts here: https://bit.ly/3ReJOKX..
Assessment for Language Teaching 23
e.g., embedded assessment in exercises; guided e.g., summative assessment; e.g., Admission tests; state or
questions; needs assessment; assessment of formal periodic quizzes; national examinations;
learners’ cognitive and affective processes; achievement tests; midterm and commercial language proficiency
self- and peer-assessment; portfolio assessment final examinations. tests; professional association
certification assessments.
In the professional literature, the term AfL is also used to refer to assessment
with a formative purpose. Generally, AfL is embedded within teaching and
learning. It can range from being informal or ‘on the run’ (e.g., spontaneous or
impromptu assessment of students’ current knowledge or understanding during
classroom activities, such as teachers’ use of questions and clarification
requests) to being formal (e.g., planned assessment activities, such as portfolios
and project-based assessment). Teachers engage in AfL to ensure that students
develop the required knowledge and skills.
As indicated earlier, Figure 3 shows that FA and SA are interconnected. We
will elaborate on this point further here. For example, teachers need to under-
stand the standards or learning outcomes assessed at the end of a learning
Assessment for Language Teaching 25
SCAN ME 9
Standard 1
1. Students can obtain meaning from reading a short, simple text.
2. Students can use basic familiar vocabulary, sounds, and sentence structures
to aid their reading comprehension.
3. Students can independently use their classroom reading experiences to
complete similar reading tasks and activities.
After AoL data have been evaluated, students are awarded scores or grades
that best describe their level of the prescribed standards. In AoL, students’
performance should be assessed against the standards (i.e., criteria or bench-
marks) rather than being compared with their peers. This approach is called
criterion-referenced approach (CRA).
Alignment is an important aspect of AoL. Since all forms of assessment can
impact student learning quality in various ways, alignments among the target
Assessment for Language Teaching 27
Learning
outcomes
Alignments
Teaching
Assessment
and
of learning
learning
learning outcomes, teaching and learning activities, and assessment are essen-
tial. Alignments can warrant assessment relevance and promote fairness in
assessment. We illustrate an example of such alignments in Figure 4.
The basics of alignments are that, first, teaching and learning in the classroom
should be relevant to the learning outcomes to ensure that students develop their
language repertoires in association with the learning outcomes. Second, assess-
ment should be aligned with the learning outcomes and the teaching and
learning activities. Figure 5 shows the alignments among learning outcomes,
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
Learning outcomes for reading (LOR): Students are Teaching and learning activities
able to
1. use key words for understanding explanations Teacher Direct Activities: Teacher provides
when reading or listening to texts being read. language lessons by
2. read previously seen explanatory texts to increase 1. giving various examples of explanatory texts (3
accuracy and fluency and improve appropriate texts of 100-150 words) – LOR 1
pauses and intonation. 2. reading them aloud several times to help
3. identify sequences using linking words such as students match words and sounds, focusing on
first, second, next. appropriate pauses and intonation – LOR 2
4. use visual supports such as diagrams, and 3. discussing how to focus on key information
illustrations to interpret meaning in an explanation. from explanatory texts – LORs 1 & 3
5. match a sentence or caption to a visual support of a 4. using visual supports to link to explanatory texts
phenomenon. -- LORs 4 & 5
SCAN ME 10
Table 3 Example of the ESL standards (emerging and developing phases) for EAL/D students in Years 3–6 (based on Australian Curriculum,
Assessment and Reporting Authority (ACARA), 2014, pp. 9, 16)
As illustrated in Figure 6, SDs are placed around the mean (average score).
A minus sign appears before them when they are below the mean, but no plus sign
is used when they are above the mean. Theoretically, this kind of score distribu-
tion suggests a ranking system. For example, students whose scores are below the
mean score are considered below the average, whereas those above it are con-
sidered above the norm; how far below or above is defined by the values of SDs.
The perfect normal distribution of scores is symmetrical (mirror image) on
the left and the right sides of the mean. When the normal distribution is
considered, scores are placed along a continuum of the lowest and highest
possible score. When there is a normal distribution, many students will likely
be placed around the average score (approximately 68 per cent within ±1SD).
To illustrate, if the mean score is 60 and the SD is 10, the scores between the
mean and 1SD can range from 60 to 70 (making up approximately 34 per cent of
all students), and the scores between the mean and –1SD can range from 50 to
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
to derive each student’s ATAR score, which is out of 100 (i.e., a percentile).
Therefore, if a student’s ATAR is 95, it means that this student performs better
than 95 per cent of the students (i.e., this student is in the top 5 per cent). Generally
speaking, a high ATAR is specified for a competitive field of study (e.g., medicine)
in a highly ranked university.
In summary, NRA is used in SA when limited places, quotas, or positions are
available. It is also used in general or specific-purpose language proficiency
tests in which test-takers must be identified as beginners, intermediate, or
advanced, and so forth. In practice, even though scores may not be normally
distributed, scores are rank-ordered, and those with the highest scores are
selected.
correctly answer the questions related to the main ideas in a reading test, then
it can be said that those students have met the objective (i.e., passed). In
practice, the cut or minimum expected scores decide whether students have
met the criteria or standards (e.g., at least 70 per cent correct). In this example,
students’ scores are not ranked as in the NRA. Accordingly, CRA focusses on
matching or aligning students’ performance against the given standards or
criteria (see also Fulcher, 2010). In other words, the CRA decision is absolute
as it focusses on whether a student has achieved the stipulated benchmark, be
it in the form of a single answer or a minimum number of acceptable answers/
responses.
Of course, in many language courses, students are not necessarily given a pass or
fail grade. There may be different letter grades that are associated with a range of
percentage scores (e.g., 0−49 = F, 50−59 = D, 60−69 = C, 70−79 = B, 80−89 = A,
and 90−100 = A+). The decision to award particular grades to students would
depend on where their scores fall in the ranges. In this grading system, teachers
Assessment for Language Teaching 33
Reflection Box 3
Should assessment for and of learning be norm-referenced? Why or why
not?
Share your thoughts here: https://bit.ly/4aaPyhx.
SCAN ME 11
of Learning/Knowing)
This section elaborates on fundamental considerations when teachers elicit or
collect evidence of student learning or performance (as specified in the learning
outcomes). Suppose teachers aim to know whether students can orally tell others
about themselves (e.g., their names, hobbies, or interests) after a lesson. In that
case, they can design a paired task where students exchange information about
themselves. This method seems appropriate and relevant. If, however, students
are asked to complete missing gaps in a written conversation dialogue, their
speaking is not appropriately assessed. In this situation, their performance also
depends on reading comprehension and writing skills.
Language assessment professionals have developed and fine-tuned the use of
test or assessment techniques for assessing language skills or components (see
Brown & Abeywickrama, 2019; Green, 2020; Coombe, 2018; Weir, 1990,
2005; Winke & Brunfaut, 2021). This section considers three (interrelated)
methods for eliciting students’ performance.
Assessment for Language Teaching 35
Choosing the most appropriate method to assess language skills (i.e., direct or
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
example, a dictation task that asks students to write what they hear is
integrative since it requires them to integrate their listening and writing
skills, vocabulary, grammatical knowledge, and spelling skills. Similarly,
a cloze task that asks students to fill in a gap in a phrase or sentence with one
suitable word only requires them to read the text for global comprehension
and to use a specific vocabulary and correct syntactic form (i.e., vocabulary
and grammatical knowledge) in the provided space. The word integrative,
therefore, suggests an interactive nature of language use as students respond
to the task.
2. Integrated tasks are considered to be a variety of integrative tasks. Unlike
the need to integrate different aspects of students’ own language know-
ledge, skills and processes in dictation or cloze tasks, they are explicitly
asked to select and/or use a set of provided stimulus materials (e.g.,
written or spoken texts) in combination to generate their test perform-
ance. For example, students are asked to read a written passage about
public health and to listen to a talk on a related topic. Then, they need to
speak to respond to a question or prompt by using information from the
reading and listening texts to form their viewpoints. In this example,
students are said to integrate various sources of information to produce
their responses. Integrated tasks are useful for assessing language pro-
duction directly, such as speaking and writing for a specific purpose.
Receptive skills are indirectly evaluated in such an integrated test task.
One of CTT’s core assumptions is that a test score comprises what we are
looking for plus some irrelevant information (labelled as errors) that can
interfere with or give a false understanding of students’ ability, knowledge, or
attainment, for example. We summarise this concept as follows:
According to this principle, the first component of a test score is a true score that
is explained by what we aim to assess (e.g., the underlying construct(s)) by
asking students to complete test tasks or questions. For example, we seek to
know whether students can give directions using a map. In that case, their scores
should represent their ability to use appropriate words, comprehensible expres-
sions, and specific information relevant to the map. As can be seen from this
example, the ability we are interested in includes grammatical and lexical
knowledge, pronunciations, social skills (e.g., turn-taking, politeness), and
38 Language Teaching
knowledge of how to read a map. This is the part of scores we often think of as
a true score – what underlies a given score.
SCAN ME 12
Hypothetically, if a score comprises a true score plus zero error (true score + 0
error), a test or assessment has captured 100 per cent of the underlying con-
struct. However, it is unlikely that a test score is made up only of a true score
since various interfering factors can impact students’ performance. Therefore,
in CTT, the products of interfering factors or conditions are known as ‘errors’ or
an ‘error score’. Now let us return to the example of the map-reading and giving
direction task and think about possible errors that may affect a score. We will
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
feedback and other support to help students correct their errors. Students’
mistakes are not the same as measurement errors in assessment (manifested
through testing and test scores). For example, when a teacher asks her
students to explain the grammar rule related to a communicative task they
are learning, one student representative explains it correctly. Therefore, the
teacher feels delighted that her students have got the concept right.
However, in the exam, most students fail to correctly use the grammar rule
in a similar task they had accomplished in the classroom. To the teacher, this
is a shocking discovery. In this scenario, the teacher might have made an
error in her generalisation about student learning: while the student repre-
sentative might have correctly understood the grammatical concept, the rest
of the class might not have. Therefore, more students with different ability
levels should have been asked to avoid such errors. In this example, the
teacher’s inaccurate understanding creates the illusion of student collective
knowing. Such an observation error can be likened to an error score in CTT,
leading to ineffective assessment.
Finally, CTT has limitations, for example its reliance on the use of raw
scores to estimate errors, its assumption of item equivalence across the
whole test (i.e., sum scores), and error of measurement being treated the
same for all ability levels.
Understanding CTT is beneficial for teachers when they design a test or
assessment task or use an already developed test; it makes them aware that
a single observation is likely to result in a misleading conclusion about
performance. Awareness of errors in assessment is relevant to tests and
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
other non-test tasks such as portfolios and assigned coursework. For example,
in portfolio assessment of young children, parents may assist their children in
completing their portfolios. Therefore, the portfolio may include contribu-
tions from the parents involved. Furthermore, in another non-test condition,
many unknown factors can contribute to errors in assessment. For example,
students may use Chat Generative Pre-trained Transformer (ChatGPT) to help
them write an essay that can be completed at home, but they cannot access any
such assistance when they are asked to hand-write an essay in a controlled
testing condition. Their writing scores can differ significantly under these
different conditions.
Reflection Box 4
What do you do to reduce errors in FA or SA?
Share your thoughts here: https://bit.ly/3t35TnR.
Assessment for Language Teaching 41
Plan
Use Develop
Score Administer
constructs to be tested, the methods and tasks being used to elicit student
performance, and the scoring methods. Test specifications can evolve and be
improved over time, and test items, tasks, and questions can be collected,
revised, and reused when appropriate (see the arrow from ‘use’ pointing to
‘plan’ in Figure 7). The appendix at the end of this Element provides an
example of Davidson and Lynch’s (2002) test specifications.
2. Question and task creation requires teachers to produce instructions for
students and create test items (questions and tasks) to elicit responses or
performance. Teachers need to choose assessment techniques when they
develop their specifications. Assessment techniques are determined with
reference to the nature of language skills or components. (We will further
discuss some test and assessment techniques in Section 5.1.6.)
3. Piloting and/or improving before use requires teachers to gather informa-
tion about test questions’ and prompts’ quality, appropriateness, and
suitability. A pilot study can start from internal team reviews (e.g., all
Assessment for Language Teaching 43
After these sub-stages have been completed and the tasks and items have been
corrected and improved, the test can be assembled for administration and use
(which involves formatting, programming, second-round pre-testing or piloting if
needed, deciding on the scoring methods, and ensuring test security such as no
accidental release of the test before the scheduled administration).
There are three main types of scoring method: objective, semi-objective, and
subjective.
• Objective scoring refers to scoring that does not require raters’ judgements.
That is, answer keys can be used for scoring. Restricted-response techniques
(e.g., multiple-choice and true/false) are scored objectively.
• Semi-objective scoring suggests that there can be some variation in the
correct answers or responses. Short constructed-response techniques such
as short-answer questions and cloze or gap-filling tasks may involve semi-
objective scoring as correct answers can appear in variable forms (e.g., word
choices and grammatical correctness of responses) that require informed rater
judgements. Some constructed-response tasks, such as dictation, may be
considered semi-objective scoring as acceptable responses are pre-
identified (see further discussion on the dictation technique in Section 5.1.6
under ‘Constructed-Response Techniques’).
• Subjective scoring requires scorers or raters to make their judgements on
responses. Usually, performance assessments such as direct speaking and
writing tasks require subjective scoring. The term subjective is used because
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
test scores are based on scorers’ or raters’ perceptions about the quality of the
performance. It is subjective because there can be variations in test scores.
For example, two raters may assign the same performance different scores. In
subjective scoring, scoring rubrics and rater moderation can help ensure
consistency (Galaczi & Lim, 2022; Pill & Smart, 2021).
with the highest test scores are generally considered for selection (as discussed in
the NRA).
In summary, developing summative tests or assessments is a complex process.
Typically, the development of standardised testing for large-scale assessment is
not a task for a teacher working on their own; it requires teamwork and expert
knowledge, with each team member responsible for leading a specific stage. To
learn more about the stages discussed in this section, see Bachman and Damböck
(2017), Davidson and Lynch (2002), and Green (2020).
SCAN ME 13
niques teachers may consider in their question (also known as ‘test item’) design
(see Brown & Abeywickrama, 2019; Purpura, 2016; Weir, 1990, 2005 for
a comprehensive discussion of test and assessment techniques).
Selected-Response Techniques
In selected-response techniques, students are limited in how they can respond to
questions or tasks. Such techniques include discrete-point techniques such as
checklists, analogies, multiple-choice, true/false, and matching. These techniques
are often used for assessing receptive language skills or linguistic knowledge (e.g.,
grammar and vocabulary) because the focus is often on comprehension, under-
standing, and/or identification. The following are selected examples of this
technique.
skills. A test using this technique is relatively easy to construct (e.g., ideas or
information from the text can be presented as statements with prompts) and
quick to score. In FA, teachers can use this technique to check students’
understanding in the classroom. In binary or dichotomous items, e.g., yes/no
questions, students have a 50 per cent chance of being correct using blind
guessing – a source of error in scores. They may answer incorrectly owing to
flaws, a lack of clarity, or inaccuracy in the statement, and they may obtain
the correct answer for the wrong reason or by merely picking randomly.
Teachers can use FA techniques to overcome this by asking students to
explain their answers. Providing grounds for answers is an extra task that
requires reflexive thinking skills and additional writing requirements.
b. A multiple-choice technique is often adopted in language assessment. These
items allow students to select an answer from three to five choices, although
four may be the most common. The popularity of this technique is driven by
the practical nature of its administration and scoring, especially in large-
scale standardised assessments. This technique can test a broad range of
reading and listening constructs or knowledge of productive skills such as
grammatical, vocabulary, pronunciation, and pragmatic ability. Figure 8
illustrates an example of a multiple-choice reading comprehension test.
As with the dichotomous techniques, students are asked to choose only one
of the multiple options as the correct answer. It is essential to pre-test, edit, and
check the difficulty levels and distractor functions for this type of test. The sole
use of and reliance on this technique in a test has been criticised and discour-
aged (Alderson, 2000). For instance, there is the possibility of guessing. Even
when five-choice questions are used, the chance of being correct by guessing is
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
□ Go to the beach
□ Go to the dentist
□ Do some gardening
□ Get a haircut
□ Go shopping
Constructed-Response Techniques
Constructed-response techniques ask students to produce answers or responses
on their own. Compared to the selected-response techniques, these minimise the
effect of guessing. These techniques can be used to elicit short or extended
responses. Short constructed-response techniques are, for instance, short-
answer questions (e.g., one word only, or no more than three words), cloze or
gap-filling (e.g., one word per one missing space in a text), and diagram
completion (e.g., no more than three words). These techniques are somewhat
restricting and controlling. The following are examples of this technique in
more detail:
a. A cloze technique is mainly used for assessing reading skills. Every nth word
(e.g., every fifth, sixth, or seventh word) in a paragraph is deleted, and
students must devise a word to replace it. This technique taps into students’
broad comprehension of the stimulus text. They have to work out what might
be missing that could complete the meaning. When they complete the gap,
they must pay attention to content meaning, grammatical features (e.g.,
tenses, verb forms), and word forms (e.g., plurals, nouns, adjectives, or
adverbs). Figure 10 presents an example of a cloze technique in which
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
The cloze technique can be (1) __________ with other techniques, such
as (2) __________ mixing it with a multiple-choice (3) __________ or by
providing a list (4) __________ words from which students or (5)
__________ can select the correct answer. (6) __________ combination
makes test tasks easier (7) __________ the cloze format on its own.
Answers (based on the original text): (1) combined (2) by (3) technique
(4) of (5) candidates (6) This (7) than
tions). These techniques can assess productive skills such as writing and
speaking.
Instructions: Look at the picture below. You are to talk about what you can see. You have
30 seconds to look at the picture carefully and to prepare what to include in your talk. You
will have 30 seconds to complete your talk. You should speak clearly and not rush. Start
with the words ‘In the picture, I can see ….’
Instructions: You are to listen to a talk about the use of dictation tests to
control immigration in Australia. You are to write down exactly what you
have heard in the missing spaces. First, the text will be read once at
a natural speed; second, the text will be read with a pause between
sentences; third, the text will be read again at a natural speed. You should
check your spelling and use of capital letters.
Test-takers hear:
The ‘White Australia Policy’ and the dictation test under which it was
infamously enforced provided central policy tools in the quest to control
Australia’s immigrant population from Federation in 1901 until well into the
twentieth century. The dictation test, which was a key element of the
Immigration Restriction Act 1901, has always been associated with the
question of race.
It was administered to ‘coloureds’ and ‘Asians’ in order to have an
apparently neutral reason to deport them. The last person to pass the test
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
Figure 12 (cont.) 15
It was administered to ‘coloureds’ and ‘Asians’ in order to have an
apparently neutral reason to deport them. (3) _______________
_______________________________________________. It became fool-
proof, as it was designed to be. The applicant would be given the test in
a language that (4) _____________________________________________
__________________________________________________________
and, upon failing, they would be told that the authorities (5)
___________________________________________________________
_________.
mation from a text. A summary technique can be combined with the dictation
technique. Students read and summarise a text through writing or speaking.
Instructions must ensure that students understand the requirement (e.g., Write
no more than 50 words; Speak within 30 seconds after the beep sound.).
d. A free-verbal or free-written recall technique is a variant of the dictation and
summary technique. This is an extended integrative technique for assessing
reading, listening, speaking, or writing. In a reading or listening test, students
read or listen to a text and then speak or write about what they have understood
from their reading or listening. This type of assessment focusses on accurate
comprehension of a given text. In a speaking assessment, in addition to correct
recollections of information, the assessment can focus on the intelligibility of
speech, pronunciation, and fluency. In writing tests, the assessment criteria
can include content accuracy, spelling, grammatical accuracy, vocabulary use,
and mechanics (e.g., punctuation). This technique can become an integrated
task when students are required to add or relate their position to the topic and
their examples or support in an essay or speech. This technique seems natural
54 Language Teaching
in the sense that there are no test questions that intervene in students’ receptive
or productive language processes.
In SA, performance is, however, dependent on task familiarity, practice,
memory capacity, and the topics and the complexity of texts. Some students
can comprehend a text well while reading or listening but may struggle to
remember what they have just read or heard shortly after finishing reading or
listening. In a reading or listening assessment, this technique may not reveal
students’ natural ways of reading or listening because they do not usually
need to tell people what and how much they can recall. The relevance of this
technique for FA is that it helps students learn to use memory strategies,
which are essential for enhancing language learning and use. Teachers can
help students with feedback and memory strategy instructions. Students can
practise recalling what they have read or heard, and they can start with
simple texts, then move to more complex and lengthier texts.
e. An essay technique is a flexible constructed-response method for assessing
writing. Generally, students are asked to independently write an essay
responding to a statement or question (see integrated tasks in
Section 4.2.2). Essay topics can range from personal topics, such as my
last holiday, my hobby, or my family, to academic or social topics, such as
issues in climate change, technology, education, or social issues. In EMI and
CLIL settings, essays are most likely to be subject topic–related. Students
are expected to state their position, provide support, reasons, and examples,
and organise an essay based on what they have been taught (e.g., introduc-
tion–body–conclusion). As far as possible, it would be advisable to avoid
offensive topics that may impact negatively on student performance.
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
Dynamic Assessment
Dynamic assessment focusses on the development and processes of learning
during teacher-guided activities. Generally speaking, it tends to focus on an
individual student who is regarded unique in their language learning needs and
their ability to tackle a learning task successfully. When students encounter
difficulties, teachers provide scaffolding and feedback to support task engage-
ment, considering the student’s current knowledge and ability. Guidance is fine-
tuned to each student’s particular learning needs or difficulties with the learning
task. The aim is not to spoon-feed students in terms of how to correct their errors
but to provide advice on how to address problems, taking account of their learning
dispositions. Teacher feedback in dynamic assessment should be premised on
a clear view of the learning processes or steps involved. To achieve this, teachers
have to analyse and understand the nature of learning involved in any pedagogic
task. This feature makes dynamic assessment different from other kinds of
formative support. The principles of dynamic assessment have been extended
to cover whole class teaching contexts. See Poehner and Infante (2017) for further
discussion.
SCAN ME 14
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
Learning-Oriented Assessment
Learning-oriented assessment (LoA) takes a learning approach in which the
real-life educational environment (e.g., learning outcomes, mandatory internal
and external assessments), cognition (e.g., thinking processes required to
develop and learn the target language features), affect of students (e.g., feeling
and emotion), and social contexts (classroom settings, teachers and peers) are
considered (Carless, 2007, 2015; Turner & Purpura, 2017). Thus, LoA can
Assessment for Language Teaching 57
involve a cycle of teaching, learning, and assessment both inside and outside the
classroom, and it can use all forms of assessment, including both FA and SA.
For example, it can begin with teachers using the learning outcomes or object-
ives to develop teaching, learning, and assessment activities (e.g., how to
engage in small talk). Students are then introduced to the concept of small
talk with real-life examples of small talk. They work individually, in pairs, or in
groups to learn from the examples. After that, they are encouraged to personal-
ise their own small talk. In-class activities include teachers monitoring student
learning, tracking student progress, and providing feedback on their learning
and performance (e.g., suggesting correct pronunciation or word choice). Self-
and peer-assessment can also be promoted as students complete learning activi-
ties. Teachers can also use SA techniques, such as quizzes and achievement
tests, and non-test assessment tasks, such as project-based and portfolio assess-
ments, to help students realise real-life applications of small talk. Both FA and
SA techniques can be aligned with the learning outcomes or objectives. In out-
of-class activities, students are invited to engage in independent study projects
(on topics and materials guided by teachers). They may be encouraged to use
digital technology to conduct their studies multimodally. Teachers can also offer
a follow-up discussion with students about their out-of-class learning during
class time. The fundamental premise of LoA is that students should learn new
skills or useful knowledge while participating in assessment activities. In LoA,
targeted language-use scenarios are adopted to ensure that the learning activities
meet students’ personal, educational, and social goals as appropriate. See
Chong and Reinders (2023), Jones and Saville (2016), and Turner and
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
SCAN ME 15
Usage-Based Assessment
Ellis et al. (2015) and Douglas Fir Group (2016) discuss a theoretical
framework that describes and explains the nature of language use as lan-
guage learning. This framework postulates that people recognise the func-
tion of language through how it is used. For example, frequently or
repeatedly used language features in a given context are likely to be more
noticed in terms of how and what they are used for than those infrequently
used. This idea recognises that language use in a real-life activity is often
patterned (see Cadierno & Eskildsen, 2015; Dolgova & Tyler, 2019; Hall,
2019). For that reason, concepts in usage-based language learning have
implications for language teaching and assessment in terms of what to
teach and assess. For example, Thai students learning to use English in
Thailand require different foci on content and activities from those studying
English in England because the needs for and exposures to English usage are
different. Therefore, usage-based assessment should be sensitive to local
language use and learning.
Scenario-Based Assessment
Purpura (2021) provides a useful discussion of scenario-based assessment.
A scenario is an imaginary language-use scene that is used to generate a real-
life experience in which a student interacts with a task. The task may include
a collaborative activity among students and teachers. Language use in a given
scenario can be dynamic and flexible. Purpura (2021) operationalises scenario-
based assessment through computer technology, although it can be carried out in
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
The following are some significant non-test SA design features that are relevant
to the four approaches discussed above. We focus on assessments that will
contribute to student attainment or achievement decision-making.
not constrained by the extent of time pressure as they are in a standardised test
condition.
Portfolio Assessment
Portfolio assessment is a broad approach that focusses on a collection of
students’ work over time (e.g., an assigned task that cannot be completed in
standardised conditions). It can accommodate collections of separate or inte-
grated language skills. It takes a learner-centred approach to assessment
because it focusses on students’ ongoing learning processes and the outcomes
of such processes. Therefore, students collect various drafts and outlines of their
work that have been commented on for improvement, the materials they have
used and produced, and their reflections as they completed the assessment task.
Portfolios can be constructed in the form of physical folders or artefacts.
E-portfolios are portfolios that embrace computer and information technology
as well as cloud-based storage and sharing. E-portfolios have gained popularity
Assessment for Language Teaching 61
Analysis (25%) 25 to >21.25 21.25 to >18.75 18.75 to >16.25 16.25 to >12.5 12.5 to 0
Effectively analyses Analyses the Competently Meets the Does not meet the
Point: ________ the interview to interview to support analyses the minimum minimum
support the research the research interview to requirements for requirements for
question. question very well. support the interview analysis. interview analysis.
research question.
Task-Based Assessment
Task-based assessment is directly related to a task-based language teaching
approach, which focusses on helping students learn and use the target language
communicatively in authentic language-use situations. Task-based teaching and
learning is a communicative approach to language teaching and learning that
focusses on fulfilling communicative tasks. It does not ignore the importance of
raising awareness of relevant linguistic form that makes language use accurate and/
or appropriate (i.e., focus on form). Task accomplishment tends to be integrative
and integrated as students must combine various language skills for successful
communication. Task-based language assessment can be both formative and sum-
mative as students can receive feedback on their performance to improve their
communicative skills, contributing to overall learning attainment. Task-based
62 Language Teaching
assessment may be designed similarly to portfolio assessment, but students may not
be required to submit all their drafts and other artefacts for evaluation. Instead, the
focus may be on the outcome of fulfilling the task rather than on the detailed
processes and development that led to the outcome, as in portfolio assessment.
Reflection Box 5
Is it necessary to plan an SA for a language classroom? Why or why not?
Share your thoughts here: https://bit.ly/46WJmXw.
Reliability
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
Impact Validity
Quality
Criteria
Fairness Practicality
Ethics
students’ performance. When students complete the same test, they should perform
similarly today, yesterday, and tomorrow. If students pass the test today but fail the
same test tomorrow, then there is a problem with reliability. We cannot rely on this
test for decision-making. In statistics, a reliability estimate of a test ranges from 0
(completely unreliable or random) to 1 (completely reliable). A reliability estimate
of 0.90 is expected for tests used to make a high-stakes decision.
It is important to note that the concepts of reliability in CTT cannot be
applied to some aspects of classroom assessment, such as learning-oriented,
formative, and dynamic assessment. It is unreasonable to expect that stu-
dents’ performances can be consistently measured because they are develop-
ing their knowledge and skills. Their performances can fluctuate over time
and across contexts and tasks. Similarly, in portfolio assessment, much of
students’ work can be in the form of drafts and revised drafts, so their
performances cannot be expected to be consistent. Therefore, in learning-
oriented or dynamic assessment, trustworthiness is more appropriately used.
The notion of ‘trustworthiness’ in language assessment refers to the quality
of assessment activities that can be dependable, useful, and relevant to
teaching and learning activities.
7.2 Validity
In the mainstream language assessment literature, validity refers to the extent to
which scores infer the target ability or construct and are used appropriately and
ethically in decision-making (Chapelle, 2021; Chapelle & Lee, 2022; Phakiti &
Isaacs, 2021). In this section, we explain three aspects of validity that are
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
relevant for consideration in FA and SA: construct, content, and face validity.
SCAN ME 16
7.3 Practicality
In language assessment, ideally, we would like to collect as much language and
language-related information as possible to have confidence in our decisions.
Nonetheless, there is a need to consider whether such an expectation is realistic
because assessment needs to be practical. The term practicality is related to the
extent to which an assessment is feasible in a given context. In both FA and SA,
teachers need to consider the cost, resources, and time required and constrained
to develop, administer, and score or evaluate students’ responses.
The practicality considerations of FA differ from those of SA in various ways.
For example, classroom AfL is ongoing and embedded in teaching and learning
activities. The teaching and learning activities, whether in-person or online, are
designed to lead to qualitative feedback to the students. Thus, AfL can be time-
demanding for teachers and students. It may not be practical to give individual-
ised feedback to each student. In SA, since the main assessment concern is
mainly obtaining outcomes accurately and efficiently, test designers routinely
consider how much time will be needed to develop a test or assessment task,
how much time students will need to complete the test or assessment, and how
much money will be spent on development and administration. Practicality
considerations can also cover issues such as throughput efficiency (the number
of students/test-takers per administration) and marking efficiency (use of
machine marking or human raters). A lengthy and complex assessment can
collect extensive evidence of learning, but it is more expensive and time-
consuming to develop and administer. A shorter assessment may be time-
efficient and affordable, but may not sufficiently collect students’ knowledge
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
or skills of interest.
7.4 Ethics
The issues of ethics and fairness impact the quality aspects of testing and
assessment. These concepts are intertwined (see the International Language
Testing Association (ILTA) Guidelines for Practice, which comprehensively
cover ethical assessment). Generally speaking, the term ‘ethics’ concerns the
broad educational principles and social responsibilities that assessment develop-
ers, curriculum designers, teachers, and policymakers have to address. Ethics are
at the heart of testing and assessment practice. Ethical considerations are part of
the decision-making when teachers or test developers select test questions or
assessment tasks for a given purpose and group of students, decide on the marking
and reporting framework, and design protocols for safeguarding students’ per-
sonal and assessment data.
66 Language Teaching
1. designing and using tests and assessments relevant to the unit of study or the
subject being taught. Using tests or assessments not aligned with the learn-
ing outcomes or the teaching and learning activities is unethical;
2. informing students about essential tests and assessments they must complete
in the course and the consequences of not succeeding. Students should be
aware of their assessment responsibilities;
3. reminding students of assessment dates and allowing them to ask questions
to help them prepare for assessment tasks;
4. ensuring that students have equal access to essential resources for learning
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
7.5 Fairness
In the discussion on ethics in Section 6.4, issues of fairness are implicated.
Fairness is related to equality of treatment and opportunity for all students (see
Kunnan, 2018). In FA, for example, if teachers give detailed feedback and
support only to some students, but not to others, they are unfair in their
assessment practice. If a high-stakes SA offers advantages only to some stu-
dents, but not to others (e.g., owing to gender, race, religion, or socio-economic
status), it is not a fair assessment. An assessment can be considered unfair if the
test items and tasks involve knowledge and skills that are accessible only to
some of the students and/or test-takers, for example topics that some students
may know more about owing to their backgrounds in terms of language,
ethnicity/race, gender, religion, and socio-economic status (for further discus-
sion, see Mirhosseini & De Costa, 2019). The following scenario illustrates how
fairness and ethics can be intertwined, presenting plausible dilemmas that
teachers may face.
The argument for this decision is that it will save time for marking as scoring is
automated, making the school look ‘up-to-the-minute with technology’ to the
public. The teachers know that many students do not yet have sufficient skills to
use computers and are from a low socio-economic background. Their students
barely use computers to complete classroom tasks. The teachers also learn that
none of the teaching and learning activities they have been covering in class are
related to the test tasks and activities. There is a severe lack of alignment between
the classroom activities and the test tasks. They fear that many students will fail
if this computer-based test is used. This does not seem fair for students as the test
content and the delivery method are inappropriate. Nonetheless, the teachers
know that the school principal does not like people to disagree with their ideas;
expressing such an opposing concern could mean running a career risk. If you
were the teachers, what would you do?
In this scenario, the principal has not advanced a ‘fairness’ argument for the
use of computer auto-marking. The professional dilemma for the teachers is
whether to challenge the principal’s authority. To do that might cause friction
between the teachers and the principal. At the same time, the teachers cannot
68 Language Teaching
simply ignore the concerns; that course of (in)action might lead to students
failing the test simply because it does not align with what they have been taught.
In many contemporary public education systems, schools and colleges
serve minoritised students from diverse language backgrounds (e.g., some
Hispanic backgrounds, Spanish-speaking communities in the USA, or
speakers of indigenous languages in Australia). Suppose the language of
academic communication is standard English (however defined). In that
case, many minoritised students may receive unequal treatment and oppor-
tunity, particularly those at an early stage of learning English. For these
students, the educational provision is unfair.
On a positive note, fairness can be facilitated by special considerations and
accommodations. Considerations of accommodations for students with disabil-
ities (both physical and cognitive diabilities) or those in difficult circumstances
(e.g., unexpected illness or family tragedy) are essential (see Abedi, 2014;
Abedi et al., 2020). Accommodations in language assessment may require
some changes or modifications to the test questions, assessment tasks, adminis-
tration conditions, and procedures. For example, accommodations may include
extending the amount of time for completion of the test, allowing the use of text-
to-speech software to read a text in a reading test or to produce speech in
response to a speaking task, or providing an amanuensis (a person writing
down answers for the student or test-taker).
7.6 Impact
Earlier in this Element, we discussed the meanings of ‘stakes’ in testing and
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
SCAN ME 17
The concept of washback is relevant to both FA and SA. Teachers’ AfL can
have positive washback impacts on students’ linguistic, cognitive, and social
engagement. For example, when students receive some corrective feedback on
their learning, they can realise their current difficulty or weakness and pay
attention to improve it. This is an example of positive washback. When students
are guided to become autonomous and self-regulated in their learning through
the use of assessment criteria (often in the form of rubrics and sample questions/
tasks), self-practices or rehearsals, and self-tests and assessments, they are
likely to develop a capacity for autonomous learning as they have realised the
benefits of self-monitoring and exploratory enquiry. The impact of assessment
is omnipresent in formal education. The impact of assessment in English
language education is most keenly felt in SA. The washback of large-scale
commercially marketed tests such as IELTS and TOEFL often impacts the
content of teaching programmes.
Reflection Box 6
Choose one of the quality criteria to focus on in a classroom assessment.
How would you apply its principles in your assessment practice?
Share your thoughts here: https://bit.ly/3TkzqnK.
8 Further Developments
In this Element, we have explored functions and practices of language assess-
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
Section 7.2 focusses on the potential contributions of research and development for
testing and assessment activities in language teaching. In Section 7.3, we provide
recommendations for further professional development. A Glossary of Language
Assessment and an Appendix are provided after this section.
Testing and assessment activities, for example AfL and AoL, serve various
educational and professional functions and purposes. The alignment among
them is often implicit. Many language curriculum specifications serve as teach-
ing and learning objectives and as assessment criteria at the same time. Table 4
is an example of teaching–learning objectives that can also be used as assess-
ment criteria.
where they are going and what steps will lead them there; FA can help students
achieve this by offering appropriate guidance (Duckor & Holmberg, 2023).
Teachers should ask the following questions before giving feedback: (1) Should
feedback be provided? (2) What feedback should be provided? (3) When should
it be delivered? (4) Who should give it? and (5) How should it be given?
Assessment
Purpose activities What to assess Examples Primary intended use
1. To enable students to AfL & AaL ➔ AoL Linguistic knowledge, Quizzes, exercises, To provide feedback on
achieve or attain the language skills, homework, informal performance and to
target learning cognitive and discussion, teachers’ adjust or develop
objectives psychological questions, diaries, and teaching activities
processes, other reflections
academic challenges to
learning
2. To determine whether AoL ➔ AfL & AaL Language skills or Midterm and final tests, To award a grade; to
students have achieved abilities related to the assignments, portfolios, produce grade
or attained the learning learning outcomes and group projects transcripts; to certify
outcomes of a course; classroom activities course fulfilments
to summarise the level
of achievement or
attainment
3. To find out the level of Diagnostic Language skills that Practice or mock tests; To inform
readiness or assessment enable success in the diagnostic assessment a recommendation of
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
preparedness (e.g., FA (for support of study of different areas (e.g., DELNA); in- whether a given student
what areas of language learning & suc- of study; language skills house workplace or employee needs
skills/abilities need cess) & SA (as to perform required assessment further specific
more improvement and based on identified occupational language support
support) ability criteria); responsibilities (e.g., as
see No. 6 a receptionist, as
a secretary)
4. To admit or accept new Admission or selec- Language ability specific College or university To accept or reject
students into tion tests; aptitude to a programme or entrance examination; students or applicants
a programme; to tests; see No. 6 degree offered in Test of English for
determine which SA (as based on tar- a given academic International
applicants should be get language con- institute or provider, or Communication
employed structs, skills, to a job interview; (TOEIC); job
knowledge); FA language aptitude tests interviews in the target
(for test-taking language; language
preparation & aptitude tests for
score users) military personnel
selection
5. To place students or Placement tests Language skills at the English Placement Test; To allocate students or
candidates into an point before being Oxford Placement Test; candidates into
appropriate level of placed into a specific in-house placement a programme that suits
a subject or area subject; a raw ability to tests their current language
learn new languages ability
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
Table 4 (cont.)
Assessment
Purpose activities What to assess Examples Primary intended use
6. To determine a level of Proficiency or General language ability Academic language To certify a level of
general or specific- specific-purpose free from any previous proficiency tests such as proficiency; to accept or
purpose proficiency language tests; learning, specific TOEFL and IELTS; reject applicants
(e.g., beginner, admission tests; instructions, or university or college
intermediate, see Nos. 3 & 4 language courses; entrance examinations;
advanced) professional specific specific-purpose tests
language ability such as OET; language
tests for immigrants;
Aviation English Test
7. To determine the level State, province, or Language skills related to National/state curriculum To ensure that students
of student learning national the expected level of standards; Bloom’s meet an expected level
attainment according to assessment; see attainment in a given taxonomy; ESL scales of standards relative to
predetermined criteria Nos. 1 & 2 school grade (e.g., used in Australia; their grades or levels; to
(known as standards or Grades 1 to 12) defined ACTFL Proficiency promote students’ self-
benchmarks) used to by governments (e.g., Guidelines; CLB assessment; to fund
guide language ministry of education) standard for English as schools or educational
curriculum and a second language; sectors
assessment design CEFR
Assessment for Language Teaching 75
engage in self- and peer-assessment (e.g., how to use assessment criteria and
what to look for in their work).
SCAN ME 18
Proficiency or Multilingual
competency proficiency
frameworks
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
Social
Fairness and Topics and areas for dimensions in
ethical language
classroom
assessment assessment
assessment contexts
Applications Integrated
of computer language
technology skills
in FA and SA. However, teachers need to be vigilant and critical about the
use of technology since its introduction changes the nature of assessment
methods, language use, and ways of observing performance. Teachers
should aim to examine and address the theoretical, methodological, and
fairness issues and other challenges they face, for example, in test design,
administration, and scoring, as well as those faced by students and teachers
(e.g., test preparation, access to resources, and technology).
In addition to the use of technology for test delivery, there has been a rise
in the use of in automated scoring technology (AST) in which AI and natural
language processing (NLP), for example, have been adopted to replace
human scoring. Developers who adopt AST as part of their test or assess-
ment design have been challenged by critical questions about the reliability,
the accuracy, and the suitability of automated scoring use (in terms of the
real-world construct of writing or speaking ability). This, incidentally, has
triggered public suspicion in relation to claims about the reliability and the
validity of automated scoring made by researchers who are affiliated with
the institutions that are developing and marketing the products.
Teachers should enquire about the use and the impact of AST on their
classroom practice, particularly as there could be unforeseen or unintended
consequences for students, test-takers, and stakeholders. For example, much
attention has been drawn to AI technology that can threaten the integrity of
assessments (e.g., it writes on behalf of students). Already, AI technology is
sufficiently sophisticated that it can be challenging for AST to detect AI
plagiarism. Teaching students to use AI ethically is also essential. If teachers
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
do not have the knowledge and/or the technology to see such use, it would be
challenging to determine students’ actual learning attainment. Therefore,
when technology has become an integral part of language assessment, it is
essential to research its impacts on the validity and the trustworthiness of
assessment practice and students’ lifelong learning.
6. Research that focusses on fairness and ethics in language assessment.
Fairness and ethical considerations in classroom language assessment are
integral to the abovementioned topics and areas. Fairness and ethics are
complex matters as all assessment contexts are different; each case has to be
considered in its own context.
Although there is no one-size-fits-all, bullet-proof measure to guarantee
fairness in language assessment, we argue that research into practices of
fair and ethical assessment would allow teachers to realise the problems
that they and their students face when assessment is unfair. For example,
teachers can focus on transparency of purpose in language assessment
(e.g., reasons for promoting self- and peer-assessment in the classroom;
Assessment for Language Teaching 81
clarity in the task, the assessment instructions, and the evaluation criteria
used; and how classroom assessment includes students with special needs).
Teachers can also investigate how they and their students shape the class-
room environment in which all students have access to equitable support
and resources (e.g., through appropriate feedback provision and opportun-
ity for students to access educational material or technology) that give
them an equal chance to be successful in their learning and achievement.
REFLECTION BOX 7
Reflect on the symbolic use of tests and assessments as guns by Spolsky
(2012). Have you seen or experienced instances in which tests or assess-
ment practices were misused?
Share your thoughts here: https://bit.ly/48cSNTW.
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
Appendix: Test Specifications
There is no correct structure or framework for test specifications (see Alderson
et al., 1995; Bachman & Damböck, 2017; Bachman & Palmer, 1996; Brown &
Abeywickrama, 2019; Carr, 2011; Fulcher, 2010). The complexity of test
specifications depends on the nature of the language constructs or skills to be
tested and the resources available. Fulcher (2010), for instance, presents various
specifications for the whole language assessment system (e.g., specifications for
assessment production, administration, scoring, and validation). This appendix
presents Davidson and Lynch’s (2002) test specifications, which are useful for
assessments in the classroom context.
GD: Students need to be able to identify the main topic of each paragraph in
a written text. These days, students read texts on computers, tablets, and mobile
phones daily. In this test, students are required to illustrate their ability to
identify the main topic of each paragraph by reading a text with several
paragraphs, selecting a heading for each paragraph from a list of headings,
and then dragging and dropping it into the space above the relevant paragraph.
2. The prompt attribute (PA): This section describes the assessment tasks’ charac-
teristics, that is, what students or test-takers will be given to inform them of what
they need to do to complete the test. A PA may include an instruction or direction
to complete the task and the technique(s) to be used (e.g., multiple-choice, short
answers, essay, or oral interview). The PA may vary according to the skills or
language components being assessed. For example, in a reading or listening test,
Appendix: Test Specifications 85
the PA needs to include details of the characteristics of the text (e.g., topics,
vocabulary range, lengths, text familiarity, and sources, e.g. authentic, simplified,
scripted sources). For a speaking test, the PA describes how students or test-
takers will be prompted to produce their speech (e.g., warm-up questions
followed by a series of tasks (usually from simple to more complex ones);
who will interact with them; and whether their speech will be recorded). The
PA may describe the prompt’s presentation mode (e.g., paper-based, computer-
based, multimedia-multimode prompts) and the time allowance (e.g., ‘You have
1 hour to complete this assessment.’). The following is an example of a PA:
PA: The students will read two written passages about a holiday destination
and wildlife. The students are familiar with these topics and relevant
vocabulary from their coursework. They have also had exposure to passages
of a similar length. The passages can be based on or adapted from online
magazines or newspaper articles. The content can be modified regarding
vocabulary and sentence structures to suit the students’ proficiency levels.
Each passage should be about 220–250 words long and be organised into
six to seven short paragraphs.
Students will complete five items for each passage (ten items across the two
passages). They will be given 20 minutes to complete the items for both
passages. The first passage and its items will be presented on screen. After
completing the first passage, students must submit their answers by clicking
the ‘Next’ button. At that time students will be prompted to proceed to
the second passage. Should the time limit of 20 minutes be reached, their
answers will be saved automatically, and no further answers can be supplied
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
at that time. Students will then be prompted to go to the next section of the test.
Each text should appear similar to how it appeared in its original form, but
this could be adjusted to suit the context of the students’ learning or to make its
appearance similar to that of the texts used in class. Provision should be made
for students with disabilities to complete the test without undue hindrance.
Each passage will include a title and will be formatted in a similar way to
the original version. A picture may be incorporated into the text if that is
deemed to improve its authenticity. Seven plausible headings describing the
paragraphs’ main topics will be provided at the top of the screen. A space or
box will be provided above each paragraph. Students are to drag and drop
one of the provided headings into each space using the computer mouse so
that each heading describes the following paragraph. Each of the headings
provided can be used only once, and two extra headings do not represent
any of the paragraphs. The heading for the first paragraph may be offered to
students as an example of how to respond to the task.
86 Appendix: Test Specifications
3. The response attribute (RA): The content of this section overlaps with that
of the PA section. It includes how students or test-takers will respond to the
PA. The information includes how students or test-takers should provide
answers or responses to the questions or tasks. For example, test-takers
choose one option per question on a computer screen in a computer-based
multiple-choice test. Once an answer has been submitted, it cannot be
changed. In an essay test, test-takers are asked to respond to the task by
writing in a designated place and producing a specified word count (e.g.,
‘Write at least 200 words.’). The following is an example of an RA of the
same reading test as in Point 2 (PA):
RA: The students will read two passages about a holiday destination and
wildlife. Each passage contains several paragraphs. A list of seven plausible
headings that best describes each paragraph is provided above the passage.
They are to read each text first and consider the main topic discussed in each
paragraph. Then they will choose one of the provided headings for each
paragraph that best describes that paragraph. The first paragraph has been
completed for students as an example. They are given a total of 20 minutes for
this section.
They will use the computer mouse to drag and drop each chosen heading
into the provided space/box above or next to the relevant paragraph. Each
heading can be used only once.
4. The sample item (SI): This section provides an example of the assessment
tasks to be developed. The directions or instructions to complete the tasks
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
should be explicit and formulated for students or test-takers to follow and for
item writers to understand. Examples allow item or task writers to replicate the
example tasks in parallel form. An SI may link to the specification supplement
in Point (5); examples of lessons, classroom activities, exercises, or past tests
or assessments that can be modelled may be included in this section.
5. The specification supplement (SS): While this section is optional, its inclu-
sion can be useful for item writers. It can function similarly to an appendix.
Detailed information for each section (Points (1)–(4)) can distract the item or
task writers from some crucial points. An SS may include relevant tips or
suggestions for selecting topics, texts, question or task formulations, sample
texts and source texts, lessons, previous tests or assessment tasks, and
criteria for scoring or assessment.
Glossary of Language Assessment
Accommodations: Alterations or modifications of assessment procedures,
deliveries, or administrations to allow students with special educational
needs and/or disabilities to participate in language assessment activities in
a way that will enable them fully to show their potential and their
capabilities.
Accountability: A concept to ensure that objectives and learning outcomes are
effectively and appropriately delivered, supported, and assessed in accord-
ance with published criteria. This concept is essential for gaining public
confidence.
Additional language: An increasingly adopted term to replace the notion of a
second or foreign language. ‘Second’ or ‘foreign’ denotes contexts and
processes of language use and language learning that are assumed to be
separate and different from that of first language; for example, English is
learnt and used as a foreign language in places such as Spain and South Korea.
The term ‘additional’ offers more comprehensive coverage for the diverse
language learning contexts and use in contemporary settings.
Administration: A predetermined process, usually officially sanctioned and
verified by the assessment authority, that test administrators, proctors, and
students must follow when completing a given assessment (e.g., instruc-
tions, order of test sections, time allowance).
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
Integrated task: An assessment task that requires students to use more than
one language skill to illustrate their ability or performance (i.e., multi-
modal assessment). The performance outcomes are related to productive
language skills, that is, writing and speaking. For example, students read a
news article about issues of animal extinction, watch a documentary about
it, and then write an essay to present their position about the issues using
information from the article and the documentary as well as their own
views.
Language assessment literacy (LAL): Conceptual understanding and
working knowledge about various kinds and purposes of language
assessment, how to vary and use them to improve or inform teaching
and learning, and how to use them to decide on students.
Language for specific purposes: Specific language style and specialist or
technical language for a given profession or context of language use. For
Glossary of Language Assessment 91
example, the terms ‘bull’ and ‘bear’ to describe stocks and shares market
conditions in economics.
Language proficiency: General ability to understand language and use it to
communicate across various language modes. Typically, language profi-
ciency is placed along a continuum from non-user and beginner to inter-
mediate and advanced users. See Constructs.
Language skill: These are conventionally labelled as listening, reading,
speaking and writing. They are not entirely independent from one another
in language use or learning. For instance, writing requires an ability to
read, and speaking involves listening skills.
Learning-oriented assessment (LoA): Assessment that emphasises the
promotion of various aspects of learning as a key goal.
Measurement: Quantification of abstract concepts or ability into numbers or
scales. For example, a zero is given when students answer a question incor-
rectly, and one is given when they answer correctly. Their scores are accumu-
lated to derive a total score that is then used to quantify their knowledge.
Multilingualism: The knowledge and ability to use multiple languages to
communicate with others. Multilingualism does not mean equal fluency in
terms of proficiency levels in the languages involved, and it may be used
separately or in combination or conjunction with other languages, depend-
ing on the target audience and the specific context.
Multimodality: Using one or more language skills and other graphic/audio-
visual modes to communicate or address a language task.
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
scoring process.
Reliability: The consistency of assessment results or scoring methods for
standards-based assessment. The term reliability is often associated with
CTT.
Selected-response techniques: Assessment techniques that provide students
with options or choices when responding to questions or tasks (e.g., true/
false and multiple-choice items).
Social dimensions: An aspect of language assessment that is inextricably
linked to the social context involved. For example, in a writing task, clear
instructions of what to do, and performance related to social conventions
of language use expectations are social aspects (as created by test-writers)
that will influence students’ performance. Another social dimension of
language assessment is its educational and political functions (e.g., to limit
access to study at a given educational institute, to provide credentials to
Glossary of Language Assessment 93
an assessment or a test can be created for a given purpose and for particular
students.
Testing: A predetermined procedure for collecting specific information such
as language learning, skills, and ability using test or assessment tools. See
Assessment.
Test-taking strategies: Knowhow to respond to test questions and tasks, and
their awareness of how to deal with specific test techniques. For example,
they can choose only one answer per question in a multiple-choice ques-
tion. Similarly, in a cloze test, they need to supply one word only per gap.
Test-wiseness strategies: Students’ or test-takers’ know-how or shortcuts to
answer a question correctly without much engagement in a given question
or task. For example, in a multiple-choice reading test, students can learn
to eliminate impossible answers or to guess an answer correctly without
94 Glossary of Language Assessment
Lam, R. (2023). E-portfolios: What we know, what we don’t, and what we need
to know. RELC Journal, 54(1), 208–15. https://doi.org/10.1177/
0033688220974102.
Larsen-Freeman, D. (1989). Pedagogical descriptions of language: Grammar.
Annual Review of Applied Linguistics, 10, 187–95. http://doi:10.1017/
S026719050000129X.
Leung, C. (2014). Classroom-based assessment: Issues for language teacher
education. In A. Kunnan (Ed.), Companion to language assessment (vol. III,
pp. 1510–19). Wiley-Blackwell.
Leung, C. (2022a). Action-oriented plurilingual mediation: A search for fluid
foundation. In D. Little & N. Figueras (Eds.), Reflecting on the Common
European Framework of Reference for Languages and its companion volume
(pp. 78–94). Multilingual Matters.
Leung, C. (2022b). Language proficiency: From description to prescription and
back. Educational Linguistics, 1(1), 56–81. https://doi.org/10.1515/eduling-
2021-0006.
Lewkowicz, J., & Leung, C. (2021). Classroom-based assessment –
Timeline. Language Teaching, 54(1), 47–57. http://doi:10.1017/
S0261444820000506.
Luoma, S. (2004). Assessing speaking. Cambridge University Press.
Lynch, B. K. (2003). Language assessment and program evaluation. Edinburgh
University Press.
Lyster, R., & Ranta, L. (1997). Corrective feedback and learner uptake. Studies
in second language acquisition, 19(1), 37–66. https://doi.org/10.1017/
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press
s0272263197001034.
McMillan, J. H. (Ed.) (2013). Sage handbook of research on classroom assess-
ment. Sage.
McNamara, T. (1996). Measuring second language performance. Longman.
McNamara, T., & Roever, C. (2006). Language testing: The social dimension.
Blackwell.
Mirhosseini, S.-A., & De Costa, P. (Eds.) (2019). The sociopolitics of English
language testing. Bloomsbury.
Moder, C. L., & Halleck, G. B. (2022). Designing language tests for specific
purposes. In G. Fulcher & L. Harding (Eds.), Routledge handbook of lan-
guage testing (2nd ed., pp. 81–95). Routledge.
New South Wales (NSW) Department of Education and Training. (2004). ESL
steps: ESL curriculum framework K-6. New South Wales, Australia.
O’Sullivan, B. (2012). The assessment development process. In C. Coombe,
P. Davidson, B. O’Sullivan, & S. Stoynoff (Eds.), Cambridge guide to second
language assessment (pp. 47–58). Cambridge University Press.
100 References
S. May (Eds.), Language testing and assessment (3rd ed.) (pp. 375–84).
Springer.
Stobart, G. (2006). The validity of formative assessment. In J. Gardner (Ed.),
Assessment and learning (pp. 133–46). Sage.
Taylor, L. (2009). Developing assessment literacy. Annual Review of Applied
Linguistics, 29(1), 21–36. https://doi.org/doi:10.1017/S0267190509090035.
Tsagari, D., & Banerjee J. (Eds.) (2017). Handbook of second language assess-
ment. De Gruyter Mouton.
Tsagari, D., & Cheng, L. (2017). Washback, impact, and consequences
revisited. In E. Shohamy, I. G. Or, & S. May (Eds.), Language testing and
assessment (3rd ed., pp. 359–72). Springer.
Turner, C. E., & Purpura, J. E. (2017). Learning-oriented assessment in second
and foreign language classrooms. In D. Tsagari & J. Banerjee (Eds.),
Handbook of second language assessment (pp. 255–73). De Gruyter Mouton.
Wall, D. (2012). Washback. In G. Fulcher & F. Davidson (Eds.), Routledge
handbook of language testing (pp. 79–92). Routledge.
102 References
Heath Rose
University of Oxford
Heath Rose is Professor of Applied Linguistics and Deputy Director (People) of the
Department of Education at the University of Oxford. At Oxford, he is the course director
of the MSc in Applied Linguistics for Language Teaching. Before moving into academia,
Heath worked as a language teacher in Australia and Japan in both school and university
contexts. He is author of numerous books, such as Introducing Global Englishes, The
Japanese Writing System, Data Collection Research Methods in Applied Linguistics, and
Global Englishes for Language Teaching. Heath’s research interests are firmly situated
within the field of second language teaching, and includes work on Global Englishes,
teaching English as an international language, and English Medium Instruction.
Jim McKinley
University College London
Jim McKinley is a Professor of Applied Linguistics and TESOL at UCL Institute of Education,
where he serves as Academic Head of Learning and Teaching. His major research areas
are second language writing in global contexts, the internationalisation of higher educa-
tion, and the relationship between teaching and research. Jim has edited or authored
numerous books, including the Routledge Handbook of Research Methods in Applied
Linguistics, Data Collection Research Methods in Applied Linguistics, and Doing Research in
Applied Linguistics. He is also an editor of the journal, System. Before moving into academia,
Jim taught in a range of diverse contexts including the US, Australia, Japan and Uganda.
Advisory Board
Brian Paltridge, University of Sydney
Gary Barkhuizen, University of Auckland
Marta Gonzalez-Lloret, University of Hawaii
https://doi.org/10.1017/9781108934091 Published online by Cambridge University Press