Linguaskill Building a Validity Argument for the Speaking Test
Linguaskill Building a Validity Argument for the Speaking Test
Linguaskill
Validity report
Linguaskill
Contents
Summary.. ....................................................... 3
Introduction .................................................... 4
1. Test purpose ................................................. 5
2. Target language use domain ............................ 5
3. Test description ............................................ 5
4. Marking of speaking performance . . ................... 6
5. Framework of test validation ........................... 8
6. Validity evidence for the Linguaskill
Speaking test............................................... 11
References....................................................................... 17
2 Linguaskill Building a validity argument for the Speaking test | June 2020
Summary
The Linguaskill Speaking test is a computer-based oral The evidence presented in this paper supports the following
English test that is enhanced by auto-marking technology. validity claims about the test.
It takes a hybrid approach to marking which combines the
strengths and benefits of artificial intelligence with those of • Test Content: The speaking topics which cover daily
the decision-making of experienced human markers. The test routines, dialogues at social activities, exchanges at
is appealing to institutional users who may need to assess workplaces and telecommunications are overall a good
a large number of English language learners in a short time representation of the communicative situations that the
frame. Individual learners may also find the browser-based candidates will likely encounter in the real world.
speaking test highly accessible in that it can be taken at • Test Content: The speaking topics are interesting, neither
home on any Windows computer with a high-speed internet too easy nor too difficult.
connection. The test results of Linguaskill are reported within
• Test Content: Candidates generally feel comfortable with
48 hours thanks to auto-marking technology. speaking to a computer.
This paper presents a validity argument for the Linguaskill • Marking of Responses: The reliability of human marking
Speaking test by weaving together a narrative about the is satisfactory.
research evidence that has been collected to support the • Marking of Responses: With hybrid marking in place, the
intended interpretations and uses of the test scores. We begin auto-marker achieved 95.6% exact agreement and 100%
by describing the purpose, the target language use (TLU) adjacent agreement on Common European Framework of
domain and the format of the test. Then we unveil the design of Reference (CEFR) grades with human examiners.
the auto-marker, the training programme of human examiners
and the hybrid marking model used in the test. In what follows
• Marking of Responses: The auto-marker is capable of
detecting suspicious non-English speech and escalating it to
we delineate the structure of the Linguaskill validity argument
human examiners for verification.
and explain why each element in it is essential for arguing for
test validity. Finally, we present research evidence related to • Interpretation of Test Results: Standard-setting exercises
three elements in the validity argument, namely Test Content, are conducted regularly to establish the link between the
Marking of Responses, and Interpretation of Test Results. It is test and the CEFR. For this reason, the test results can be
notable that test validation is a cumulative process. Further interpreted confidently based on the CEFR.
research is being carried out to gather additional evidence to • Interpretation of Test Results: Confirmatory factor
support the validity argument. analysis suggests that a single, overarching speaking
construct is assessed by the test.
Building a validity argument for the Speaking test | June 2020 Linguaskill 3
Validity report
Introduction
The Linguaskill Speaking test is a computer-based oral English may see increased accessibility in automated assessment
test that is enhanced by auto-marking technology. In contrast because, with a remote proctoring solution in place,
to other automated speaking assessments, Linguaskill the speaking test can be taken even at home. However,
Speaking takes a hybrid approach to marking, which means automated assessment also brings about challenges and
its test responses are marked by a combination of human problems that are not typically associated with human
examiners and auto-marking technology. If the computer marking, such as scepticism about construct coverage and
indicates low confidence in marking a response, the response susceptibility to candidate cheating (Chun 2006, Fan 2014,
is escalated to human marking. This hybrid model aims to Xi 2010, Xu 2015).
address the challenges that fully automated assessment
This paper presents a validity argument for the Linguaskill
brings, by marrying the latest auto-marking technology with
Speaking test. A validity argument provides an overall
the decision-making of experienced human markers.
evaluation of the intended interpretations and uses of test
With the advancement of natural language processing, scores by conducting coherent analysis on various strands
machine learning and speech recognition technologies, of research evidence either for or against the proposed
automated speaking assessment is growing in popularity. interpretations and uses (Cronbach 1988, Kane 2013). We
Compared with traditional face-to-face speaking exams, begin by describing the test specifications including the
automated speaking assessment, which is delivered on a intended test purpose, target language use domain and test
computer or mobile device, offers the benefits of much format. Then, we introduce the hybrid marking model applied
faster score reporting, simple test administration and to the oral assessment in which marking tasks are shared by
on-demand testing. an auto-marker and human examiners. In what follows, we
present a clear validation framework which lays out critical
Automated speaking assessment is particularly appealing to
validity considerations at different stages of a testing cycle. In
institutional users who may find large-scale administration
the remainder of the paper, we present validity evidence that
of face-to-face speaking exams unfeasible. Individual learners
has been collected based on this framework.
4 Linguaskill Building a validity argument for the Speaking test | June 2020
1. Test purpose The critical TLU tasks selected for Linguaskill generally fall into
four categories. They are daily routines (e.g., discussing leisure-
The Linguaskill Speaking test assesses candidates’ oral English time habits, giving preferences), dialogues at social activities
proficiency for everyday communication. It can be taken on (e.g., describing a situation or issue, recounting news or
its own or in conjunction with the other Linguaskill modules personal experience), exchanges at workplaces (e.g., raising a
of Reading and Listening, and Writing. Linguaskill aims to problem or issue, reporting on data), and telecommunications
provide fast, reliable and clearly interpretable results based (e.g., requesting information and leaving a telephone message).
on the Common European Framework of Reference for
Languages (CEFR), a widely recognised standard for describing
the progression of language learning and acquisition (Council 3. Test description
of Europe 2001, 2018), as well as more granular scores The Linguaskill Speaking test is browser-based so candidates
based on the Cambridge English Scale. Intended uses of can sit the test on any Windows computer1 with a high-
Linguaskill include a) measuring a candidate’s level of English speed internet connection in an invigilated setting. The test
for placement, progression, or graduation at education is remotely proctored if a candidate chooses to take it at
institutions and b) measuring a candidate’s level of English for home. Questions are presented to the candidate through
job or development opportunities at companies. The target the computer screen and headphones, and their responses
candidates of Linguaskill are English language learners over are recorded and remotely assessed by either computer
the age of 16 years. algorithms or examiners (see Section 4). The test is multi-
level, meaning that it is designed to elicit and assess oral
2. Target language use domain performance of multiple proficiency levels based on the CEFR,
including below A1, A1, A2, B1, B2, and C1 and above. The test
A target language use (TLU) domain is a hypothetical results are reported within 48 hours.
description of the situations or contexts in which candidates
need to be able to use the language outside the test. By The Linguaskill Speaking test has five parts: Interview, Reading
delineating the scope of this domain and identifying the key Aloud, Presentation, Presentation with Visual Information,
characteristics of language use in it, the test developer is able and Communication Activity. All parts are weighted equally
to design tasks that mimic these language use activities. Then, and focus on different aspects of speaking ability. The format,
candidates’ test-taking behaviours can be viewed as a sample testing aim and evaluation criteria of the five parts are
of their predicted language performance in the TLU domain. presented below and summarised in Table 1.
Length of
Part Task Description Preparation time Marks
response(s)
The candidate answers eight 4 x 10 secs and
1 Interview none 20%
questions about themselves. 4 x 20 secs
2 Reading Aloud The candidate reads aloud eight sentences. 8 x 10 secs none 20%
1
At this point, sitting Linguaskill on a Mac computer is not supported. It is suggested
that candidates use Google Chrome or Mozilla Firefox to take the test on a PC.
Building a validity argument for the Speaking test | June 2020 Linguaskill 5
b. Testing aim specific context, such as leaving a voicemail for a friend or
As well as introducing the candidate to the computer giving a presentation in class.
format of the test, the focus of this test part is to assess
b. Testing aim
the candidate’s ability to answer personal questions and
The focus of this test part is to assess the candidate’s ability
to give lower-proficiency candidates a more accessible and
to deliver a long turn which involves the interpretation of very
achievable task.
simple visual information and providing a recommendation,
c. Evaluation criteria explanation or suggestion.
Candidates are assessed on their linguistic output in terms of
c. Evaluation criteria
pronunciation and fluency, and language resource.
Candidates are assessed on their linguistic output in terms of
pronunciation and fluency, language resource and discourse
3.2 Reading Aloud management. Additionally, the marking takes into account
a. Format the candidate’s ability to complete the task appropriately in
In the Reading Aloud task, the candidate is required to read accordance with the rubric and instructions.
aloud eight sentences. They have 10 seconds to read aloud
each sentence. Sentences are of the kind that candidates may 3.5 Communication Activity
have to read aloud in real-world situations and are presented
a. Format
in increasing level of difficulty, covering a wide range of
In the Communication Activity task, the candidate is required
phonological features and syntactic structures.
to answer five questions related to a scenario. A preparation
b. Testing aim time of 40 seconds is given before the candidate hears the
The focus of this test part is to assess the candidate’s ability first question. Each question has a 20-second response
to transform the written form of the language into speech window and asks the candidate to provide an opinion,
and to handle elements of pronunciation at sentence level. speculate a hypothesis, or make an evaluation.
c. Evaluation criteria b. Testing aim
Candidates are assessed based on phonological criteria, The focus of this test part is to assess the candidate’s ability
including their overall intelligibility, their ability to to express opinions and ideas on a given topic in response
produce individual sounds, as well as their stress, rhythm to an aural prompt. It is an opportunity for higher-level
and intonation. candidates to demonstrate their higher-level skills.
c. Evaluation criteria
3.3 Presentation Candidates are assessed on their linguistic output in terms of
a. Format pronunciation and fluency, language resource and discourse
In the Presentation task, the candidate is required to speak management. Additionally, the marking takes into account
for 1 minute on a given topic. There is no choice of topic. A the candidate’s ability to complete the task appropriately in
preparation time of 40 seconds is given before a candidate accordance with the rubric and instructions.
records their response.
4. Marking of speaking performance
b. Testing aim
The focus of this test part is to assess the candidate’s ability The Linguaskill Speaking test adopts a hybrid or human-in-
to deliver a long turn. As well as a description of a situation or the-loop marking model in which an auto-marker is used
issue, the candidate is encouraged to state and/or justify an in live assessment, but with the involvement of human
opinion through bulleted prompts. examiners. This section discusses the design of the
auto-marker, examiner training and certification, and how
c. Evaluation criteria
hybrid marking is applied.
Candidates are assessed on their linguistic output in terms of
pronunciation and fluency, language resource and discourse
management. Additionally, the marking takes into account 4.1 Auto-marker
the candidate’s ability to complete the task appropriately in An auto-marker is a set of computer algorithms designed to
accordance with the rubric and instructions. mark constructed test responses such as extended speaking
and writing. Cambridge Assessment English (henceforth
3.4 Presentation with Visual Information Cambridge English), in collaboration with research groups
from the University of Cambridge, started to develop
a. Format
automated marking of spontaneous non-native English
In the Presentation with Visual Information task, the
speech in 2012. The auto-marker used in the Linguaskill
candidate is required to talk for 1 minute about information
Speaking test is called the Custom Automated Speech
presented to them in visual form. A preparation time of 1
Engine (CASE), which was developed by Enhanced Speech
minute is given before a candidate records their response.
Technology Ltd building upon technology transferred from the
The candidate is asked to present the information within a
6 Linguaskill Building a validity argument for the Speaking test | June 2020
Figure 1. The architecture of the auto-marker (Knill et al 2018)
Features
Feature
extraction Grader Grade
Speech
recogniser Text
Institute for Automated Language Teaching and Assessment comments about performance and marking rationales.
(ALTA), an interdisciplinary research centre of the University of Applicants are guided through the material through
Cambridge, using machine learning technologies. the documentation.
CASE, as shown in Figure 1, consists of three major Within 30 days of access the certification test must be
components: a speech recogniser, a feature extraction successfully taken. Certification tests include a selection of
module, and a grader (Knill et al 2018, Wang et al 2018). The speaking items previously marked by a pool of experienced
speech recogniser conducts Automated Speech Recognition and reliable examiners, with a statistically adjusted average
(ASR), converting the audio signal of speech into a structured score2 as the final approved mark. Applicants must allocate
representation of the underlying word transcription. It was a minimum of 80% of correct marks (within 0.5 of the
trained based on deep neural network models using learner approved mark). Two attempts are provided, with different
speech supplied by Cambridge English and combined with versions of the test. Examiners who have failed both attempts
crowd-sourced transcriptions (see the ASR2 system in Lu et al have their access to the portal automatically revoked.
2019). Feature extraction is about deriving features relevant to
Once applicants have successfully passed the certification
the speaking construct (e.g., fluency, pronunciation accuracy,
test the system identifies them as certificated and they are
vocabulary diversity) from both the audio signal and the
added to the marking pool and can start marking candidates.
structured word transcription as the basis for grading. Based
Once examiners start providing marks, they are continuously
on these features, the grader uses state-of-the-art machine
statistically monitored. Marking behaviour analysis is carried
learning models to return a distribution over scores from
out to identify possible bias, consistency and non-compliant
which feedback to the candidate such as the CEFR grade is
behaviour. Examiners who are flagged up statistically are
derived. The training sample for the grader includes a large set
investigated and removed from the marking pool if their
of Linguaskill Speaking test responses produced by learners
behaviour is confirmed as unsatisfactory. Re-certification
of various first languages and all CEFR levels as well as the
occurs every two years with new training and test material.
marks awarded to these responses by examiners. In addition,
the machine learning grader models used in the CASE have
been selected to provide an uncertainty measure based on the 4.3 Hybrid marking
similarity between the input and the training data (van Dalen,
Hybrid marking aims to combine the strengths and benefits
Knill and Gales 2015, Malinin, Ragni, Knill and Gales 2017). This
of artificial intelligence (AI) with those of human examiners.
uncertainty measure is a meaningful indicator of the reliability
Computer marking is speedy and cost-effective but is only
of the auto-marker score and is useful for identifying test
reliable when the responses being marked are close to the
responses that require human marking.
training sample of the AI system. Some impediments to auto-
marker accuracy include poor audio quality, aberrant speaking
4.2 Examiners behaviours and training sample underrepresentation. Poor
audio quality is likely to significantly reduce ASR accuracy and
All examiners for the Linguaskill Speaking test undergo a
affect the auto-marking performance. Learners may also be
rigorous training programme in order to qualify (Figure 2).
tempted to apply strategies to trick the marking system into
Prospective examiners must meet the minimum professional
giving them higher marks (Xi, Schmidgall and Wang 2016).
requirements, which include being educated to first degree
Thus, it can be argued that human examiners play a key role
level or equivalent, holding a recognised language teaching
as gatekeepers in preventing less reliable auto-marker scores
qualification, providing proof of substantial and relevant
being released to candidates.
teaching experience within the last two years, and having
suitable English language competency. The hybrid marking model is about using human examiner
expertise to support and further develop the auto-marker.
Approved applicants are provided with training materials
It is also based on the assumption that the computer can
through an online portal, which includes extensive
provide information to indicate its confidence in score
documentation about the marking procedure and sample
prediction. When this confidence is low, the test response
speaking responses with marks, along with detailed
is flagged up and escalated to human examiners. In the
2
We use fair average scores which are average scores adjusted for marker severity
by multi-faceted Rasch analysis (Linacre 1989).
Building a validity argument for the Speaking test | June 2020 Linguaskill 7
Figure 2. The procedure for certificating Linguaskill speaking examiners
Nomination
Meet minimum
professional requirements Can re-apply to become a
No Linguaskill online examiner
after 12 months
Yes
Training
Fail
Become examiner
Monitoring
in year not re-certificated
Re-certification
in year not monitored
Cambridge English hybrid marking model (Figure 3), to ensure the quality of marking and provide marking data
escalation to human marking is determined by setting to further train the auto-marker. The proportion of human
thresholds on three features generated by the auto-marker marking will gradually decrease with the enhancement of
in addition to the predicted score. They are the Assessment the auto-marker. The evaluation of the auto-marker and
Quality score, Language Quality score and Audio Quality the hybrid marking model will be further discussed in
score. The Assessment Quality score is an uncertainty measure Section 6.2.
produced by the grader which suggests the amount of
confidence the grader has in its score prediction (see Section 5. Framework of test validation
4.1). The Language Quality score is an ASR confidence score
returned by the speech recogniser. It represents the system’s Validity is the most fundamental issue of assessment.
confidence in the accuracy of its transcription, which in turn Validity is the degree to which evidence and theory support
can be a useful proxy for identifying candidates who are not the interpretations of test results for intended test uses
actually speaking English during the test (see Lu et al 2019). (AERA, APA and NCME 2014). Two common frameworks
The Audio Quality score indicates the clarity of voice recording used for language test validation are the argument-based
and is derived from three separate measures: dynamic ratio framework (Bachman and Palmer 2010, Kane 2013) and the
(differences in amplitude between loud and quiet parts of the socio-cognitive framework (Weir 2005). The former focuses
audio), clipping (frames of audio that reach the maximum/ on decomposing Messick’s (1989) complex validity theory by
minimum possible values and hence are distorted) and structuring validity enquiry around practical arguments. The
noise. It also incorporates a variety of other ASR errors linked latter applies Messick’s (1989) validity theory to language
to audio quality or the intermediate processes during the assessment, and similar to Messick, takes a cumulative
speech-to-text conversion. In addition, test responses with approach to evidence collection. The Linguaskill validity
auto-marker scores falling below or above certain cut-off argument, as shown in Figure 4, is constructed by
values are flagged for examiner marking. This is informed by integrating the two, in order to make validity claims and
our auto-marker evaluation suggesting that the auto-marker evaluate supporting evidence.
score tends to be less reliable on the lower and higher ends of
The Linguaskill validity argument consists of six parts:
the scoring scale. In the current hybrid marking model, a large
Test Content, Response Processes, Marking of Responses,
proportion of test responses is marked by human examiners
8 Linguaskill Building a validity argument for the Speaking test | June 2020
Figure 3. The Cambridge English hybrid marking model
Confident
AI
auto-marker Level
Not confident
Human
examiner
New data for further training
of the auto-marker
Interpretation of Test Results, Test Use and Test Impact. evidence for response processes includes coherence
They represent a sequence of activities that a typical testing between test-taking behaviours/strategies and construct
cycle comprises, i.e., from test construction to the impact theories of language abilities, appropriate task delivery and
of testing on stakeholders. This section explains what these administration, clear test instruction, and accommodation
notions mean and how they can help guide the validation of candidates with special needs. Such evidence helps rule
research for Linguaskill. out an alternative interpretation of the test scores that
factors other than targeted language ability had an effect on
5.1 Test Content candidates’ test performance (AERA, APA and NCME 2014).
Building a validity argument for the Speaking test | June 2020 Linguaskill 9
Figure 4. The outline of the Linguaskill validity argument
• Appropriate use of mark schemes in line with knowledge of targeted language abilities
Marking of Responses • Psychometric evidence about score reliability and generalisability
• Degree of overlap between computer and human scoring
Test Impact
• Positive impact of testing on test preparation and language education inside and
outside classroom
• Positive impact of testing on society
5.4 Interpretation of Test Results evidence for score interpretation may also be collected from
latent factor analysis which investigates the underlying factor
Test scores are simply numbers so meanings have to be structure of the test (e.g., Sawaki, Stricker and Oranje 2009).
assigned to them to make them useful for various purposes. This piece of evidence is particularly relevant to integrated
This is at the heart of score interpretation. In most cases, language assessment in which two or more language skills
the test developer provides test users with the suggested (e.g., listening and speaking) are assessed at the same time.
interpretations, in the form of Can Do statements of abilities
associated with the scores, but this interpretation has to
be backed up by theories of cognitive processes, language 5.5 Test Use
development, or second language acquisition (Weir 2005). Based on the test results, stakeholders, such as candidates,
Validity evidence for score interpretation can be obtained teachers, employers and admission officers, will likely take
from standard-setting exercises that aim to align test actions. For example, a candidate may decide to put in more
scores to a theory-driven and/or research-based standard effort to improve a particular skill; a teacher may tweak his
for describing language proficiency, such as the CEFR or her lesson plans to meet students’ learning needs; an
(Council of Europe 2001, 2018). Additionally, this evidence employer may select a team among high-scoring candidates
may be collected from concurrent studies that examine to expand overseas markets; a school admission officer may
the relationship between test results and other measures make acceptance decisions on applicants. Validity evidence
of targeted language abilities. This is traditionally called related to test use is about the extent to which test results
concurrent validity (APA, AERA and NCME 1974). Validity help stakeholders make informed decisions or take the right
10 Linguaskill Building a validity argument for the Speaking test | June 2020
actions. This evidence can be sought in the following two before being included in a live test. Any problematic test items
areas. First, suggested test use and meanings of test scores found by the trial are returned to the development phase.
should be well understood by test users to avoid unintended
Xu and Gallacher (2017) conducted a survey on 3,601 adult
score interpretations and uses. Second, test results should be
English language learners from 23 countries in a global trial of
useful for predicting future behaviours of interest such as job
the Linguaskill Speaking test. The majority of the participants
performance and academic performance that are related to
reported that the speaking tasks were similar to how they
language use. This use of validity in its predictive sense was
used English in the real world (65.3%) and that the speaking
called predictive validity and had been predominant before
topics were closely related to their life (57.9%). In addition,
construct validity came into being in the 1950s (APA, AERA
approximately 70% of the participants agreed or strongly
and NCME 1974).
agreed that the Linguaskill Speaking test allowed them to
demonstrate their English-speaking ability.
5.6 Test Impact
Analysis on the qualitative feedback received in the survey
The use of test scores will exert an impact on stakeholders suggested that the speaking topics were interesting, neither
in a range of teaching, learning and social contexts. For too easy nor too difficult, and related to everyday life or
example, the way a high-stakes language test is designed work. For example, one participant related test questions to
is likely to influence how learners learn a language, how his daily language use situations:
teachers teach a language, and even social values regarding
language proficiency and fairness. This impact is also called
‘I find the topics relevant, not too easy nor difficult. I
consequences, washback, or consequential validity (Cheng
think that these topics are related to what normally
2014, Messick 1996, Weir 2005) and is an integral aspect of
happens in daily life. These are topics that most people
the concept of validity. Test impact is closely related to how
learning English should master because they are what
a test is used. If Linguaskill were misused for unintended
takes place in the real world.’ (Participant ID 1742853)
purposes, the test impact would probably be negative. The
responsibility to ensure positive test impact is shared by both
the test provider and test users. Many participants reported the speaking test was not as
stressful as they had anticipated. As one participant put it:
6. Validity evidence for the Linguaskill ‘I always feel worried in exams, but as I hear the
Speaking test questions, I felt more comfortable and relax. So they
This section presents the research evidence to support the were easy for me.’ (Participant ID 1721614)
use of the Linguaskill Speaking test for its intended purposes.
Test validation is a cumulative and ongoing process (Messick Participants had mixed feelings about speaking to a
1989). Validity evidence is collected and refined over time as computer. Some felt that talking to a computer was ‘just like
more data is collected, and as the Linguaskill test is relatively talking to a real person’ (Participant ID 1750497) or even less
new, not all aspects of the validity argument have yet been stressful than talking to a speaking examiner (Participant
fully documented. Extensive evidence has been obtained ID 1741200). Others indicated that they were very used to
for Test Content, Marking of Responses and Interpretation of interacting with a digital device since ‘interaction through
Test Results. Further research is being carried out to gather [a] mobile phone is quite popular nowadays’ (Participant
additional evidence for Response Processes, Test Use and ID 1756618). A small proportion (19.3%) of participants
Test Impact in specific countries and regions where the still preferred to speak to a human interlocutor as they had
test is used. expected exchanges of information (Participant ID 1646082)
and seeing a human face (Participant ID 1714910) in the real-
life oral communication.
6.1 Validity evidence for Test Content
Expert judgment is an essential element in attesting the In short, the quality-assurance process underpinning the
relevance of the test tasks to the TLU domain (Messick 1989, Linguaskill content production and the findings from this large-
p. 39). For Linguaskill, the judgment on content relevance scale trial study suggest that the content of the Linguaskill
and construct coverage (i.e., the skills assessed by the test) Speaking test is overall a good representation of the speaking
is made in test review meetings at the test development tasks that candidates will likely encounter in the real world.
phase by a group of experts consisting of item writers, senior The ability to interact with an interlocutor (Brown 2003,
examiners and language testing researchers (Figure 5). The Galaczi and Taylor 2018), which is typically assessed in a face-
review of speaking items focuses on item difficulty, clarity to-face interview, is not represented in the Linguaskill Speaking
of the prompt and instruction, authenticity of topics and test. It is therefore not possible to interpret the Linguaskill
situations, background knowledge needed to give a response, Speaking test score as a direct measure of interactional
and the skills being assessed. Test items that fail the review are competence. Nevertheless, monologic speaking performance
either discarded or revised before being reviewed once again. may to some extent predict interactional speaking
Those which pass the expert review are trialled with a large performance (Bernstein, Van Moere and Cheng 2010), and it
group of learners selected from the target test population
Building a validity argument for the Speaking test | June 2020 Linguaskill 11
Figure 5. The test development cycle adopted by Linguaskill (Cambridge English 2016)
Start
Planning phase
Design phase
Initial specifications
Revision
Live test
Evaluation
Operational phase
Monitoring phase
Review
can be argued that the Linguaskill Speaking test is designed to 6.2.1 Reliability of examiner marking
cover a variety of communicative speaking functions. Xu and Gallacher (2017) conducted a study to investigate
the reliability of human marking in the Linguaskill Speaking
6.2 Validity evidence for Marking of Responses test. Five Linguaskill speaking examiners randomly selected
from a larger pool were asked to mark a common dataset
Reliable marking of test responses serves as the basis for consisting of test responses produced by 60 candidates of
accurate estimation about candidates’ targeted language various proficiency levels. In other words, each part of the
abilities. A caveat associated with automated speaking test was marked by the same five examiners. Reliability of
assessment relates to the reliability of marking open-ended human marking on each test part and the whole test (which
speech (Xu 2015). Before discussing the reliability of CASE, was the average of each part) was estimated using intraclass
the speech auto-marker used in Linguaskill, we first report correlation coefficients (ICC). This coefficient indicates the
the reliability of examiner marking. This is because a) a large degree to which a single mark on a response represents
proportion of the Linguaskill Speaking tests are still marked the other marks on the same response (Shrout and Fleiss
by examiners and b) the auto-marker was trained on human- 1979). In general, an ICC value between 0.75 and 0.90 is
marked spoken data and thus cannot outperform the best considered good reliability and a value above 0.90 indicates
examiners in the training sample. excellent reliability (Cicchetti 1994). The ICC values of each
12 Linguaskill Building a validity argument for the Speaking test | June 2020
Figure 6. Histograms of examiner and auto-marker raw scores without hybrid marking
400
400
300
300
Frequency
Frequency
200
200
100
100
0
0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
test part as well as the whole test are presented in Table 2. to or smaller than one CEFR level. In 3.4% of the tests, the
It can be seen that the reliability of single human marking auto-marker and human examiners differed by more than one
varies from 0.84 to 0.91 in the five test parts and is 0.91 for CEFR level (Table 3). In operational testing, these inaccurate
the whole test, thus indicating adequate reliability of human auto-marker scores are overridden by examiner scores, which
marking at task level and excellent reliability at the test level. will be discussed in the next section.
Brenchley (2020) re-examined inter-rater reliability using
The research also found that although the distributions of the
a larger dataset of 204 Linguaskill Speaking tests marked
auto-marker and human raw scores largely overlapped (Figure
independently by three examiners. The study reported a
6), the auto-marker was comparatively harsher on the higher
single-marker ICC of 0.90 for the whole test.
end of the scoring scale and comparatively lenient on the
6.2.2 Reliability of the auto-marker on its own lower end (Figure 7). Again, these inaccuracies are addressed
Auto-marker evaluation is often performed by computing through the use of hybrid marking, as well as continued
the correlation or agreement between computer marking training and improvement of the auto-marker. The root mean
and human marking (e.g., Bernstein et al 2010, Wang et al square error (RMSE) of the auto-marker raw score was 0.64,
2018). Jones, Brenchley and Benjamin (2020) conducted an about a half CEFR band. RMSE is the standard deviation of
evaluation study on the current version of the auto-marker, the residuals (auto-marker prediction errors) and an indicator
focusing first on the performance of the auto-marker on its of how concentrated the data points are around the diagonal
own – that is, not embedded in a hybrid marking system regression line where exact human–machine agreement is
(see next section). The evaluation was based on a dataset of achieved (Figure 7).
9,286 Linguaskill Speaking tests reflecting live candidature.
In addition to auto-marker agreement with examiners, the
The distribution of human CEFR grades in the dataset was
research also evaluated the usefulness of the Language
approximately 1% Below A1, 5% A1, 13% A2, 36% B1, 35%
Quality score, a confidence measure of speech recognition
B2 and 10% C1 or above. The dataset contained speakers of
(see Section 4.3), for identifying non-English speech or
a large number of native languages in which Spanish (29%),
gibberish in the test responses. Based on a subset of data
Arabic (26%) and Portuguese (15%) were the most frequent.
(n = 284) which included aberrant speaking behaviours, it
The study found that when the same CEFR cut-off values are was found that normal English speech resulted in significantly
applied to auto-marker and human raw scores, the auto- higher Language Quality scores than non-English speech
marker awarded the same CEFR grade as the examiners in (see Figure 8). By applying a cut-off value to this score,
56.8% of the tests. In 96.6% of the tests, the difference
between computer marking and human marking was equal
Table 3. Percentage agreement between auto-marker and
human CEFR grades (n = 9,286)
Table 2. Intraclass correlation coefficients of single examiner
marking (Xu and Gallacher 2017) Human–machine agreement Percentage
Exact agreement (or no difference) 56.8%
Part 1 Part 2 Part 3 Part 4 Part 5 Whole
test Adjacent agreement (or difference 96.6%
<= 1 CEFR level)
0.84 0.87 0.90 0.88 0.91 0.91
Mismarking (or difference > 1 CEFR level) 3.4%
Building a validity argument for the Speaking test | June 2020 Linguaskill 13
Figure 7. A scatter plot of examiner vs. auto-marker raw scores without hybrid marking
6
5
4
Auto-marker score
3
2
1
0
0 1 2 3 4 5 6
Examiner score
the 19 non-English-speaking responses in the dataset were passed to a human examiner. The thresholds that are used in
all successfully identified. The research suggests that the the rules (e.g. 0.9) were determined by a process
Language Quality score is sensitive to non-English speech and of optimisation.
helpful for recognising candidates with an intent to game the
This constrained optimisation was done using exhaustive
auto-marker.
search, also known as brute-force search. For each variable
6.2.3 Reliability of hybrid marking a set of possible thresholds was created. For example, the
Hybrid marking, as mentioned in Section 4.3, is about Language Quality score ranges from 0 to 1, so the set of
escalating to human examiners any responses which the thresholds might be 0, 0.01, 0.02, ..., 1. For all possible
auto-marker may have mismarked, defined as cases where the combinations of five thresholds, an optimal (highest) recall
auto-marker and human scores are likely to be further than statistic was calculated based on the dataset of 9,286
one CEFR level apart on the scoring scale. In the Linguaskill Linguaskill Speaking tests (Jones et al 2020). The recall
hybrid marking model, rules are applied to a number of statistic, which is reported as a percentage, indicates the
features generated by the auto-marker including Assessment completeness of flagging. For example, a recall value of
Quality, Language Quality, Audio Quality, auto-marker score 0.90 means that of all the test responses mismarked by the
(lower bound) and auto-marker score (higher bound). Each auto-marker, 90% of them are successfully flagged by the
rule is an inequality statement such as the Language Quality application of the rules.
score is below 0.9. If a response satisfies any of the rules, it is
14 Linguaskill Building a validity argument for the Speaking test | June 2020
Figure 8. Language Quality scores between English-speaking Figure 9. Single Factor model (Xu and Seed 2017)
tests and non-English-speaking tests
1
1.28 Part 1
0.95
0.9
1.57
9.60 Part 2
0.85
Language Quality (overall)
3.74
0.8
0.7
2.32
0.65
1.89 Part 4
1.42
0.6
0.55
1.80 Part 5
0.5
A statistic that is often reported along with recall is precision, A construct theory may be a set of language proficiency
which is an indicator about the accuracy of flagging. For descriptors, as in the CEFR, which detail the course of
example, a precision of 0.90 means that of all the flagged language development. Alternatively, it can be a speculation
test responses, 90% of them were indeed mismarked by the on the composition of a language ability. The validity
auto-marker. There is always a trade-off between precision evidence for supporting the proposed score interpretation of
and recall – a high recall value will lead to a low precision the Linguaskill Speaking test has been collected via standard
value and vice versa. In designing the Linguaskill piping rules, setting and factor analysis. The former links the performance
we pursued a high recall value in order to prevent unreliable on the test to a theory about speaking proficiency progression
auto-marker scores being released to the candidates. whereas the latter examines the underlying structure of the
speaking construct targeted by the test.
Given our emphasis on high reliability of marking, we initially
opted for threshold values that would result in a recall of 0.96 6.3.1 Standard setting
at the cost of escalating a large proportion of test responses As the Linguaskill Speaking test reports CEFR-based
to human examiners. The high recall, in turn, led to a small test results, standard-setting exercises were performed
auto-marker RMSE of 0.16 and excellent human–machine periodically to align its test results to the CEFR framework.
agreement: 95.6% exact agreement and 100% adjacent This alignment allows test users to interpret the test results
agreement on CEFR grades. We are, however, continually in a wider context by referring to the language proficiency
improving the auto-marker and evaluating the threshold descriptors provided by the CEFR.
values to decrease the proportion of test responses that are
Standard setting refers to the process of establishing one or
examiner-marked.
more cut scores on examinations (Cizek and Bunch 2007,
p. 13). In the case of Linguaskill, cut scores are used to divide
6.3 Validity evidence for Interpretation of candidates into six proficiency groups in line with the CEFR
Test Results proficiency levels: Below A1, A1, A2, B1, B2 and C1 or above.
The most recent standard-setting exercise on the Linguaskill
The interpretation of language test results must be supported
Speaking test was conducted by Lopes and Cheung (2020)
by construct theories about targeted language abilities. On
who followed a modified Bookmark method recommended
the one hand, construct theories about language are chosen
by a manual for relating language tests to the CEFR (Council
by test developers to inform test design, assign meanings to
of Europe 2009).
the test scores and account for the variance in test scores.
On the other hand, test validation is also a process of 6.3.2 Factor structure
theory validation in that the observed test data may either In addition to standard setting, factor analysis was performed
confirm or refute the chosen theories for score interpretation to examine the underlying structure of the Linguaskill
(Cronbach and Meehl 1955). Speaking test. It was hypothesised that the abilities assessed
Building a validity argument for the Speaking test | June 2020 Linguaskill 15
in the five test parts were unidimensional, meaning that a below indicates an adequate model fit (Sawaki et al 2009).
single, overarching speaking construct was assessed by the The finding suggests that a single speaking construct was able
test. However, it appeared that Reading Aloud, the second to account for test performances in all the five parts, thus
part of the test, might assess a slightly different construct supporting the practice of averaging the five parts to produce
from the other four spontaneous speaking tasks. an overall test score. It was, however, also noted that the
residual (error) term associated with Part 2, Reading Aloud,
To test the above hypothesis, Xu and Seed (2017) conducted
was relatively larger than those associated with the other
an item-level confirmatory factor analysis on 3,250 speaking
parts. The researchers regarded this as a piece of evidence
tests solely marked by examiners. The study found that a
for distinguishing between reading aloud and spontaneous
Single Factor model (Figure 9) fit the data well, resulting in
speaking in speaking assessment, and cautioned against using
a Comparative Fit Index (CFI) value of 0.99, a Non-Normed
constrained speaking tasks alone to measure communicative
Fit Index (NNFI) value of 0.98, and a Root Mean Square Error
speaking ability.
of Approximation (RMSEA) value of 0.08. Generally, a CFI or
NNFI value of 0.90 or above or an RMSEA value of 0.80 or
Linguaskill
References Cronbach, L J and Meehl, P E (1955) Construct validity in
psychological tests, Psychological Bulletin 52 (4), 281–302.
Fan, J (2014) Chinese test takers’ attitudes towards the Versant
AERA, APA and NCME (2014) Standards for educational and
English Test: A mixed-methods approach, Language Testing in Asia
psychological testing, Washington, DC: AERA.
4 (6), 1–17.
APA, AERA and NCME (1974) Standards for educational and
Galaczi, E and Taylor, L (2018) Interactional competence:
psychological tests, Washington, DC: APA.
Conceptualisations, operationalisations, and outstanding
Bachman, L F and Palmer, A S (1996) Language testing in practice, questions, Language Assessment Quarterly 15 (3), 219–236.
Oxford: Oxford University Press.
Haertel, E H (2006) Reliability, in Brennan, R L (Ed.) Educational
Bachman, L F and Palmer A S (2010) Language assessment in Measurement (4th edn), Westport, CT: Praeger, 65–110.
practice, Oxford: Oxford University Press.
Jones, E, Brenchley, M and Benjamin, T (2020) An investigation
Bernstein, J, Van Moere, A and Cheng, J (2010) Validating into the hybrid marking model for the Linguaskill Speaking test,
automated speaking tests, Language Testing 27 (3), 355–377. Cambridge Assessment English internal research report.
Brenchley, M (2020) Re-examining the reliability of human Kane, M T (2013) Validating the interpretations and uses of test
marking in the Linguaskill Speaking test, Cambridge Assessment scores, Journal of Educational Measurement 50 (1), 1–73.
English internal research report.
Knill, K, Gales, M, Kyriakopoulos, K, Malinin, A, Ragni, A, Wang,
Brown, A (2003) Interviewer variation and the co-construction of Y and Caines, A (2018) Impact of ASR Performance on Free
speaking proficiency, Language Testing 20 (1), 1–25. Speaking Language Assessment, Proc. Interspeech 2018, 1,641–
Cambridge Assessment English (2016) Principles of good practice: 1,645. https://doi.org/10.21437/Interspeech.2018-1312
Research and innovation in language learning and assessment, Linacre, J M (1989) Many-facet Rasch measurement, Chicago:
Cambridge, UK: Cambridge Assessment. MESA Press.
Chapelle, C A, Enright, M K and Jamieson, J M (2010) Does Lopes, S and Cheung, K (2020) Final report on the December 2018
an argument-based approach to validity make a difference? standard setting of the Linguaskill General papers to the CEFR,
Educational Measurement: Issues and Practice 29 (1), 3–13. Cambridge Assessment English internal research report.
Cheng, L (2014) Consequences, impact, and washback, in Lu, Y, Gales, M, Knill, K, Manakul, P, Wang, L and Wang, Y
Kunnan, A J (Ed.) The Companion To Language Assessment (Vol. III) (2019) Impact of ASR performance on spoken grammatical
Chichester, West Sussex: John Wiley and Sons, 1,130–1,146. error detection, Proceedings of the Annual Conference of the
Chun, C W (2006) An analysis of a language test for employment: International Speech Communication Association, INTERSPEECH,
The authenticity of the PhonePass test, Language Assessment September 2019, 1,876–1,880. https://doi.org/10.21437/
Quarterly 3 (3), 295–306. Interspeech.2019-1706
Cicchetti, D V (1994) Guidelines, criteria, and rules of thumb for Malinin, A, Ragni A, Knill, K and Gales, M (2017) Incorporating
evaluating normed and standardized assessment instruments in Uncertainty into Deep Learning for spoken language assessment.
psychology, Psychological Assessment 6 (4), 284–290. Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics 2, 45–50. https://doi.org/10.18653/v1/
Cizek, G J and Bunch, M B (2007) Standard setting: A guide to P17-2008
establishing and evaluating performance standards on tests,
Thousand Oaks, CA: Sage. Messick, S (1989) Validity, in Linn, R L (Ed.), Educational
measurement (3rd edn), New York: Macmillan, 13–103.
Council of Europe (2001) Common European Framework
of Reference for Languages: Learning, teaching, assessment, Messick, S (1996) Validity and washback in language testing,
Strasbourg: Council of Europe. Language Testing 13 (3), 241–256.
Council of Europe (2009) Relating Language Examinations to Sawaki, Y, Stricker, L J and Oranje, A H (2009) Factor structure of
the Common European Framework of Reference for Languages: the TOEFL Internet-based test, Language Testing 26 (1), 5–30.
Learning, Teaching, Assessment (CEFR). A Manual, Strasbourg: Shrout, P E and Fleiss, J L (1979) Intraclass correlations: Uses in
Council of Europe. assessing rater reliability, Psychological Bulletin 86 (2), 420–428.
Council of Europe (2018) Common European Framework van Dalen, R C, Knill, K and Gales, M (2015) Automatically
of Reference for Languages: Learning, teaching, assessment grading learners’ English using a Gaussian process. SLaTE 2015:
(Companion volume with new descriptors), Strasbourg: Council Workshop on Speech and Language Technology in Education, 7–12.
of Europe. https://www.isca-speech.org/archive/slate_2015/sl15_007.html
Cronbach, L J (1988) Five perspectives on validity argument, Wang, Y, Gales, M J F, Knill, K M, Kyriakopoulos, K, Malinin, van
in Wainer, H and Braun, H I (Eds) Test validity, Hillsdale, NJ: Dalen, R C and Rashid, M (2018) Towards automatic assessment
Lawrence Erlbaum, 3–17. of spontaneous spoken English, Speech Communication 104,
47–56. https://doi.org/10.1016/j.specom.2018.09.002
Building a validity argument for the Speaking test | June 2020 Linguaskill 17
Weir, C J (2005) Language testing and validation: An evidence-
based approach, Basingstoke: Palgrave Macmillan.
Xi, X (2010) Automated scoring and feedback systems: Where are
we and where are we heading? Language Testing 27 (3), 291–300.
Xi, X, Schmidgall, J and Wang, Y (2016) Chinese users’ perceptions
of the use of automated scoring for a speaking practice test,
in Yu, G and Jin, Y (Eds) Assessing Chinese learners of English:
Language constructs, consequences and conundrums, Basingstoke,
Hampshire: Palgrave Macmillan, 150–175.
Xu, J (2015) Predicting ESL learners’ oral proficiency by measuring
the collocations in their spontaneous speech, unpublished doctoral
dissertation, Iowa State University, Ames, IA.
Xu, J and Gallacher, T (2017) Linguaskill Speaking trial report,
Cambridge Assessment English internal research report.
Xu, J and Seed, G (2017) Automated speaking tests: Merging
technology, assessment and customer needs, paper presented at
the Language Testing Forum 2017, Huddersfield, UK.
18 Linguaskill Building a validity argument for the Speaking test | June 2020
Contact us
All details are correct at the time of going to print in June 2020. Copyright © UCLES 2020 | CER/6644/V1/JUN20