0% found this document useful (0 votes)

52 views

Linguaskill Building a Validity Argument for the Speaking Test

The Linguaskill Speaking test is a computer-based English assessment that utilizes hybrid marking, combining artificial intelligence with human examiners to evaluate speaking performance. The test is designed to assess candidates' oral English proficiency in various real-world contexts, providing results within 48 hours. This validity report outlines the test's purpose, target language use, marking reliability, and evidence supporting its validity based on the Common European Framework of Reference (CEFR).

Uploaded by

englishclassinspain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views

Linguaskill Building a Validity Argument for the Speaking Test

Uploaded by

englishclassinspain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Linguaskill

Building a validity argument

for the Speaking test
Jing Xu, Mark Brenchley, Edmund Jones,
Annabelle Pinnington, Trevor Benjamin,
Kate Knill, Gaelle Seal-Coon, Martin Robinson
and Ardeshir Geranpayeh

Linguaskill
Validity report

Linguaskill

Contents
Summary.. ....................................................... 3
Introduction .................................................... 4
1. Test purpose ................................................. 5
2. Target language use domain ............................ 5
3. Test description ............................................ 5
4. Marking of speaking performance . . ................... 6
5. Framework of test validation ........................... 8
6. Validity evidence for the Linguaskill
Speaking test............................................... 11
References....................................................................... 17

2 Linguaskill Building a validity argument for the Speaking test | June 2020
Summary
The Linguaskill Speaking test is a computer-based oral The evidence presented in this paper supports the following
English test that is enhanced by auto-marking technology. validity claims about the test.
It takes a hybrid approach to marking which combines the
strengths and benefits of artificial intelligence with those of • Test Content: The speaking topics which cover daily
the decision-making of experienced human markers. The test routines, dialogues at social activities, exchanges at
is appealing to institutional users who may need to assess workplaces and telecommunications are overall a good
a large number of English language learners in a short time representation of the communicative situations that the
frame. Individual learners may also find the browser-based candidates will likely encounter in the real world.
speaking test highly accessible in that it can be taken at • Test Content: The speaking topics are interesting, neither
home on any Windows computer with a high-speed internet too easy nor too difficult.
connection. The test results of Linguaskill are reported within
• Test Content: Candidates generally feel comfortable with
48 hours thanks to auto-marking technology. speaking to a computer.
This paper presents a validity argument for the Linguaskill • Marking of Responses: The reliability of human marking
Speaking test by weaving together a narrative about the is satisfactory.
research evidence that has been collected to support the • Marking of Responses: With hybrid marking in place, the
intended interpretations and uses of the test scores. We begin auto-marker achieved 95.6% exact agreement and 100%
by describing the purpose, the target language use (TLU) adjacent agreement on Common European Framework of
domain and the format of the test. Then we unveil the design of Reference (CEFR) grades with human examiners.
the auto-marker, the training programme of human examiners
and the hybrid marking model used in the test. In what follows
• Marking of Responses: The auto-marker is capable of
detecting suspicious non-English speech and escalating it to
we delineate the structure of the Linguaskill validity argument
human examiners for verification.
and explain why each element in it is essential for arguing for
test validity. Finally, we present research evidence related to • Interpretation of Test Results: Standard-setting exercises
three elements in the validity argument, namely Test Content, are conducted regularly to establish the link between the
Marking of Responses, and Interpretation of Test Results. It is test and the CEFR. For this reason, the test results can be
notable that test validation is a cumulative process. Further interpreted confidently based on the CEFR.
research is being carried out to gather additional evidence to • Interpretation of Test Results: Confirmatory factor
support the validity argument. analysis suggests that a single, overarching speaking
construct is assessed by the test.

Building a validity argument for the Speaking test | June 2020 Linguaskill 3
Validity report

Introduction
The Linguaskill Speaking test is a computer-based oral English may see increased accessibility in automated assessment
test that is enhanced by auto-marking technology. In contrast because, with a remote proctoring solution in place,
to other automated speaking assessments, Linguaskill the speaking test can be taken even at home. However,
Speaking takes a hybrid approach to marking, which means automated assessment also brings about challenges and
its test responses are marked by a combination of human problems that are not typically associated with human
examiners and auto-marking technology. If the computer marking, such as scepticism about construct coverage and
indicates low confidence in marking a response, the response susceptibility to candidate cheating (Chun 2006, Fan 2014,
is escalated to human marking. This hybrid model aims to Xi 2010, Xu 2015).
address the challenges that fully automated assessment
This paper presents a validity argument for the Linguaskill
brings, by marrying the latest auto-marking technology with
Speaking test. A validity argument provides an overall
the decision-making of experienced human markers.
evaluation of the intended interpretations and uses of test
With the advancement of natural language processing, scores by conducting coherent analysis on various strands
machine learning and speech recognition technologies, of research evidence either for or against the proposed
automated speaking assessment is growing in popularity. interpretations and uses (Cronbach 1988, Kane 2013). We
Compared with traditional face-to-face speaking exams, begin by describing the test specifications including the
automated speaking assessment, which is delivered on a intended test purpose, target language use domain and test
computer or mobile device, offers the benefits of much format. Then, we introduce the hybrid marking model applied
faster score reporting, simple test administration and to the oral assessment in which marking tasks are shared by
on-demand testing. an auto-marker and human examiners. In what follows, we
present a clear validation framework which lays out critical
Automated speaking assessment is particularly appealing to
validity considerations at different stages of a testing cycle. In
institutional users who may find large-scale administration
the remainder of the paper, we present validity evidence that
of face-to-face speaking exams unfeasible. Individual learners
has been collected based on this framework.

4 Linguaskill Building a validity argument for the Speaking test | June 2020
1. Test purpose The critical TLU tasks selected for Linguaskill generally fall into
four categories. They are daily routines (e.g., discussing leisure-
The Linguaskill Speaking test assesses candidates’ oral English time habits, giving preferences), dialogues at social activities
proficiency for everyday communication. It can be taken on (e.g., describing a situation or issue, recounting news or
its own or in conjunction with the other Linguaskill modules personal experience), exchanges at workplaces (e.g., raising a
of Reading and Listening, and Writing. Linguaskill aims to problem or issue, reporting on data), and telecommunications
provide fast, reliable and clearly interpretable results based (e.g., requesting information and leaving a telephone message).
on the Common European Framework of Reference for
Languages (CEFR), a widely recognised standard for describing
the progression of language learning and acquisition (Council 3. Test description
of Europe 2001, 2018), as well as more granular scores The Linguaskill Speaking test is browser-based so candidates
based on the Cambridge English Scale. Intended uses of can sit the test on any Windows computer1 with a high-
Linguaskill include a) measuring a candidate’s level of English speed internet connection in an invigilated setting. The test
for placement, progression, or graduation at education is remotely proctored if a candidate chooses to take it at
institutions and b) measuring a candidate’s level of English for home. Questions are presented to the candidate through
job or development opportunities at companies. The target the computer screen and headphones, and their responses
candidates of Linguaskill are English language learners over are recorded and remotely assessed by either computer
the age of 16 years. algorithms or examiners (see Section 4). The test is multi-
level, meaning that it is designed to elicit and assess oral
2. Target language use domain performance of multiple proficiency levels based on the CEFR,
including below A1, A1, A2, B1, B2, and C1 and above. The test
A target language use (TLU) domain is a hypothetical results are reported within 48 hours.
description of the situations or contexts in which candidates
need to be able to use the language outside the test. By The Linguaskill Speaking test has five parts: Interview, Reading
delineating the scope of this domain and identifying the key Aloud, Presentation, Presentation with Visual Information,
characteristics of language use in it, the test developer is able and Communication Activity. All parts are weighted equally
to design tasks that mimic these language use activities. Then, and focus on different aspects of speaking ability. The format,
candidates’ test-taking behaviours can be viewed as a sample testing aim and evaluation criteria of the five parts are
of their predicted language performance in the TLU domain. presented below and summarised in Table 1.

As Linguaskill is designed to serve multiple test purposes (see

Section 1), its TLU domain has to be fairly broad, covering a 3.1 Interview
wide range of situations and tasks of English language use a. Format
in both daily-life and workplace settings. As the contexts In the interview task, the candidate answers eight questions
of communication in this domain are heterogeneous, it is about themselves. The first four questions are standard in all
important for the test developer to identify and describe some tests and candidates have 10 seconds to answer each question.
critical TLU tasks and ensure that they are represented in test Questions 5–8 vary according to each test version and are
content (see a further discussion in Section 5.1). likely to ask the candidate simple personal questions relating
to habits, experiences, or tastes. Candidates have 20 seconds
to answer each of these questions.

Table 1. An overview of the Linguaskill Speaking tasks

Length of
Part Task Description Preparation time Marks
response(s)
The candidate answers eight 4 x 10 secs and
1 Interview none 20%
questions about themselves. 4 x 20 secs

2 Reading Aloud The candidate reads aloud eight sentences. 8 x 10 secs none 20%

3 Presentation The candidate speaks on a given topic. 1 minute 40 secs 20%

4 Presentation with The candidate gives a presentation based

1 minute 1 minute 20%
Visual Information on the graphic information given.

5 Communication The candidate gives opinions on five

5 x 20 secs 40 secs 20%
Activity questions related to a scenario.

1
At this point, sitting Linguaskill on a Mac computer is not supported. It is suggested
that candidates use Google Chrome or Mozilla Firefox to take the test on a PC.

Building a validity argument for the Speaking test | June 2020 Linguaskill 5
b. Testing aim specific context, such as leaving a voicemail for a friend or
As well as introducing the candidate to the computer giving a presentation in class.
format of the test, the focus of this test part is to assess
b. Testing aim
the candidate’s ability to answer personal questions and
The focus of this test part is to assess the candidate’s ability
to give lower-proficiency candidates a more accessible and
to deliver a long turn which involves the interpretation of very
achievable task.
simple visual information and providing a recommendation,
c. Evaluation criteria explanation or suggestion.
Candidates are assessed on their linguistic output in terms of
c. Evaluation criteria
pronunciation and fluency, and language resource.
Candidates are assessed on their linguistic output in terms of
pronunciation and fluency, language resource and discourse
3.2 Reading Aloud management. Additionally, the marking takes into account
a. Format the candidate’s ability to complete the task appropriately in
In the Reading Aloud task, the candidate is required to read accordance with the rubric and instructions.
aloud eight sentences. They have 10 seconds to read aloud
each sentence. Sentences are of the kind that candidates may 3.5 Communication Activity
have to read aloud in real-world situations and are presented
a. Format
in increasing level of difficulty, covering a wide range of
In the Communication Activity task, the candidate is required
phonological features and syntactic structures.
to answer five questions related to a scenario. A preparation
b. Testing aim time of 40 seconds is given before the candidate hears the
The focus of this test part is to assess the candidate’s ability first question. Each question has a 20-second response
to transform the written form of the language into speech window and asks the candidate to provide an opinion,
and to handle elements of pronunciation at sentence level. speculate a hypothesis, or make an evaluation.
c. Evaluation criteria b. Testing aim
Candidates are assessed based on phonological criteria, The focus of this test part is to assess the candidate’s ability
including their overall intelligibility, their ability to to express opinions and ideas on a given topic in response
produce individual sounds, as well as their stress, rhythm to an aural prompt. It is an opportunity for higher-level
and intonation. candidates to demonstrate their higher-level skills.

c. Evaluation criteria
3.3 Presentation Candidates are assessed on their linguistic output in terms of
a. Format pronunciation and fluency, language resource and discourse
In the Presentation task, the candidate is required to speak management. Additionally, the marking takes into account
for 1 minute on a given topic. There is no choice of topic. A the candidate’s ability to complete the task appropriately in
preparation time of 40 seconds is given before a candidate accordance with the rubric and instructions.
records their response.
4. Marking of speaking performance
b. Testing aim
The focus of this test part is to assess the candidate’s ability The Linguaskill Speaking test adopts a hybrid or human-in-
to deliver a long turn. As well as a description of a situation or the-loop marking model in which an auto-marker is used
issue, the candidate is encouraged to state and/or justify an in live assessment, but with the involvement of human
opinion through bulleted prompts. examiners. This section discusses the design of the
auto-marker, examiner training and certification, and how
c. Evaluation criteria
hybrid marking is applied.
Candidates are assessed on their linguistic output in terms of
pronunciation and fluency, language resource and discourse
management. Additionally, the marking takes into account 4.1 Auto-marker
the candidate’s ability to complete the task appropriately in An auto-marker is a set of computer algorithms designed to
accordance with the rubric and instructions. mark constructed test responses such as extended speaking
and writing. Cambridge Assessment English (henceforth
3.4 Presentation with Visual Information Cambridge English), in collaboration with research groups
from the University of Cambridge, started to develop
a. Format
automated marking of spontaneous non-native English
In the Presentation with Visual Information task, the
speech in 2012. The auto-marker used in the Linguaskill
candidate is required to talk for 1 minute about information
Speaking test is called the Custom Automated Speech
presented to them in visual form. A preparation time of 1
Engine (CASE), which was developed by Enhanced Speech
minute is given before a candidate records their response.
Technology Ltd building upon technology transferred from the
The candidate is asked to present the information within a

6 Linguaskill Building a validity argument for the Speaking test | June 2020
Figure 1. The architecture of the auto-marker (Knill et al 2018)

Features
Feature
extraction Grader Grade

Speech
recogniser Text

Institute for Automated Language Teaching and Assessment comments about performance and marking rationales.
(ALTA), an interdisciplinary research centre of the University of Applicants are guided through the material through
Cambridge, using machine learning technologies. the documentation.

CASE, as shown in Figure 1, consists of three major Within 30 days of access the certification test must be
components: a speech recogniser, a feature extraction successfully taken. Certification tests include a selection of
module, and a grader (Knill et al 2018, Wang et al 2018). The speaking items previously marked by a pool of experienced
speech recogniser conducts Automated Speech Recognition and reliable examiners, with a statistically adjusted average
(ASR), converting the audio signal of speech into a structured score2 as the final approved mark. Applicants must allocate
representation of the underlying word transcription. It was a minimum of 80% of correct marks (within 0.5 of the
trained based on deep neural network models using learner approved mark). Two attempts are provided, with different
speech supplied by Cambridge English and combined with versions of the test. Examiners who have failed both attempts
crowd-sourced transcriptions (see the ASR2 system in Lu et al have their access to the portal automatically revoked.
2019). Feature extraction is about deriving features relevant to
Once applicants have successfully passed the certification
the speaking construct (e.g., fluency, pronunciation accuracy,
test the system identifies them as certificated and they are
vocabulary diversity) from both the audio signal and the
added to the marking pool and can start marking candidates.
structured word transcription as the basis for grading. Based
Once examiners start providing marks, they are continuously
on these features, the grader uses state-of-the-art machine
statistically monitored. Marking behaviour analysis is carried
learning models to return a distribution over scores from
out to identify possible bias, consistency and non-compliant
which feedback to the candidate such as the CEFR grade is
behaviour. Examiners who are flagged up statistically are
derived. The training sample for the grader includes a large set
investigated and removed from the marking pool if their
of Linguaskill Speaking test responses produced by learners
behaviour is confirmed as unsatisfactory. Re-certification
of various first languages and all CEFR levels as well as the
occurs every two years with new training and test material.
marks awarded to these responses by examiners. In addition,
the machine learning grader models used in the CASE have
been selected to provide an uncertainty measure based on the 4.3 Hybrid marking
similarity between the input and the training data (van Dalen,
Hybrid marking aims to combine the strengths and benefits
Knill and Gales 2015, Malinin, Ragni, Knill and Gales 2017). This
of artificial intelligence (AI) with those of human examiners.
uncertainty measure is a meaningful indicator of the reliability
Computer marking is speedy and cost-effective but is only
of the auto-marker score and is useful for identifying test
reliable when the responses being marked are close to the
responses that require human marking.
training sample of the AI system. Some impediments to auto-
marker accuracy include poor audio quality, aberrant speaking
4.2 Examiners behaviours and training sample underrepresentation. Poor
audio quality is likely to significantly reduce ASR accuracy and
All examiners for the Linguaskill Speaking test undergo a
affect the auto-marking performance. Learners may also be
rigorous training programme in order to qualify (Figure 2).
tempted to apply strategies to trick the marking system into
Prospective examiners must meet the minimum professional
giving them higher marks (Xi, Schmidgall and Wang 2016).
requirements, which include being educated to first degree
Thus, it can be argued that human examiners play a key role
level or equivalent, holding a recognised language teaching
as gatekeepers in preventing less reliable auto-marker scores
qualification, providing proof of substantial and relevant
being released to candidates.
teaching experience within the last two years, and having
suitable English language competency. The hybrid marking model is about using human examiner
expertise to support and further develop the auto-marker.
Approved applicants are provided with training materials
It is also based on the assumption that the computer can
through an online portal, which includes extensive
provide information to indicate its confidence in score
documentation about the marking procedure and sample
prediction. When this confidence is low, the test response
speaking responses with marks, along with detailed
is flagged up and escalated to human examiners. In the
2
We use fair average scores which are average scores adjusted for marker severity
by multi-faceted Rasch analysis (Linacre 1989).

Building a validity argument for the Speaking test | June 2020 Linguaskill 7
Figure 2. The procedure for certificating Linguaskill speaking examiners

Nomination

Meet minimum
professional requirements Can re-apply to become a
No Linguaskill online examiner
after 12 months
Yes

Training
Fail

Certiﬁcation stage 1 Certiﬁcation stage 2

Fail
Pass Pass

Become examiner

Monitoring
in year not re-certiﬁcated

Re-certiﬁcation
in year not monitored

Cambridge English hybrid marking model (Figure 3), to ensure the quality of marking and provide marking data
escalation to human marking is determined by setting to further train the auto-marker. The proportion of human
thresholds on three features generated by the auto-marker marking will gradually decrease with the enhancement of
in addition to the predicted score. They are the Assessment the auto-marker. The evaluation of the auto-marker and
Quality score, Language Quality score and Audio Quality the hybrid marking model will be further discussed in
score. The Assessment Quality score is an uncertainty measure Section 6.2.
produced by the grader which suggests the amount of
confidence the grader has in its score prediction (see Section 5. Framework of test validation
4.1). The Language Quality score is an ASR confidence score
returned by the speech recogniser. It represents the system’s Validity is the most fundamental issue of assessment.
confidence in the accuracy of its transcription, which in turn Validity is the degree to which evidence and theory support
can be a useful proxy for identifying candidates who are not the interpretations of test results for intended test uses
actually speaking English during the test (see Lu et al 2019). (AERA, APA and NCME 2014). Two common frameworks
The Audio Quality score indicates the clarity of voice recording used for language test validation are the argument-based
and is derived from three separate measures: dynamic ratio framework (Bachman and Palmer 2010, Kane 2013) and the
(differences in amplitude between loud and quiet parts of the socio-cognitive framework (Weir 2005). The former focuses
audio), clipping (frames of audio that reach the maximum/ on decomposing Messick’s (1989) complex validity theory by
minimum possible values and hence are distorted) and structuring validity enquiry around practical arguments. The
noise. It also incorporates a variety of other ASR errors linked latter applies Messick’s (1989) validity theory to language
to audio quality or the intermediate processes during the assessment, and similar to Messick, takes a cumulative
speech-to-text conversion. In addition, test responses with approach to evidence collection. The Linguaskill validity
auto-marker scores falling below or above certain cut-off argument, as shown in Figure 4, is constructed by
values are flagged for examiner marking. This is informed by integrating the two, in order to make validity claims and
our auto-marker evaluation suggesting that the auto-marker evaluate supporting evidence.
score tends to be less reliable on the lower and higher ends of
The Linguaskill validity argument consists of six parts:
the scoring scale. In the current hybrid marking model, a large
Test Content, Response Processes, Marking of Responses,
proportion of test responses is marked by human examiners

8 Linguaskill Building a validity argument for the Speaking test | June 2020
Figure 3. The Cambridge English hybrid marking model

Conﬁdent
AI
auto-marker Level

Not conﬁdent

Human
examiner
New data for further training
of the auto-marker

Interpretation of Test Results, Test Use and Test Impact. evidence for response processes includes coherence
They represent a sequence of activities that a typical testing between test-taking behaviours/strategies and construct
cycle comprises, i.e., from test construction to the impact theories of language abilities, appropriate task delivery and
of testing on stakeholders. This section explains what these administration, clear test instruction, and accommodation
notions mean and how they can help guide the validation of candidates with special needs. Such evidence helps rule
research for Linguaskill. out an alternative interpretation of the test scores that
factors other than targeted language ability had an effect on
5.1 Test Content candidates’ test performance (AERA, APA and NCME 2014).

A common understanding of test validity concerns the test

5.3 Marking of Responses
itself or the instrument constructed to measure an ability.
That is, is the test design of high quality and fit for purpose? Marking of test responses can be performed either by
This understanding is not incorrect, but it only addresses one examiners or computer algorithms. When a test elicits
facet of validity. constructed responses rather than selection from fixed
multiple-choice options, mark schemes (also called scoring
Traditionally, the aspect of validity concerning test content is
criteria) are needed to evaluate test performance. Validity
called content validity (APA, AERA and NCME 1974) or context
evidence related to marking may come from mark scheme
validity (Weir 2005). The idea is that the questions on a test
validation, analysis on human marking processes and
should be relevant to intended test purposes and cover the
rationales, and the investigation of consistency of marking.
critical knowledge and skills associated with such purposes.
For example, it is expected that mark scheme development
In language assessment, validity evidence for test content is
is informed by theories and research about targeted
usually gathered by expert review on the connection between
language abilities and that examiners assign credit to key
test tasks and the TLU domain, which describes the situations
aspects of language behaviours attributable to such abilities.
or contexts the candidates are likely to encounter outside
Additionally, the marking processes, including the way scores
of the test. The aim of test content review is to ensure that
are weighted or combined, should be justified and reflect the
the characteristics of the test tasks mirror or adequately
best approach to estimating targeted language abilities. It is
represent the characteristics of language use activities in the
also expected that the test score a candidate receives would
TLU domain. This notion is also referred to as authenticity
provide a close estimate of the scores that he or she would
by some language testing researchers (e.g., Bachman and
have obtained on parallel forms of the same test or from
Palmer 1996) and is often considered as the basis of a validity
any examiners randomly selected from the marker pool. This
argument (Bachman and Palmer 2010, Chapelle, Enright and
concept is commonly referred to as reliability (Haertel 2006)
Jamieson 2010).
or scoring validity (Weir 2005). When a machine is used to
predict human scores, the reliability of the automated scores
5.2 Response Processes must be presented in terms of machine agreement with
human examiners or the deviation of machine scores from
When candidates sit a language test, their language use is
human scores. In addition, when constructed responses are
elicited and observed. Validity related to response processes
marked by a computer, evidence about the degree of overlap
concerns the elicitation of behaviours attributable to targeted
between computer and human scoring criteria needs to be
language abilities. This ties to the original conception of
sought or it would be difficult to interpret computer scores
construct validity regarding the traits or abilities being
and human scores in the same way (Xi 2010, Xu 2015).
assessed by the test (Cronbach and Meehl 1955). Validity

Building a validity argument for the Speaking test | June 2020 Linguaskill 9
Figure 4. The outline of the Linguaskill validity argument

• Adequate understanding of the target language use domain

• Relevance of test tasks to the target language use domain
Test Content • Coverage of critical knowledge and skills needed for the target language use domain
• Appropriate test content and task difficulty for candidates of various proficiency levels
and ages

• Cognitive processes attributable to targeted language abilities

Response Processes • Appropriate test delivery and administration conditions including those for candidates
with special needs

• Appropriate use of mark schemes in line with knowledge of targeted language abilities
Marking of Responses • Psychometric evidence about score reliability and generalisability
• Degree of overlap between computer and human scoring

• Alignment of test results to a professional standard for describing targeted

language abilities
Interpretation
of Test Results • Consistency between test results and other measures of targeted language abilities
• Factor structure underlying the test

• Test results easily understood and appropriately used by stakeholders

Test Use
• Usefulness of test results for predicting performance in the target language use domain

• Positive impact of testing on stakeholders

Test Impact
• Positive impact of testing on test preparation and language education inside and
outside classroom
• Positive impact of testing on society

5.4 Interpretation of Test Results evidence for score interpretation may also be collected from
latent factor analysis which investigates the underlying factor
Test scores are simply numbers so meanings have to be structure of the test (e.g., Sawaki, Stricker and Oranje 2009).
assigned to them to make them useful for various purposes. This piece of evidence is particularly relevant to integrated
This is at the heart of score interpretation. In most cases, language assessment in which two or more language skills
the test developer provides test users with the suggested (e.g., listening and speaking) are assessed at the same time.
interpretations, in the form of Can Do statements of abilities
associated with the scores, but this interpretation has to
be backed up by theories of cognitive processes, language 5.5 Test Use
development, or second language acquisition (Weir 2005). Based on the test results, stakeholders, such as candidates,
Validity evidence for score interpretation can be obtained teachers, employers and admission officers, will likely take
from standard-setting exercises that aim to align test actions. For example, a candidate may decide to put in more
scores to a theory-driven and/or research-based standard effort to improve a particular skill; a teacher may tweak his
for describing language proficiency, such as the CEFR or her lesson plans to meet students’ learning needs; an
(Council of Europe 2001, 2018). Additionally, this evidence employer may select a team among high-scoring candidates
may be collected from concurrent studies that examine to expand overseas markets; a school admission officer may
the relationship between test results and other measures make acceptance decisions on applicants. Validity evidence
of targeted language abilities. This is traditionally called related to test use is about the extent to which test results
concurrent validity (APA, AERA and NCME 1974). Validity help stakeholders make informed decisions or take the right

10 Linguaskill Building a validity argument for the Speaking test | June 2020
actions. This evidence can be sought in the following two before being included in a live test. Any problematic test items
areas. First, suggested test use and meanings of test scores found by the trial are returned to the development phase.
should be well understood by test users to avoid unintended
Xu and Gallacher (2017) conducted a survey on 3,601 adult
score interpretations and uses. Second, test results should be
English language learners from 23 countries in a global trial of
useful for predicting future behaviours of interest such as job
the Linguaskill Speaking test. The majority of the participants
performance and academic performance that are related to
reported that the speaking tasks were similar to how they
language use. This use of validity in its predictive sense was
used English in the real world (65.3%) and that the speaking
called predictive validity and had been predominant before
topics were closely related to their life (57.9%). In addition,
construct validity came into being in the 1950s (APA, AERA
approximately 70% of the participants agreed or strongly
and NCME 1974).
agreed that the Linguaskill Speaking test allowed them to
demonstrate their English-speaking ability.
5.6 Test Impact
Analysis on the qualitative feedback received in the survey
The use of test scores will exert an impact on stakeholders suggested that the speaking topics were interesting, neither
in a range of teaching, learning and social contexts. For too easy nor too difficult, and related to everyday life or
example, the way a high-stakes language test is designed work. For example, one participant related test questions to
is likely to influence how learners learn a language, how his daily language use situations:
teachers teach a language, and even social values regarding
language proficiency and fairness. This impact is also called
‘I find the topics relevant, not too easy nor difficult. I
consequences, washback, or consequential validity (Cheng
think that these topics are related to what normally
2014, Messick 1996, Weir 2005) and is an integral aspect of
happens in daily life. These are topics that most people
the concept of validity. Test impact is closely related to how
learning English should master because they are what
a test is used. If Linguaskill were misused for unintended
takes place in the real world.’ (Participant ID 1742853)
purposes, the test impact would probably be negative. The
responsibility to ensure positive test impact is shared by both
the test provider and test users. Many participants reported the speaking test was not as
stressful as they had anticipated. As one participant put it:

6. Validity evidence for the Linguaskill ‘I always feel worried in exams, but as I hear the
Speaking test questions, I felt more comfortable and relax. So they
This section presents the research evidence to support the were easy for me.’ (Participant ID 1721614)
use of the Linguaskill Speaking test for its intended purposes.
Test validation is a cumulative and ongoing process (Messick Participants had mixed feelings about speaking to a
1989). Validity evidence is collected and refined over time as computer. Some felt that talking to a computer was ‘just like
more data is collected, and as the Linguaskill test is relatively talking to a real person’ (Participant ID 1750497) or even less
new, not all aspects of the validity argument have yet been stressful than talking to a speaking examiner (Participant
fully documented. Extensive evidence has been obtained ID 1741200). Others indicated that they were very used to
for Test Content, Marking of Responses and Interpretation of interacting with a digital device since ‘interaction through
Test Results. Further research is being carried out to gather [a] mobile phone is quite popular nowadays’ (Participant
additional evidence for Response Processes, Test Use and ID 1756618). A small proportion (19.3%) of participants
Test Impact in specific countries and regions where the still preferred to speak to a human interlocutor as they had
test is used. expected exchanges of information (Participant ID 1646082)
and seeing a human face (Participant ID 1714910) in the real-
life oral communication.
6.1 Validity evidence for Test Content
Expert judgment is an essential element in attesting the In short, the quality-assurance process underpinning the
relevance of the test tasks to the TLU domain (Messick 1989, Linguaskill content production and the findings from this large-
p. 39). For Linguaskill, the judgment on content relevance scale trial study suggest that the content of the Linguaskill
and construct coverage (i.e., the skills assessed by the test) Speaking test is overall a good representation of the speaking
is made in test review meetings at the test development tasks that candidates will likely encounter in the real world.
phase by a group of experts consisting of item writers, senior The ability to interact with an interlocutor (Brown 2003,
examiners and language testing researchers (Figure 5). The Galaczi and Taylor 2018), which is typically assessed in a face-
review of speaking items focuses on item difficulty, clarity to-face interview, is not represented in the Linguaskill Speaking
of the prompt and instruction, authenticity of topics and test. It is therefore not possible to interpret the Linguaskill
situations, background knowledge needed to give a response, Speaking test score as a direct measure of interactional
and the skills being assessed. Test items that fail the review are competence. Nevertheless, monologic speaking performance
either discarded or revised before being reviewed once again. may to some extent predict interactional speaking
Those which pass the expert review are trialled with a large performance (Bernstein, Van Moere and Cheng 2010), and it
group of learners selected from the target test population

Building a validity argument for the Speaking test | June 2020 Linguaskill 11
Figure 5. The test development cycle adopted by Linguaskill (Cambridge English 2016)

Start

Perceived need for

a new test

Planning phase

Design phase
Initial specifications

Revision

Development phase Trialling analysis

Evaluation/review
Final specifications

Live test
Evaluation
Operational phase

Monitoring phase

Review

can be argued that the Linguaskill Speaking test is designed to 6.2.1 Reliability of examiner marking
cover a variety of communicative speaking functions. Xu and Gallacher (2017) conducted a study to investigate
the reliability of human marking in the Linguaskill Speaking
6.2 Validity evidence for Marking of Responses test. Five Linguaskill speaking examiners randomly selected
from a larger pool were asked to mark a common dataset
Reliable marking of test responses serves as the basis for consisting of test responses produced by 60 candidates of
accurate estimation about candidates’ targeted language various proficiency levels. In other words, each part of the
abilities. A caveat associated with automated speaking test was marked by the same five examiners. Reliability of
assessment relates to the reliability of marking open-ended human marking on each test part and the whole test (which
speech (Xu 2015). Before discussing the reliability of CASE, was the average of each part) was estimated using intraclass
the speech auto-marker used in Linguaskill, we first report correlation coefficients (ICC). This coefficient indicates the
the reliability of examiner marking. This is because a) a large degree to which a single mark on a response represents
proportion of the Linguaskill Speaking tests are still marked the other marks on the same response (Shrout and Fleiss
by examiners and b) the auto-marker was trained on human- 1979). In general, an ICC value between 0.75 and 0.90 is
marked spoken data and thus cannot outperform the best considered good reliability and a value above 0.90 indicates
examiners in the training sample. excellent reliability (Cicchetti 1994). The ICC values of each

12 Linguaskill Building a validity argument for the Speaking test | June 2020
Figure 6. Histograms of examiner and auto-marker raw scores without hybrid marking
400

400
300

300
Frequency
Frequency
200

200
100

100
0
0

0 1 2 3 4 5 6 0 1 2 3 4 5 6

Examiner scores Auto-marker scores

test part as well as the whole test are presented in Table 2. to or smaller than one CEFR level. In 3.4% of the tests, the
It can be seen that the reliability of single human marking auto-marker and human examiners differed by more than one
varies from 0.84 to 0.91 in the five test parts and is 0.91 for CEFR level (Table 3). In operational testing, these inaccurate
the whole test, thus indicating adequate reliability of human auto-marker scores are overridden by examiner scores, which
marking at task level and excellent reliability at the test level. will be discussed in the next section.
Brenchley (2020) re-examined inter-rater reliability using
The research also found that although the distributions of the
a larger dataset of 204 Linguaskill Speaking tests marked
auto-marker and human raw scores largely overlapped (Figure
independently by three examiners. The study reported a
6), the auto-marker was comparatively harsher on the higher
single-marker ICC of 0.90 for the whole test.
end of the scoring scale and comparatively lenient on the
6.2.2 Reliability of the auto-marker on its own lower end (Figure 7). Again, these inaccuracies are addressed
Auto-marker evaluation is often performed by computing through the use of hybrid marking, as well as continued
the correlation or agreement between computer marking training and improvement of the auto-marker. The root mean
and human marking (e.g., Bernstein et al 2010, Wang et al square error (RMSE) of the auto-marker raw score was 0.64,
2018). Jones, Brenchley and Benjamin (2020) conducted an about a half CEFR band. RMSE is the standard deviation of
evaluation study on the current version of the auto-marker, the residuals (auto-marker prediction errors) and an indicator
focusing first on the performance of the auto-marker on its of how concentrated the data points are around the diagonal
own – that is, not embedded in a hybrid marking system regression line where exact human–machine agreement is
(see next section). The evaluation was based on a dataset of achieved (Figure 7).
9,286 Linguaskill Speaking tests reflecting live candidature.
In addition to auto-marker agreement with examiners, the
The distribution of human CEFR grades in the dataset was
research also evaluated the usefulness of the Language
approximately 1% Below A1, 5% A1, 13% A2, 36% B1, 35%
Quality score, a confidence measure of speech recognition
B2 and 10% C1 or above. The dataset contained speakers of
(see Section 4.3), for identifying non-English speech or
a large number of native languages in which Spanish (29%),
gibberish in the test responses. Based on a subset of data
Arabic (26%) and Portuguese (15%) were the most frequent.
(n = 284) which included aberrant speaking behaviours, it
The study found that when the same CEFR cut-off values are was found that normal English speech resulted in significantly
applied to auto-marker and human raw scores, the auto- higher Language Quality scores than non-English speech
marker awarded the same CEFR grade as the examiners in (see Figure 8). By applying a cut-off value to this score,
56.8% of the tests. In 96.6% of the tests, the difference
between computer marking and human marking was equal
Table 3. Percentage agreement between auto-marker and
human CEFR grades (n = 9,286)
Table 2. Intraclass correlation coefficients of single examiner
marking (Xu and Gallacher 2017) Human–machine agreement Percentage
Exact agreement (or no difference) 56.8%
Part 1 Part 2 Part 3 Part 4 Part 5 Whole
test Adjacent agreement (or difference 96.6%
<= 1 CEFR level)
0.84 0.87 0.90 0.88 0.91 0.91
Mismarking (or difference > 1 CEFR level) 3.4%

Building a validity argument for the Speaking test | June 2020 Linguaskill 13
Figure 7. A scatter plot of examiner vs. auto-marker raw scores without hybrid marking

6
5
4
Auto-marker score

3
2
1
0

0 1 2 3 4 5 6

Examiner score

the 19 non-English-speaking responses in the dataset were passed to a human examiner. The thresholds that are used in
all successfully identified. The research suggests that the the rules (e.g. 0.9) were determined by a process
Language Quality score is sensitive to non-English speech and of optimisation.
helpful for recognising candidates with an intent to game the
This constrained optimisation was done using exhaustive
auto-marker.
search, also known as brute-force search. For each variable
6.2.3 Reliability of hybrid marking a set of possible thresholds was created. For example, the
Hybrid marking, as mentioned in Section 4.3, is about Language Quality score ranges from 0 to 1, so the set of
escalating to human examiners any responses which the thresholds might be 0, 0.01, 0.02, ..., 1. For all possible
auto-marker may have mismarked, defined as cases where the combinations of five thresholds, an optimal (highest) recall
auto-marker and human scores are likely to be further than statistic was calculated based on the dataset of 9,286
one CEFR level apart on the scoring scale. In the Linguaskill Linguaskill Speaking tests (Jones et al 2020). The recall
hybrid marking model, rules are applied to a number of statistic, which is reported as a percentage, indicates the
features generated by the auto-marker including Assessment completeness of flagging. For example, a recall value of
Quality, Language Quality, Audio Quality, auto-marker score 0.90 means that of all the test responses mismarked by the
(lower bound) and auto-marker score (higher bound). Each auto-marker, 90% of them are successfully flagged by the
rule is an inequality statement such as the Language Quality application of the rules.
score is below 0.9. If a response satisfies any of the rules, it is

14 Linguaskill Building a validity argument for the Speaking test | June 2020
Figure 8. Language Quality scores between English-speaking Figure 9. Single Factor model (Xu and Seed 2017)
tests and non-English-speaking tests

1
1.28 Part 1
0.95

0.9
1.57
9.60 Part 2
0.85
Language Quality (overall)

3.74
0.8

0.75 2.88 Part 3 3.74 Speaking 1.00

0.7
2.32
0.65
1.89 Part 4
1.42
0.6

0.55

1.80 Part 5
0.5

A statistic that is often reported along with recall is precision, A construct theory may be a set of language proficiency
which is an indicator about the accuracy of flagging. For descriptors, as in the CEFR, which detail the course of
example, a precision of 0.90 means that of all the flagged language development. Alternatively, it can be a speculation
test responses, 90% of them were indeed mismarked by the on the composition of a language ability. The validity
auto-marker. There is always a trade-off between precision evidence for supporting the proposed score interpretation of
and recall – a high recall value will lead to a low precision the Linguaskill Speaking test has been collected via standard
value and vice versa. In designing the Linguaskill piping rules, setting and factor analysis. The former links the performance
we pursued a high recall value in order to prevent unreliable on the test to a theory about speaking proficiency progression
auto-marker scores being released to the candidates. whereas the latter examines the underlying structure of the
speaking construct targeted by the test.
Given our emphasis on high reliability of marking, we initially
opted for threshold values that would result in a recall of 0.96 6.3.1 Standard setting
at the cost of escalating a large proportion of test responses As the Linguaskill Speaking test reports CEFR-based
to human examiners. The high recall, in turn, led to a small test results, standard-setting exercises were performed
auto-marker RMSE of 0.16 and excellent human–machine periodically to align its test results to the CEFR framework.
agreement: 95.6% exact agreement and 100% adjacent This alignment allows test users to interpret the test results
agreement on CEFR grades. We are, however, continually in a wider context by referring to the language proficiency
improving the auto-marker and evaluating the threshold descriptors provided by the CEFR.
values to decrease the proportion of test responses that are
Standard setting refers to the process of establishing one or
examiner-marked.
more cut scores on examinations (Cizek and Bunch 2007,
p. 13). In the case of Linguaskill, cut scores are used to divide
6.3 Validity evidence for Interpretation of candidates into six proficiency groups in line with the CEFR
Test Results proficiency levels: Below A1, A1, A2, B1, B2 and C1 or above.
The most recent standard-setting exercise on the Linguaskill
The interpretation of language test results must be supported
Speaking test was conducted by Lopes and Cheung (2020)
by construct theories about targeted language abilities. On
who followed a modified Bookmark method recommended
the one hand, construct theories about language are chosen
by a manual for relating language tests to the CEFR (Council
by test developers to inform test design, assign meanings to
of Europe 2009).
the test scores and account for the variance in test scores.
On the other hand, test validation is also a process of 6.3.2 Factor structure
theory validation in that the observed test data may either In addition to standard setting, factor analysis was performed
confirm or refute the chosen theories for score interpretation to examine the underlying structure of the Linguaskill
(Cronbach and Meehl 1955). Speaking test. It was hypothesised that the abilities assessed

Building a validity argument for the Speaking test | June 2020 Linguaskill 15
in the five test parts were unidimensional, meaning that a below indicates an adequate model fit (Sawaki et al 2009).
single, overarching speaking construct was assessed by the The finding suggests that a single speaking construct was able
test. However, it appeared that Reading Aloud, the second to account for test performances in all the five parts, thus
part of the test, might assess a slightly different construct supporting the practice of averaging the five parts to produce
from the other four spontaneous speaking tasks. an overall test score. It was, however, also noted that the
residual (error) term associated with Part 2, Reading Aloud,
To test the above hypothesis, Xu and Seed (2017) conducted
was relatively larger than those associated with the other
an item-level confirmatory factor analysis on 3,250 speaking
parts. The researchers regarded this as a piece of evidence
tests solely marked by examiners. The study found that a
for distinguishing between reading aloud and spontaneous
Single Factor model (Figure 9) fit the data well, resulting in
speaking in speaking assessment, and cautioned against using
a Comparative Fit Index (CFI) value of 0.99, a Non-Normed
constrained speaking tasks alone to measure communicative
Fit Index (NNFI) value of 0.98, and a Root Mean Square Error
speaking ability.
of Approximation (RMSEA) value of 0.08. Generally, a CFI or
NNFI value of 0.90 or above or an RMSEA value of 0.80 or

Linguaskill
References Cronbach, L J and Meehl, P E (1955) Construct validity in
psychological tests, Psychological Bulletin 52 (4), 281–302.
Fan, J (2014) Chinese test takers’ attitudes towards the Versant
AERA, APA and NCME (2014) Standards for educational and
English Test: A mixed-methods approach, Language Testing in Asia
psychological testing, Washington, DC: AERA.
4 (6), 1–17.
APA, AERA and NCME (1974) Standards for educational and
Galaczi, E and Taylor, L (2018) Interactional competence:
psychological tests, Washington, DC: APA.
Conceptualisations, operationalisations, and outstanding
Bachman, L F and Palmer, A S (1996) Language testing in practice, questions, Language Assessment Quarterly 15 (3), 219–236.
Oxford: Oxford University Press.
Haertel, E H (2006) Reliability, in Brennan, R L (Ed.) Educational
Bachman, L F and Palmer A S (2010) Language assessment in Measurement (4th edn), Westport, CT: Praeger, 65–110.
practice, Oxford: Oxford University Press.
Jones, E, Brenchley, M and Benjamin, T (2020) An investigation
Bernstein, J, Van Moere, A and Cheng, J (2010) Validating into the hybrid marking model for the Linguaskill Speaking test,
automated speaking tests, Language Testing 27 (3), 355–377. Cambridge Assessment English internal research report.
Brenchley, M (2020) Re-examining the reliability of human Kane, M T (2013) Validating the interpretations and uses of test
marking in the Linguaskill Speaking test, Cambridge Assessment scores, Journal of Educational Measurement 50 (1), 1–73.
English internal research report.
Knill, K, Gales, M, Kyriakopoulos, K, Malinin, A, Ragni, A, Wang,
Brown, A (2003) Interviewer variation and the co-construction of Y and Caines, A (2018) Impact of ASR Performance on Free
speaking proficiency, Language Testing 20 (1), 1–25. Speaking Language Assessment, Proc. Interspeech 2018, 1,641–
Cambridge Assessment English (2016) Principles of good practice: 1,645. https://doi.org/10.21437/Interspeech.2018-1312
Research and innovation in language learning and assessment, Linacre, J M (1989) Many-facet Rasch measurement, Chicago:
Cambridge, UK: Cambridge Assessment. MESA Press.
Chapelle, C A, Enright, M K and Jamieson, J M (2010) Does Lopes, S and Cheung, K (2020) Final report on the December 2018
an argument-based approach to validity make a difference? standard setting of the Linguaskill General papers to the CEFR,
Educational Measurement: Issues and Practice 29 (1), 3–13. Cambridge Assessment English internal research report.
Cheng, L (2014) Consequences, impact, and washback, in Lu, Y, Gales, M, Knill, K, Manakul, P, Wang, L and Wang, Y
Kunnan, A J (Ed.) The Companion To Language Assessment (Vol. III) (2019) Impact of ASR performance on spoken grammatical
Chichester, West Sussex: John Wiley and Sons, 1,130–1,146. error detection, Proceedings of the Annual Conference of the
Chun, C W (2006) An analysis of a language test for employment: International Speech Communication Association, INTERSPEECH,
The authenticity of the PhonePass test, Language Assessment September 2019, 1,876–1,880. https://doi.org/10.21437/
Quarterly 3 (3), 295–306. Interspeech.2019-1706

Cicchetti, D V (1994) Guidelines, criteria, and rules of thumb for Malinin, A, Ragni A, Knill, K and Gales, M (2017) Incorporating
evaluating normed and standardized assessment instruments in Uncertainty into Deep Learning for spoken language assessment.
psychology, Psychological Assessment 6 (4), 284–290. Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics 2, 45–50. https://doi.org/10.18653/v1/
Cizek, G J and Bunch, M B (2007) Standard setting: A guide to P17-2008
establishing and evaluating performance standards on tests,
Thousand Oaks, CA: Sage. Messick, S (1989) Validity, in Linn, R L (Ed.), Educational
measurement (3rd edn), New York: Macmillan, 13–103.
Council of Europe (2001) Common European Framework
of Reference for Languages: Learning, teaching, assessment, Messick, S (1996) Validity and washback in language testing,
Strasbourg: Council of Europe. Language Testing 13 (3), 241–256.

Council of Europe (2009) Relating Language Examinations to Sawaki, Y, Stricker, L J and Oranje, A H (2009) Factor structure of
the Common European Framework of Reference for Languages: the TOEFL Internet-based test, Language Testing 26 (1), 5–30.
Learning, Teaching, Assessment (CEFR). A Manual, Strasbourg: Shrout, P E and Fleiss, J L (1979) Intraclass correlations: Uses in
Council of Europe. assessing rater reliability, Psychological Bulletin 86 (2), 420–428.
Council of Europe (2018) Common European Framework van Dalen, R C, Knill, K and Gales, M (2015) Automatically
of Reference for Languages: Learning, teaching, assessment grading learners’ English using a Gaussian process. SLaTE 2015:
(Companion volume with new descriptors), Strasbourg: Council Workshop on Speech and Language Technology in Education, 7–12.
of Europe. https://www.isca-speech.org/archive/slate_2015/sl15_007.html
Cronbach, L J (1988) Five perspectives on validity argument, Wang, Y, Gales, M J F, Knill, K M, Kyriakopoulos, K, Malinin, van
in Wainer, H and Braun, H I (Eds) Test validity, Hillsdale, NJ: Dalen, R C and Rashid, M (2018) Towards automatic assessment
Lawrence Erlbaum, 3–17. of spontaneous spoken English, Speech Communication 104,
47–56. https://doi.org/10.1016/j.specom.2018.09.002

Building a validity argument for the Speaking test | June 2020 Linguaskill 17
Weir, C J (2005) Language testing and validation: An evidence-
based approach, Basingstoke: Palgrave Macmillan.
Xi, X (2010) Automated scoring and feedback systems: Where are
we and where are we heading? Language Testing 27 (3), 291–300.
Xi, X, Schmidgall, J and Wang, Y (2016) Chinese users’ perceptions
of the use of automated scoring for a speaking practice test,
in Yu, G and Jin, Y (Eds) Assessing Chinese learners of English:
Language constructs, consequences and conundrums, Basingstoke,
Hampshire: Palgrave Macmillan, 150–175.
Xu, J (2015) Predicting ESL learners’ oral proficiency by measuring
the collocations in their spontaneous speech, unpublished doctoral
dissertation, Iowa State University, Ames, IA.
Xu, J and Gallacher, T (2017) Linguaskill Speaking trial report,
Cambridge Assessment English internal research report.
Xu, J and Seed, G (2017) Automated speaking tests: Merging
technology, assessment and customer needs, paper presented at
the Language Testing Forum 2017, Huddersfield, UK.

18 Linguaskill Building a validity argument for the Speaking test | June 2020
Contact us

We are Cambridge Assessment English. Part of Cambridge Assessment English cambridgeenglish.org

the University of Cambridge, we help millions
of people learn English and prove their skills to The Triangle Building /cambridgeenglish
the world. Shaftesbury Road
For us, learning English is more than just exams
Cambridge /cambridgeenglishtv
and grades. It’s about having the confidence
to communicate and access a lifetime of CB2 8EA /cambridgeeng
enriching experiences and opportunities.
United Kingdom
With the right support, learning a language is /cambridgeenglish
an exhilarating journey. We’re with you every
step of the way. /cambridge-assessment-english

Patterns of Paragraph Development
100% (8)
Patterns of Paragraph Development
20 pages
National Achievement Test Reviewer Oral Communication
100% (2)
National Achievement Test Reviewer Oral Communication
5 pages
Personal Pronoun A Detailed Lesson Plan in English
100% (1)
Personal Pronoun A Detailed Lesson Plan in English
5 pages
Introduction To Linguaskill Slides
100% (1)
Introduction To Linguaskill Slides
37 pages
Software Testing Practice: Test Management: A Study Guide for the Certified Tester Exam ISTQB Advanced Level
From Everand
Software Testing Practice: Test Management: A Study Guide for the Certified Tester Exam ISTQB Advanced Level
Andreas Spillner
3/5 (2)
TOEFL Critical Analysis
No ratings yet
TOEFL Critical Analysis
7 pages
Assessing Speaking
100% (1)
Assessing Speaking
18 pages
Language Testing
75% (4)
Language Testing
45 pages
Language Testing: Liu Jianda
No ratings yet
Language Testing: Liu Jianda
45 pages
Validation of An Oral Assessment Tool For Classroom Use: Articles
No ratings yet
Validation of An Oral Assessment Tool For Classroom Use: Articles
19 pages
assessing-students-oral-communication-skills
No ratings yet
assessing-students-oral-communication-skills
10 pages
2020 06 19 Anexo Circular Linguaskills Examen On Line
No ratings yet
2020 06 19 Anexo Circular Linguaskills Examen On Line
2 pages
RoutledgeHandbooks 9781351034784 Chapter2
No ratings yet
RoutledgeHandbooks 9781351034784 Chapter2
11 pages
Introduction
No ratings yet
Introduction
7 pages
Assessing Speaking Skills: A: Workshop For Teacher Development Ben Knight
No ratings yet
Assessing Speaking Skills: A: Workshop For Teacher Development Ben Knight
9 pages
What Are Our Tools Really Made Out Of? A Critical Assessment of Recent Models of Language Proficiency
No ratings yet
What Are Our Tools Really Made Out Of? A Critical Assessment of Recent Models of Language Proficiency
9 pages
Linking Second Language Speaking Task Performance and Language Testing
No ratings yet
Linking Second Language Speaking Task Performance and Language Testing
16 pages
001 Rebecca
No ratings yet
001 Rebecca
16 pages
Computer Based Assessment CBA of Foreign
No ratings yet
Computer Based Assessment CBA of Foreign
126 pages
Assessing Speaking LESSON
No ratings yet
Assessing Speaking LESSON
57 pages
LanguageCert-IESOL - Assessing-Speaking-Performance
No ratings yet
LanguageCert-IESOL - Assessing-Speaking-Performance
20 pages
Lecture 2 - Validity
No ratings yet
Lecture 2 - Validity
19 pages
Assessing Speaking Skills: A: Workshop For Teacher Development Ben Knight
No ratings yet
Assessing Speaking Skills: A: Workshop For Teacher Development Ben Knight
9 pages
Proposal
No ratings yet
Proposal
23 pages
Hatipoglu 2021 TestingandAssessmentofSpeakingSkillsCh6
No ratings yet
Hatipoglu 2021 TestingandAssessmentofSpeakingSkillsCh6
55 pages
SUNUM
No ratings yet
SUNUM
44 pages
New Views of Validity in Language Testing
No ratings yet
New Views of Validity in Language Testing
16 pages
Language Examining and Test Development
No ratings yet
Language Examining and Test Development
59 pages
assignment 3 final
No ratings yet
assignment 3 final
3 pages
Assessing Specific Language Ability A The or It I Cal Framework
No ratings yet
Assessing Specific Language Ability A The or It I Cal Framework
14 pages
Mai Hương full.FINAL (1)
No ratings yet
Mai Hương full.FINAL (1)
12 pages
Norris ALR 2022 Keynote
No ratings yet
Norris ALR 2022 Keynote
47 pages
Group 11 Assessing Speaking
No ratings yet
Group 11 Assessing Speaking
8 pages
Building blocks for Validity
No ratings yet
Building blocks for Validity
89 pages
Evaluating Speaking 2
No ratings yet
Evaluating Speaking 2
4 pages
Linguaskill Brochure for Higher Education Institutions
No ratings yet
Linguaskill Brochure for Higher Education Institutions
12 pages
A CORPUS Based Investigation Into The VALIDITY of The CET SET Group Discussion
No ratings yet
A CORPUS Based Investigation Into The VALIDITY of The CET SET Group Discussion
33 pages
Chapelle (1999) On Validity
No ratings yet
Chapelle (1999) On Validity
19 pages
Hatipoglu-2021-TestingandAssessmentofSpeakingSkillsCh6
No ratings yet
Hatipoglu-2021-TestingandAssessmentofSpeakingSkillsCh6
56 pages
Farhady 2005 A Linguametric Perspective
No ratings yet
Farhady 2005 A Linguametric Perspective
19 pages
Assessing_Speaking LA-speaking 2024-2025
No ratings yet
Assessing_Speaking LA-speaking 2024-2025
26 pages
Rubric For Chapter Discussion Essays
No ratings yet
Rubric For Chapter Discussion Essays
1 page
Modern_language_testing_at_the_turn_of_the_century
No ratings yet
Modern_language_testing_at_the_turn_of_the_century
42 pages
Guide October 2002 Revised Version1
No ratings yet
Guide October 2002 Revised Version1
61 pages
Brown_McNamara_Iwasita_An Examination of Rater Orientations
No ratings yet
Brown_McNamara_Iwasita_An Examination of Rater Orientations
170 pages
DRAFT LTE PAK KAMAL BARU
No ratings yet
DRAFT LTE PAK KAMAL BARU
11 pages
PPT
No ratings yet
PPT
29 pages
AI Driven+Analysis+a+Case+Study+on+Assessing+Argumentation+Proficiency+and+Skill+Development+Through+Sequential+Submissions
No ratings yet
AI Driven+Analysis+a+Case+Study+on+Assessing+Argumentation+Proficiency+and+Skill+Development+Through+Sequential+Submissions
10 pages
Summary Chapter 7 and 9 - Nguyen Thi My Duyen - 1967 012 049
No ratings yet
Summary Chapter 7 and 9 - Nguyen Thi My Duyen - 1967 012 049
21 pages
Erik Digest
No ratings yet
Erik Digest
4 pages
Alderson Charles Principles and Practice in Language Testing
No ratings yet
Alderson Charles Principles and Practice in Language Testing
68 pages
Assessing Speaking Presentation 20412
100% (8)
Assessing Speaking Presentation 20412
51 pages
Issues and Challenges in Language Assessment
No ratings yet
Issues and Challenges in Language Assessment
13 pages
83-06
No ratings yet
83-06
14 pages
Analysis of Assessment Procedures
No ratings yet
Analysis of Assessment Procedures
3 pages
5_Exploring the Potential of a Video‐Mediated Interactive Speaking Assessment
No ratings yet
5_Exploring the Potential of a Video‐Mediated Interactive Speaking Assessment
29 pages
Brochure
No ratings yet
Brochure
8 pages
Types of Validity in Speaking Tests: Research Article
No ratings yet
Types of Validity in Speaking Tests: Research Article
4 pages
Ppt Language Testing Meeting 10
No ratings yet
Ppt Language Testing Meeting 10
57 pages
Mohammad Abu El-Magd Research No.23 2018
No ratings yet
Mohammad Abu El-Magd Research No.23 2018
47 pages
Models of Language Proficiency (Part 1)
No ratings yet
Models of Language Proficiency (Part 1)
34 pages
Applied Natural Language Processing with AllenNLP: Definitive Reference for Developers and Engineers
From Everand
Applied Natural Language Processing with AllenNLP: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
1st Point-Check The Subject and Verb Student 3
No ratings yet
1st Point-Check The Subject and Verb Student 3
1 page
0needs - 37-2-1 - Communicative Needs in Foreign Language Learning
No ratings yet
0needs - 37-2-1 - Communicative Needs in Foreign Language Learning
10 pages
Milne Ki Ghari Jo Thehri Hai by Rizwana Ameer Ul Haq Urdu Novels Center
75% (4)
Milne Ki Ghari Jo Thehri Hai by Rizwana Ameer Ul Haq Urdu Novels Center
13 pages
21st Lit Module 1 Literary Periods
No ratings yet
21st Lit Module 1 Literary Periods
39 pages
Chapter 7 - Adverbs and Particles
No ratings yet
Chapter 7 - Adverbs and Particles
33 pages
B2 Module 3
No ratings yet
B2 Module 3
4 pages
The Hydrophobic Skunk
100% (1)
The Hydrophobic Skunk
5 pages
The Neutral Vowels
No ratings yet
The Neutral Vowels
6 pages
Keith Kelly - Ingredients For Successful CLIL - TeachingEnglish - British Council - BBC
No ratings yet
Keith Kelly - Ingredients For Successful CLIL - TeachingEnglish - British Council - BBC
10 pages
Dimensions Lesson Plan - Student Version
No ratings yet
Dimensions Lesson Plan - Student Version
3 pages
Rff Tb Test3
No ratings yet
Rff Tb Test3
6 pages
English Courses Upv
No ratings yet
English Courses Upv
110 pages
Speech For Farewell
No ratings yet
Speech For Farewell
1 page
Stencil Tenses Past Present Future
No ratings yet
Stencil Tenses Past Present Future
4 pages
Gr 7 Relative Clause Notes
No ratings yet
Gr 7 Relative Clause Notes
2 pages
Biology Chse Girls Dt-11!05!2025
No ratings yet
Biology Chse Girls Dt-11!05!2025
10 pages
Singular and Plural Nouns Lesson Plan
100% (4)
Singular and Plural Nouns Lesson Plan
22 pages
Department of Education Schools Division of Samar A Detailed Lesson Plan in English 11
No ratings yet
Department of Education Schools Division of Samar A Detailed Lesson Plan in English 11
5 pages
Batch Processing: From Wikipedia, The Free Encyclopedia
No ratings yet
Batch Processing: From Wikipedia, The Free Encyclopedia
4 pages
9a q1l7 In-Class Worksheet
No ratings yet
9a q1l7 In-Class Worksheet
4 pages
Tom Mcarthur World or Internationl or Global
No ratings yet
Tom Mcarthur World or Internationl or Global
8 pages
Revision Vocabulary - Final Exam - PA 1 (2023)
No ratings yet
Revision Vocabulary - Final Exam - PA 1 (2023)
7 pages
Keeping Quiet, Poetic Devices
No ratings yet
Keeping Quiet, Poetic Devices
5 pages
Excel 8 Module 2
No ratings yet
Excel 8 Module 2
34 pages
Assignment 1 - Sem MARCH - AUGUST 2023
No ratings yet
Assignment 1 - Sem MARCH - AUGUST 2023
4 pages
lug P2
No ratings yet
lug P2
5 pages
BLB HSG12 Anh 01
No ratings yet
BLB HSG12 Anh 01
15 pages

Linguaskill Building a Validity Argument for the Speaking Test

Uploaded by

Linguaskill Building a Validity Argument for the Speaking Test

Uploaded by

Linguaskill

Building a validity argument

As Linguaskill is designed to serve multiple test purposes (see

Table 1. An overview of the Linguaskill Speaking tasks

3 Presentation The candidate speaks on a given topic. 1 minute 40 secs 20%

4 Presentation with The candidate gives a presentation based

5 Communication The candidate gives opinions on five

Certiﬁcation stage 1 Certiﬁcation stage 2

A common understanding of test validity concerns the test

• Adequate understanding of the target language use domain

• Cognitive processes attributable to targeted language abilities

• Alignment of test results to a professional standard for describing targeted

• Test results easily understood and appropriately used by stakeholders

• Positive impact of testing on stakeholders

Perceived need for

Development phase Trialling analysis

Examiner scores Auto-marker scores

0.75 2.88 Part 3 3.74 Speaking 1.00

We are Cambridge Assessment English. Part of Cambridge Assessment English cambridgeenglish.org

You might also like