0% found this document useful (0 votes)
28 views46 pages

Assessing L2 English Speaking Using Automated Scoring Technology-Examining Automarker Reliability

Uploaded by

sherrybi0510
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views46 pages

Assessing L2 English Speaking Using Automated Scoring Technology-Examining Automarker Reliability

Uploaded by

sherrybi0510
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

This is a preprint of the paper published in the Assessment in Education journal.

Please cite the paper as follows:


Xu, J., Jones, E., Laxton, V., & Galaczi, E. (2021). Assessing L2 English speaking using automated scoring technology:
Examining automarker reliability. Assessment in education: Principles, policy & practice, 28(4), 411–436. https://
doi.org/10.1080/0969594X.2021.1979467
Assessing L2 English speaking using automated scoring technology:
Examining automarker reliability

Jing Xu, Edmund Jones, Victoria Laxton and Evelina Galaczi

Recent advances in machine learning have made automated scoring of candidate


speech widespread, and yet validation research that provides support for applying
automated scoring technology to assessment is still in its infancy. Both the
educational measurement and language assessment communities have called for
greater transparency in describing scoring algorithms and research evidence
about the reliability of automated scoring. This paper reports on a study that
investigated the reliability of an automarker with candidate responses produced in
an online oral English test. Based on ‘limits of agreement’ and multi-faceted
Rasch analyses on automarker scores and individual examiner scores, the study
found that the automarker, while exhibiting excellent internal consistency, was
slightly more lenient than examiner fair average scores, particularly for low-
proficiency speakers. Additionally, it was found that an automarker uncertainty
measure termed Language Quality, which indicates the confidence of speech
recognition, was useful for predicting automarker reliability and flagging
abnormal speech.

Keywords: automated scoring; L2 speaking assessment; limits of agreement

1. Introduction

With the rapid advancement of speech recognition, natural language processing, and

machine learning technologies, training a computer to evaluate learner spoken language

is no longer a dream. Automated scoring is a viable alternative (and, in some cases,

complement) to examiner scoring in speaking assessment in that it improves scoring

consistency, increases the speed of score reporting, reduces the logistical complexity of

test administration and has the potential for generating individualised feedback for

learning. Despite increased popularity of automated speaking assessment, validation


work on automated speaking assessment is in its infancy. The limited transparency in

how candidate speech is scored by computer algorithms and evidence for the reliability

of these algorithms has not only raised language assessment professionals’ concerns but

provoked scepticism over automated scoring among language teachers, learners, and

test users (Fan, 2014; Khabbazbashi et al., 2021; Xi, 2012; Xi et al., 2016).

This paper contributes to this gap in language assessment by providing evidence

for the performance of the Custom Automated Speech Engine or CASE (v1.9), the

automarker designed for the Cambridge Assessment English Linguaskill General

Speaking test, and by critically discussing traditional approaches to establishing

automarker reliability. The purpose of our paper is twofold: to provide validity

evidence supporting score interpretation for the Linguaskill speaking test (based on an

evaluation of the automarker) and to extend the range of methodologies used for

establishing accuracy measures in automated assessment. We begin with a brief

overview of the functioning of L2 speech automarkers, in particular CASE, and broad

validity considerations for automated speaking assessment.

2. Previous Research

2.1. L2 speech automarker design

The main goal of an automarker is to evaluate a candidate’s spoken language ability and

provide a score that is as close as possible to that of an examiner. As shown in Figure 1,

an automarker for spontaneous speech usually has three main components: a speech

recogniser, a feature extraction module, and a grader (Yu Wang et al., 2018; Xi et al.,

2008).
Insert Figure 1 about here

A speech recogniser conducts automated speech recognition (ASR), which

converts the audio signal of speech into a structured representation of the underlying

word transcription. Two components underlying a speech recogniser are the acoustic

model and the language model. The former maps sound to phonemes/words whereas the

latter estimates the probability of hypothesised word sequences based on training

corpora (Yu & Deng, 2016). To illustrate how a speech recogniser works, Lieberman et

al. (2005, p. 1) list two possible recognition outputs from a hypothetical acoustic model:

“wreck a nice beach you sing calm incense” or “recognise speech using common

sense.” Based on probability estimates gained from the language model, the recogniser

will select the latter as the more likely transcription.

The acoustic model is usually trained via deep neural network models on a set of

accurately transcribed spoken data. The training process involves pairing the audio with

the human expert transcriptions, so that the model learns the association between sounds

and their orthographic representations (Yu & Deng, 2016). The performance of a speech

recogniser is measured by word error rate (WER), i.e. the rate of word-level error in the

machine transcription as compared to the expert transcription.

The acoustic model of CASE (v1.9) was trained using a “time delay neural

network” (TDNN-F) on 426 hours of Business Language Testing Service (BULATS)

data from 12,375 speakers and 78 hours of individual head-mounted microphone (IHM)

data from the publicly available Augmented Multi-party Interaction (AMI) corpus of

meeting recordings (EST Ltd., 2020). The WER of the CASE speech recogniser is

22.8%, meaning that approximately 77% of words in candidate speech are accurately

transcribed (Lu et al., 2019). This level of accuracy is considerably lower than we
would expect from native speaker speech (e.g. Song, 2020), but on a par with other

ASR research on spontaneous L2 speech (e.g. Yanhong Wang et al., 2020).

The feature extraction module automatically extracts construct-relevant features

from both the audio signal and the transcription generated by the speech recogniser.

These features are used as proxies of human assessment criteria. CASE (v1.9) extracts

features related to intelligibility, intonation, fluency, vocabulary, and grammar. For

example, fluency features include speech rate and normalised frequency of long pauses;

vocabulary features include lexical diversity and associated quantities such as

normalised frequencies of unique unigrams, bigrams, and trigrams in the ASR

transcription. An independent content appropriateness feature, which does not

contribute to score prediction, will be added soon, and aims to detect off-topic

responses.

Finally, the grader (also called scoring model) makes predictions of examiner

scores based on the features. Researchers have tried various approaches to grader

design, using regression models (Xi, Higgins, Zechner, & Williamson, 2012),

classification trees (Xi et al. 2012), and non-linear models (Van Moere & Downey,

2016). The grader of CASE (v1.9) uses a Gaussian Process (GP), a statistical model

based on the multivariate normal distribution. The main advantage of a GP is that it

allows the grader to produce an uncertainty measure about its score prediction based on

the similarity between the speech input and the training data; when the input is close to

the training sample, the variance of the predicted score is small (van Dalen, Knill, &

Gales, 2015).

The training of the grader usually requires a large amount of learner data that

consists of spoken test responses and reliable examiner scores associated with them.
The CASE grader was trained on 2,632 Linguaskill General Speaking tests representing

performances at six Common European Framework of Reference for Languages

(CEFR) levels: Below A1, A1, A2, B1, B2, and C1 or above (EST Ltd., 2020).

CASE produces two uncertainty measures, Language Quality (LQ) and

Assessment Quality (AQ). LQ (generated by the speech recogniser) is an indicator of

ASR confidence, i.e. how confident the system is that words, phonemes, or phrases in a

response have been correctly recognised. Low ASR confidence could result from

unclear or incorrect pronunciation, a strong accent, low-quality audio or excessive

grammatical errors and disfluencies in the speech input (Yu Wang et al., 2018). The AQ

score (generated by the grader based on the GP) indicates how confident the system is

with its score prediction. Both measures are continuous variables ranging from 0 to 1,

and higher values suggest higher levels of confidence.

2.2. Typical task types in an automated speaking test

A number of task types are typically used in automated speaking assessment, generally

falling into two main categories: constrained and free speaking tasks (Xu, 2015).

Common constrained speaking tasks include reading sentences aloud, repeating

sentences, saying opposite words, giving short answers to questions, and building a

sentence from phrases. Such tasks form a large proportion of the Pearson’s Versant

English test (formally PhonePass, Chun, 2006) and the Duolingo English test (Wagner,

2020; Wagner & Kunnan, 2015).

An advantage of constrained speaking tasks is the relative ease of training

automated scoring systems, as candidate responses to such tasks are highly predictable.

However, it has been argued that this test design trades authenticity for practicality and

convenience (Xu, 2015; Chun, 2008; Wagner & Kunnan, 2015). One core validity
consideration against constrained speaking tasks concerns construct under-

representation. In a review of PhonePass, Chun (2006) questions if the speaking

behaviours observed in the test are a true reflection of everyday use of English. Wagner

(2020) and Wagner and Kunnan (2015) also criticise the Duolingo English Test for

failing to tap into a broader range of cognitive processes, such as the ability to

comprehend and compose language.

Free speaking tasks, in contrast, require candidates to produce open-ended

responses such as speaking on a given topic, describing a visual, and responding to

listening/reading input in integrated tasks. This approach has been adopted in the Test

of English as a Foreign Language (TOEFL) and in Linguaskill (Xi et al., 2008; Xu, et

al., 2020). Compared to constrained speaking tasks, free speaking tasks tap into a more

communication-oriented oral construct and are considered more authentic (Chun, 2006).

Due to current technological limitations, automated systems cannot simulate co-

constructed interactions and generate interactional behaviours such as multi-turn topic

development, comprehension confirmations, follow-up questions, and backchannel

feedback – all features shown to be an important part of the construct of interaction

(Galaczi & Taylor, 2018). This limited representation of interactional skills further adds

to the validity debates around automated speaking assessment and its use for different

purposes (e.g. Xi, 2010; Xu, 2015). Some promising work has been done in exploring

the viability of automated dialogic systems. This suggests that, notwithstanding

limitations in terms of eliciting interactional competence, such systems are feasible

(Litman et al., 2018; Ockey & Chukharev-Hudilainen, in press).


2.3. Validating automated speaking assessment: Conceptual and
methodological approaches

Automated scoring brings about challenges that are not typically associated with

examiner marking, such as the limited range of construct features in making score

decisions and its susceptibility to candidate cheating. As a result, the validity of

automated systems is often questioned (Chun, 2006; Khabbazbashi et al., 2021; Xi,

2010, 2012).

Based on the argument-based approach to test validation (Chapelle et al., 2008;

Kane, 1992, 2006), a series of guiding questions for validating automated language

assessment was suggested by Xi (2010). These are based on six inferential steps that

support the intended interpretation and use of test scores, including domain

representation, evaluation, generalisation, explanation, extrapolation, and utilisation.

Each step is essential for justifying the use of automated scoring and they should fit

together to form a coherent validity argument.

Further to these inferential steps, Xi (2012) suggests that practitioners

attempting to validate the use of automated scoring should assess the relevant validity

issues and then determine which areas need to be prioritised, depending on the amount

of evidence for and against each validity claim. There are two main questions: what is

the intended use of the scores, and how is the automated scoring system used to produce

the scores? Aspects of the validity argument may have differing levels of importance

depending on the stakes of the exam.

Automarker reliability is certainly a key validity question of relevance for

stakeholders. It is usually interpreted as the accuracy and consistency of automated

scoring, and falls under the evaluation and generalisation inferences in Xi’s (2010)
validation framework. For example, validity concerns would be raised if automated

scoring yielded test scores that were inaccurate indicators of the quality of test

performance (Xi, 2010). Inaccurate scores would fail to support score-based claims

about candidates’ ability and thus weaken the entire validity argument. Likewise, if

automated scoring yields inconsistent scores in different measurement contexts (e.g.

more or less accurate under certain conditions such as the extremities of a measurement

scale), then the generalisation step could be questioned and challenged.

Evidence for automarker reliability is usually gathered via an investigation into

the agreement between automated scores and examiner scores. The latest Standards for

Educational and Psychological Testing, for example, recommends test developers to

provide test users with research evidence about the “agreement rates” between

automated scoring algorithms and examiners (AERA et al., 2014, p. 92). To evaluate

this machine–examiner agreement, researchers have commonly used the Pearson

correlation coefficient (hereafter correlation; see Chen et al., 2018; Van Dalen et al.,

2015; Weigle, 2010; Williamson et al., 2012). However, we argue that this is

inappropriate and alternative measures should be used. Correlation is a measure of

linear association, not agreement. For example, if the automarker score is always

exactly ten points higher than the examiner score, the correlation is 1 (perfect) but the

agreement is clearly low (Yannakoudakis & Cummins, 2015).

Researchers sometimes measure machine–examiner agreement using quadratic-

weighted kappa (QWK; Cohen, 1968). The original kappa (Cohen, 1960) is an adjusted

version of the probability of exact agreement that takes account of the probability of

agreement by chance, and QWK is a modified version that additionally takes account of

the distance between the two variables when they disagree. Kappa and QWK are hard to

interpret – they are not probabilities (in fact they can be negative), and the concept of
adjusting for agreement by chance can be hard to understand. They also behave counter-

intuitively. The simplest example is a test where the only scores are “pass” and “fail”

and an examiner gives “pass” 80% of the time. If there are two automarkers, A and B, it

can happen that A agrees with the examiner 40% of the time, and B agrees with the

examiner 65% of the time, but kappa is higher for A than for B (Yannakoudakis &

Cummins, 2015). This phenomenon arises from the assumption in kappa that if the

automarker and human examiner assigned scores randomly, they would follow fixed

marginal distributions, as if they had been told in advance what proportion of responses

should get each score, as in a norm-referenced test (Brennan & Prediger, 1981;

Yannakoudakis & Cummins, 2015). Similar possibilities exist with QWK and

continuous scores (for more detail on the drawbacks of kappa see Brennan & Prediger,

1981; Di Eugenio & Glass, 2004; Pontius & Millones, 2011).

Bland and Altman (1986, 1999) proposed a better way to evaluate agreement,

called ‘limits of agreement’ (LOAs), which we will apply in this study. The idea of this

approach is to consider the differences between the two measurements—in automated

assessment, the difference is the automarker score minus the examiner score—and

describe their distribution. LOAs are upper and lower bounds that contain 95% of the

differences. If these are close together, and close to zero, the automarker is performing

well. Unlike correlation, LOAs describe the automarker–examiner agreement, and it is

easy to understand and interpret (unlike correlation with its rules of thumb such as

“correlation over 0.7 is strong”). The details of the approach are explained in Section

4.5 and illustrated in Section 5.

Other measures of agreement include the proportion of responses for which the

automarker and human scores are within 0.5 or 1 of each other (Wu et al., 2020). These

are similar to the basic version of LOAs, though LOAs show whether the differences
tend to be positive or negative as well as their spread. QWK, LOAs, and other measures

are all affected in different ways by random error in the scores.

3. Research Questions

The four research questions (RQs) of this study are directly related to the evaluation and

generalisation inferences in Xi’s (2010) argument-based validation framework. The first

question concerns the accuracy of the automarker, i.e. it centres on evidence for the

evaluation inference. The second addresses the consistency and severity of the

automarker, i.e. its focus is evidence for the generalisation inference. The third

investigates the generalisability of automated scores by examining automarker

performance at different confidence levels. The fourth investigates the robustness of the

automarker against abnormal test behaviours and thus provides supporting evidence for

the evaluation inference.

(1) How well does the automarker agree with the examiner gold standard?

(Evaluation)

(2) Does the automarker display the same level of internal consistency and severity

as examiners? (Generalisation)

(3) Are LQ scores and AQ scores useful for identifying unreliable automarker

scores? (Generalisation)

(4) Can the automarker reliably distinguish between English speech and non-

English speech, including gibberish? (Evaluation)

4. Methodology

To answer these questions, we conducted a small-scale automarker evaluation study

using test responses from the Linguaskill General Speaking test. The study was carried
out in 2020 using data gathered in 2019 and CASE (v1.9); later CASE versions are not

necessarily the same in any particular aspect. The numerical scores used in the analyses

were aligned with CEFR levels so that 1–2 is A1, 2–3 is A2, and so on (see Figures 2–

5).

4.1. Linguaskill General Speaking test

The Linguaskill General Speaking test is browser-based so candidates can sit the test on

any computer with a high-speed internet connection and with human invigilation in

place. Questions are presented through the computer screen and headphones, and the

candidate’s responses are recorded and remotely assessed by either computer algorithms

or trained examiners. The test is multi-level, i.e. designed to elicit oral performances of

six CEFR levels.

The test has five parts: Interview, Reading Aloud, Presentation, Presentation

with Visual Information, and Communication Activity. Each part focuses on a different

aspect of speaking ability and is marked independently and weighted equally. The

format, testing aim, and evaluation criteria of the five parts are summarised in Table 1.

Insert Table 1 about here

4.2. Evaluation dataset

The evaluation dataset of the study consisted of 209 test responses randomly selected

from all six proficiency levels of Linguaskill candidates. Two responses were dropped

at the beginning of the study because they were identified by examiners as unmarkable.
Thus, a total of 207 test responses were used. All identifying information about the

candidates was removed when the data was received to comply with the ethics of using

test data. The candidates spoke 30 different first languages (L1s); the top five were

Arabic (n =37), Spanish (n = 31), Portuguese (n = 24), Japanese (n = 22), and

Indonesian (n = 11). The gender distribution was 45.9% female, 48.3% male, and 5.8%

unidentified. Based on fair average scores derived from triple examiner marking (see

Section 4.4. below), the sample included eight (3.9%) below A1 responses, 34 (16.4%)

A1 responses, 41 (19.8%) A2 responses, 50 (24.2%) B1 responses, 49 (23.7%) B2

responses, 15 (7.2%) C1 or above responses, and 10 (4.8%) unusual responses which

are further discussed in Sections 5.1 and 6.1.

A secondary dataset including 19 non-English-speaking test responses was used

to investigate the fourth question. These responses were produced by colleagues of the

authors whose native language is not English with an intention to trick the automarker.

They were instructed to talk in their native languages, code-switch, or speak gibberish.

4.3. Human markers

The main evaluation dataset was marked independently by three professional

examiners. They were experienced examiners who had been marking the Linguaskill

General Speaking test since its launch in 2018 and had completed examiner

recertification shortly before the marking exercise. Two of them each had over 10 years

of marking experience on a range of English language tests. The other had been an

examiner for Cambridge English Qualifications for five years before joining

Linguaskill.
4.4. Examiner gold standard

Distinct from previous studies on automarker evaluation (Van Moere, 2012; Xi et al.,

2008), we used fair average scores derived from multifaceted Rasch measurement

(MFRM) as the gold standard criterion of oral proficiency. The fair average scores

resulting from MFRM are average scores adjusted for marker severity (Myford &

Wolfe, 2003, 2004). They can be deemed as scores that would be given by an average

marker chosen from a pool of markers (Linacre, 1989). MFRM is commonly used by

language testing researchers to identify and measure the factors that contribute to

variability in assessment results (Barkaoui, 2014; Brown et al., 2005; Linacre, 1989;

Yan, 2014) and was used in this study to offset differential severity among markers. The

“examiner gold standard scores” discussed in the following sections refer to fair average

scores derived from triple examiner marking. That is, every test part answered by every

candidate was marked independently by all three examiners.

Computing fair average scores based on multiply marked responses is

logistically complex and does not reflect the score reporting practice of the Linguaskill

General Speaking test. Fair average scores were computed to create reliable estimates of

candidate abilities or scores that closely reflect candidates’ true oral proficiency. This is

a standard procedure that the Linguaskill General Speaking test follows in examiner

training, certification, and monitoring in which each examiner’s marking is compared

against the fair average scores resulting from marking by senior examiners. In this

study, we evaluated the automarker using the same examiner gold standard.

4.5. Statistical analysis

The data for this study consisted of automarker scores, automarker uncertainty

measures, examiner raw scores, and examiner fair average scores. The agreement
between automarker and examiner gold standard scores (RQ1) was evaluated using

LOAs, the standard approach in medical science for comparing two methods of clinical

measurement (Bland & Altman, 1986, 1999). This approach is based on looking at the

differences between the two measurements on each of the subjects. The first step is to

make a scatterplot, to check two assumptions—that the distribution of the differences is

approximately constant over the score range, and that they follow a normal distribution.

If the assumptions are met, the second step is to calculate the LOAs. These are upper

and lower bounds such that approximately 95% of the differences lie between them. If

the LOAs are both close to zero, the two measurements agree well with each other. If

the assumptions in the first step are not met, there are modified methods that can be

used for the second step, as illustrated in the next section.

The idea of the approach is that domain experts, such as language testing

researchers or testing organisation employees, can look at the LOAs between the

automarker and the examiner gold standard and judge whether they are satisfactory.

This contrasts with measures such as QWK, for which interpretation often relies on

rather arbitrary rules of thumb, and which do not have much meaning to people who

understand the scoring scale but not statistical science, in addition to the problems

mentioned in Section 2.3.

We also calculated the percentage agreement between the automarker and the

examiner gold standard on CEFR classification. This is simply the proportion of cases

in which the automarker and examiner gold standard give the same CEFR level. We

include this analysis because CEFR levels are the primary test results reported for

Linguaskill. Its disadvantages are that it depends on the number of possible categories

for the classification, and for cases where the automarker and examiner disagree it does

not describe the degree of disagreement.


Automarker consistency and severity (RQ2) were investigated via the MFRM

analysis on raw scores awarded by each of the three examiners and the automarker,

using the FACETS computer program (Version 3.71; Linacre, 2014). In this analysis the

10 unusual responses were excluded. In contrast to the Bland and Altman method on

overall test scores, the MFRM analysis was performed on test part scores, to take into

consideration the variance in assessment results caused by test items. The examiner

scores were discrete data points from 0 to 6 in increments of 0.5, whereas the

automarker scores on each test part were continuous data from 0 to 6. To put the data

into the same form, the automarker scores were rounded to the nearest 0.5. FACETS

requires integer scores, so both automarker scores and examiner scores were doubled,

and the MFRM analysis was conducted on scores in the form of integers from 0 to 12.

The fair averages resulting from the analysis were then halved to be on the original

scale of measurement.

The relationship between automarker uncertainty measures and automarker

accuracy (RQ3) was investigated by making a simple scatterplot between the two

variables. Additionally, a reduced dataset containing selected responses of high

automarker confidence was used to validate the observed relationship. Automarker

sensitivity to non-English speech (RQ4) was investigated by using the difference in the

ASR confidence score between English and non-English speech. A Mann–Whitney test

was used to test this difference, as the normality assumption was not met.
5. Results

5.1. RQ1: Agreement between automarker and the examiner gold standard

5.1.1. Agreement on test scores

We show three analyses in order to illustrate the variety of methods that are available

with Bland and Altman’s approach. This is also intended to show the steps that a

researcher goes through when analysing this kind of data, including dealing with

unusual values, and to illustrate some common misunderstandings of the approach. As

mentioned above, the “examiner score” was the Rasch fair average of the three

examiners’ scores, which is a gold standard.

Figure 2 shows the automarker and examiner scores, and the diagonal line shows

perfect agreement—a regression line could be added, but this is not recommended and

is not a suitable method for analysing agreement (Bland & Altman 1999, 2003). As

shown on the axes, the 0–6 scale corresponds to the six levels of the CEFR. As can be

seen, there is a clear tendency for the automarker scores to be higher than the examiner

scores. Several individual points are unusual: in five cases the examiner and automarker

scores are both very close to zero; for five others, the examiner score is close to zero but

the automarker score is several bands away.

Insert Figure 2 about here

Bland and Altman recommended a different scatterplot, shown in Figure 3. The

differences, shown on the y-axis, are easier to see, and their downward trend is more
obvious. (This trend could be due to the automarker being poorly calibrated for low and

high scores; see Guo et al., 2017). Bland and Altman also recommended a histogram of

the differences, which we omit for reasons of space. In the scatterplot, if the distribution

of the differences was roughly the same across the range of scores, and normally

distributed, the next step would be to summarise it by calculating three quantities: the

bias, which is defined to be the mean of the differences, and LOAs, which contain 95%

of the differences. The LOAs are calculated as the bias plus or minus 1.96 times the

standard deviation. These are shown in Figure 3 (thick dashed lines), with confidence

intervals, as an illustration of the most basic version of Bland and Altman’s approach.

The idea is that most of the differences lie within the range shown by the LOAs, so if

the LOAs are sufficiently close together and close to zero, then the automarker is

performing well.

Insert Figure 3 about here

For this dataset, the horizontal LOAs do not describe the distribution of the

differences well, because of the downward trend in the differences. The unusual scores

mentioned above also have a distorting effect and do not follow the assumption that the

differences are normally distributed. In order to explore potential reasons for these

unusual scores, we listened to some of the audio files and looked at the examiner scores.

Many of the zero-score responses were silent apart from microphone crackle. For a

further analysis we decided to drop 10 responses that had received examiner scores

below 0.5 when the Rasch fair averages were calculated from 207 responses and had

also received comments from the examiners such as “no meaningful response”. The
automarker performance on these unusual responses indicated that in half of these cases,

the fair average score was extremely close to 0 (below 0.1) and the automarker score

was 0, which is very satisfactory; for the other half the automarker score was between

1.9 and 3.8, which is unsatisfactory. In addition, none of the five unsatisfactory cases

had a LQ score over 0.9, a cut score set to distinguish between reliable and unreliable

automarker scores (see further discussion in Section 5.3).

Figure 4 shows the new scatterplot, based on 197 cases; naturally the LOAs are

closer together. In Figure 5 a modified approach has been used, to give sloping LOAs

that fit the data better (Bland & Altman, 1999). For example, for scores around the

middle of the B1 range, it can be seen that the bias is approximately 0.4 and the LOAs

are –0.7 and +1.4. The sloping LOAs describe the distribution of the differences well,

but they are harder to interpret—it is harder to judge from them whether the automarker

performance is satisfactory.

Insert Figure 4 and Figure 5 about here

To summarise, if the 10 unusual responses are excluded, the automarker tends to

give scores on average 0.41 higher than the gold standard, and 95% of the differences

lie between –0.75 and 1.57. There is a distinct trend for the differences to be higher in

the A1 and A2 range. Of course, the unusual responses cannot be ignored, and the

automarker needs to be developed and trained to deal with these correctly. In this case

the automarker only identified five of the 10 responses that should have been given a

score very close to zero.


In Figures 3–5 we followed Bland and Altman in putting the mean of the two

scores on the x-axis. Their argument for this choice of x-axis assumes that the two

measurement methods have reasonably similar random error (Bland & Altman 1995),

which is probably not the case here, because the Rasch fair average of three human

scores has lower variance than a single score. An alternative might be to put the Rasch

fair average on the x-axis. However, this would not affect our observation of the

downward trend in the differences, which is also visible in Figure 2, or our main results,

which used the fixed horizontal limits of agreement in Figure 4.

5.1.2. Agreement on CEFR classification

In addition to the automarker–examiner agreement on test scores, we also investigated

the agreement on CEFR classification. CEFR classification was chosen since

Linguaskill reports candidates’ CEFR levels as the primary test result. We conducted

this analysis on the 197 responses, as we were more interested in automarker accuracy

on normal responses—unusual responses can be flagged by uncertainty measures (see

Section 5.3) and are normally marked by examiners. By applying CEFR cut scores, we

converted the raw scores awarded by the automarker and the three markers to CEFR

levels. Our question was how well the CEFR classification made by the automarker and

the three markers agreed with the examiner gold standard.

Table 2 shows three types of agreement: exact agreement (assigning to the same

level), adjacent agreement (assigning to the same level or one level up or down), and

disagreement (difference greater than one CEFR level). On the 197 Linguaskill General

Speaking tests, the automarker achieved 48.3% exact agreement and 93.3% adjacent

agreement with the examiner gold standard. When comparing these with individual

marks awarded by the three examiners, this performance was slightly worse than
Marker 1 and considerably worse than Marker 2 and Marker 3.

Insert Table 2 about here

5.2. RQ2: Automarker consistency and severity

The MFRM analysis reports individual-level measures for marker consistency via outfit

and infit mean square residuals. As the outfit mean square residuals is unweighted and

thus more sensitive to outliers, the infit mean square residuals (hereafter, infit) is usually

considered a more robust indicator of a markers’ internal consistency (Yan, 2014).

Ideally, an infit should be close to 1. An infit lower than 1 suggests a higher degree of

predictability or tendency to award the same score; an infit higher than 1 suggests a

higher degree of randomness in marking. Generally, examiners with an infit higher than

1.5 tend to mark inconsistently or unpredictably, whereas markers with an infit lower

than 0.5 tend to be too predictable in their marking behaviours (Linacre, 2014; Yan,

2014). Table 3 shows the infit statistics of the examiners and the automarker. All were

in the range of 0.5–1.5, suggesting they were generally consistent.

In addition to internal consistency, the MFRM analysis provided measures of

marker severity at both group and individual levels. At the group level, FACETS

performed a fixed chi-square test on the null hypothesis that all markers are at the same

level of severity. The chi-square test indicated a significant difference in severity among

the four markers (χ2 = 503.8, df = 3, p < .01). At the individual level, marker severity is

measured in logits with a positive value indicating severity and a negative value

indicating leniency.
As shown in the second column of Table 3, marker severity ranged from −0.42

logits to 0.50 logits. All the severity measures were close to zero, suggesting that none

of the markers were too harsh or too lenient. The automarker, with a severity of –0.42,

was the most lenient, but close to Marker 1, the most consistent marker, whose severity

was –0.40. The ranking of severity among the four markers can be seen in the ‘Marker’

column of the Wright map (Figure 6). The higher markers are in this column, the more

severe they are. Likewise, marker severity can be inferred by the fair average a marker

awarded to all candidates in the sample. As seen in Table 3, Marker 3, the most severe

marker among the four, had a fair average of 3.32 whereas the automarker had a fair

average of 3.88, which is approximately a half-band difference.

To summarise, the automarker exhibited almost the same level of internal consistency

as the most consistent examiner. The marker severity measures were closely distributed

around zero (M = 0.00, SD = 0.48), suggesting no marker (including the automarker)

was extremely harsh or lenient. However, the automarker was found to be the most

lenient of the four, which confirms the finding from the LOAs analyses above.

Insert Table 3 and Figure 6 about here

5.3. RQ3: Using uncertainty measures to predict unreliable automarker scores

As mentioned in Section 2.1, the automarker produces two uncertainty measures, AQ

and LQ, to indicate its confidence in scoring a test response. Both are continuous

variables ranging from 0 to 1, with a higher value indicating higher confidence. The AQ

score had a small range (0.7, 0.9) and low variability (SD = 0.02) in the present
dataset—95.7% of the test responses received an AQ score of 0.9. For this reason, AQ

was not found to be a useful predictor for automarker reliability.

The LQ score, in contrast, had a wider range (0.46, 0.95) and greater variability

(SD = 0.08). The scatterplot in Figure 7 shows the relationship between LQ (x-axis) and

the absolute differences between the automarker and examiner fair average scores (y-

axis). The unit of measurement for the Y axis is one CEFR band. The clustered

datapoints in the bottom right corner of the scatterplot suggest that the absolute

difference between the automarker and the examiner gold standard tends to be smaller

than one CEFR band (M = 0.40, SD = 0.26) when the LQ score is greater than or equal

to 0.9 (i.e. data points to the right of the red vertical line). In contrast, when the LQ

score is smaller than 0.9 (i.e. data points to the left of the vertical line), this difference

tends to be larger (M = 0.68, SD = 0.51) or the likelihood of obtaining a deviant

automarker score increases. For example, the five unusual responses that had inaccurate

automarker scores (see Section 5.1.1.) had an LQ score ranging from 0.62 to 0.85.

Figure 7 only shows the absolute differences, but there was no systematic tendency in

the direction of the differences.

Insert Figure 7 about here

To validate this observation, automarker–examiner percentage agreement on

CEFR categorisation was recalculated after excluding test responses with an LQ score

lower than 0.9. This dataset contained 79 tests with examiner fair average scores

ranging from 1.14 to 5.94, 72 (90%) of which had an examiner fair average over 3.0,
equivalent to B1 proficiency on the CEFR. Table 4 shows the percentage agreement

statistics of the four markers in the reduced dataset. Compared to the numbers from the

original dataset (Table 2), the automarker exact agreement increased from 48.3% to

61.3% and adjacent agreement increased from 93.3% to 100%. The automarker

performance drew closer to that of Marker 1 and Marker 3 but was still much lower

than that of Marker 2.

Insert Table 4 about here

In addition, the MFRM analysis was conducted on the reduced dataset to

confirm the improved automarker performance on responses with high LQ scores. The

results (Table 5) indicate that the automarker had average severity (–0.25) among the

four markers with its internal consistency (infit = 0.89) still being the second best. This

is a visible improvement as compared to its slightly lenient performance on the whole

dataset, as discussed in Section 5.2. The severity ranking is illustrated in the second

column of Figure 8 in which the automarker falls in the middle and is very close to zero,

the middle point on the severity scale.

In short, the LQ score seems to be a useful feature for predicting automarker

reliability. When LQ is higher than 0.9, the automarker tends to achieve a closer

agreement with the examiner gold standard as well as ideal marker severity.

Insert Table 5 and Figure 8 about here

5.4. RQ4: Automatic detection of non-English speech

To investigate the automarker’s ability to detect non-English speech, 19 non-English-


speaking responses were added to the evaluation data. It was hypothesised that the

automarker would produce lower LQ scores on these as the speech recogniser was

designed to indicate lower confidence in transcribing non-English speech.

As expected, the non-English-speaking group received a lower LQ score (M =

0.66, SD = 0.03) than the English-speaking group (M = 0.89, SD = 0.04). Because the

normality assumption was not met, the Mann-Whitney test, a non-parametric test, was

chosen to the test the null hypothesis that the two groups were the same. This suggested

a significant difference, with U = 0, p < .01; a U value of zero indicates a complete

separation of datapoints between the two groups. This separation can be observed in the

boxplot shown in Figure 9.

To summarise, the LQ score generated by the speech recogniser seems a reliable

measure for judging whether a test response is in English or not. In the English-

speaking group, A1 and A2 candidates, who usually produce unintelligible or accented

speech, received an average LQ score of 0.86 with a range from 0.78 to 0.91. Thus, if

the LQ score is below 0.7, there is a high chance that the speech is not English.

Insert Figure 9 about here

6. Conclusions

6.1. CASE performance on scoring spontaneous L2 English speech

For the 95.2% of responses with a fair average score above 0.5, the LOAs indicated that

the average difference between CASE and the examiner gold standard was 0.41, with
most of the differences lying between –0.75 and 1.57 (1 usually represents one CEFR

band). For the other responses, CASE agreed closely with the examiner gold standard

50% of the time. This level of agreement is inadequate to justify a decision to fully

replace examiners with the automarker in high-stakes assessment contexts (Xi, 2010),

but it could be considered adequate for low-stakes speaking practice in which

candidates are keen to receive quick feedback on their oral performance.

Based on the MFRM analysis, CASE exhibited a high level of internal

consistency and was almost on a par with the most consistent examiner. When

uncertainty measures were not considered, the automarker had moderate exact and

adjacent agreement on CEFR grading with the examiner gold standard. It also tended to

be more lenient than the examiner gold standard, particularly in the A1–A2 range.

However, the automarker performance was considerably more accurate in cases where

it indicated high confidence in speech recognition. In a subset of the data in which LQ

scores were greater than or equal to 0.9, the automarker achieved average severity

among the four markers and was nearly as accurate in CEFR grading as two of the

human markers. These findings seem to suggest that speech intelligibility and audio

quality may have a significant impact on automarker reliability. In other words, the

automarker tended to score clear-speaking candidates more reliably.

The LQ score generated by the speech recogniser was also found useful for

detecting non-English speech. Generally, the LQ score of low-proficiency English

speech was still much higher than non-English speech or gibberish. Human inspection

of test responses with a low LQ score seems a viable means of detecting malpractice (Xi

et al., 2016).

This research cannot be straightforwardly compared with previous automarker


evaluation studies due to the differences in test format and research methodology. In

this study, we used LOAs to measure the deviation of the automarker scores from the

examiner gold standard on the original scale of measurement, i.e. the CEFR. We also

showed how to construct confidence intervals for this deviation at different score ranges

so that varying automarker performance on responses at different levels of proficiency

could be scrutinised. In contrast, previous research used standardised measures, such as

Pearson correlation and Cohen’s kappa (e.g. Higgins et al., 2011; Van Moere et al.,

2012; Yu Wang et al., 2018), which we argue are unsuitable for automarker evaluation

and hard to interpret (see Section 2.3). Unlike in previous automarker research, we used

examiner fair averages resulting from MFRM as the examiner gold standard. The fair

average score is generally considered a more reliable measure of candidate ability than

an arithmetic average or single marking in that the fair average score is statistically

adjusted for marker severity based on marker behaviours in the entire sample (Myford

& Wolfe, 2003, 2004).

To conclude, this study, which addresses the evaluation and generalisation

inferences in a validity argument for automated scoring, did not find evidence to

support the use of the automarker (CASE v1.9) on its own in the Linguaskill General

Speaking test, a relatively high-stakes English language assessment. However, the study

highlighted promising aspects of the automarker, including its internal consistency of

marking, ability to indicate its confidence and flag abnormal responses, and human-like

reliability on selected speech samples.

6.2. Implications

One of the aims of this study was to address the lack of transparency on automarker

research and validation (Khabbazbashi et al., 2021). Accordingly, the first major

contribution of the present research is that it provides the stakeholders of Linguaskill


and language testing researchers with up-to-date information about a state-of-the-art

speech automarker, as well as empirical data about its performance.

Second, the study introduced a new method, based on LOAs, to validate

automated scoring. We argue that LOAs are measures of agreement, but correlation is

not. This paper demonstrated how LOAs could be used to 1) examine automarker

agreement with, or deviation from, the examiner gold standard on the original scale of

measurement and 2) construct 95% confidence intervals for the differences. LOAs

arguably present more granular information about automarker reliability than a simple

correlation or kappa coefficient. In addition, we combined LOAs with MFRM, a more

traditional method for evaluating marker severity and consistency and found that the

two methods generated converging evidence.

Third, the research on the relationship between uncertainty measures and

automarker reliability has shown the potential of implementing a hybrid approach to

marking computer-based speaking assessment. That is, if the automarker uncertainty

measures such as the LQ score are useful in predicting automarker reliability, then

examiner marking may only be needed when the automarker is predicted to be

unreliable. By setting thresholds on the uncertainty measures, we can divide test

responses into two categories: those that must be marked by examiners and those on

which computer marking is trustworthy. This hybrid approach combines the strengths

and benefits of artificial intelligence with those of examiners and can improve the

overall efficiency of marking without lowering score reliability (see Xu et al., 2020).

6.3. Limitations

The present study had two major limitations. First, we did not have enough data for a

thorough investigation of automarker performance on abnormal test behaviours. As


such, that part of the study should be seen as exploratory. Second, the study was

conducted on a relatively small sample covering a limited number of first languages. It

is not known how the automarker would perform on unfamiliar accents. The responses

were exclusively retrieved from the Linguaskill General Speaking test, so the findings

cannot be generalised to the Linguaskill Business Speaking test or other tests in which

test-taking behaviours may be different. These limitations are being addressed as part of

the ongoing validation research on Linguaskill.

Acknowledgments
We would like to thank Mark Gales, Kate Knill, Trevor Benjamin, the editors, and the

anonymous reviewers for their stimulating and helpful comments on the manuscript.

Grateful acknowledgements are extended to Annabelle Pinnington, David Dursun,

Martin Robinson, Bronagh Rolph, Kevin Cheung, Ardeshir Geranpayeh, and John

Savage, for their advice and assistance.

References

AERA, APA, & NCME. (2014). Standards for educational and psychological testing.

AERA.

Barkaoui, K. (2014). Multifaceted Rasch analysis for test evaluation. In A. J. Kunnan

(Ed.), The companion to language assessment (Vol. III, pp. 1301–1322). John

Wiley & Sons.

Bland, J. M., & Altman, D. G. (1986). Statistical methods for assessing agreement

between two methods of clinical measurement. Lancet, 327(8476), 307–310.

https://doi.org/10.1016/S0140-6736(86)90837-8

Bland, J. M., & Altman, D. G. (1999). Measuring agreement in method comparison


studies. Statistical Methods in Medical Research, 8(2), 135–160.

https://doi.org/10.1177/096228029900800204

Bland, J. M., & Altman, D. G. (1995). Comparing methods of measurement: Why

plotting difference against standard method is misleading. Lancet, 346(8982),

1085–1087. https://doi.org/10.1016/S0140-6736(95)91748-9

Bland, J. M. & Altman, D. G. (2003). Applying the right statistics: Analyses of

measurement studies. Ultrasound in Obstetrics & Gynecology, 22(1), 85–93.

https://doi.org/10.1002/uog.122

Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and

alternatives. Educational and Psychological Measurement, 41(3), 687-699.

https://doi.org/10.1177/001316448104100307

Brown, A., Iwashita, N., & McNamara, T. (2005). An Examination of rater orientations

and test-taker performance on English-for-academic-purposes speaking tasks

(ETS Research Report No. RR-05-05). Educational Testing Service.

http://dx.doi.org/10.1002/j.2333-8504.2005.tb01982.x

Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity

argument for the Test of English as a Foreign Language. Routledge.

Chen, L., Zechner, K., Yoon, S.-Y., Evanini, K., Wang, X., Loukina, A., Tao, J., Davis,

L., Lee, C. M., Ma, M., Mundkowsky, R., Lu, C., Leong, C. W., & Gyawali, B.

(2018). Automated scoring of nonnative speech using the SpeechRater v. 5.0

Engine (ETS Research Report No. RR-18-10). Educational Testing Service.

https://doi.org/10.1002/ets2.12198

Chun, C. W. (2006). Commentary: An analysis of a language test for employment: The


authenticity of the PhonePass test. Language Assessment Quarterly, 3(3), 295–

306. https://doi.org/10.1207/s15434311laq0303_4

Chun, C. W. (2008). Comments on ‘Evaluation of the usefulness of the Versant for

Englsih Test: A response’: The author responds. Language Assessment

Quarterly, 5(2), 168–172. https://doi.org/10.1080/15434300801934751

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and

Psychological Measurement, 20(1), 37–46.

https://doi.org/10.1177/001316446002000104

Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled

disagreement or partial credit. Psychological Bulletin, 70(4), 213–220.

https://doi.org/10.1037/h0026256

Di Eugenio, B., & Glass, M. (2004). The kappa statistic: A second look. Computational

Linguistics, 30(1), 95–101. https://doi.org/10.1162/089120104773633402

Enhanced Speech Technology Ltd. (2020). EST custom automated speech engine

(CASE) v1.0. Cambridge English internal research report.

Fan, J. (2014). Chinese test takers' attitudes towards the Versant English Test: A mixed-

methods approach. Language Testing in Asia, 4(6), 1–17.

https://doi.org/10.1186/s40468-014-0006-9

Galaczi, E., & Taylor, L. (2018). Interactional competence: Conceptualisations,

operationalisations, and outstanding questions. Language Assessment Quarterly,

15(3), 219-236. https://doi.org/10.1080/15434303.2018.1453816

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern
neural networks. Proceedings of Machine Learning Research, 70, 1321–1330.

http://proceedings.mlr.press/v70/guo17a.html

Higgins, D., Xi, X., Zechner, K., & Williamson, D. M. (2011). A three-stage approach

to the automated scoring of spontaneous spoken responses. Computer Speech

and Language, 25, 282–306. https://doi.org/10.1016/j.csl.2010.06.001

Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin,

112(3), 527–535.

Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational Measurement (4th

ed., pp. 17–64). Praeger.

Khabbazbashi, N., Xu, J., & Galaczi, E. (2021). Opening the black box: Exploring

automated speaking evaluation. In B. Lanteigne, C. Coombe, & J. D. Brown

(Eds.), Challenges in language testing around the world (pp. 333-343).

Springer.

Lieberman, H., Faaborg, A., Daher, W., & Espinosa, J. (2005). How to wreck a nice

beach you sing calm incense [Paper presentation]. 10th International Conference

on Intelligent User Interfaces, San Diego, {CA,} USA.

Linacre, J. M. (1989). Many-facet Rasch measurement. MESA Press.

Linacre, J. M. (2014). Many-facet Rasch measurement: Facets tutorial. Winsteps.

https://www.winsteps.com/tutorials.htm

Litman, D., Strik, H., & Lim, G.S. (2018). Speech technologies and the assessment of

second language speaking: Approaches, challenges, and opportunities. Language

Assessment Quarterly, 15(3), 294–309.


https://doi.org/10.1080/15434303.2018.1472265

Lu, Y., Gales, M., Knill, K., Manakul, P., Wang, L., & Wang, Y. (2019). Impact of

ASR performance on spoken grammatical error detection. Proc. Interspeech

2019, 1876–1880. https://doi.org/10.21437/Interspeech.2019-1706

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using

many-facet rasch measurement: Part I. Journal of Applied Measurement, 4(4),

386–422.

Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using

many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2),

189–227.

Ockey, G. J., & Chukharev-Hudilainen, E. (in press). Human versus computer partner in

the paired oral discussion test. Applied Linguistics.

https://doi.org/10.1093/applin/amaa067

Pontius, R. G., & Millones, M. (2011). Death to kappa: Birth of quantity disagreement

and allocation disagreement for accuracy assessment. International Journal of

Remote Sensing, 32(15), 4407–4429.

https://doi.org/10.1080/01431161.2011.552923

Song, Z. (2020). English speech recognition based on deep learning with multiple

features. Computing, 102(3), 663–682. doi:10.1007/s00607-019-00753-0

Van Dalen, R. C., Knill, K. M., & Gales, M. J. F. (2015). Automatically grading

learners’ English using a Gaussian process. Workshop on speech and language

technology in education (SLaTe) 2015, 7–12.

https://www.slate2015.org/files/SLaTE2015-Proceedings.pdf
Van Moere, A. (2012). A psycholinguistic approach to oral language assessment.

Language Testing, 29(3), 325–344. https://doi.org/10.1177/0265532211424478

Van Moere, A., & Downey, R. (2016). Technology and artificial intelligence in

language assessment. In D. Tsagari & J. Banerjee (Eds.), Handbook of second

language assessment (pp. 341–358). Walter de Gruyter.

Wagner, E. (2020). Duolingo English test, Revised version July 2019. Language

Assessment Quarterly, 17(3), 300–315.

https://doi.org/10.1080/15434303.2020.1771343

Wagner, E., & Kunnan, A. J. (2015). The Duolingo English test. Language Assessment

Quarterly, 12(3), 320–331. https://doi.org/10.1080/15434303.2015.1061530

Wang, Y. [Yu], Gales, M. J. F., Knill, K. M., Kyriakopoulos, K., Malinin, A., van

Dalen, R. C., & Rashid, M. (2018). Towards automatic assessment of

spontaneous spoken English. Speech Communication, 104, 47–56.

https://doi.org/10.1016/j.specom.2018.09.002

Wang, Y. [Yanhong], Luan, H., Yuan, J., Wang, B., Lin, H. (2020) LAIX corpus of

Chinese learner English: Towards a benchmark for L2 English ASR. Proc.

Interspeech 2020, 414–418. https://www.isca-

speech.org/archive/Interspeech_2020/pdfs/1677.pdf

Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-

test indicators of writing ability. Language Testing, 27(3), 335–353.

https://doi.org/10.1177/0265532210364406

Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use

of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–


13. https://doi.org/10.1111/j.1745-3992.2011.00223.x

Wu, X., Knill, K., Gales, M., & Malinin, A. (2020). Ensemble approaches for

uncertainty in spoken language assessment. Proc. Interspeech 2020, 3860–

3864. https://doi.org/10.21437/Interspeech.2020-2238

Xi, X. (2010). Automated scoring and feedback systems: Where are we and where are

we heading? Language Testing, 27(3), 291–300.

https://doi.org/10.1177/0265532210364643

Xi, X. (2012). Validity in the automated scoring of performance tests. In G. Fulcher &

F. Davidson (Eds.), The Routledge handbook of language testing (pp. 438–451).

Routledge.

Xi, X., Higgins, D., Zechner, K., & Williamson, D. M. (2008). Automated scoring of

spontaneous speech using SpeechRaterSM v1.0. (ETS Research Report No. RR-

08-62). Educational Testing Service. https://doi.org/10.1002/j.2333-

8504.2008.tb02148.x

Xi, X., Higgins, D., Zechner, K., & Williamson, D. (2012). A comparison of two

scoring methods for an automated speech scoring system. Language Testing,

29(3), 371–394. https://doi.org/10.1177/0265532211425673

Xi, X., Schmidgall, J., & Wang, Y. (2016). Chinese users' perceptions of the use of

automated scoring for a speaking practice test. In G. Yu & Y. Jin (Eds.),

Assessing Chinese learners of English: Language constructs, consequences and

conundrums (pp. 150–175). Palgrave Macmillan.


Xu, J. (2015). Predicting ESL learners’ oral proficiency by measuring the collocations

in their spontaneous speech [Doctoral dissertation, Iowa State University]. Iowa

State University Digital Repository. https://doi.org/10.31274/etd-180810-4474

Xu, J., Brenchley, J., Jones, E., Pinnington, A., Benjamin, T., Knill, K., Seal-Coon, G.,

Robinson, M., & Geranpayeh, A. (2020). Linguaskill: Building a validity

argument for the Speaking test. Cambridge Assessment English.

https://www.cambridgeenglish.org/Images/589637-linguaskill-building-a-

validity-argument-for-the-speaking-test.pdf

Yan, X. (2014). An examination of rater performance on a local oral English

proficiency test: A mixed-methods approach. Language Testing, 31(4), 501–527.

https://doi.org/10.1177/0265532214536171

Yannakoudakis, H., & Cummins, R. (2015). Evaluating the performance of Automated

Text Scoring systems. In J. Tetreault, J. Burstein & C. Leacock (Eds.),

Proceedings of the Tenth Workshop on Innovative Use of NLP for Building

Educational Applications (pp. 213–223). Association for Computational

Linguistics. https://doi.org/10.3115/v1/W15-0625

Yu, D., & Deng, L. (2016). Automatic speech recognition. Springer.


Table 1. An overview of the Linguaskill General Speaking test design
Length of Preparation
Part Task Description Weight
response(s) time
The candidate answers
4 × 10 secs and 4
1 Interview eight questions about none 20%
× 20 secs
themselves.
The candidate reads eight
2 Reading Aloud 8 × 10 secs none 20%
sentences aloud.
The candidate speaks on a
3 Presentation 1 minute 40 secs 20%
given topic.
Presentation with The candidate gives a
4 Visual presentation based on 1 minute 1 minute 20%
Information visual information.
The candidate gives
Communication responses to five opinion-
5 5 × 20 secs 40 secs 20%
Activity focused questions related
to a scenario.

Table 2. Agreement on CEFR classification between the examiner gold standard and single
marking from automarker and three markers (n = 197).
Examiner gold standard
Exact agreement Adjacent agreement
Marker 1 55.2% 100%
Marker 2 83.7% 99.5%
Marker 3 73.4% 99.5%
Automarker 48.3% 93.3%

Table 3. Measurement estimates of the reliability of three markers and automarker (n = 197).
Marker Severity Fair Model SE Infit ZStd Outfit MSq ZStd PtBis Corr
Avg MSq
Marker 1 –0.40 3.87 0.04 1.01 0.10 0.97 –0.30 0.85
Marker 2 0.33 3.44 0.04 1.11 1.60 1.14 2.10 0.86
Marker 3 0.50 3.32 0.04 0.78 –3.50 0.91 –1.40 0.91
Automarker –0.42 3.88 0.04 0.93 –0.90 1.07 1.0 0.82
Fixed (all same) chi-square: 503.8, df = 3, p < .01
Table 4. Agreement on CEFR classification between the examiner gold standard and single
marking from automarker and three markers when LQ is greater than 0.9 (n = 79).
Examiner gold standard
Exact agreement Adjacent agreement
Marker 1 64.1% 100%
Marker 2 85.9% 100%
Marker 3 69.2% 100%
Automarker 61.3% 100%

Table 5. Measurement estimates of marker reliability on a subset of test data in which LQ is


larger than 0.9 (n = 79).
Marker Severity Fair Avg Model Infit ZStd Outfit MSq ZStd PtBis
SE MSq Corr
Marker 1 –0.66 4.47 0.07 0.84 –1.50 0.77 –2.30 0.84
Marker 2 0.34 4.03 0.07 1.16 1.40 1.22 2.00 0.86
Marker 3 0.57 3.92 0.07 0.95 –0.40 1.06 0.50 0.90
Automarker –0.25 4.30 0.07 0.89 –0.90 1.01 0.10 0.79
Fixed (all same) chi-square: 189.7, df = 3, p < .01
Figure 1. Typical architecture of a speech automarker.
Figure 2. Basic scatterplot of the automarker and human scores. The human scores are the Rasch
fair average of three human examiners’ scores.
Figure 3. Bland–Altman plot of the automarker and human scores. The bias is shown by the thick
solid line and the limits of agreement (LOAs) by the thick dashed lines, though the differences
have a downward trend so the LOAs do not summarize their distribution satisfactorily. The
numbers in parentheses are 95% confidence intervals for the bias and the two LOAs.
Figure 4. Bland–Altman plot with unusual responses excluded. As in Figure 3, the bias is shown
by the thick solid line and the LOAs by the thick dashed lines.
Figure 5. Bland–Altman plot, with unusual responses excluded, and with sloping limits of
agreement (LOAs) calculated using a regression-based method.
Figure 6. Wright map for multiple test responses (n = 197).
Figure 7. Basic scatterplot of the LQ score and absolute difference between automarker and human
fair average scores.
Figure 8. Wright map for a subset of test responses that the automarker indicates higher confidence
in marking (n = 80).
Figure 9. Distributions of LQ scores by English vs. non-English speech (n = 217).

You might also like