Assessing L2 English Speaking Using Automated Scoring Technology-Examining Automarker Reliability
Assessing L2 English Speaking Using Automated Scoring Technology-Examining Automarker Reliability
1. Introduction
With the rapid advancement of speech recognition, natural language processing, and
consistency, increases the speed of score reporting, reduces the logistical complexity of
test administration and has the potential for generating individualised feedback for
how candidate speech is scored by computer algorithms and evidence for the reliability
of these algorithms has not only raised language assessment professionals’ concerns but
provoked scepticism over automated scoring among language teachers, learners, and
test users (Fan, 2014; Khabbazbashi et al., 2021; Xi, 2012; Xi et al., 2016).
for the performance of the Custom Automated Speech Engine or CASE (v1.9), the
evidence supporting score interpretation for the Linguaskill speaking test (based on an
evaluation of the automarker) and to extend the range of methodologies used for
2. Previous Research
The main goal of an automarker is to evaluate a candidate’s spoken language ability and
an automarker for spontaneous speech usually has three main components: a speech
recogniser, a feature extraction module, and a grader (Yu Wang et al., 2018; Xi et al.,
2008).
Insert Figure 1 about here
converts the audio signal of speech into a structured representation of the underlying
word transcription. Two components underlying a speech recogniser are the acoustic
model and the language model. The former maps sound to phonemes/words whereas the
corpora (Yu & Deng, 2016). To illustrate how a speech recogniser works, Lieberman et
al. (2005, p. 1) list two possible recognition outputs from a hypothetical acoustic model:
“wreck a nice beach you sing calm incense” or “recognise speech using common
sense.” Based on probability estimates gained from the language model, the recogniser
The acoustic model is usually trained via deep neural network models on a set of
accurately transcribed spoken data. The training process involves pairing the audio with
the human expert transcriptions, so that the model learns the association between sounds
and their orthographic representations (Yu & Deng, 2016). The performance of a speech
recogniser is measured by word error rate (WER), i.e. the rate of word-level error in the
The acoustic model of CASE (v1.9) was trained using a “time delay neural
data from 12,375 speakers and 78 hours of individual head-mounted microphone (IHM)
data from the publicly available Augmented Multi-party Interaction (AMI) corpus of
meeting recordings (EST Ltd., 2020). The WER of the CASE speech recogniser is
22.8%, meaning that approximately 77% of words in candidate speech are accurately
transcribed (Lu et al., 2019). This level of accuracy is considerably lower than we
would expect from native speaker speech (e.g. Song, 2020), but on a par with other
from both the audio signal and the transcription generated by the speech recogniser.
These features are used as proxies of human assessment criteria. CASE (v1.9) extracts
example, fluency features include speech rate and normalised frequency of long pauses;
contribute to score prediction, will be added soon, and aims to detect off-topic
responses.
Finally, the grader (also called scoring model) makes predictions of examiner
scores based on the features. Researchers have tried various approaches to grader
design, using regression models (Xi, Higgins, Zechner, & Williamson, 2012),
classification trees (Xi et al. 2012), and non-linear models (Van Moere & Downey,
2016). The grader of CASE (v1.9) uses a Gaussian Process (GP), a statistical model
allows the grader to produce an uncertainty measure about its score prediction based on
the similarity between the speech input and the training data; when the input is close to
the training sample, the variance of the predicted score is small (van Dalen, Knill, &
Gales, 2015).
The training of the grader usually requires a large amount of learner data that
consists of spoken test responses and reliable examiner scores associated with them.
The CASE grader was trained on 2,632 Linguaskill General Speaking tests representing
(CEFR) levels: Below A1, A1, A2, B1, B2, and C1 or above (EST Ltd., 2020).
ASR confidence, i.e. how confident the system is that words, phonemes, or phrases in a
response have been correctly recognised. Low ASR confidence could result from
grammatical errors and disfluencies in the speech input (Yu Wang et al., 2018). The AQ
score (generated by the grader based on the GP) indicates how confident the system is
with its score prediction. Both measures are continuous variables ranging from 0 to 1,
A number of task types are typically used in automated speaking assessment, generally
falling into two main categories: constrained and free speaking tasks (Xu, 2015).
sentences, saying opposite words, giving short answers to questions, and building a
sentence from phrases. Such tasks form a large proportion of the Pearson’s Versant
English test (formally PhonePass, Chun, 2006) and the Duolingo English test (Wagner,
automated scoring systems, as candidate responses to such tasks are highly predictable.
However, it has been argued that this test design trades authenticity for practicality and
convenience (Xu, 2015; Chun, 2008; Wagner & Kunnan, 2015). One core validity
consideration against constrained speaking tasks concerns construct under-
behaviours observed in the test are a true reflection of everyday use of English. Wagner
(2020) and Wagner and Kunnan (2015) also criticise the Duolingo English Test for
failing to tap into a broader range of cognitive processes, such as the ability to
listening/reading input in integrated tasks. This approach has been adopted in the Test
of English as a Foreign Language (TOEFL) and in Linguaskill (Xi et al., 2008; Xu, et
al., 2020). Compared to constrained speaking tasks, free speaking tasks tap into a more
communication-oriented oral construct and are considered more authentic (Chun, 2006).
(Galaczi & Taylor, 2018). This limited representation of interactional skills further adds
to the validity debates around automated speaking assessment and its use for different
purposes (e.g. Xi, 2010; Xu, 2015). Some promising work has been done in exploring
Automated scoring brings about challenges that are not typically associated with
examiner marking, such as the limited range of construct features in making score
automated systems is often questioned (Chun, 2006; Khabbazbashi et al., 2021; Xi,
2010, 2012).
Kane, 1992, 2006), a series of guiding questions for validating automated language
assessment was suggested by Xi (2010). These are based on six inferential steps that
support the intended interpretation and use of test scores, including domain
Each step is essential for justifying the use of automated scoring and they should fit
attempting to validate the use of automated scoring should assess the relevant validity
issues and then determine which areas need to be prioritised, depending on the amount
of evidence for and against each validity claim. There are two main questions: what is
the intended use of the scores, and how is the automated scoring system used to produce
the scores? Aspects of the validity argument may have differing levels of importance
scoring, and falls under the evaluation and generalisation inferences in Xi’s (2010)
validation framework. For example, validity concerns would be raised if automated
scoring yielded test scores that were inaccurate indicators of the quality of test
performance (Xi, 2010). Inaccurate scores would fail to support score-based claims
about candidates’ ability and thus weaken the entire validity argument. Likewise, if
more or less accurate under certain conditions such as the extremities of a measurement
the agreement between automated scores and examiner scores. The latest Standards for
provide test users with research evidence about the “agreement rates” between
automated scoring algorithms and examiners (AERA et al., 2014, p. 92). To evaluate
correlation coefficient (hereafter correlation; see Chen et al., 2018; Van Dalen et al.,
2015; Weigle, 2010; Williamson et al., 2012). However, we argue that this is
linear association, not agreement. For example, if the automarker score is always
exactly ten points higher than the examiner score, the correlation is 1 (perfect) but the
weighted kappa (QWK; Cohen, 1968). The original kappa (Cohen, 1960) is an adjusted
version of the probability of exact agreement that takes account of the probability of
agreement by chance, and QWK is a modified version that additionally takes account of
the distance between the two variables when they disagree. Kappa and QWK are hard to
interpret – they are not probabilities (in fact they can be negative), and the concept of
adjusting for agreement by chance can be hard to understand. They also behave counter-
intuitively. The simplest example is a test where the only scores are “pass” and “fail”
and an examiner gives “pass” 80% of the time. If there are two automarkers, A and B, it
can happen that A agrees with the examiner 40% of the time, and B agrees with the
examiner 65% of the time, but kappa is higher for A than for B (Yannakoudakis &
Cummins, 2015). This phenomenon arises from the assumption in kappa that if the
automarker and human examiner assigned scores randomly, they would follow fixed
marginal distributions, as if they had been told in advance what proportion of responses
should get each score, as in a norm-referenced test (Brennan & Prediger, 1981;
Yannakoudakis & Cummins, 2015). Similar possibilities exist with QWK and
continuous scores (for more detail on the drawbacks of kappa see Brennan & Prediger,
Bland and Altman (1986, 1999) proposed a better way to evaluate agreement,
called ‘limits of agreement’ (LOAs), which we will apply in this study. The idea of this
assessment, the difference is the automarker score minus the examiner score—and
describe their distribution. LOAs are upper and lower bounds that contain 95% of the
differences. If these are close together, and close to zero, the automarker is performing
easy to understand and interpret (unlike correlation with its rules of thumb such as
“correlation over 0.7 is strong”). The details of the approach are explained in Section
Other measures of agreement include the proportion of responses for which the
automarker and human scores are within 0.5 or 1 of each other (Wu et al., 2020). These
are similar to the basic version of LOAs, though LOAs show whether the differences
tend to be positive or negative as well as their spread. QWK, LOAs, and other measures
3. Research Questions
The four research questions (RQs) of this study are directly related to the evaluation and
question concerns the accuracy of the automarker, i.e. it centres on evidence for the
evaluation inference. The second addresses the consistency and severity of the
automarker, i.e. its focus is evidence for the generalisation inference. The third
performance at different confidence levels. The fourth investigates the robustness of the
automarker against abnormal test behaviours and thus provides supporting evidence for
(1) How well does the automarker agree with the examiner gold standard?
(Evaluation)
(2) Does the automarker display the same level of internal consistency and severity
as examiners? (Generalisation)
(3) Are LQ scores and AQ scores useful for identifying unreliable automarker
scores? (Generalisation)
(4) Can the automarker reliably distinguish between English speech and non-
4. Methodology
using test responses from the Linguaskill General Speaking test. The study was carried
out in 2020 using data gathered in 2019 and CASE (v1.9); later CASE versions are not
necessarily the same in any particular aspect. The numerical scores used in the analyses
were aligned with CEFR levels so that 1–2 is A1, 2–3 is A2, and so on (see Figures 2–
5).
The Linguaskill General Speaking test is browser-based so candidates can sit the test on
any computer with a high-speed internet connection and with human invigilation in
place. Questions are presented through the computer screen and headphones, and the
candidate’s responses are recorded and remotely assessed by either computer algorithms
or trained examiners. The test is multi-level, i.e. designed to elicit oral performances of
The test has five parts: Interview, Reading Aloud, Presentation, Presentation
with Visual Information, and Communication Activity. Each part focuses on a different
aspect of speaking ability and is marked independently and weighted equally. The
format, testing aim, and evaluation criteria of the five parts are summarised in Table 1.
The evaluation dataset of the study consisted of 209 test responses randomly selected
from all six proficiency levels of Linguaskill candidates. Two responses were dropped
at the beginning of the study because they were identified by examiners as unmarkable.
Thus, a total of 207 test responses were used. All identifying information about the
candidates was removed when the data was received to comply with the ethics of using
test data. The candidates spoke 30 different first languages (L1s); the top five were
Indonesian (n = 11). The gender distribution was 45.9% female, 48.3% male, and 5.8%
unidentified. Based on fair average scores derived from triple examiner marking (see
Section 4.4. below), the sample included eight (3.9%) below A1 responses, 34 (16.4%)
to investigate the fourth question. These responses were produced by colleagues of the
authors whose native language is not English with an intention to trick the automarker.
They were instructed to talk in their native languages, code-switch, or speak gibberish.
examiners. They were experienced examiners who had been marking the Linguaskill
General Speaking test since its launch in 2018 and had completed examiner
recertification shortly before the marking exercise. Two of them each had over 10 years
of marking experience on a range of English language tests. The other had been an
examiner for Cambridge English Qualifications for five years before joining
Linguaskill.
4.4. Examiner gold standard
Distinct from previous studies on automarker evaluation (Van Moere, 2012; Xi et al.,
2008), we used fair average scores derived from multifaceted Rasch measurement
(MFRM) as the gold standard criterion of oral proficiency. The fair average scores
resulting from MFRM are average scores adjusted for marker severity (Myford &
Wolfe, 2003, 2004). They can be deemed as scores that would be given by an average
marker chosen from a pool of markers (Linacre, 1989). MFRM is commonly used by
language testing researchers to identify and measure the factors that contribute to
variability in assessment results (Barkaoui, 2014; Brown et al., 2005; Linacre, 1989;
Yan, 2014) and was used in this study to offset differential severity among markers. The
“examiner gold standard scores” discussed in the following sections refer to fair average
scores derived from triple examiner marking. That is, every test part answered by every
logistically complex and does not reflect the score reporting practice of the Linguaskill
General Speaking test. Fair average scores were computed to create reliable estimates of
candidate abilities or scores that closely reflect candidates’ true oral proficiency. This is
a standard procedure that the Linguaskill General Speaking test follows in examiner
against the fair average scores resulting from marking by senior examiners. In this
study, we evaluated the automarker using the same examiner gold standard.
The data for this study consisted of automarker scores, automarker uncertainty
measures, examiner raw scores, and examiner fair average scores. The agreement
between automarker and examiner gold standard scores (RQ1) was evaluated using
LOAs, the standard approach in medical science for comparing two methods of clinical
measurement (Bland & Altman, 1986, 1999). This approach is based on looking at the
differences between the two measurements on each of the subjects. The first step is to
approximately constant over the score range, and that they follow a normal distribution.
If the assumptions are met, the second step is to calculate the LOAs. These are upper
and lower bounds such that approximately 95% of the differences lie between them. If
the LOAs are both close to zero, the two measurements agree well with each other. If
the assumptions in the first step are not met, there are modified methods that can be
The idea of the approach is that domain experts, such as language testing
researchers or testing organisation employees, can look at the LOAs between the
automarker and the examiner gold standard and judge whether they are satisfactory.
This contrasts with measures such as QWK, for which interpretation often relies on
rather arbitrary rules of thumb, and which do not have much meaning to people who
understand the scoring scale but not statistical science, in addition to the problems
We also calculated the percentage agreement between the automarker and the
examiner gold standard on CEFR classification. This is simply the proportion of cases
in which the automarker and examiner gold standard give the same CEFR level. We
include this analysis because CEFR levels are the primary test results reported for
Linguaskill. Its disadvantages are that it depends on the number of possible categories
for the classification, and for cases where the automarker and examiner disagree it does
analysis on raw scores awarded by each of the three examiners and the automarker,
using the FACETS computer program (Version 3.71; Linacre, 2014). In this analysis the
10 unusual responses were excluded. In contrast to the Bland and Altman method on
overall test scores, the MFRM analysis was performed on test part scores, to take into
consideration the variance in assessment results caused by test items. The examiner
scores were discrete data points from 0 to 6 in increments of 0.5, whereas the
automarker scores on each test part were continuous data from 0 to 6. To put the data
into the same form, the automarker scores were rounded to the nearest 0.5. FACETS
requires integer scores, so both automarker scores and examiner scores were doubled,
and the MFRM analysis was conducted on scores in the form of integers from 0 to 12.
The fair averages resulting from the analysis were then halved to be on the original
scale of measurement.
accuracy (RQ3) was investigated by making a simple scatterplot between the two
sensitivity to non-English speech (RQ4) was investigated by using the difference in the
ASR confidence score between English and non-English speech. A Mann–Whitney test
was used to test this difference, as the normality assumption was not met.
5. Results
5.1. RQ1: Agreement between automarker and the examiner gold standard
We show three analyses in order to illustrate the variety of methods that are available
with Bland and Altman’s approach. This is also intended to show the steps that a
researcher goes through when analysing this kind of data, including dealing with
mentioned above, the “examiner score” was the Rasch fair average of the three
Figure 2 shows the automarker and examiner scores, and the diagonal line shows
perfect agreement—a regression line could be added, but this is not recommended and
is not a suitable method for analysing agreement (Bland & Altman 1999, 2003). As
shown on the axes, the 0–6 scale corresponds to the six levels of the CEFR. As can be
seen, there is a clear tendency for the automarker scores to be higher than the examiner
scores. Several individual points are unusual: in five cases the examiner and automarker
scores are both very close to zero; for five others, the examiner score is close to zero but
differences, shown on the y-axis, are easier to see, and their downward trend is more
obvious. (This trend could be due to the automarker being poorly calibrated for low and
high scores; see Guo et al., 2017). Bland and Altman also recommended a histogram of
the differences, which we omit for reasons of space. In the scatterplot, if the distribution
of the differences was roughly the same across the range of scores, and normally
distributed, the next step would be to summarise it by calculating three quantities: the
bias, which is defined to be the mean of the differences, and LOAs, which contain 95%
of the differences. The LOAs are calculated as the bias plus or minus 1.96 times the
standard deviation. These are shown in Figure 3 (thick dashed lines), with confidence
intervals, as an illustration of the most basic version of Bland and Altman’s approach.
The idea is that most of the differences lie within the range shown by the LOAs, so if
the LOAs are sufficiently close together and close to zero, then the automarker is
performing well.
For this dataset, the horizontal LOAs do not describe the distribution of the
differences well, because of the downward trend in the differences. The unusual scores
mentioned above also have a distorting effect and do not follow the assumption that the
differences are normally distributed. In order to explore potential reasons for these
unusual scores, we listened to some of the audio files and looked at the examiner scores.
Many of the zero-score responses were silent apart from microphone crackle. For a
further analysis we decided to drop 10 responses that had received examiner scores
below 0.5 when the Rasch fair averages were calculated from 207 responses and had
also received comments from the examiners such as “no meaningful response”. The
automarker performance on these unusual responses indicated that in half of these cases,
the fair average score was extremely close to 0 (below 0.1) and the automarker score
was 0, which is very satisfactory; for the other half the automarker score was between
1.9 and 3.8, which is unsatisfactory. In addition, none of the five unsatisfactory cases
had a LQ score over 0.9, a cut score set to distinguish between reliable and unreliable
Figure 4 shows the new scatterplot, based on 197 cases; naturally the LOAs are
closer together. In Figure 5 a modified approach has been used, to give sloping LOAs
that fit the data better (Bland & Altman, 1999). For example, for scores around the
middle of the B1 range, it can be seen that the bias is approximately 0.4 and the LOAs
are –0.7 and +1.4. The sloping LOAs describe the distribution of the differences well,
but they are harder to interpret—it is harder to judge from them whether the automarker
performance is satisfactory.
give scores on average 0.41 higher than the gold standard, and 95% of the differences
lie between –0.75 and 1.57. There is a distinct trend for the differences to be higher in
the A1 and A2 range. Of course, the unusual responses cannot be ignored, and the
automarker needs to be developed and trained to deal with these correctly. In this case
the automarker only identified five of the 10 responses that should have been given a
scores on the x-axis. Their argument for this choice of x-axis assumes that the two
measurement methods have reasonably similar random error (Bland & Altman 1995),
which is probably not the case here, because the Rasch fair average of three human
scores has lower variance than a single score. An alternative might be to put the Rasch
fair average on the x-axis. However, this would not affect our observation of the
downward trend in the differences, which is also visible in Figure 2, or our main results,
Linguaskill reports candidates’ CEFR levels as the primary test result. We conducted
this analysis on the 197 responses, as we were more interested in automarker accuracy
Section 5.3) and are normally marked by examiners. By applying CEFR cut scores, we
converted the raw scores awarded by the automarker and the three markers to CEFR
levels. Our question was how well the CEFR classification made by the automarker and
Table 2 shows three types of agreement: exact agreement (assigning to the same
level), adjacent agreement (assigning to the same level or one level up or down), and
disagreement (difference greater than one CEFR level). On the 197 Linguaskill General
Speaking tests, the automarker achieved 48.3% exact agreement and 93.3% adjacent
agreement with the examiner gold standard. When comparing these with individual
marks awarded by the three examiners, this performance was slightly worse than
Marker 1 and considerably worse than Marker 2 and Marker 3.
The MFRM analysis reports individual-level measures for marker consistency via outfit
and infit mean square residuals. As the outfit mean square residuals is unweighted and
thus more sensitive to outliers, the infit mean square residuals (hereafter, infit) is usually
Ideally, an infit should be close to 1. An infit lower than 1 suggests a higher degree of
predictability or tendency to award the same score; an infit higher than 1 suggests a
higher degree of randomness in marking. Generally, examiners with an infit higher than
1.5 tend to mark inconsistently or unpredictably, whereas markers with an infit lower
than 0.5 tend to be too predictable in their marking behaviours (Linacre, 2014; Yan,
2014). Table 3 shows the infit statistics of the examiners and the automarker. All were
marker severity at both group and individual levels. At the group level, FACETS
performed a fixed chi-square test on the null hypothesis that all markers are at the same
level of severity. The chi-square test indicated a significant difference in severity among
the four markers (χ2 = 503.8, df = 3, p < .01). At the individual level, marker severity is
measured in logits with a positive value indicating severity and a negative value
indicating leniency.
As shown in the second column of Table 3, marker severity ranged from −0.42
logits to 0.50 logits. All the severity measures were close to zero, suggesting that none
of the markers were too harsh or too lenient. The automarker, with a severity of –0.42,
was the most lenient, but close to Marker 1, the most consistent marker, whose severity
was –0.40. The ranking of severity among the four markers can be seen in the ‘Marker’
column of the Wright map (Figure 6). The higher markers are in this column, the more
severe they are. Likewise, marker severity can be inferred by the fair average a marker
awarded to all candidates in the sample. As seen in Table 3, Marker 3, the most severe
marker among the four, had a fair average of 3.32 whereas the automarker had a fair
To summarise, the automarker exhibited almost the same level of internal consistency
as the most consistent examiner. The marker severity measures were closely distributed
was extremely harsh or lenient. However, the automarker was found to be the most
lenient of the four, which confirms the finding from the LOAs analyses above.
and LQ, to indicate its confidence in scoring a test response. Both are continuous
variables ranging from 0 to 1, with a higher value indicating higher confidence. The AQ
score had a small range (0.7, 0.9) and low variability (SD = 0.02) in the present
dataset—95.7% of the test responses received an AQ score of 0.9. For this reason, AQ
The LQ score, in contrast, had a wider range (0.46, 0.95) and greater variability
(SD = 0.08). The scatterplot in Figure 7 shows the relationship between LQ (x-axis) and
the absolute differences between the automarker and examiner fair average scores (y-
axis). The unit of measurement for the Y axis is one CEFR band. The clustered
datapoints in the bottom right corner of the scatterplot suggest that the absolute
difference between the automarker and the examiner gold standard tends to be smaller
than one CEFR band (M = 0.40, SD = 0.26) when the LQ score is greater than or equal
to 0.9 (i.e. data points to the right of the red vertical line). In contrast, when the LQ
score is smaller than 0.9 (i.e. data points to the left of the vertical line), this difference
automarker score increases. For example, the five unusual responses that had inaccurate
automarker scores (see Section 5.1.1.) had an LQ score ranging from 0.62 to 0.85.
Figure 7 only shows the absolute differences, but there was no systematic tendency in
CEFR categorisation was recalculated after excluding test responses with an LQ score
lower than 0.9. This dataset contained 79 tests with examiner fair average scores
ranging from 1.14 to 5.94, 72 (90%) of which had an examiner fair average over 3.0,
equivalent to B1 proficiency on the CEFR. Table 4 shows the percentage agreement
statistics of the four markers in the reduced dataset. Compared to the numbers from the
original dataset (Table 2), the automarker exact agreement increased from 48.3% to
61.3% and adjacent agreement increased from 93.3% to 100%. The automarker
performance drew closer to that of Marker 1 and Marker 3 but was still much lower
confirm the improved automarker performance on responses with high LQ scores. The
results (Table 5) indicate that the automarker had average severity (–0.25) among the
four markers with its internal consistency (infit = 0.89) still being the second best. This
dataset, as discussed in Section 5.2. The severity ranking is illustrated in the second
column of Figure 8 in which the automarker falls in the middle and is very close to zero,
reliability. When LQ is higher than 0.9, the automarker tends to achieve a closer
agreement with the examiner gold standard as well as ideal marker severity.
automarker would produce lower LQ scores on these as the speech recogniser was
0.66, SD = 0.03) than the English-speaking group (M = 0.89, SD = 0.04). Because the
normality assumption was not met, the Mann-Whitney test, a non-parametric test, was
chosen to the test the null hypothesis that the two groups were the same. This suggested
separation of datapoints between the two groups. This separation can be observed in the
measure for judging whether a test response is in English or not. In the English-
speech, received an average LQ score of 0.86 with a range from 0.78 to 0.91. Thus, if
the LQ score is below 0.7, there is a high chance that the speech is not English.
6. Conclusions
For the 95.2% of responses with a fair average score above 0.5, the LOAs indicated that
the average difference between CASE and the examiner gold standard was 0.41, with
most of the differences lying between –0.75 and 1.57 (1 usually represents one CEFR
band). For the other responses, CASE agreed closely with the examiner gold standard
50% of the time. This level of agreement is inadequate to justify a decision to fully
replace examiners with the automarker in high-stakes assessment contexts (Xi, 2010),
consistency and was almost on a par with the most consistent examiner. When
uncertainty measures were not considered, the automarker had moderate exact and
adjacent agreement on CEFR grading with the examiner gold standard. It also tended to
be more lenient than the examiner gold standard, particularly in the A1–A2 range.
However, the automarker performance was considerably more accurate in cases where
scores were greater than or equal to 0.9, the automarker achieved average severity
among the four markers and was nearly as accurate in CEFR grading as two of the
human markers. These findings seem to suggest that speech intelligibility and audio
quality may have a significant impact on automarker reliability. In other words, the
The LQ score generated by the speech recogniser was also found useful for
speech was still much higher than non-English speech or gibberish. Human inspection
of test responses with a low LQ score seems a viable means of detecting malpractice (Xi
et al., 2016).
this study, we used LOAs to measure the deviation of the automarker scores from the
examiner gold standard on the original scale of measurement, i.e. the CEFR. We also
showed how to construct confidence intervals for this deviation at different score ranges
Pearson correlation and Cohen’s kappa (e.g. Higgins et al., 2011; Van Moere et al.,
2012; Yu Wang et al., 2018), which we argue are unsuitable for automarker evaluation
and hard to interpret (see Section 2.3). Unlike in previous automarker research, we used
examiner fair averages resulting from MFRM as the examiner gold standard. The fair
average score is generally considered a more reliable measure of candidate ability than
an arithmetic average or single marking in that the fair average score is statistically
adjusted for marker severity based on marker behaviours in the entire sample (Myford
inferences in a validity argument for automated scoring, did not find evidence to
support the use of the automarker (CASE v1.9) on its own in the Linguaskill General
Speaking test, a relatively high-stakes English language assessment. However, the study
marking, ability to indicate its confidence and flag abnormal responses, and human-like
6.2. Implications
One of the aims of this study was to address the lack of transparency on automarker
research and validation (Khabbazbashi et al., 2021). Accordingly, the first major
automated scoring. We argue that LOAs are measures of agreement, but correlation is
not. This paper demonstrated how LOAs could be used to 1) examine automarker
agreement with, or deviation from, the examiner gold standard on the original scale of
measurement and 2) construct 95% confidence intervals for the differences. LOAs
arguably present more granular information about automarker reliability than a simple
traditional method for evaluating marker severity and consistency and found that the
measures such as the LQ score are useful in predicting automarker reliability, then
responses into two categories: those that must be marked by examiners and those on
which computer marking is trustworthy. This hybrid approach combines the strengths
and benefits of artificial intelligence with those of examiners and can improve the
overall efficiency of marking without lowering score reliability (see Xu et al., 2020).
6.3. Limitations
The present study had two major limitations. First, we did not have enough data for a
is not known how the automarker would perform on unfamiliar accents. The responses
were exclusively retrieved from the Linguaskill General Speaking test, so the findings
cannot be generalised to the Linguaskill Business Speaking test or other tests in which
test-taking behaviours may be different. These limitations are being addressed as part of
Acknowledgments
We would like to thank Mark Gales, Kate Knill, Trevor Benjamin, the editors, and the
anonymous reviewers for their stimulating and helpful comments on the manuscript.
Martin Robinson, Bronagh Rolph, Kevin Cheung, Ardeshir Geranpayeh, and John
References
AERA, APA, & NCME. (2014). Standards for educational and psychological testing.
AERA.
(Ed.), The companion to language assessment (Vol. III, pp. 1301–1322). John
Bland, J. M., & Altman, D. G. (1986). Statistical methods for assessing agreement
https://doi.org/10.1016/S0140-6736(86)90837-8
https://doi.org/10.1177/096228029900800204
1085–1087. https://doi.org/10.1016/S0140-6736(95)91748-9
https://doi.org/10.1002/uog.122
Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and
https://doi.org/10.1177/001316448104100307
Brown, A., Iwashita, N., & McNamara, T. (2005). An Examination of rater orientations
http://dx.doi.org/10.1002/j.2333-8504.2005.tb01982.x
Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity
Chen, L., Zechner, K., Yoon, S.-Y., Evanini, K., Wang, X., Loukina, A., Tao, J., Davis,
L., Lee, C. M., Ma, M., Mundkowsky, R., Lu, C., Leong, C. W., & Gyawali, B.
https://doi.org/10.1002/ets2.12198
306. https://doi.org/10.1207/s15434311laq0303_4
https://doi.org/10.1177/001316446002000104
Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled
https://doi.org/10.1037/h0026256
Di Eugenio, B., & Glass, M. (2004). The kappa statistic: A second look. Computational
Enhanced Speech Technology Ltd. (2020). EST custom automated speech engine
Fan, J. (2014). Chinese test takers' attitudes towards the Versant English Test: A mixed-
https://doi.org/10.1186/s40468-014-0006-9
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern
neural networks. Proceedings of Machine Learning Research, 70, 1321–1330.
http://proceedings.mlr.press/v70/guo17a.html
Higgins, D., Xi, X., Zechner, K., & Williamson, D. M. (2011). A three-stage approach
112(3), 527–535.
Khabbazbashi, N., Xu, J., & Galaczi, E. (2021). Opening the black box: Exploring
Springer.
Lieberman, H., Faaborg, A., Daher, W., & Espinosa, J. (2005). How to wreck a nice
beach you sing calm incense [Paper presentation]. 10th International Conference
https://www.winsteps.com/tutorials.htm
Litman, D., Strik, H., & Lim, G.S. (2018). Speech technologies and the assessment of
Lu, Y., Gales, M., Knill, K., Manakul, P., Wang, L., & Wang, Y. (2019). Impact of
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using
386–422.
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using
189–227.
Ockey, G. J., & Chukharev-Hudilainen, E. (in press). Human versus computer partner in
https://doi.org/10.1093/applin/amaa067
Pontius, R. G., & Millones, M. (2011). Death to kappa: Birth of quantity disagreement
https://doi.org/10.1080/01431161.2011.552923
Song, Z. (2020). English speech recognition based on deep learning with multiple
Van Dalen, R. C., Knill, K. M., & Gales, M. J. F. (2015). Automatically grading
https://www.slate2015.org/files/SLaTE2015-Proceedings.pdf
Van Moere, A. (2012). A psycholinguistic approach to oral language assessment.
Van Moere, A., & Downey, R. (2016). Technology and artificial intelligence in
Wagner, E. (2020). Duolingo English test, Revised version July 2019. Language
https://doi.org/10.1080/15434303.2020.1771343
Wagner, E., & Kunnan, A. J. (2015). The Duolingo English test. Language Assessment
Wang, Y. [Yu], Gales, M. J. F., Knill, K. M., Kyriakopoulos, K., Malinin, A., van
https://doi.org/10.1016/j.specom.2018.09.002
Wang, Y. [Yanhong], Luan, H., Yuan, J., Wang, B., Lin, H. (2020) LAIX corpus of
speech.org/archive/Interspeech_2020/pdfs/1677.pdf
Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-
https://doi.org/10.1177/0265532210364406
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use
Wu, X., Knill, K., Gales, M., & Malinin, A. (2020). Ensemble approaches for
3864. https://doi.org/10.21437/Interspeech.2020-2238
Xi, X. (2010). Automated scoring and feedback systems: Where are we and where are
https://doi.org/10.1177/0265532210364643
Xi, X. (2012). Validity in the automated scoring of performance tests. In G. Fulcher &
Routledge.
Xi, X., Higgins, D., Zechner, K., & Williamson, D. M. (2008). Automated scoring of
spontaneous speech using SpeechRaterSM v1.0. (ETS Research Report No. RR-
8504.2008.tb02148.x
Xi, X., Higgins, D., Zechner, K., & Williamson, D. (2012). A comparison of two
Xi, X., Schmidgall, J., & Wang, Y. (2016). Chinese users' perceptions of the use of
Xu, J., Brenchley, J., Jones, E., Pinnington, A., Benjamin, T., Knill, K., Seal-Coon, G.,
https://www.cambridgeenglish.org/Images/589637-linguaskill-building-a-
validity-argument-for-the-speaking-test.pdf
https://doi.org/10.1177/0265532214536171
Linguistics. https://doi.org/10.3115/v1/W15-0625
Table 2. Agreement on CEFR classification between the examiner gold standard and single
marking from automarker and three markers (n = 197).
Examiner gold standard
Exact agreement Adjacent agreement
Marker 1 55.2% 100%
Marker 2 83.7% 99.5%
Marker 3 73.4% 99.5%
Automarker 48.3% 93.3%
Table 3. Measurement estimates of the reliability of three markers and automarker (n = 197).
Marker Severity Fair Model SE Infit ZStd Outfit MSq ZStd PtBis Corr
Avg MSq
Marker 1 –0.40 3.87 0.04 1.01 0.10 0.97 –0.30 0.85
Marker 2 0.33 3.44 0.04 1.11 1.60 1.14 2.10 0.86
Marker 3 0.50 3.32 0.04 0.78 –3.50 0.91 –1.40 0.91
Automarker –0.42 3.88 0.04 0.93 –0.90 1.07 1.0 0.82
Fixed (all same) chi-square: 503.8, df = 3, p < .01
Table 4. Agreement on CEFR classification between the examiner gold standard and single
marking from automarker and three markers when LQ is greater than 0.9 (n = 79).
Examiner gold standard
Exact agreement Adjacent agreement
Marker 1 64.1% 100%
Marker 2 85.9% 100%
Marker 3 69.2% 100%
Automarker 61.3% 100%