Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees
Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees
Abstract
Automated Scoring (AS), the natural language processing
task of scoring essays and speeches in an educational testing
setting, is growing in popularity and being deployed across
contexts from government examinations to companies pro-
viding language proficiency services. However, existing sys-
tems either forgo human raters entirely, thus harming the re- ?
liability of the test, or score every response by both human
and machine thereby increasing costs. We target the spec- 0 % responses scored by humans 100
trum of possible solutions in between, making use of both
humans and machines to provide a higher quality test while
keeping costs reasonable to democratize access to AS. In this Figure 1: Existing Automated Scoring systems2 either do
work, we propose a combination of the existing paradigms, not involve humans at all in their scoring (Duolingo, Second
sampling responses to be scored by humans intelligently. We Language Testing Inc (SLTI)), or utilize human raters for
propose reward sampling and observe significant gains in ac- every single response (Educational Testing Services (ETS)).
curacy (19.80% increase on average) and quadratic weighted Crucially, there are no solutions that target the gulf in be-
kappa (QWK) (25.60% on average) with a relatively small tween, where humans are involved in scoring only some per-
human budget (30% samples) using our proposed sampling. centage of the responses.
The accuracy increase observed using standard random and
importance sampling baselines are 8.6% and 12.2% respec-
tively. Furthermore, we demonstrate the system’s model ag-
nostic nature by measuring its performance on a variety of are numerous, including but not limited to, the costs in-
models currently deployed in an AS setting as well as pseudo volved in providing and scoring a test, and ensuring that
models. Finally, we propose an algorithm to estimate the ac-
curacy/QWK with statistical guarantees1 .
all test takers are scored on a uniform set of rubrics applied
across all students, standardizing the scoring for these un-
structured responses. The promise of lower costs and uni-
1 Introduction form scoring rubrics among other factors, has fueled the
Automated Scoring (AS), the task of assigning scores to un- popularity of Automated Scoring systems, and various ML
structured responses to open-ended questions, is an NLP ap- and DL systems are being increasingly deployed in AS con-
plication typically deployed in an educational setting. His- texts (Kumar et al. 2019; Liu, Xu, and Zhu 2019; Singla
torically, its origins have been traced to the work of Ellis et al. 2021a). AS systems are behind some of the world’s
Page (Page 1967), who first argued for the possibility of most popular language tests, such as ETS’ Test of English
scoring essays by computer. The factors behind the rise of as a Foreign Language (TOEFL) (Zechner et al. 2009),
Automated Scoring systems and its subtasks, Automated Es- Duolingo’s English Test (DET) (LaFlair and Settles 2019),
say Scoring (AES) and Automated Speech Scoring (ASS) among others. Various governmental institutions and busi-
nesses have also instituted automated systems to augment
* These authors contributed equally. the scoring process, such as the state schools of Utah (Incor-
Copyright © 2022, Association for the Advancement of Artificial porated 2017) and Ohio (O’Donnell 2018), and a majority
Intelligence (www.aaai.org). All rights reserved. of BPOs. It is estimated that automatic scoring has a large
1
Our code is available at https://git.io/J1IOy market size of more than USD 110 billion, with a US mar-
2
TOEFL by ETS, Pearson PTE, SLTI, Linguaskill by Cam- ket size alone of USD 17.1 billion (TechNavio 2020; Service
bridge, Duolingo English Test, and TrueNorth by Emmersion are 2020; Strauss 2020; Le 2020).
registered brand names and are shown here for illustration purposes
only. The authors claim no rights over their logos or brand names. However, this popularity has not been without backlash,
In this work, we mainly refer to the automatically scored speaking with criticism focusing on different aspects, such as “the
and writing proficiency measurement tests of these companies. overreliance on surface features of responses, the insensi-
Human-machine Agreement matrix Expert Human Rater Expert Human Rater
A B C
0.1 0.7 0.2 Candidate Prompt Response Prediction Global Prediction Confidence Candidate Global Prediction Global Label
B
Reward Sampling
Sampling 1 B B
1 P1 The mitochondria is the .... B
Deoxyribonucleic acid 10 A A
2 P1 B
or DNA ...
Candidate Prompt Response Prediction Global Prediction B
2 P2
A key drawback of the schooling system
B
Estimate system metrics using a
today ... second sample
1 P1 The mitochondria is the .... A
A ... ... ... ... ...
A broad and well rounded education
1 P2 A
is key to ...
10 P1 Scientists theorize that abiogenesis .... A
Deoxyribonucleic acid
2 P1 B A
or DNA ...
B Exposure to foreign languages and
10 P2 A
A key drawback of the schooling system cultures is ..
2 P2 B
today ...
Figure 2: From a dataset, records are sampled and assigned to expert human raters for double scoring based on a human-machine
agreement matrix. A second sample is then drawn to check predictions and metrics are estimated with statistical guarantees.
tivity to the content of responses and to creativity, and the Machine-only Scoring: On the other end of the scale are
vulnerability to new types of cheating and test-taking strate- tests like the Duolingo English Test (DET) which are scored
gies.” (Yang et al. 2002). Others have given harsher criti- by machines alone, without any human intervention, keep-
cisms, such as (Perelman et al. 2014), who shows that it is ing costs low but decreasing the reliability of the test. This
possible to game the system and achieve near perfect scores is one of the main reasons, the DET costs USD 49, less than
on ETS and Vantage Technologies’ AES systems with gib- one-fourth of what TOEFL costs. All tests surveyed in Fig 1
berish prose. This has led to the revoking of NAPLAN AES except Pearson PTE are priced around the same price point.
in Australia (ACARA 2018). Our solution (Fig 2) proposes to unify these varieties, al-
Nonetheless, the ability of AS systems to instantly pro- locating the available human budget intelligently to balance
vide scores, reduce costs, and make language proficiency the reliability of the test with the cost to the test-taker. To
tests more widely available to all, makes them an impor- the best of our knowledge, no existing systems target this
tant research area and subsequently there is considerable in- continuum of utilizing both humans and AS raters. Provid-
terest in improving them across multiple dimensions, from ing this option would allow AS models to be deployed in
leveraging advancements in NLP to achieve state-of-the-art more versatile scenarios, working in tandem with expert hu-
performance (Liu, Xu, and Zhu 2019) to improving their ro- man raters to provide both reliability and lower-cost solu-
bustness (Kumar et al. 2020; Parekh et al. 2020; Singla et al. tions. Increasing reliability helps to build trust in automati-
2021b). In this work, we tackle another facet of Automatic cally scored exams, thus leading to broader adoption. Cost is
Scoring systems, that of improving performance by bringing a critical consideration to lower-income test-takers and those
humans into the loop. who need to take the test multiple times.
Typically in an AS task, a test taker’s responses are scored We define the problem and solution more formally as fol-
on prompts of varying difficulty levels. Each prompt has its lows: given a set of responses to be scored, a target AS
own difficulty level, and based on the prompts’ difficulty model, and an expert human budget (that is, the number
and the quality of the candidate’s answers to these prompts, of responses we can have scored by expert human raters),
a score is assigned to the candidate. The Central European our goal is to efficiently sample responses to be scored by the
Framework of Reference for Languages (CEFR) is an in- expert. These expert-scored samples are then combined with
ternational standard for measuring language proficiency and automatically scored samples to maximize the overall system
assigns scores on a six-point scale from A1 (beginner) to C2 performance metric. We propose a novel Monte-Carlo sam-
(proficient), each score with their own rubrics for evaluation pling based reward sampling algorithm to efficiently sample
(Broeder and Martyniuk 2008). Each prompt and response is responses to maximize the system performance.
assigned a score on this scale and a global score is computed Usually one or multiple amongst accuracy, Quadratic
aggregating these individual scores. Weighted Kappa (QWK), or Cohen’s kappa (Taghipour and
Existing AS systems are typically of two varieties (Fig 1): Ng 2016; Zhao et al. 2017; Kumar et al. 2019; Grover et al.
Double Scoring: Examinations such as ETS’ TOEFL 2020; Singla et al. 2021a) are used in automatic scoring lit-
score every response by one human and an AES system as erature as they are robust measures of inter-rater reliabil-
the second rater. A second human rater resolves any dis- ity, a primary goal in Automated Scoring. A key point to
agreements between the two (Yang et al. 2002). This effec- be noted is that the reliability of the test (i.e. how consis-
tively means that atleast one human rater is required for ev- tently a test measures a characteristic) is measured on the
ery test, driving up costs, as evidenced by the TOEFL’s high global score (the aggregate of the responses) and lesser on
price of ∼230 USD (ETS 2021). the score on the individual responses. The global score de-
termines admissions, interviews, and career growth, while ing (Ding et al. 2020; Kumar et al. 2020; Parekh et al. 2020),
per-item scores are used as indicators of particular skills. explainability (Kumar and Boulanger 2020), uncertainty es-
While intuitively, we can say that there exists a monoton- timation (Malinin 2019), off-topic detection (Malinin et al.
ically increasing relationship between the reliability of the 2016), evaluation metrics (Loukina et al. 2020), etc. To the
test on individual questions and the overall score, we show best of our knowledge, there is no work on increasing the
that it is more efficient to consider the global context instead reliability of automatic scoring systems by bringing humans
of item-level context, while sampling responses for getting into the loop. Most white papers from second language test-
them double-scored by humans. ing firms mention results on historical data as a measure of
We establish strong baselines using Uncertainty Sam- their reliability (Brenzel and Settles 2017; Pearson 2019).
pling (§3.2), an importance sampling formulation that sam- However, historical results are not a guarantee for perfor-
ples using probability of being wrong output by the AS mance over time. Due to continuous domain shift, historical
model. We propose Reward Sampling (§3.3), that samples results cannot be trusted for a model’s future performance
based on the estimated reward of correcting a mistake. gains. Therefore, performance guarantees of AS models are
We summarize our main contributions as follows: essential to establish institutional trust in them. To fill this
- We propose to combine existing paradigms to integrate research gap, we propose reward sampling based on Monte
humans with Automated Scoring systems. Provided a bud- Carlo sampling methods for measuring and increasing AS
get indicating the number of responses that can be scored by systems’ reliability.
human raters, we observe significant gains in accuracy and
QWK using our proposed sampling model, Reward Sam- Monte-Carlo Sampling For Evaluation: There has been
pling (§3.3). For instance, by using 40% human budget with much work in improving automatic metrics using Monte-
an AS model with 64% accuracy, our sampling methodology Carlo sampling methods in machine translation and nat-
can achieve an accuracy gain of 23% while random sampling ural language (NL) evaluation (Chaganty, Mussman, and
leads to 14% and uncertainty sampling leads to 15%. To the Liang 2018; Hashimoto, Zhang, and Liang 2019; Wei and
best of our knowledge, this is the first time such a formula- Jia 2021). They use statistical sampling methods like impor-
tion has been considered in Automatic Scoring systems. tance sampling and control variates to combine automatic
- We conduct experiments on various models differing NL evaluation with expensive human queries. To the best
in accuracy to show our algorithm’s model agnostic nature of our knowledge, we are the first to extend sampling tech-
(§3). We include results from models deployed in AS set- niques in the context of automatic scoring. We use them to
tings in the real world to crafted pseudo models. Averaging combine relatively cheaper automatic scoring model results
over these models, we observe 19.80% increase in accuracy with expensive human expert scorers. Kang et al. (2020) use
and 25.60% increase in QWK when using reward sampling sampling for approximate selection queries. They combine
with 30% of the dataset as a human budget. The random cheap classifiers with expensive estimators to meet mini-
sampling and uncertainty sampling baselines achieve 8.6% mum precision or recall targets with guarantees. We extend
and 12.2% gains in accuracy, respectively. their work to take the global context into account while esti-
mating accuracy (§3.4).
- While augmenting the system’s performance is an im-
portant goal, it is equally important to quantify this improve-
ment, especially when deployed in the real world, where 3 System Overview
there are no labeled datasets to compare against and the con- This section describes the components of the proposed solu-
sequences of misgrading, for both business and test takers, tion, the intuition and reasoning behind the sampling mech-
could be catastrophic. Thus, we also propose an algorithm anisms, and the algorithm for estimating the metrics with
to estimate the accuracy and QWK achieved, with statistical statistical guarantees. Given an Automated Scoring model, a
guarantees. (§3.4). dataset to be scored, and a human budget indicating the per-
centage of records we can provide to expert human raters for
2 Related Work scoring, records are sampled making use of a pre-computed
Broadly, our paper covers two areas of research: Automatic human-machine agreement matrix (to be described be-
Scoring and Sampling methods. Here we cover them briefly. low). For the samples selected, we replace the predictions
made by the AS model with the scores provided by the hu-
Automatic Scoring: The goal of an automatic scorer is man raters and compute an estimate of the increase in accu-
to assess language competence of a candidate with an accu- racy and QWK with guarantees (Fig 2).
racy matching that of a human grader, but faster, with greater When considering sampling, the baseline approach is ran-
consistency and at a fraction of the cost (Malinin 2019; Yan, dom sampling i.e. sampling with uniform probability for
Rupp, and Foltz 2020). Almost all work in the automatic each record in the dataset. This is not a good allocation of
scoring domain has been to better model the scoring of es- resources, as when considering models of high quality, most
says and speech traits as a natural language processing task. samples will not provide any value. For example, with a
The techniques have ranged from manually-engineered nat- model of 75% accuracy, random sampling would only pro-
ural language features (Kumar et al. 2019; Dong and Zhang vide value for ∼25% of samples, as the rest would have
2016) to LSTMs, memory networks (Zhao et al. 2017) and been correctly scored anyway. This motivates our search for
transformers (Singla et al. 2021a). There has also been some a more efficient sampling mechanism, one that takes into ac-
recent work in other facets of AS including adversarial test- count the probability of the model being wrong with respect
ticular class has been predicted. The cross entropy of this
distribution with the ideal distribution (one-hot encoding for
that class) is calculated.
For e.g., the distribution associated with Low B1 in
the matrix is [0.0057, 0.61, 0.27, 0.11, 0.0029, 0]. The cross
entropy of this distribution with the ideal distribution (
[0, 1, 0, 0, 0, 0]) for Low B1 is calculated. In this way, we
can quantify the “loss” associated with Low B1. Subse-
quently, every record is assigned a loss associated with the
prediction made for that record, and this is normalized over
the entire dataset to create a probability distribution. We
draw a sample s ∼ U (D) without replacement from the un-
certainty distribution over the dataset U (D). The provided
human budget indicates the number of samples to be drawn
and the likelihood of a record being drawn corresponds to
the uncertainty associated with the prediction class.
Figure 3: A sample human-machine agreement matrix on a 3.3 Reward Sampling
CEFR aligned scoring scale. The rows indicate machine pre-
dictions, and each row is normalized to give the probability For single skill testing exams (for e.g., one out of speak-
of the machine class matching the human labeled class. ing, writing, reading) like the one by SLTI (2021) and LTI
(2021), the test reliability and validity are measured over the
complete test as opposed to individual prompts. While in-
to a human expert, and crucially, the reward that would be creasing accuracy on individual prompts (through sampling
gained by correcting this mistake. We define the reward as and subsequent human intervention) is a sure way of increas-
the magnitude of the change in the global score that would ing the accuracy on the overall exam, it is more efficient
occur when a local response is changed as a result of human to directly sample records which are more likely to affect
correction of machine score (§3.3). the overall result, rather than simply sampling those which
the machine is uncertain about. In uncertainty sampling, we
3.1 Human-Machine Agreement Matrix sample records based on the likelihood of the prediction
being wrong, but we do not consider whether being right
The human-machine agreement matrix is a normalized con- would actually change the global score. This is the motiva-
fusion matrix of the AS model’s predictions and the ground tion behind reward sampling. Here we sample records which
truth, precomputed on validation data or historical test are more likely to generate a larger reward, i.e., a change in
data. As the matrix is normalized, each entry indicates the the score at the global level. To this end, the expected reward
probability of the class predicted by the machine align- ER is calculated for each record in the dataset as:
ing with the class labeled by the human. Fig 3 shows a X
sample human-machine agreement matrix where m[Low ER (d) = p(c | m) ∗ reward(d, c) ∀d ∈ D (1)
B1][High B1] = 0.27 indicates the probability of the c∈C
ground truth being High B1 when the machine has pre- where d represents one record in the dataset D, c and m
dicted Low B1. represent classes in the set of all classes C, p(c | m) indi-
cates the probability of the ground truth being c when ma-
3.2 Uncertainty Sampling chine has predicted m, and the reward function is denoted as
The key idea behind uncertainty sampling is that the ma- reward. The expected reward encodes the reward gained by
chine is not equally likely to be wrong across all prediction the ground truth being c when the machine has predicted m
classes. Some scores may be assigned with much better ac- weighted by the probability of the same, summed over ev-
curacy than others. This idea is borne out by the human- ery class c. p(c | m) is looked up from the human-machine
machine agreement matrix as well, where the probabilities agreement matrix and the output of the reward function is
of a correct prediction are along the principal diagonal. We weighted by this probability.
can see in Fig 3, High B1, Low B2 are accurately pre- The reward function calculates the reward gained by
dicted whereas A2, High B2 predictions are likely to be swapping the predicted class with a different class. The ag-
wrong. Since the machine is likely making a wrong judge- gregate label for the candidate associated with d is calculated
ment when predicting these classes, it would be more effi- before and after the swap with a new class, and the reward is
cient to sample more from the records where these predic- defined as the absolute difference between the two scores,
tions have been made and corrected using human labelers. which encodes the magnitude of the score change that would
To quantify this, we formulate Uncertainty Sampling as happen if the prediction class was changed from m to c. The
vanilla importance sampling, where the uncertainty of each absolute difference is considered because it is equally im-
class is calculated using the cross-entropy function. Each portant if the new score is greater or lesser than the predicted
row in the human-machine agreement matrix represents the score, thus incurring the same reward. If the prediction is an
probability distribution of the ground truth when that par- outlier compared to predictions on other responses of the
Table 1: Accuracy (acc) and Quadratic Weighted Kappa (kappa) for various models across multiple sampling methods and
increasing percentages of the dataset as sample size. Bold indicates the best performing variant for each configuration.
same candidate, a large reward could be generated when would be selected, thus leading to inaccurate performance
changing predictions, making it a prime target for sampling. estimates. Kang et al. (2020) show that importance sam-
On the other hand, if changing the class to c does not change pling based on a model’s confidence of prediction improve
the final score, then a reward of 0 would be generated. With over uniform random sampling by providing a lower vari-
a zero reward, these records would not be sampled. Thus, to ance estimate. More specifically, they show that the squared
ensure that every record has a nonzero reward i.e a nonzero confidence of the model minimizes the variance of the esti-
probability to be sampled, the reward is additively smoothed mate. As we have not considered model confidence in our
ER (d) = ER (d) + ∆ where ∆ = 0.001. In this manner, an work, we take the following formulation as a proxy for con-
expected reward is calculated for each record in the dataset. fidence, applied over every candidate who wrote the test (not
The sampling procedure proceeds similarly: the rewards responses to individual questions):
are normalized to create a probability distribution over X
which a sample s ∼ ER (D) is drawn. In using this sampling ζ(t) = (1 − i[u] )2 ∀t ∈ T (2)
mechanism, we directly sample records that are most likely i∈t
to provide us an improvement at the aggregate level, com- where ζ represents the confidence associated with a test
pared to indirectly improving the aggregate metrics when taker t in the set of all test-takers T , i represents individual
using uncertainty sampling. responses of t and u represents the uncertainty. From uncer-
tainty sampling, we have a normalized uncertainty associ-
3.4 Estimation with Guarantees ated with each response, this is aggregated over all responses
of a candidate, subtracted from 1 and then squared to pro-
In high stakes testing scenarios, it is critical to ensure that
vide a confidence estimate. This confidence is normalized to
the system does not fail catastrophically. For this reason, it
create a probability distribution. A secondary smaller sam-
is important to provide estimations of system metrics with
ple is taken over this distribution of candidates, effectively
guarantees. Kang et al. (2020) describes an algorithm that
sampling all underlying responses of the candidate. Using
provides statistical guarantees on precision/recall on a subset
the aggregated labels and predictions, the lower bound esti-
of results returned from a dataset. More specifically, given
mates of accuracy and kappa (McHugh 2012) are calculated.
a dataset, a precision/recall target value, sample size and a
failure probability, the algorithm returns a result (a subset of
the dataset) which meet the required precision/recall target
4 Experiments
with a probabilistic guarantee. 4.1 Dataset
Our task is similar, but instead looks at providing guar- To evaluate our method, we make use of data collected by
antees on the accuracy/QWK of overall score on the entire Second Language Testing Inc. (SLTI) while conducting the
dataset rather than just a dataset subset and individual sam- Simulated Oral Proficiency (SOPI) Exam. The SOPI exam
ples. To provide these guarantees, we form confidence inter- has been used since 1992 and studied extensively (Stans-
vals over accuracy/QWK and take the lower bound. field and Kenyon 1992, 1996). SOPI is used for interviews,
The samples selected by reward and uncertainty sampling university admissions, skill development and as a test in sev-
procedures are not a good fit for estimation as they have been eral online courses (SLTI 2021). SOPI offers psychomet-
taken with the purpose of correcting mistakes and improving ric advantages in terms of reliability and validity, partic-
reliability. This means that highly underconfident samples ularly in standardized testing situations. The candidates in
1.00 1.00 1.00 1.00
0.95 0.95 0.95 0.95
0.90 0.90 0.90 0.90
0.85 0.85 0.85 0.85
0.80 0.80 0.80 0.80
kappa
kappa
0.75 0.75 0.75 0.75
acc
acc
0.70 0.70 0.70 0.70
0.65 0.65 0.65 0.65
0.60 0.60 0.60 0.60
0.55 0.55 0.55 0.55
0.500 10 20 30 40 50 60 70 80 0.500 10 20 30 40 50 60 70 80 0.500 10 20 30 40 50 60 70 80 0.500 10 20 30 40 50 60 70 80
% responses scored by humans % responses scored by humans % responses scored by humans % responses scored by humans
kappa
0.75 0.75 0.75 0.75
acc
acc
0.70 0.70 0.70 0.70
0.65 0.65 0.65 0.65
0.60 0.60 0.60 0.60
0.55 0.55 0.55 0.55
0.500 10 20 30 40 50 60 70 80 0.500 10 20 30 40 50 60 70 80 0.500 10 20 30 40 50 60 70 80 0.500 10 20 30 40 50 60 70 80
% responses scored by humans % responses scored by humans % responses scored by humans % responses scored by humans
0.75 0.75
acc
0.70 0.70
0.65 0.65
0.60 0.60
0.55 0.55
0.500 10 20 30 40 50 60 70 80 0.500 10 20 30 40 50 60 70 80
% responses scored by humans % responses scored by humans
Figure 4: In each model, we show the change in accuracy (left) and quadratic weighted kappa (QWK) (right) after sampling
with the sample size (human budget) shown on the x-axis. As can be seen, reward sampling outperforms both uncertainty
sampling and random sampling baseline in each model.
the dataset are primarily Filipino high school graduates and the-art models such as BERT and Bi-directional LSTMs,
above. A test-taker is presented with six prompts on their both baseline versions and conditioned on speaker infor-
computer and their responses for each individual item are mation. In addition, we also run experiments on a pseudo
recorded. The prompts and the rubrics for evaluation fol- model, described as follows. For a given accuracy, a pseudo
low the Central European Framework of Reference for Lan- model’s predictions are generated by randomly changing
guages (CEFR) (Broeder and Martyniuk 2008) guidelines. 100−acc% of ground truth labels. For e.g., the prediction of
The prompts difficulty varies from B1 to C1. A candidate a pseudo model with 65% accuracy is the ground truth with
receives both a prompt-level score and a global score cal- 35% of labels randomly changed. Predictions, and hence ac-
culated from the individual prompt-level scores. The SOPI curacy, are generated at the local level, for each response
dataset has eight question papers (forms) containing six whereas we are concerned about the metrics at the global
prompts each, and each form was attempted by 7200 speak- level, which is typically lesser. The dataset is split into train
ers on an average. Many other works have used the SLTI and test sets, with the additional constraint that this split be
dataset for tasks including automated scoring and coherence done such that all responses of one candidate are contained
modeling (Grover et al. 2020; Patil et al. 2020; Stansfield in a set, and not split between the train and test sets. For our
and Winke 2008; Singla et al. 2021a). experiments, since we do not have a precomputed human-
machine agreement matrix, we compute it using the train-
4.2 Experimental Setup ing set and hold out the test set for verifying our proposed
To demonstrate that the sampling methods described are system. In addition, the aggregate dataset must also be calcu-
model agnostic, we conduct experiments using multiple lated from each candidate’s individual responses, calculating
models of varying accuracy. We leverage speech scoring the candidate’s global score from each of their responses.
models from Singla et al. (2021a), making use of state-of- The experiments were conducted with sample sizes upto
twice the gains achieved by random sampling. Interestingly,
1.00 1.00
0.95 0.95 Uncertainty Sampling shows similar gains to random sam-
0.90 0.90 pling in all models except the pseudo model, where its per-
0.85 0.85
0.80 0.80
formance is more in line with reward sampling. The predic-
0.75 0.75 tions of the pseudo model are randomly generated, hence
kappa
0.70 0.70
acc
0.65 0.65
local gains translate well to global gains. This difference is
0.60 0.60 likely the reason for the large gap in performance when con-
0.55 Reward Sampling (RS) - k=0 0.55 Reward Sampling (RS) - k=0 sidering uncertainty sampling on pseudo and real models.
0.50 Estimate after RS - k=0 0.50 Estimate after RS - k=0
Reward Sampling (RS) - k=1 Reward Sampling (RS) - k=1 Reward Sampling, where the reward that is gained by hav-
0.45 Estimate after RS - k=1 0.45 Estimate after RS - k=1
0.40 0 10 20 30 40 50 60 70 80 0.40 0 10 20 30 40 50 60 70 80 ing a record rated by a human is also a sampling factor,
% responses scored by humans % responses scored by humans
shows significant gains across models as shown in Fig 4. We
note that the gains provided by Reward Sampling decline
Figure 5: The metrics of two models [0 - BERT-TwoStage,
compared to the baseline sampling methods with increas-
1 - LSTM-Baseline] have been presented. Results for other
ing sample sizes. Initially, Reward Sampling outperforms
models are similar and are not shown for visual clarity.
the other sampling methods with large gains, upto a sam-
ple size of ∼30%. Beyond this mark, the gains are no longer
as significant and the other methods slowly catch up. This
80% of the dataset to observe the effect of sample size on
trend holds across all models, indicating that Reward Sam-
the improvement in accuracy with respect to each sampling
pling shows maximal gains over baselines when sampling
method. Records are sampled from the test set according to
less than half of the dataset for human scoring.
Reward Sampling, along with Random (uniform) Sampling
and Uncertainty (importance) Sampling, our baselines. We
replace the predictions of records in the sample with those 5.2 Estimation with Guarantees
of the ground truth, following which we recompute the ag- Fig 5 is a plot visualizing the metrics of the models when
gregate dataset. This dataset is used to calculate the system utilizing reward sampling and an estimate of the same. The
level metrics, not just the model, but the combination of the sample size used for estimation remains constant and it is
model and the human in the loop. In estimating metrics, a only the sample used for reward sampling that changes. Af-
secondary sample was taken. Empirically, we observed that ter reward sampling, a sample based on the confidence distri-
a sample size of 200 was sufficient for stable estimations bution (§3.4) is drawn, and the 95% confidence interval for
of a 95% confidence interval. We report our estimation on both accuracy and kappa is calculated. The lower bound is
Reward Sampling when considering 80% of the dataset as taken to provide a statistical guarantee that accuracy/QWK
human budget i.e. the most performant configuration. will only fall below the estimated values 5% of all runs.
5 Results 6 Conclusion
5.1 Improving Reliability Automatic Scoring (AS) helps assess the language compe-
Table 1 presents the results of the experiments conducted tency of candidates with accuracy matching that of a human
across configurations of models, sampling methods, and hu- grader, but faster, with greater consistency and at a fraction
man budget. Entries in bold indicate the best performing of the cost. Existing systems either rely on double scoring,
configuration, which is identical across nearly all models effectively scoring each sample by both human and AS sys-
(Reward Sampling with the maximum sample size, 80% of tem, or solely by an AS system. Although double scoring is
the dataset). The models considered are BERT (baseline), more reliable, it is considerably more expensive. We develop
BERT (two stage speaker conditioning), BD-LSTM with novel, sample-efficient algorithms to target the spectrum of
Attention (baseline), BD-LSTM with Attention (two stage possible solutions in the middle of both extremes. We show
speaker conditioning) and a pseudo model with accuracy = that by using a relatively small human budget, we can im-
0.75 at the local level. For all models, the dataset is aggre- prove and estimate performance with guarantees, thus in-
gated, following which accuracy and QWK are calculated, creasing the reliability and trustworthiness of the system. We
giving us the values in the Model metrics column. implement and evaluate our algorithms on real exam data,
Fig 4 shows the change in both accuracy and QWK at the showing that they outperform naive baselines in all settings
global level for various models. The changes are measured evaluated. These results indicate the promise of probabilistic
for increasing human budget i.e. percent of responses avail- algorithms to improve and estimate automatic scoring relia-
able to be scored by humans upto 80% of the dataset. For bility with statistical guarantees.
illustration, we consider the BERT-Baseline model. Firstly, As part of future research, we plan to work on even more
we observe that random sampling shows minimal improve- sample efficient algorithms and incorporating trait scoring
ment (3%) over the actual accuracy when sampling 10% of while sampling. Another possible research avenue where we
the dataset to be scored by human raters. This is due to the can apply our algorithms is in test design. While right now
model’s accuracy (66% of all samples would have been pre- test design involves linguistic validity assessment studies,
dicted correctly anyway) and the remaining gains are further it does not take into account the reliability of the final test
minimized by the aggregation process. Reward Sampling, built. Reliability of a test could be incorporated as another
on the other hand, shows an 8% gain in accuracy, more than constraint easily through our modelling paradigm.
References Liu, J.; Xu, Y.; and Zhu, Y. 2019. Automated es-
ACARA. 2018. ACARA News (January 2018). say scoring based on two-stage learning. arXiv preprint
https://www.acara.edu.au/news-and-media/news- arXiv:1901.07744.
details?section=201801250300#201801250300. Loukina, A.; Madnani, N.; Cahill, A.; Yao, L.; Johnson,
Brenzel, J.; and Settles, B. 2017. The Duolingo English M. S.; Riordan, B.; and McCaffrey, D. F. 2020. Using
Test—Design, Validity, and Value. DET Whitepaper (Short). PRMSE to evaluate automated scoring systems in the pres-
ence of label noise. In Proceedings of the Fifteenth Work-
Broeder, P.; and Martyniuk, W. 2008. Language education
shop on Innovative Use of NLP for Building Educational
in Europe: The common European framework of reference.
Applications, 18–29.
Encyclopedia of language and education, 209–226.
LTI. 2021. Language Testing Inc. https://www.
Chaganty, A. T.; Mussman, S.; and Liang, P. 2018. The price
languagetesting.com/lti-information/general-test-
of debiasing automatic metrics in natural language evalua-
descriptions.
tion. arXiv preprint arXiv:1807.02202.
Malinin, A. 2019. Uncertainty estimation in deep learning
Ding, Y.; Riordan, B.; Horbach, A.; Cahill, A.; and Zesch, T. with application to spoken language assessment. Ph.D. the-
2020. Don’t take “nswvtnvakgxpm” for an answer–The sur- sis, University of Cambridge.
prising vulnerability of automatic content scoring systems to
adversarial input. In Proceedings of the 28th International Malinin, A.; Van Dalen, R.; Knill, K.; Wang, Y.; and Gales,
Conference on Computational Linguistics, 882–892. M. 2016. Off-topic response detection for spontaneous spo-
ken English assessment. In Proceedings of the 54th Annual
Dong, F.; and Zhang, Y. 2016. Automatic features for es- Meeting of the Association for Computational Linguistics
say scoring–an empirical study. In Proceedings of the 2016 (Volume 1: Long Papers), 1075–1084.
conference on empirical methods in natural language pro-
cessing, 1072–1077. McHugh, M. L. 2012. Interrater reliability: the kappa statis-
tic. Biochemia medica, 22(3): 276–282.
ETS. 2021. Find Test Centers and Dates for ETS TE-
O’Donnell, P. 2018. Computers are now grading essays on
OFL Exam. https://v2.ereg.ets.org/ereg/public/testcenter/
Ohio’s state tests.
availability/seats? p=TEL.
Page, E. B. 1967. Statistical and linguistic strategies in the
Grover, M. S.; Kumar, Y.; Sarin, S.; Vafaee, P.; Hama, M.;
computer grading of essays. In COLING 1967 Volume 1:
and Shah, R. R. 2020. Multi-modal automated speech scor-
Conference Internationale Sur Le Traitement Automatique
ing using attention fusion. arXiv preprint arXiv:2005.08182.
Des Langues.
Hashimoto, T. B.; Zhang, H.; and Liang, P. 2019. Unify- Parekh, S.; Singla, Y. K.; Chen, C.; Li, J. J.; and Shah, R. R.
ing Human and Statistical Evaluation for Natural Language 2020. My Teacher Thinks The World Is Flat! Interpret-
Generation. In Proceedings of the 2019 Conference of the ing Automatic Essay Scoring Mechanism. arXiv preprint
North American Chapter of the Association for Computa- arXiv:2012.13872.
tional Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers). Association for Computational Patil, R.; Singla, Y. K.; Shah, R. R.; Hama, M.; and Zimmer-
Linguistics. mann, R. 2020. Towards Modelling Coherence in Spoken
Discourse. arXiv preprint arXiv:2101.00056.
Incorporated, M. 2017. PEG - The engine driving Auto-
mated Essay Scoring. Pearson. 2019. Pearson Test of English Aca-
demic: Automated Scoring. https://assets.ctfassets.
Kang, D.; Gan, E.; Bailis, P.; Hashimoto, T.; and Zaharia, M. net/yqwtwibiobs4/26s58z1YI9J4oRtv0qo3mo/
2020. Approximate selection with guarantees using proxies. 88121f3d60b5f4bc2e5d175974d52951/Pearson-Test-
arXiv preprint arXiv:2004.00827. of-English-Academic-Automated-Scoring-White-Paper-
Kumar, V.; and Boulanger, D. 2020. Explainable Auto- May-2018.pdf.
mated Essay Scoring: Deep Learning Really Has Pedagogi- Perelman, L.; Sobel, L.; Beckman, M.; and Jiang, D.
cal Value. In Frontiers in Education, volume 5, 186. Fron- 2014. The Basic Automatic B.S. Essay Language Gener-
tiers. ator (BABEL Generator). https://lesperelman.com/writing-
Kumar, Y.; Aggarwal, S.; Mahata, D.; Shah, R. R.; Ku- assessment-robo-grading/babel-generator/.
maraguru, P.; and Zimmermann, R. 2019. Get IT Scored Service, E. T. 2020. Education Testing Service
Using AutoSAS—An Automated System for Scoring Short EIN 21-0634479. https://www.causeiq.com/organizations/
Answers. In Proceedings of the AAAI Conference on Artifi- educational-testing-service,210634479/.
cial Intelligence, volume 33, 9662–9669. Singla, Y. K.; Gupta, A.; Bagga, S.; Chen, C.; Krishna-
Kumar, Y.; Bhatia, M.; Kabra, A.; Li, J. J.; Jin, D.; and Shah, murthy, B.; and Shah, R. R. 2021a. Speaker-Conditioned
R. R. 2020. Calling out bluff: Attacking the robustness of Hierarchical Modeling for Automated Speech Scoring. In
automatic scoring systems with simple adversarial testing. Proceedings of the 30th ACM International Conference on
arXiv preprint arXiv:2007.06796. Information & Knowledge Management, 1681–1691.
LaFlair, G. T.; and Settles, B. 2019. Duolingo English test: Singla, Y. K.; Parekh, S.; Singh, S.; Li, J. J.; Shah, R. R.; and
Technical manual. Retrieved April, 28: 2020. Chen, C. 2021b. AES Systems Are Both Overstable And
Le, T. 2020. Testing & Educational Support in the US. https: Oversensitive: Explaining Why And Proposing Defenses.
//my.ibisworld.com/us/en/industry/61171/key-statistics. arXiv preprint arXiv:2109.11728.
SLTI. 2021. Second Language Testing Inc. https://
secondlanguagetesting.com/.
Stansfield, C.; and Winke, P. 2008. Testing aptitude for sec-
ond language learning. Encyclopaedia of language and ed-
ucation, 2nd Edition: Language Testing and assessment, 7:
81–94.
Stansfield, C. W.; and Kenyon, D. M. 1992. The develop-
ment and validation of a simulated oral proficiency inter-
view. The Modern Language Journal, 76(2): 129–141.
Stansfield, C. W.; and Kenyon, D. M. 1996. Test De-
velopment Handbook: Simulated Oral Proficiency Inter-
view,(SOPI). Center for Applied Linguistics.
Strauss, V. 2020. How much do big education
nonprofits pay their bosses? Quite a bit, it turns
out. https://www.washingtonpost.com/news/answer-
sheet/wp/2015/09/30/how-much-do-big-education-
nonprofits-pay-their-bosses-quite-a-bit-it-turns-out/.
Taghipour, K.; and Ng, H. T. 2016. A neural approach to
automated essay scoring. In Proceedings of the 2016 confer-
ence on empirical methods in natural language processing,
1882–1891.
TechNavio. 2020. Global Higher Education Test-
ing and Assessment Market 2020-2024. https:
//www.researchandmarkets.com/reports/5136950/global-
higher-education-testing-and-assessment.
Wei, J.; and Jia, R. 2021. The statistical advantage of auto-
matic NLG metrics at the system level. In Proceedings of the
59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers). As-
sociation for Computational Linguistics.
Yan, D.; Rupp, A. A.; and Foltz, P. W. 2020. Handbook of
automated scoring: Theory into practice. CRC Press.
Yang, Y.; Buckendahl, C. W.; Juszkiewicz, P. J.; and Bhola,
D. S. 2002. A review of strategies for validating computer-
automated scoring. Applied Measurement in Education,
15(4): 391–412.
Zechner, K.; Higgins, D.; Xi, X.; and Williamson, D. M.
2009. Automatic scoring of non-native spontaneous speech
in tests of spoken English. Speech Communication, 51(10):
883–895.
Zhao, S.; Zhang, Y.; Xiong, X.; Botelho, A.; and Heffernan,
N. 2017. A memory-augmented neural model for automated
grading. In Proceedings of the fourth (2017) ACM confer-
ence on learning@Scale, 189–192.