0% found this document useful (0 votes)
2 views

OnTheUseofBERT

This paper presents a novel approach for Automated Essay Scoring (AES) using BERT that incorporates multi-scale essay representation and joint learning. The authors argue that previous uses of BERT in AES have not fully leveraged its capabilities, and their method improves performance significantly, achieving state-of-the-art results on the ASAP task. Additionally, they introduce new loss functions inspired by the essay rating process to enhance the model's effectiveness in scoring essays.

Uploaded by

nguyễn bi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

OnTheUseofBERT

This paper presents a novel approach for Automated Essay Scoring (AES) using BERT that incorporates multi-scale essay representation and joint learning. The authors argue that previous uses of BERT in AES have not fully leveraged its capabilities, and their method improves performance significantly, achieving state-of-the-art results on the ASAP task. Additionally, they introduce new loss functions inspired by the essay rating process to enhance the model's effectiveness in scoring essays.

Uploaded by

nguyễn bi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

On the Use of BERT for Automated Essay Scoring:

Joint Learning of Multi-Scale Essay Representation


Yongjie Wang1 Chuan Wang1 Ruobing Li1 Hui Lin1,2
1
LAIX Inc.
2
Shanghai Key Laboratory of Artificial Intelligence in Learning and Cognitive Science

{yongjie.wang, chuan.wang, ruobing.li, hui.lin}@liulishuo.com

Abstract features may include the numbers of words, the


essay structure, the master degree of vocabulary
In recent years, pre-trained models have be-
and syntactic complexity, etc. These features come
come dominant in most natural language pro-
arXiv:2205.03835v2 [cs.CL] 21 May 2022

cessing (NLP) tasks. However, in the area of


from different scales of the essay. This inspires
Automated Essay Scoring (AES), pre-trained us to extract multi-scale features from the essays
models such as BERT have not been prop- which represent multi-level characteristics of the
erly used to outperform other deep learning essays.
models such as LSTM. In this paper, we in- Most of the deep neural networks AES systems
troduce a novel multi-scale essay representa-
use LSTM or CNN. Some researchers (Uto et al.,
tion for BERT that can be jointly learned. We
also employ multiple losses and transfer learn- 2020; Rodriguez et al., 2019; Mayfield and Black,
ing from out-of-domain essays to further im- 2020) attempt to use BERT (Devlin et al., 2019)
prove the performance. Experiment results in their AES systems but fail to outperform other
show that our approach derives much benefit deep neural networks methods (Dong et al., 2017;
from joint learning of multi-scale essay repre- Tay et al., 2018). We believe previous approaches
sentation and obtains almost the state-of-the- using BERT for AES suffer from at least three lim-
art result among all deep learning models in itations. First, the pre-trained models are usually
the ASAP1 task. Our multi-scale essay rep-
trained on sentence-level, but fail to learn enough
resentation also generalizes well to Common-
Lit Readability Prize (CRP2 ) data set, which knowledge of essays. Second, the AES training
suggests that the novel text representation pro- data is usually quite limited for direct fine-tuning
posed in this paper may be a new and effective of the pre-trained models in order to learn better
choice for long-text tasks. representation of essays. Last but not least, mean
squared error (MSE) is commonly used in the AES
1 Introduction task as the loss function. However, the distribution
AES is a valuable task, which can promote the of the sample population and the sorting proper-
development of automated assessment and help ties between samples are also important issues to
teachers reduce the heavy burden of assessment. be considered when designing the loss functions
With the rise of online education in recent years, as they imitate the psychological process of teach-
more and more researchers begin to pay attention ers rating essays. Different optimizations can also
to this field. bring diversity to the final overall score distribu-
AES systems typically consist of two modules, tion and contribute to the effectiveness of ensemble
which are essay representation and essay scoring learning.
modules. The essay representation module extracts To address the aforementioned issues and limi-
features to represent an essay and the essay scoring tations, we introduce joint learning of multi-scale
module rates the essay with the extracted features. essay representation into the AES task with BERT,
When a teacher rates an essay, the scores are which outperforms the state-of-the-art deep learn-
often affected by multiple signals from different ing models based on LSTM (Dong et al., 2017;
granularity levels, such as token level, sentence Tay et al., 2018). We propose to explicitly model
level, paragraph level and etc. For example, the more effective representations by extracting multi-
1 scale features as well as leveraging the knowledge
https://www.kaggle.com/c/asap-aes
2
https://www.kaggle.com/c/ learned from numerous sentence data. As the train-
commonlitreadabilityprize/data ing data is limited, we also employ transfer learn-
ing from out-of-domain essays which is inspired 2017; Alikaniotis et al., 2016; Wang et al.,
by (Song et al., 2020). To introduce the diversity of 2018; Tay et al., 2018; Farag et al., 2018; Song
essay scoring distribution, we combine two other et al., 2020; Ridley et al., 2021; Muangkam-
loss functions with MSE. When training our model muen and Fukumoto, 2020; Mathias et al.,
with multiple losses and transfer learning using R- 2020). While the handcrafted features are
Drop (Liang et al., 2021), we almost achieve the complicated to implement and careful man-
state-of-the-art result among all deep learning mod- ual design makes these features less portable,
els. The source code of prediction module with deep neural networks such as LSTM or CNN
a trained model for ASAP’s prompt 8 is publicly can automatically discover and learn com-
available3 . plex features of essays, which makes AES an
In summary, the contribution of this work is as end-to-end task. Saving much time to design
follows: features, deep neural networks can transfer
well among different AES tasks. By combin-
• We propose a novel essay scoring approach ing traditional and deep neural network ap-
to jointly learn multi-scale essay representa- proaches, AES can even obtain a better result,
tion with BERT, which significantly improve which benefits from both representations (Jin
the result compared to traditionally using pre- et al., 2018; Dasgupta et al., 2018; Uto et al.,
trained language models. 2020). However, ensemble way still needs
• Our method shows significant advantages in handcrafted features which cost numerous en-
long text tasks and obtains almost the state-of- ergy of researchers.
the-art result among all deep learning models
• Pre-training AES uses the pre-trained lan-
in the ASAP task.
guage model as the initial essay representation
• We introduce two new loss functions which module and fine-tune the model on the essay
are inspired by the mental process of teacher training set. Though the pre-trained methods
rating essays, and employ transfer learn- have achieved the state-of-the-art performance
ing from out-of-domain essays with R- in most NLP tasks, most of them (Uto et al.,
Drop (Liang et al., 2021), which further im- 2020; Rodriguez et al., 2019; Mayfield and
proves the performance for rating essays. Black, 2020) fail to show an advantage over
other deep learning methods (Dong et al.,
2 Related Work 2017; Tay et al., 2018) in AES task. As far
as we know, the work from Cao et al. (2020)
The dominant approaches in AES can be grouped
and Yang et al. (2020) are the only two pre-
into three categories: traditional AES, deep neural
training approaches which surpass the other
networks AES and pre-training AES.
deep learning methods. Their improvement
• Traditional AES usually uses regression mainly comes from the training optimization.
or ranking systems with complicated hand- Cao et al. (2020) employ two self-supervised
crafted features to rate an essay (Larkey, 1998; tasks and domain adversarial training, while
Rudner and Liang, 2002; Attali and Burstein, Yang et al. (2020) combine regression and
2006; Yannakoudakis et al., 2011; Chen and ranking to train their model.
He, 2013; Phandi et al., 2015; Cozma et al.,
2018). These handcrafted features are based 3 Approach
on the prior knowledge of linguists. Therefore 3.1 Task Formulation
they can achieve good performance even with
small amounts of data. The AES task is defined as following:
Given an essay with n words X = {xi }ni=1 , we
• Deep Neural Networks AES has made great need to output one score y as a result of measuring
progress and achieved comparable results with the level of this essay.
traditional AES recently (Taghipour and Ng, Quadratic weighted Kappa (QWK) (Cohen,
2016; Dong and Zhang, 2016; Dong et al., 1968) metric is commonly used to evaluate AES
3
https://github.com/lingochamp/ systems by researchers, which measures the agree-
Multi-Scale-BERT-AES ment between the scoring results of two raters.
3.2 Multi-scale Essay Representation vector [hi,1 , hi,2 , ..., hi,d ] representing the ith se-
We obtain the multi-scale essay representation quence output, and hi,j is the jth element in hi .
from three scales: token-scale, segment-scale and Segment-scale Assuming the segment-scale
document-scale. value set is K = [k1 , k2 , ...ki , ..., kS ], where S
Token-scale and Document-scale Input We is the number of segment scales we want to ex-
apply one pre-trained BERT (Devlin et al., 2019) plore, and ki is the ith segment-scale in K. Given
model for token-scale and document-scale essay a token sequence T1 = [t1 , t2 , ......tn ] for an essay,
representations. The BERT tokenizer is used we obtain the segment-scale essay representation
to split the essay into a token sequence T1 = corresponding to scale ki as follows:
[t1 , t2 , ......tn ], where ti is the ith token and n is 1. We define np as the maximum number of to-
the number of the tokens in the essay. The token kens corresponding to each essay prompt p.
we mentioned in this paper all refer to WordPiece, We truncate the token sequence to np tokens if
which is obtained by the subword tokenization algo- the essay length is longer than np , otherwise
rithm used for BERT. We construct a new sequence we pad [P AD] to the sequence to reach the
T2 from T1 as following. L is set to 510, which length np .
is the max sequence length supported by BERT
2. Divide the token sequence into m = dnp /ki e
except the ( token [CLS] and [SEP ]. segments and each segment is of length ki
[CLS]+[t1 , t2 , .., tL ]+[SEP] n>L
T2 = [CLS]+T1 +[SEP] n=L except for the last segment, which is similar
[CLS]+T1 +[PAD]∗(L − n)+[SEP] n<L to the work of (Mulyar et al., 2019).
The final input representation are the sum of the
3. Input each of the m segment tokens into the
token embeddings, the segmentation embeddings
BERT model, and get m segment representa-
and the position embeddings. A detailed descrip-
tion vectors from the [CLS] output.
tion can be found in the work of BERT (Devlin
et al., 2019). 4. Use an LSTM model to process the sequence
Document-scale The document-scale represen- of m segment representations, followed by at-
tation is obtained by the [CLS] output of the BERT tention pooling operation on the hidden states
model. As the [CLS] output aggregates the whole of the LSTM output to obtain the segment-
sequence representation, it attempts to extract the scale essay representation corresponding to
essay information from the most global granularity. scale ki .
Token-scale As the BERT model is pre-trained
The LSTM cell units process the sequence of
by Masked Language Modeling (Devlin et al.,
segment representations and generate the hidden
2019), the sequence outputs can capture the con-
states as follows:
text information to represent each token. An essay
often consists of hundreds of tokens, thus RNN it = σ(Qi · st + Ui · ht−1 + bi )
is not the proper choice to combine all the token ft = σ(Qf · st + Uf · ht−1 + bf )
information due to the gradients vanishing prob- ĉt = tanh(Qc · st + Uc · ht−1 + bc )
lem. Instead, we utilize a max-pooling operation to ct = it ◦ ĉt + ft ◦ ct−1
all the sequence outputs and obtain the combined ot = σ(Qo · st + Uo · ht−1 + bo )
token-scale essay representation. Specifically, the ht = ot ◦ tanh(ct )
max-pooling layer generates a d-dimensional vec- where st is the tth segment representation from
tor W = [w1 , w2 , ..., wj , ..., wd ] and the element BERT [CLS] output and ht is the tth hidden state
wj is computed as below: generated from LSTM. Qi , Qf , Qc , Qo , Ui , Uf ,
wj = max{h1,j , h2,j , ..., hn,j } Uc and Uo are weight matrices, and bi , bf , bc , and
bo are bias vectors.
where d is the hidden size of the BERT model. The attention pooling operation we use is similar
As we use the pre-trained BERT model bert-base- to the work of (Dong et al., 2017), which is defined
uncased4 , the hidden size d is 768. All the n se- as follows:
quence outputs of the BERT model are annotated as
α̂t = tanh(Qa · ht + ba )
[h1 , h2 , ..., hi , ..., hn ], where hi is a d-dimensional qa ·α̂t
αt = Pe qa ·α̂j
4 j
e
https://huggingface.co/
· ht
P
bert-base-uncased o= t αt
o is the segment-scale essay representation corre- A teacher takes into account the overall level dis-
sponding to the scale ki . αt is the attention weight tribution of all the students when rating an essay.
for hidden state ht . Qa , ba , qa are the weight ma- Following such intuition, we introduce the SIM
trix, bias and weight vector respectively. loss to the AES task. In each training step, we take
the predicted scores of the essays in the batch as
3.3 Model Architecture the predicted vector y, and the labels as the label
The model architecture is depicted in Figure 1. vector ŷ. The SIM loss awards the similar vector
We apply one BERT model to obtain the pairs to make the model think more about the cor-
document-scale and token-scale essay representa- relation among the batch of essays. The SIM loss
tion. The concatenation of them is input into a is defined as below:
dense regression layer which predicts the score cor-
SIM (y, ŷ) = 1 − cos(y, ŷ)
responding to the document-scale and token-scale.
y = [y1 , y2 , ..., yN ]
For each segment-scale k with number of segments
ŷ = [ŷ1 , ŷ2 , ..., ŷN ]
m, we apply another BERT model to get m CLS
outputs, and apply an LST M model followed by where yi and ŷi are the predicted score and label
an attention layer to get the segment-scale represen- for the ith essay respectively, N is the number of
tation. We input the segment-scale representation the essays.
into another dense regression layer to get the score Margin Ranking (MR) measures the ranking
corresponding to segment-scale k. The final score orders for each essay pair in the batch. We in-
is obtained by adding the scores of all S segment- tuitively introduce MR loss because the sorting
scales and the score of the document-scale and property between essays is a key factor to scoring.
token-scale, which is illustrated as below: For each batch of essays, we first enumerate all
P the essay pairs, and then compute the MR loss as
y = k yk + ydoc,tok
follows. The MR loss attempts to make the model
yk = Ŵseg · ok + bseg penalize wrong order.
ydoc,tok = Ŵdoc,tok · Hdoc,tok + bdoc,tok
1P
Hdoc,tok = wdoc W
L M R(y, ŷ) = max(0, −ri,j (yi − yj ) + b)
N̂ i,j
yk is the predicted score corresponding to  1
 ŷi > ŷj
segment-scale k. ydoc,tok is the predicted score ri,j = -1 ŷi < ŷj
 -sgn(y − y )

ŷi = ŷj
corresponding to the document-scale and token- i j

scale. Ŵseg and bseg are weight matrix and bias for yi and ŷi are the predicted score and label for the
segment-scale respectively. Wdoc,tok and bdoc,tok ith essay respectively. N̂ is the number of the essay
are weight matrix and bias for document and token- pairs. b is a hyper parameter, which is set to 0 in
scales, ok is the segment-scale essay representa- our experiment. For each sample pair (i, j), when
tion with the scale k. wdoc is the document-scale the label ŷi is larger than ŷj , the predicted result
essay representation. W is the token-scale essay yi should be larger than yj , otherwise, the pair
representation. Hdoc,tok is the concatenation of contributes yj − yi to the loss. When ŷi is equal to
document-scale and token-scale essay representa- ŷj , the loss is actually |yi − yj |.
tions. The combined loss is described as below:
Losstotal (y, ŷ) = αM SE(y, ŷ)+βM R(y, ŷ)+
3.4 Loss Function γSIM (y, ŷ).
We use three loss functions to train the model. α, β, γ are weight parameters which are tuned
MSE measures the average value of square er- according to the performance on develop set.
rors between predicted scores and labels, which is
defined as below: 4 Experiment
M SE(y, ŷ) = N1 i (yi − ŷi )2
P
4.1 Data and Evaluation
where yi and ŷi are the predicted score and the ASAP data set is widely used in the AES task,
label for the ith essay respectively, N is the number which contains eight different prompts. A detailed
of the essays. description can be seen in Table 1. For each prompt,
Similarity (SIM) measures whether two vectors the WordPiece length indicates the smallest num-
are similar or dissimilar by using cosine function. ber which is bigger than the length of 90% of the
Figure 1: The proposed automated essay scoring architecture based on multi-scale essay representation. The left
part illustrates the document-scale and token-scale essay representation and scoring module, and the right part
illustrates S segment-scale essay representations and scoring modules.

essays in terms of WordPiece number. We evalu- 4.2 Baseline


ate the scoring performance using QWK on ASAP
The baseline models for comparison are described
data set, which is the official metric in the ASAP
as follows.
competition. Following previous work, we adopt
EASE 5 is the best open-source system that par-
5-fold cross validation with 60/20/20 split for train,
ticipated in the ASAP competition and ranked the
develop and test sets.
third place among 154 participants. EASE uses
regression techniques with handcrafted features.
CRP data set provides 2834 excerpts from sev-
Results of EASE with the settings of Support Vec-
eral time periods and reading ease scores which
tor Regression (SVR) and Bayesian Linear Ridge
range from -3.68 to 1.72. The average length of the
Regression (BLRR) are reported in (Phandi et al.,
excerpts is 175 and the WordPiece length is 252.
2015).
We also use 5-fold cross validation with 60/20/20
CNN+RNN Various deep neural networks
split for train, develop and test sets on CRP data
based on CNN and RNN for AES are studied
set. As the RMSE metric is used in the CRP com-
by (Taghipour and Ng, 2016). They combine CNN
petition, we also use it to evaluate our system in
ensembles and LSTM ensembles over 10 runs and
ease score prediction task.
get the best result in their experiment.
Hierarchical LSTM-CNN-Attention (Dong
et al., 2017) builds a hierarchical sentence-
document model, which uses CNN to encode sen-
Prompt Essays Avg length Score Range WordPiece length tences and LSTM to encode texts. The attention
1 1783 350 2-12 649
2 1800 350 1-6 704 mechanism is used to automatically determine the
3 1726 150 0-3 219 relative weights of words and sentences in gener-
4 1772 150 0-3 203
5 1805 150 0-4 258
ating sentence representations and text represen-
6 1800 150 0-4 289 tations respectively. They obtain the state-of-the-
7 1569 250 0-30 371 art result among all neural models without pre-
8 723 650 0-60 1077
training.
Table 1: Statistics of ASAP data set. 5
http://github.com/edx/ease
SKIPFLOW (Tay et al., 2018) proposes to use Longformer instead of BERT. Longformer-DOC
SKIPFLOW mechanism to model the relationships represents essays with document-scale features
between snapshots of the hidden representations based on Longformer.
of an LSTM. The work of (Tay et al., 2018) also Models with Transfer Learning. To transfer
obtains the state-of-the-art result among all neural learn from the out-of-domain essays 6 , we addition-
models without pre-training. ally employ a pre-training stage, which is similar
Dilated LSTM with Reinforcement Learn- to the work of (Song et al., 2020). In this stage, we
ing (Wang et al., 2018) proposes a method using a scale all the labels of essays from out-of-domain
dilated LSTM network in a reinforcement learning data into range 0-1 and pre-train the model on them
framework. They attempt to directly optimize the with MSE loss. After the pre-training stage, we
model using the QWK metric which considers the continue to fine-tune the model on in-domain es-
rating schema. says. Tran-BERT-MS has the same modules as
HA-LSTM+SST+DAT and BERT-DOC-TOK-SEG with pre-training on out-
BERT+SST+DAT (Cao et al., 2020) pro- of-domain data. MS means multiple scale features.
pose to use two self-supervised tasks and a Models with Multiple Losses. Based on Tran-
domain adversarial training technique to optimize BERT-MS model, we explore the performance of
their training, which is the first work to use adding multiple loss functions. Tran-BERT-MS-
pre-trained language model to outperform LSTM ML additionally employs MR loss and SIM loss.
based methods. They experiment with both ML means multiple losses. Tran-BERT-MS-ML-
hierarchical LSTM model and BERT in their work, R incorporates R-Drop strategy (Liang et al., 2021)
which are HA − LST M + SST + DAT and in training based on Tran-BERT-MS-ML model.
BERT + SST + DAT respectively.
For the proposed model architecture which is de-
BERT2 (Yang et al., 2020) combines regres-
picted in Figure 1, the BERT model in the left part
sion and ranking to fine-tune BERT model which
are shared by the document-scale and token-scale
also outperforms LSTM based methods and even
essay representations, and the other BERT model in
obtains the new state-of-the-art.
the right part are shared by all segment-scale essay
representations. We use the "bert-base-uncased"
4.3 Settings
which includes 12 transformer layers and the hid-
To compare with the baseline models and further den size is 768. In the training stage, we freeze
study the effectiveness of multi-scale essay repre- all the layers in the BERT models except the last
sentations, losses and transfer learning, we conduct layer, which is more task related than other lay-
the following experiments. ers. The Longformer model used in our work is
Multi-scale Models. These models are opti- "longformer-base-4096". For the MR loss, we set b
mized with MSE loss, and BERT-DOC repre- to 0. The weights α, β and γ are tuned according to
sents essays with document-scale features based on the performance on develop set. We use Adam op-
BERT. BERT-TOK represents essays with token- timizer (Kingma and Ba, 2015) to fine-tune model
scale features based on BERT. BERT-DOC-TOK parameters in an end-to-end fashion with learning
represents essays with both document-scale and rate of 6e-5, β1=0.9, β2=0.999, L2 weight decay
token-scale features based on BERT. BERT-DOC- of 0.005. The coefficient weight α in R-Drop is
TOK-SEG represents essays with document-scale, 9. We set the batch size to 32. We use dropout
token-scale, and multiple segment-scale features in the training stage and the drop rate is set to 0.1.
based on BERT. Longformer (Beltagy et al., 2020) We train all the models for 80 epochs, and select
is an extension for transformers with an atten- the best model according the performance on the
tion mechanism that scales linearly with sequence develop set. We use a greedy search method to find
length, making it easy to process long docu- the best combination of segment scales, which is
ments. We conduct experiments to show that our shown in detail in Appendix A. Following (Cao
multi-scale features also works with Longformer et al., 2020), we perform the significance test for
and can further improve the performance in long our models.
text tasks. Longformer-DOC-TOK-SEG uses
document-scale, token-scale, and multiple segment- 6
For each prompt, we use all the essays from other prompts
scale features to represent essays, but based on in ASAP data set.
ID Models P1 P2 P3 P4 P5 P6 P7 P8 Average
1 EASE(SVR) (Phandi et al., 2015) 0.781 0.621 0.630 0.749 0.782 0.771 0.727 0.534 0.699
2 EASE(BLRR) (Phandi et al., 2015) 0.761 0.606 0.621 0.742 0.784 0.775 0.730 0.617 0.705
3 CNN(10 runs) + LSTM(10 runs) (Taghipour and Ng, 2016) 0.821 0.688 0.694 0.805 0.807 0.819 0.808 0.644 0.761
4 Hierarchical LSTM-CNN-Attention (Dong et al., 2017) 0.822 0.682 0.672 0.814∗ 0.803 0.811 0.801 0.705 0.764
5 SKIPFLOW LSTM(Bilinear) (Tay et al., 2018) 0.830 0.678 0.677 0.778 0.795 0.807 0.790 0.670 0.753
6 SKIPFLOW LSTM(Tensor) (Tay et al., 2018) 0.832 0.684 0.695 0.788 0.815 0.810 0.800 0.697 0.764
7 Dilated LSTM With RL (Wang et al., 2018) 0.776 0.659 0.688 0.778 0.805 0.791 0.760 0.545 0.724
8 HA-LSTM+SST+DAT (Cao et al., 2020) 0.836 0.730 0.732 0.822 0.835 0.832∗ 0.821 0.718 0.790
9 BERT+SST+DAT (Cao et al., 2020) 0.824 0.699 0.726 0.859 0.822 0.828 0.840 0.726 0.791∗
10 R2 BERT (Yang et al., 2020) 0.817 0.719 0.698 0.845 0.841 0.847 0.839 0.744 0.794∗
11 BERT-DOC-TOK-SEG 0.836 0.695 0.700 0.815 0.812 0.816 0.838 0.744 0.782
12 Tran-BERT-MS-ML-R 0.834 0.716 0.714 0.812 0.813 0.836 0.839 0.766 0.791∗

Table 2: Experiment results of all models in terms of QWK on ASAP. The name of our implemented models are
in bold. The bold number is the best performance for each prompt. The best 3 average QWK are annotated with ∗ .

ID Models P1 P2 P8 Average
8 HA-LSTM+SST+DAT 0.836 0.730 0.718 0.761
• Compared to the models 4 and 6, our model 11
9 HA-BERT+SST+DAT 0.824 0.699 0.726 0.750 uses multi-scale features to encode essays in-
10 R2 BERT 0.817 0.719 0.744 0.760
12 Tran-BERT-MS-ML-R 0.834 0.716 0.766 0.772
stead of LSTM based models, and we use the
same regression loss to optimize the model.
Table 3: Experiment results of our model and the state- Our model simply changes the representation
of-the-art models on ASAP long essays (WordPiece way and significantly improves the result from
length are longer than 510). The name of our imple- 0.764 to 0.782, which demonstrates the strong
mented model is in bold. encoding ability armed by multi-scale repre-
sentation for long text. Before that, the con-
ventional way of using BERT can not surpass
4.4 Results
the performance of models 4 and 6.
Table 2 shows the performance of baseline models
and our proposed models with joint learning of 4.5 Further analysis
multi-scale essay representation. Table 3 shows the Multi-scale Representation We further analyze
results of our model and the state-of-the-art models the effectiveness of employing each scale essay
on essays in prompt 1, 2 and 8, whose WordPiece representation to the joint learning process.
length are longer than 510. We summarize some
Models Average QWK
findings from the experiment results. BERT-DOC 0.760
BERT-TOK 0.764
BERT-DOC-TOK 0.768
• Our model 12 almost obtains the published BERT-DOC-TOK-SEG 0.782
state-of-the-art for neural approaches. For the
prompts 1,2 and 8, whose WordPiece length Table 4: Performance of different feature scale models
are longer than 510, we improve the result on ASAP data set.
from 0.761 to 0.772. As Longformer is good
at encoding long text, we also use it to encode
Models RMSE
essays of prompt 1, 2 and 8 directly but the BERT-DOC 0.742
performance is poor compared to the meth- BERT-TOK 0.760
BERT-DOC-TOK 0.691
ods in Table 3. The results demonstrate the BERT-DOC-TOK-SEG 0.607
effectiveness of the proposed framework for
encoding and scoring essays. We further re- Table 5: Performance of different feature scale mod-
implement BERT2 proposed by (Yang et al., els on CRP data set. The evaluation metric is RMSE.
2020), and our implementation of BERT2 is Lower numbers are better.
not as well-performing as the published result.
Though (Uto et al., 2020) obtain a much bet- Table 4 and Table 5 show the performance of
ter result(QWK 0.801), our method performs our models to represent essays on different fea-
much better than their system with only neu- ture scales, which are trained with MSE loss and
ral features(QWK 0.730), which demonstrates without transfer learning. Table 4 shows the perfor-
the strong essay encoding ability of our neural mance on ASAP data set while Table 5 shows the
approach. performance on CRP data set. The improvement of
BERT-DOC-TOK-SEG over BERT-DOC, BERT- features, not only comes from the ability to
TOK, BERT-DOC-TOK are significant (p<0.0001) deal with long texts.
on CRP data set, and are significant (p<0.0001) in
most cases on ASAP data set. Results on both table These results are consistent with our intuition
indicate the similar findings. that our approach takes into account different level
features of essays and predict the scores more ac-
• Combining the features from document-scale curately. We consider it caused by that multi-scale
and token-scale, BERT-DOC-TOK outper- features are not effectively constructed in the rep-
forms the models BERT-DOC and BERT- resentation layer of pre-trained model due to the
TOK, which only use one scale features. This lack of data for fine-tuning in the AES task. There-
demonstrates that our proposed framework fore, we need to explicitly model the multi-scale
can benefit from multi-scale essay representa- information of the essay data and combine it with
tion even with only two scales. the powerful linguistic knowledge of pre-trained
• By additionally incorporating multiple model.
segment-scale features, BERT-DOC- Models Average
TOK-SEG performs much better than BERT-DOC-TOK-SEG 0.782
Tran-BERT-MS 0.788
BERT-DOC-TOK. This demonstrates the Tran-BERT-MS-ML 0.790
effectiveness and generalization ability of our Tran-BERT-MS-ML-R 0.791

multi-scale essay representation on multiple


tasks. Table 7: Experiment results for transfer learning with
multiple loss functions and R-Drop .
Models Average QWK
Longformer-DOC 0.746
Longformer-DOC-TOK-SEG 0.771 Transfer Learning with Multiple Losses and
Table 6: Performance of multi-scale Longformer mod- R-Drop We further explore the effectiveness of pre-
els on ASAP data set. training with adding multiple loss functions and
employing R-Drop. As is shown in table 7, by in-
corporating the pre-training stage which learns the
Reasons for Effectiveness of Multi-scale Rep- knowledge from out-of-domain data, Tran-BERT-
resentation Though the experiment shows the ef- MS model improves the result from 0.782 to 0.788
fectiveness of multi-scale representation, we fur- compared to BERT-DOC-TOK-SEG model. The
ther explore the reason. We could doubt that the ef- model Tran-BERT-MS-ML which jointly learns
fectiveness comes from supporting long sequences, with multiple loss functions further improves the
not the multi-scale itself. As Longformer is good performance from 0.788 to 0.790. We consider it
at dealing with long texts, we compare the re- due to the reason that MR brings ranking informa-
sults between Longformer-DOC and Longformer- tion and SIM takes into account the overall score
DOC-TOK-SEG. The results of the significance distribution information. Diverse losses bring dif-
test show that the improvement of Longformer- ferent but positive influence on the optimization
DOC-TOK-SEG over Longformer-DOC are signif- direction and act as an ensembler. By employing R-
icant (p<0.0001) in most cases. Performance of the Drop, Tran-BERT-MS-ML-R improves the QWK
two models are shown in Table 6, and we get the slightly, which comes from the fact that R-Drop
following findings. plays a regularization role.

• Though Longformer-DOC supports long se- 5 Conclusion and Future Work


quences encoding, it performs poor, which
indicates us that supporting long sequence In this paper, we propose a novel multi-scale es-
ability is not enough for a good essay scor- say representation approach based on pre-trained
ing system. language model, and employ multiple losses and
transfer learning for AES task. We almost obtain
• Longformer-DOC-TOK-SEG outperforms the state-of-the-art result among deep learning mod-
Longformer-DOC significantly, which els. In addition, we show multi-scale representation
indicates the effectiveness of our model has a significant advantage when dealing with long
comes from encoding essays by multi-scale texts.
One of the future directions could be exploring Fei Dong, Yue Zhang, and Jie Yang. 2017. Attention-
soft multi-scale representation. Introducing linguis- based recurrent convolutional neural network for au-
tomatic essay scoring. In Proceedings of the 21st
tic knowledge to segment at a more reasonable
Conference on Computational Natural Language
scale may bring further improvement. Learning (CoNLL 2017), pages 153–162.

Youmna Farag, Helen Yannakoudakis, and Ted Briscoe.


References 2018. Neural automated essay scoring and coher-
ence modeling for adversarially crafted input. In
Dimitrios Alikaniotis, Helen Yannakoudakis, and Proceedings of the 2018 Conference of the North
Marek Rei. 2016. Automatic text scoring using neu- American Chapter of the Association for Compu-
ral networks. In Proceedings of the 54th Annual tational Linguistics: Human Language Technolog,
Meeting of the Association for Computational Lin- pages 263–271.
guistics, pages 715–725.
Cancan Jin, Ben He, Kai Hui, and Le Sun. 2018.
Yigal Attali and Jill Burstein. 2006. Automated essay Tdnn: A two-stage deep neural network for prompt-
scoring with e-rater® v.2. The Journal of Technol- independent automated essay scoring. In Proceed-
ogy, Learning, and Assessment, 4(3). ings of the 56th Annual Meeting of the Association
for Computational Linguistics, pages 1088–1097.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020.
Longformer: The long-document transformer. In Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
arXiv: Computation and Language. method for stochastic optimization. In 3rd Interna-
tional Conference for Learning Representations.
Yue Cao, Hanqi Jin, Xiaojun Wan, and Zhiwei Yu.
2020. Domain-adaptive neural automated essay Leah S. Larkey. 1998. Automatic essay grading using
scoring. In SIGIR ’20: Proceedings of the 43rd text categorization techniques. In SIGIR ’98 Pro-
International ACM SIGIR Conference on Research ceedings of the 21st annual international ACM SI-
and Development in Information, pages 1011–1020. GIR conference on Research and development in in-
formation retrieva, pages 90–95.
Hongbo Chen and Ben He. 2013. Automated essay
scoring by maximizing human-machine agreement. Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang,
In Proceedings of the 2013 Conference on Empiri- Qi Meng, Tao Qin, Wei Chen, Min Zhang, and Tie-
cal Methods in Natural Language Processing, pages Yan Liu. 2021. R-drop: Regularized dropout for
1741–1752. neural networks. In Advances in Neural Information
Processing Systems, pages 10890–10905.
J Cohen. 1968. Weighted kappa: nominal scale agree-
ment provision for scaled disagreement or partial Sandeep Mathias, Rudra Murthy, Diptesh Kanojia,
credit. Psychological bulletin, 70(4):213–220. Abhijit Mishra, and Pushpak Bhattacharyya. 2020.
Happy are those who grade without seeing: A multi-
Mădălina Cozma, Andrei M. Butnaru, and Radu Tudor task learning approach to grade essays using gaze be-
Ionescu. 2018. Automated essay scoring with string haviour. In Proceedings of the 1st Conference of the
kernels and word embeddings. In Proceedings of the Asia-Pacific Chapter of the Association for Compu-
56th Annual Meeting of the Association for Compu- tational Linguistics and the 10th International Joint
tational Linguistics. Conference on Natural Language Processing, pages
858–872.
Tirthankar Dasgupta, Abir Naskar, Lipika Dey, and
Rupsa Saha. 2018. Augmenting textual qualitative Elijah Mayfield and Alan W Black. 2020. Should you
features in deep convolution recurrent neural net- fine-tune bert for automated essay scoring? In Pro-
work for automatic essay scoring. In Proceedings ceedings of the 15th Workshop on Innovative Use of
of the 5th Workshop on Natural Language Process- NLP for Building Educational Applications, pages
ing Techniques for Educational Applications, pages 151–162.
93–102.
Panitan Muangkammuen and Fumiyo Fukumoto. 2020.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Multi-task learning for automated essay scoring with
Kristina Toutanova. 2019. Bert: Pre-training of deep sentiment analysis. In Proceedings of the 1st Con-
bidirectional transformers for language understand- ference of the Asia-Pacific Chapter of the Associa-
ing. In Proceedings of the 2019 Conference of the tion for Computational Linguistics and the 10th In-
North American Chapter of the Association for Com- ternational Joint Conference on Natural Language
putational Linguistics: Human Language Technolo- Processing: Student Research Workshop, pages 116–
gies, pages 4171–4186. 123.

Fei Dong and Yue Zhang. 2016. Automatic features Andriy Mulyar, Elliot Schumacher, Masoud
for essay scoring – an empirical study. In Proceed- Rouhizadeh, and Mark Dredze. 2019. Pheno-
ings of the 2016 Conference on Empirical Methods typing of clinical notes with improved document
in Natural Language Processing, pages 1072–1077. classification models using contextualized neural
language models. In 33rd Conference on Neural Helen Yannakoudakis, Ted Briscoe, and Ben Medlock.
Information Processing Systems (NeurIPS). 2011. A new dataset and method for automatically
grading esol texts. In HLT ’11 Proceedings of the
Peter Phandi, Kian Ming A. Chai, and Hwee Tou Ng. 49th Annual Meeting of the Association for Com-
2015. Flexible domain adaptation for automated es- putational Linguistics: Human Language Technolo-
say scoring using correlated linear regression. In gies, pages 180–189.
Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pages A Appendix
431–439.
All the segment-scales we explore range from 10 to
Robert Ridley, Liang He, Xinyu Dai, Shujian Huang, 190. The interval between two neighbor scales is 20.
and Jiajun Chen. 2021. Automated cross-prompt
scoring of essay traits. In Proceedings of the AAAI As the combination number of all segment-scales
Conference on Artificial Intelligence, pages 13745– is exponential, we use a greedy search method to
13753. find the best combination.
Pedro Uria Rodriguez, Amir Jafari, and Christopher M. 1. Initialize the segment-scale value set R as the
Ormerod. 2019. Language models and automated document-scale and token-scale.
essay scoring. In arXiv: Computation and Lan-
guage. 2. Experiment the combination of each segment-
Lawrence M. Rudner and Tahung Liang. 2002. Au- scale with the token-scale and document-scale
tomated essay scoring using bayes’ theorem. The essay representation, and compute the average
Journal of Technology, Learning, and Assessment, QWK on develop set for all segment-scales,
1(2):3–21. which is denoted as QW Kave . The scale with
Wei Song, Kai Zhang, Ruiji Fu, Lizhen Liu, Ting higher QWK compared to QW Kave is added
Liu, and Miaomiao Cheng. 2020. Multi-stage pre- to the candidate scale list L and the scales in
training for automated chinese essay scoring. In L are sorted according to their QWK values
Proceedings of the 2020 Conference on Empirical from large to small.
Methods in Natural Language Processing (EMNLP),
pages 6723–6733. 3. For each i from 1 to |L|, we perform ex-
Kaveh Taghipour and Hwee Tou Ng. 2016. A neural periments on the combination of the first i
approach to automated essay scoring. In Proceed- segment-scales in L with the token-scale and
ings of the 2016 Conference on Empirical Methods document-scale. The combination segment-
in Natural Language Processing, pages 1882–1891. scales with the best performance on develop
Yi Tay, Minh C. Phan, Luu Anh Tuan, and Siu Cheung set are added to the segment-scale value set R
Hui. 2018. Skipflow:incorporating neural coherence
features for end-to-end automatic text scoring. In
Proceedings of the Thirty-Second AAAI Conference
on Artificial Intelligence, pages 5948–5955.

Masaki Uto, Yikuan Xie, and Maomi Ueno. 2020.


Neural automated essay scoring incorporating hand-
crafted features. In Proceedings of the 28th Inter-
national Conference on Computational Linguistics,
pages 6077–6088.

Yucheng Wang, Zhongyu Wei, Yaqian Zhou, and Xu-


anjing Huang. 2018. Automatic essay scoring incor-
porating rating schema via reinforcement learning.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
791–797.

Ruosong Yang, Jiannong Cao, Zhiyuan Wen,


Youzheng Wu, and Xiaodong He. 2020. En-
hancing automated essay scoring performance
via fine-tuning pre-trained language models with
combination of regression and ranking. In Findings
of the Association for Computational Linguistics:
EMNLP 2020, pages 1560–1569.

You might also like