OnTheUseofBERT
OnTheUseofBERT
scale. Ŵseg and bseg are weight matrix and bias for yi and ŷi are the predicted score and label for the
segment-scale respectively. Wdoc,tok and bdoc,tok ith essay respectively. N̂ is the number of the essay
are weight matrix and bias for document and token- pairs. b is a hyper parameter, which is set to 0 in
scales, ok is the segment-scale essay representa- our experiment. For each sample pair (i, j), when
tion with the scale k. wdoc is the document-scale the label ŷi is larger than ŷj , the predicted result
essay representation. W is the token-scale essay yi should be larger than yj , otherwise, the pair
representation. Hdoc,tok is the concatenation of contributes yj − yi to the loss. When ŷi is equal to
document-scale and token-scale essay representa- ŷj , the loss is actually |yi − yj |.
tions. The combined loss is described as below:
Losstotal (y, ŷ) = αM SE(y, ŷ)+βM R(y, ŷ)+
3.4 Loss Function γSIM (y, ŷ).
We use three loss functions to train the model. α, β, γ are weight parameters which are tuned
MSE measures the average value of square er- according to the performance on develop set.
rors between predicted scores and labels, which is
defined as below: 4 Experiment
M SE(y, ŷ) = N1 i (yi − ŷi )2
P
4.1 Data and Evaluation
where yi and ŷi are the predicted score and the ASAP data set is widely used in the AES task,
label for the ith essay respectively, N is the number which contains eight different prompts. A detailed
of the essays. description can be seen in Table 1. For each prompt,
Similarity (SIM) measures whether two vectors the WordPiece length indicates the smallest num-
are similar or dissimilar by using cosine function. ber which is bigger than the length of 90% of the
Figure 1: The proposed automated essay scoring architecture based on multi-scale essay representation. The left
part illustrates the document-scale and token-scale essay representation and scoring module, and the right part
illustrates S segment-scale essay representations and scoring modules.
Table 2: Experiment results of all models in terms of QWK on ASAP. The name of our implemented models are
in bold. The bold number is the best performance for each prompt. The best 3 average QWK are annotated with ∗ .
ID Models P1 P2 P8 Average
8 HA-LSTM+SST+DAT 0.836 0.730 0.718 0.761
• Compared to the models 4 and 6, our model 11
9 HA-BERT+SST+DAT 0.824 0.699 0.726 0.750 uses multi-scale features to encode essays in-
10 R2 BERT 0.817 0.719 0.744 0.760
12 Tran-BERT-MS-ML-R 0.834 0.716 0.766 0.772
stead of LSTM based models, and we use the
same regression loss to optimize the model.
Table 3: Experiment results of our model and the state- Our model simply changes the representation
of-the-art models on ASAP long essays (WordPiece way and significantly improves the result from
length are longer than 510). The name of our imple- 0.764 to 0.782, which demonstrates the strong
mented model is in bold. encoding ability armed by multi-scale repre-
sentation for long text. Before that, the con-
ventional way of using BERT can not surpass
4.4 Results
the performance of models 4 and 6.
Table 2 shows the performance of baseline models
and our proposed models with joint learning of 4.5 Further analysis
multi-scale essay representation. Table 3 shows the Multi-scale Representation We further analyze
results of our model and the state-of-the-art models the effectiveness of employing each scale essay
on essays in prompt 1, 2 and 8, whose WordPiece representation to the joint learning process.
length are longer than 510. We summarize some
Models Average QWK
findings from the experiment results. BERT-DOC 0.760
BERT-TOK 0.764
BERT-DOC-TOK 0.768
• Our model 12 almost obtains the published BERT-DOC-TOK-SEG 0.782
state-of-the-art for neural approaches. For the
prompts 1,2 and 8, whose WordPiece length Table 4: Performance of different feature scale models
are longer than 510, we improve the result on ASAP data set.
from 0.761 to 0.772. As Longformer is good
at encoding long text, we also use it to encode
Models RMSE
essays of prompt 1, 2 and 8 directly but the BERT-DOC 0.742
performance is poor compared to the meth- BERT-TOK 0.760
BERT-DOC-TOK 0.691
ods in Table 3. The results demonstrate the BERT-DOC-TOK-SEG 0.607
effectiveness of the proposed framework for
encoding and scoring essays. We further re- Table 5: Performance of different feature scale mod-
implement BERT2 proposed by (Yang et al., els on CRP data set. The evaluation metric is RMSE.
2020), and our implementation of BERT2 is Lower numbers are better.
not as well-performing as the published result.
Though (Uto et al., 2020) obtain a much bet- Table 4 and Table 5 show the performance of
ter result(QWK 0.801), our method performs our models to represent essays on different fea-
much better than their system with only neu- ture scales, which are trained with MSE loss and
ral features(QWK 0.730), which demonstrates without transfer learning. Table 4 shows the perfor-
the strong essay encoding ability of our neural mance on ASAP data set while Table 5 shows the
approach. performance on CRP data set. The improvement of
BERT-DOC-TOK-SEG over BERT-DOC, BERT- features, not only comes from the ability to
TOK, BERT-DOC-TOK are significant (p<0.0001) deal with long texts.
on CRP data set, and are significant (p<0.0001) in
most cases on ASAP data set. Results on both table These results are consistent with our intuition
indicate the similar findings. that our approach takes into account different level
features of essays and predict the scores more ac-
• Combining the features from document-scale curately. We consider it caused by that multi-scale
and token-scale, BERT-DOC-TOK outper- features are not effectively constructed in the rep-
forms the models BERT-DOC and BERT- resentation layer of pre-trained model due to the
TOK, which only use one scale features. This lack of data for fine-tuning in the AES task. There-
demonstrates that our proposed framework fore, we need to explicitly model the multi-scale
can benefit from multi-scale essay representa- information of the essay data and combine it with
tion even with only two scales. the powerful linguistic knowledge of pre-trained
• By additionally incorporating multiple model.
segment-scale features, BERT-DOC- Models Average
TOK-SEG performs much better than BERT-DOC-TOK-SEG 0.782
Tran-BERT-MS 0.788
BERT-DOC-TOK. This demonstrates the Tran-BERT-MS-ML 0.790
effectiveness and generalization ability of our Tran-BERT-MS-ML-R 0.791
Fei Dong and Yue Zhang. 2016. Automatic features Andriy Mulyar, Elliot Schumacher, Masoud
for essay scoring – an empirical study. In Proceed- Rouhizadeh, and Mark Dredze. 2019. Pheno-
ings of the 2016 Conference on Empirical Methods typing of clinical notes with improved document
in Natural Language Processing, pages 1072–1077. classification models using contextualized neural
language models. In 33rd Conference on Neural Helen Yannakoudakis, Ted Briscoe, and Ben Medlock.
Information Processing Systems (NeurIPS). 2011. A new dataset and method for automatically
grading esol texts. In HLT ’11 Proceedings of the
Peter Phandi, Kian Ming A. Chai, and Hwee Tou Ng. 49th Annual Meeting of the Association for Com-
2015. Flexible domain adaptation for automated es- putational Linguistics: Human Language Technolo-
say scoring using correlated linear regression. In gies, pages 180–189.
Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pages A Appendix
431–439.
All the segment-scales we explore range from 10 to
Robert Ridley, Liang He, Xinyu Dai, Shujian Huang, 190. The interval between two neighbor scales is 20.
and Jiajun Chen. 2021. Automated cross-prompt
scoring of essay traits. In Proceedings of the AAAI As the combination number of all segment-scales
Conference on Artificial Intelligence, pages 13745– is exponential, we use a greedy search method to
13753. find the best combination.
Pedro Uria Rodriguez, Amir Jafari, and Christopher M. 1. Initialize the segment-scale value set R as the
Ormerod. 2019. Language models and automated document-scale and token-scale.
essay scoring. In arXiv: Computation and Lan-
guage. 2. Experiment the combination of each segment-
Lawrence M. Rudner and Tahung Liang. 2002. Au- scale with the token-scale and document-scale
tomated essay scoring using bayes’ theorem. The essay representation, and compute the average
Journal of Technology, Learning, and Assessment, QWK on develop set for all segment-scales,
1(2):3–21. which is denoted as QW Kave . The scale with
Wei Song, Kai Zhang, Ruiji Fu, Lizhen Liu, Ting higher QWK compared to QW Kave is added
Liu, and Miaomiao Cheng. 2020. Multi-stage pre- to the candidate scale list L and the scales in
training for automated chinese essay scoring. In L are sorted according to their QWK values
Proceedings of the 2020 Conference on Empirical from large to small.
Methods in Natural Language Processing (EMNLP),
pages 6723–6733. 3. For each i from 1 to |L|, we perform ex-
Kaveh Taghipour and Hwee Tou Ng. 2016. A neural periments on the combination of the first i
approach to automated essay scoring. In Proceed- segment-scales in L with the token-scale and
ings of the 2016 Conference on Empirical Methods document-scale. The combination segment-
in Natural Language Processing, pages 1882–1891. scales with the best performance on develop
Yi Tay, Minh C. Phan, Luu Anh Tuan, and Siu Cheung set are added to the segment-scale value set R
Hui. 2018. Skipflow:incorporating neural coherence
features for end-to-end automatic text scoring. In
Proceedings of the Thirty-Second AAAI Conference
on Artificial Intelligence, pages 5948–5955.