A Trait Based Deep Learning
A Trait Based Deep Learning
Abstract—Numerous Automated Essay Scoring (AES) criterion separately, and a holistic rubric evaluates all criteria
systems have been developed over the past years. Recent simultaneously. Each type has its advantages and
advances in deep learning have shown that applying neural disadvantages. Analytic rubrics give formative feedback to
network approaches to AES systems has accomplished state-of- learners and are easier to link to instruction. Nevertheless,
the-art solutions. Most neural-based AES systems assign an they take more time to score and achieve acceptable inter-rater
overall score to given essays, even if they depend on analytical reliability than holistic rubrics. Holistic rubrics are faster and
rubrics/traits. The trait evaluation/scoring helps to identify suitable for summative assessment (assessment of learning).
learners’ levels of performance. Besides, providing feedback to On the other hand, a single overall score does not
learners about their writing performance is as important as
communicate information about what to do to improve
assessing their level. Producing adaptive feedback to the learners
requires identifying the strengths/weaknesses and the magnitude
learning and is not useful for formative assessment
of influence of each trait. In this paper, we develop a framework (assessment for learning) [4]. It is also interesting to know that
that strengthens the validity and enhances the accuracy of a research showed that learners prefer AES feedback over peer
baseline neural-based AES model with respect to traits feedback [5].
evaluation/scoring. We extend the model to present a method Over the past years, various AES systems have been
based on essay traits prediction to give trait-specific adaptive developed to evaluate learners‟ responses to a given prompt
feedback. We explored multiple deep learning models for the (essay). AES systems automatically asses the quality of the
automatic essay scoring task, and we performed several analyses
written text and assign a score to each text. The efficiency of
to get some indicators from these models. The results show that
Long Short-Term Memory (LSTM) based system outperformed
these systems depends on the agreement between the human-
the baseline study by 4.6% in terms of quadratic weighted Kappa rater scores and the AES scores[6]. Research in deep learning
(QWK). Moreover, the prediction of the traits scores enhance the has led to the development of neural network models for
efficiency of the prediction of the overall score. Our extended automatic essay scoring task moving away from feature
model is used in the iAssistant, an educational module that engineering and found that utilizing neural networks to
provides trait-specific adaptive feedback to learners. automatic essay scoring task has achieved state-of-the-art
outcomes [7]. Utilizing the automatically learned features has
Keywords—AES system; trait evaluation; adaptive feedback; added significant benefits to the efficiency of such systems as
deep learning; neural networks; ASAP well [8] [9].
I. INTRODUCTION The vast majority of existing Neural based AES systems
were developed for holistic scoring to given essays even if
“Nothing we do to, or for our students is more important they depend on analytical rubrics/traits [10]. The trait
than our assessment of their work and the feedback we give evaluation/scoring helps to identify learners‟ levels of
them on it [1].” It is widely acknowledged that feedback is a performance. Besides, providing feedback to the learners
critical element of learning [2]. Both scores and feedback are about their writing requires identifying the
fundamental aspects of the learning process. Accurate scoring strengths/weaknesses and the magnitude of influence of each
of learners' answers creates a fair way to assess learners' work, trait. Based on that, our goal is to develop a framework that
which is a very important aspect. However, giving feedback to strengthens the validity and enhances the accuracy of neural-
learners about their answers helps them identify their based AES approaches with respect to traits evaluation/
weaknesses and improve their performance as well. scoring. Using this framework should help in providing
Rubrics are widely used in evaluating learners' answers to effective adaptive feedback to learners as well.
essay questions. Brookhart (2013) defines a rubric as “a The following part of the paper is organized as follows:
coherent set of criteria for learners' work that includes Section 2 describes a brief overview of related work. Section 3
descriptions of levels of performance quality on the criteria describes the methods and materials, including the AES
[3].” The definition identifies two significant aspects of a good models (baseline and the augmented), dataset description,
rubric: coherent sets of criteria and descriptions of levels of training, and testing, in addition to the evaluation metric.
performance for these criteria. There are two types of rubrics: Reporting and discussion of results are in Section 4. Then, our
analytic and holistic rubrics. An analytic rubric evaluates each conclusion and future improvements are in sections 5.
287 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
II. RELATED WORK development, and testing sets, respectively. AEST&N model
PEG is the earliest AES system that was developed by architecture is illustrated in Fig. 1.
Ellis Page in 1966. PEG was the starting spark for decades of AEST&N results show that all model variations
research into AES. Then, many AES systems have been (Convolutional Neural Network (CNN), Recurrent Neural
developed that analyze the quality of text and assign a score to Network (RNN), Gated Recurrent Units (GRU), and LSTM)
it. AES systems use various manually tuned shallow and deep succeed to learn the task properly and its performance
linguistic features [5]. comparable to or better than the baseline (AES system called
AES systems can be classified into two main types: i) ‘Enhanced AI Scoring Engine’ (EASE)1). The authors reported
handcrafted discrete features-based type that is bounded to that the LSTM based AEST&N system outperformed other
specific domains, which usually uses natural language neural networks (RNN, GRU, and CNN) systems significantly
processing, latent semantic analysis, or Bayesian network, etc. and outperformed the baseline by (4.1%).
and ii) automatic feature extraction-based type which usually AEST&N system has significantly outperformed the other
uses neural networks [5]. AES systems, yet there is always an area for improvement to
Several AES systems include automated scoring alongside increase the accuracy of scoring. AEST&N system has predicted
providing feedback, e.g., for the first type, Criterion, MY only the overall scores, although some of the essays have
Access, and Writing Pal. Criterion provides an overall score analytical rubrics/traits. Moreover, it has not provided any
and a learner‟s feedback using E-rater and Critique as an AES feedback to learners.
component. Where the E-rater module performs the given B. Proposed Model
essay utomatic scoring task and Critique consists of a set of
modules that detect mistakes/errors in mechanics, grammar, Our model (AESAUG) is inspired by the baseline model
and usage. Then, it identifies the issues of discourse and style AEST&N of Taghipour and Ng [6]. We extend and utilize the
in writing. MY Access offers instant score and diagnostic AEST&N model to predict not only the overall score for essays
feedback based on the IntelliMetric AES system to stimulate but also the traits scores. Besides, we aim to utilize the traits
the learners to improve their writing ability [8]. Moreover, scores to provide adaptive feedback to learners. Fig. 2 presents
Writing Pal is classified as an intelligent tutoring system that the AESAUG model architecture, which is described.
is mainly concerned with learning tasks and provides the
service of evaluating writing tasks with feedback [11]. It
targets learners‟ writing strategies within providing automated
feedback. However, it classified as a handcrafted discrete
features-based system; the automatic essay scoring model is
separate from the feedback part. It uses specific algorithms for
each feedback category.
In particular, a few of the other type systems consider
scoring the traits and providing the appropriate feedback for
each essay. Woods et al. [12] established a new ordinal essay
scoring model with extension to use essay traits prediction to
give a formative trait-specific feedback to learners. Fig. 1. AEST&N Model Architecture of Taghipour and Ng [6], where the
Nevertheless, one of the concerns of their system that their Output Layer Predicts Only the Overall Score.
Ordinal Logistic Regression (OLR) model does not perform
accurately with large scoring ranges essays (like prompts 1
and 7 in ASAP dataset).
III. MATERIALS AND METHODS
A. Baseline Model
Taghipour and Ng [6], developed an AES system
(AEST&N) based on neural networks, which automatically
predicts the overall score of a given essay [10]. AEST&N takes
the sequence of words in an essay as input; their model first
uses a convolution layer to extract n-gram level features.
These features, which capture the local textual dependencies
among the words in an n-gram, are then passed to a recurrent
layer composed of an LSTM network. It was trained and given
state-of-the-art results on the Kaggle's ASAP dataset. The Fig. 2. AESAUG Model Architecture, where the Output Layer Predicts both
evaluation metric, which is used to evaluate the efficiency of the Overall Score and Traits‟ Score.
the system, is Quadratic Weighted Kappa (QWK) [6], [8].
They used a 5-fold cross-validation, and for each fold, they 1
EASE is an open source handcrafted features-based AES system. It
distributed the dataset into 60%, 20%, and 20%; training, depends on Bayesian linear ridge regression and vector regression techniques.
It was the third in the ASAP competition (among 154 systems).
288 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
1) The Lookup Table Layer; first layer/step of the model where is weight matrix (with mini-batch size 32), is
transforms each word into dimensional space . Given a bias vector, is activations of the previous layer, is the
sentence ( ), the output of the lookup table input of the layer (from MoT layer), and is the dense layer
operation ( ) represented in Equation 1. output.
289 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
2
http://www.nltk.org
3
http://ai.stanford.edu/~wzou/mt
290 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
IV. RESULTS AND DISCUSSION accuracy of predicting the overall score [0.851] (on Fold 2) to
We describe in this part our experiments and results. In the outperform the baseline AEST&N best model (LSTM) which
case of overall scores, we mention the results and then was [0.805] with 4.6% improvement. It even outperformed the
evaluate our system to the baseline system (AEST&N). In the best result for prompt no. 7, which is LSTM ensembles (10
case of traits scores, we present only the results of our runs), which QWK was [0.811] with a 4% improvement. As
AESAUG system, and its QWK evaluation as the AEST&N shown in Table III, predicting traits scores always leads to
system did not predict traits scores. improvement in the AESAUG overall score.
We started our experiments by replicating the AEST&N 4 Table III shows the QWKs of our AESAUG models on
model results over the ASAP dataset. Taghipour and Ng [6] prompt no. 7 overall score and four traits scores. It also shows
(using AEST&N) experimented and explored a variety of neural the AEST&N systems replicated results for the overall score.
network model architectures like CNN, basic RNN, GRU, and The statistical significance of improvements is marked with „*‟.
LSTM without using an MoT layer. After replicating the We produced the AESAUG systems for all models (CNN,
AEST&N systems (CNN, RNN, GRU, and LSTM) and RNN, GRU, and LSTM)6; all results are shown in Table III.
producing the same QWKs results, we extended the model to Based on Table III, all models can predict the overall and
the AESAUG model architecture. We trained the model with the traits scores competitively compared to the baseline. However,
training data (described in section 2.3), including the overall we agree with Taghipour and Ng [6] findings that LSTM has
score and the four traits reference-scores (by 2 human raters as performed better than the other models significantly, and it has
described in section 2.3). We started by simulating the human outperformed the baseline model by (4.6%). Nevertheless, the
approach in scoring traits that every rater gives a score, and least accurate model is basic RNN, which does not work
the trait score is the summation of the two raters‟ scores, so precisely as GRU or LSTM. Such a finding can be due to the
AESAUG systems predicted two scores for every trait, and we moderately long sequences of words in texts. Both LSTM and
summed them. We got the same QWK (0.805) for the overall GRU demonstrate efficient learning of long-term
score (on Fold 4) and QWK [0.715, 0.623, 0.581, 0.443] for dependencies and sequences. Therefore, we believe this is of
the first predicted traits scores and [0.723, 0.656, 0.568, 0.476] the RNN‟s poor performance points. The CNN model is the
for the second predicted traits scores, with an average [0.598]5. fastest in the training and the evaluation compared to other
We found that the predicted traits scores have low QWK models.
values, so we analyzed the case by calculating the QWK We further investigated the overall and traits scores
among the first human rater (H-R1), the second human rater predicted by our best model (AESAUG LSMT), for the
(H-R2), and each of AESAUG predicted scores (A-R3 & A-R4). predicted and original in ASAP dataset. We presented the
Table II shows QWK for traits scores of the human raters and results in Fig. 4((a) for overall score, (b), (c), (d), and (e) for
the AESAUG system (using the best model, which is LSTM). the traits). The graphs show the system predictions are less
We noticed that the agreement (QWK) between the human varied and positively contribute to the performance of our
raters (0.64) is lower than the agreement (QWK) between any proposed approach.
AESAUG prediction and any of the human raters (0.66, 0.67,
0.68 and 0.68); All the QWKs are shown in Table II. In our TABLE II. QWK AMONG HUMAN RATERS (H-R1 & H-R2) AND EACH OF
attempt to understand the logic behind this low agreement, we AESAUG PREDICTED SCORES (A-R3 & A-R4)
examined the prompt content and rubrics with the help of two
Raters Average QWK score
English language specialists. They confirmed that the
definitions of the level descriptors in the rubrics are not clear H-R1 vs. H-R2 0.641
and definite, which may lead to different interpretations A-R3 vs. A-R4 0.906
between raters, which accordingly may lead to a low H-R1 vs. A-R3 0.684
agreement between raters. They also added that using the H-R1 vs. A-R4 0.680
summation of the two raters on each trait (as described on the
H-R2 vs. A-R3 0.669
ASAP scoring guide) will provide a more accurate and
objective indicator for a learner‟s performance. H-R2 vs. A-R4 0.670
In order to enhance the traits QWK scores for AESAUG TABLE III. THE QWK OF THE AESAUG SEVERAL NEURAL NETWORK
systems, we changed our score calculation approach, i.e., MODELS AND THE AEST&N
before training the system, we calculated one score for each
AEST&N AESAUG AESAUG traits QWK
trait by summing the two human scores. Then, we calculated Systems
QWK QWK 1 2 3 4 average
the QWK score for each trait between one reference-score and
one AESAUG system predicted score. As a result of that change CNN 0.746 0.822* 0.793 0.717 0.714 0.700 0.731
in score calculation methods, we got higher QWKs for the RNN 0.743 0.760* 0.733 0.656 0.652 0.641 0.671
traits scores [0.820, 0.767, 0.767, 0.733], respectively, with an GRU 0.752 0.827* 0.837 0.749 0.728 0.700 0.754
average QWK of [0.771]. We also noticed that the traits scores LSTM 0.805 0.851* 0.820 0.767 0.767 0.733 0.771
prediction within AESAUG model architecture enhanced the
*
p < .0001.
4
https://github.com/nusnlp/nea 6
5
We tried to add L2 regularization and 256 dense layers, but the model All the mentioned neural network models are unidirectional and include the
extracted was not better than the one that was concluded. MoT layer. The convolution layer is included in the CNN model only.
291 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
(a) (b)
(c) (d)
(e)
Fig. 4. The Graphs show for Prompt no. 7 and its Traits, the System Predictions are Less Varied and Positively Contribute to the Efficiency of our Proposed
Approach. (a) Representing the Overall Score, While (b), (c), (d), and (e) Represent the Four Traits‟ score, respectively. The Blue Circles Represent the Original
Essay Scores, and the Red Pluses the Predicted Scores. All Predicted Scores are Mapped to their Original Scoring Scale.
292 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
V. CONCLUSIONS REFERENCES
[1] S. Brown, 500 Tips on Assessment. Routledge, 2004.
In this paper, we have proposed a framework, based on
[2] T. C. Stephen, “Using Automated Essay Scoring to Assess Higher-Level
deep learning models that strengthens the validity and Thinking Skills in Nursing Education,” 2019.
enhances the accuracy of a baseline system with respect to [3] S. M. Brookhart, How To Create and Use Rubrics. Ascd, 2013.
traits‟ evaluation/scoring. Our method does not rely only on
[4] A. J. N. and S. M. Brookhart, Educational assessment of students.
overall score prediction but also on essay traits prediction to Pearson Merrill Prentice Hall, 2007.
give trait-specific adaptive feedback. We explored multiple [5] M. Lu, Q. Deng, and M. Yang, “EFL writing assessment: Peer
deep learning models for the automatic essay scoring task. assessment vs. automated essay scoring,” in Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial Intelligence and
Based on our experiments, we can conclude that our Lecture Notes in Bioinformatics), 2020.
proposed AESAUG model outperformed all the previously used [6] K. Taghipour and H. T. Ng, “A Neural Approach to Automated Essay
AES models (CNN, RNN, GRU, and LSTM). Including traits Scoring,” in Proceedings of the 2016 Conference on Empirical Methods
in training has significantly improved the learning process. in Natural Language Processing, 2016, pp. 1882–1891.
Thus, our AESAUG system has significantly increased the [7] F. Nadeem, H. Nguyen, Y. Liu, and M. Ostendorf, “Automated Essay
accuracy of the overall and traits scores for essays using Scoring with Discourse-Aware Neural Models,” 2019.
analytic-rubrics. This point highlights the contributions of our [8] M. A. Hussein, H. Hassan, and M. Nassef, “Automated language essay
model over all the previous models. scoring systems: A literature review,” PeerJ Comput. Sci., vol. 2019, no.
8, 2019.
It is also found that the LSTMAUG model, like the AEST&N [9] N. W. Solomons et al., “[Use of tests based on the analysis of expired air
system, proves to be the best model to predict scores for in nutritional studies].,” Arch. Latinoam. Nutr., vol. 28, no. 3, pp. 301–
essays that include relatively long sequences of words which 17, 1978.
is consistent with the nature of the LSTM models. However, [10] Z. Ke and V. Ng, “Automated Essay Scoring: A Survey of the State of
the Art,” 2019.
adding a dense layer between the MoT layer and the output
layer did not improve the results of our AESAUG model. We [11] R. D. Varner, L. K. Crossley, and S. A. Mcnamara, “Developing
pedagogically-guided algorithms for intelligent writing feedback,” 2013.
can also assume, based on our experiments, that increasing the
[12] B. Woods, D. Adamson, S. Miel, and E. Mayfield, “Formative Essay
training data has a positive effect on the accuracy of AESAUG Feedback Using Predictive Scoring Models,” dl.acm.org, vol. Part
scores. F1296, pp. 2071–2080, Aug. 2017.
Additionally, it is very important to note that the clarity of [13] S. Dikli, “Automated essay scoring,” Turkish Online Journal of Distance
Education, vol. 7, no. 1. pp. 45–56, 2006.
the definition of the scoring rubrics strongly influences the
[14] M. Shermis and B. H. Education, “Contrasting state-of-the-art
accuracy of both human and AESAUG scores, which automated scoring of essays: Analysis,” Annual national council on
accordingly affects the quality of the adaptive feedback that measurement in education meeting. 2012.
can be given to the learners. In other words, the more the [15] M. D. Shermis, “State-of-the-art automated essay scoring: Competition,
rubric is clear and definite, the more the AESAUG scores are results, and future directions from a United States demonstration,”
accurate, and the feedback is more specific. Assess. Writ., vol. 20, pp. 53–76, 2014.
[16] T. Dasgupta, A. Naskar, R. Saha, and L. Dey, “Augmenting Textual
Finally, our proposed AESAUG model offers a new Qualitative Features in Deep Convolution Recurrent Neural Network for
methodology that may be interesting to the users, and it Automatic Essay Scoring,” aclweb.org, pp. 93–102, 2018.
provides more accurate results without requiring a high [17] F. Dong and Y. Zhang, “Automatic Features for Essay Scoring-An
configuration of hardware. Empirical Study,” 2016.
[18] P. Phandi, K. A. Ming Chai, and H. Tou Ng, “Flexible Domain
VI. FUTURE WORK Adaptation for Automated Essay Scoring Using Correlated Linear
Regression,” Association for Computational Linguistics, 2015.
The future directions of this work may be to highlight the
[19] D. Alikaniotis, H. Yannakoudakis, and M. Rei, “Automatic Text Scoring
words and sentences that made the AES system give a specific Using Neural Networks,” 2016.
score for further analysis and adaptive feedbacking, in [20] Y. N. Dauphin, H. De Vries, and Y. Bengio, “Equilibrated adaptive
addition to training and testing the model on a larger dataset learning rates for non-convex optimization,” in Advances in Neural
with well-defined rubrics. Information Processing Systems, 2015, vol. 2015-Janua, pp. 1504–1512.
[21] W. Y. Zou, R. Socher, D. Cer, and C. D. Manning, “Bilingual Word
Embeddings for Phrase-Based Machine Translation,” Association for
Computational Linguistics, 2013.
[22] H. Yannakoudakis and R. Cummins, “Proceedings of the Tenth
Workshop on Innovative Use of NLP for Building Educational
Applications,” 2015.
293 | P a g e
www.ijacsa.thesai.org