0% found this document useful (0 votes)
4 views

A Trait Based Deep Learning

The document presents a framework for a Trait-based Deep Learning Automated Essay Scoring (AES) system that provides adaptive feedback to learners based on their writing performance. The proposed model enhances the accuracy of existing neural-based AES systems by predicting both overall and trait-specific scores, utilizing Long Short-Term Memory (LSTM) networks. The study demonstrates that the new model outperforms the baseline in scoring accuracy and offers valuable feedback for learners to improve their writing skills.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

A Trait Based Deep Learning

The document presents a framework for a Trait-based Deep Learning Automated Essay Scoring (AES) system that provides adaptive feedback to learners based on their writing performance. The proposed model enhances the accuracy of existing neural-based AES systems by predicting both overall and trait-specific scores, utilizing Long Short-Term Memory (LSTM) networks. The study demonstrates that the new model outperforms the baseline in scoring accuracy and offers valuable feedback for learners to improve their writing skills.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 5, 2020

A Trait-based Deep Learning Automated Essay


Scoring System with Adaptive Feedback
Mohamed A. Hussein1 Hesham A. Hassan2, Mohammad Nassef3
National Center for Examination and Educational Faculty of Computers and Information
Evaluation- NCEEE Cairo University
Cairo, Egypt Cairo, Egypt

Abstract—Numerous Automated Essay Scoring (AES) criterion separately, and a holistic rubric evaluates all criteria
systems have been developed over the past years. Recent simultaneously. Each type has its advantages and
advances in deep learning have shown that applying neural disadvantages. Analytic rubrics give formative feedback to
network approaches to AES systems has accomplished state-of- learners and are easier to link to instruction. Nevertheless,
the-art solutions. Most neural-based AES systems assign an they take more time to score and achieve acceptable inter-rater
overall score to given essays, even if they depend on analytical reliability than holistic rubrics. Holistic rubrics are faster and
rubrics/traits. The trait evaluation/scoring helps to identify suitable for summative assessment (assessment of learning).
learners’ levels of performance. Besides, providing feedback to On the other hand, a single overall score does not
learners about their writing performance is as important as
communicate information about what to do to improve
assessing their level. Producing adaptive feedback to the learners
requires identifying the strengths/weaknesses and the magnitude
learning and is not useful for formative assessment
of influence of each trait. In this paper, we develop a framework (assessment for learning) [4]. It is also interesting to know that
that strengthens the validity and enhances the accuracy of a research showed that learners prefer AES feedback over peer
baseline neural-based AES model with respect to traits feedback [5].
evaluation/scoring. We extend the model to present a method Over the past years, various AES systems have been
based on essay traits prediction to give trait-specific adaptive developed to evaluate learners‟ responses to a given prompt
feedback. We explored multiple deep learning models for the (essay). AES systems automatically asses the quality of the
automatic essay scoring task, and we performed several analyses
written text and assign a score to each text. The efficiency of
to get some indicators from these models. The results show that
Long Short-Term Memory (LSTM) based system outperformed
these systems depends on the agreement between the human-
the baseline study by 4.6% in terms of quadratic weighted Kappa rater scores and the AES scores[6]. Research in deep learning
(QWK). Moreover, the prediction of the traits scores enhance the has led to the development of neural network models for
efficiency of the prediction of the overall score. Our extended automatic essay scoring task moving away from feature
model is used in the iAssistant, an educational module that engineering and found that utilizing neural networks to
provides trait-specific adaptive feedback to learners. automatic essay scoring task has achieved state-of-the-art
outcomes [7]. Utilizing the automatically learned features has
Keywords—AES system; trait evaluation; adaptive feedback; added significant benefits to the efficiency of such systems as
deep learning; neural networks; ASAP well [8] [9].
I. INTRODUCTION The vast majority of existing Neural based AES systems
were developed for holistic scoring to given essays even if
“Nothing we do to, or for our students is more important they depend on analytical rubrics/traits [10]. The trait
than our assessment of their work and the feedback we give evaluation/scoring helps to identify learners‟ levels of
them on it [1].” It is widely acknowledged that feedback is a performance. Besides, providing feedback to the learners
critical element of learning [2]. Both scores and feedback are about their writing requires identifying the
fundamental aspects of the learning process. Accurate scoring strengths/weaknesses and the magnitude of influence of each
of learners' answers creates a fair way to assess learners' work, trait. Based on that, our goal is to develop a framework that
which is a very important aspect. However, giving feedback to strengthens the validity and enhances the accuracy of neural-
learners about their answers helps them identify their based AES approaches with respect to traits evaluation/
weaknesses and improve their performance as well. scoring. Using this framework should help in providing
Rubrics are widely used in evaluating learners' answers to effective adaptive feedback to learners as well.
essay questions. Brookhart (2013) defines a rubric as “a The following part of the paper is organized as follows:
coherent set of criteria for learners' work that includes Section 2 describes a brief overview of related work. Section 3
descriptions of levels of performance quality on the criteria describes the methods and materials, including the AES
[3].” The definition identifies two significant aspects of a good models (baseline and the augmented), dataset description,
rubric: coherent sets of criteria and descriptions of levels of training, and testing, in addition to the evaluation metric.
performance for these criteria. There are two types of rubrics: Reporting and discussion of results are in Section 4. Then, our
analytic and holistic rubrics. An analytic rubric evaluates each conclusion and future improvements are in sections 5.

287 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

II. RELATED WORK development, and testing sets, respectively. AEST&N model
PEG is the earliest AES system that was developed by architecture is illustrated in Fig. 1.
Ellis Page in 1966. PEG was the starting spark for decades of AEST&N results show that all model variations
research into AES. Then, many AES systems have been (Convolutional Neural Network (CNN), Recurrent Neural
developed that analyze the quality of text and assign a score to Network (RNN), Gated Recurrent Units (GRU), and LSTM)
it. AES systems use various manually tuned shallow and deep succeed to learn the task properly and its performance
linguistic features [5]. comparable to or better than the baseline (AES system called
AES systems can be classified into two main types: i) ‘Enhanced AI Scoring Engine’ (EASE)1). The authors reported
handcrafted discrete features-based type that is bounded to that the LSTM based AEST&N system outperformed other
specific domains, which usually uses natural language neural networks (RNN, GRU, and CNN) systems significantly
processing, latent semantic analysis, or Bayesian network, etc. and outperformed the baseline by (4.1%).
and ii) automatic feature extraction-based type which usually AEST&N system has significantly outperformed the other
uses neural networks [5]. AES systems, yet there is always an area for improvement to
Several AES systems include automated scoring alongside increase the accuracy of scoring. AEST&N system has predicted
providing feedback, e.g., for the first type, Criterion, MY only the overall scores, although some of the essays have
Access, and Writing Pal. Criterion provides an overall score analytical rubrics/traits. Moreover, it has not provided any
and a learner‟s feedback using E-rater and Critique as an AES feedback to learners.
component. Where the E-rater module performs the given B. Proposed Model
essay utomatic scoring task and Critique consists of a set of
modules that detect mistakes/errors in mechanics, grammar, Our model (AESAUG) is inspired by the baseline model
and usage. Then, it identifies the issues of discourse and style AEST&N of Taghipour and Ng [6]. We extend and utilize the
in writing. MY Access offers instant score and diagnostic AEST&N model to predict not only the overall score for essays
feedback based on the IntelliMetric AES system to stimulate but also the traits scores. Besides, we aim to utilize the traits
the learners to improve their writing ability [8]. Moreover, scores to provide adaptive feedback to learners. Fig. 2 presents
Writing Pal is classified as an intelligent tutoring system that the AESAUG model architecture, which is described.
is mainly concerned with learning tasks and provides the
service of evaluating writing tasks with feedback [11]. It
targets learners‟ writing strategies within providing automated
feedback. However, it classified as a handcrafted discrete
features-based system; the automatic essay scoring model is
separate from the feedback part. It uses specific algorithms for
each feedback category.
In particular, a few of the other type systems consider
scoring the traits and providing the appropriate feedback for
each essay. Woods et al. [12] established a new ordinal essay
scoring model with extension to use essay traits prediction to
give a formative trait-specific feedback to learners. Fig. 1. AEST&N Model Architecture of Taghipour and Ng [6], where the
Nevertheless, one of the concerns of their system that their Output Layer Predicts Only the Overall Score.
Ordinal Logistic Regression (OLR) model does not perform
accurately with large scoring ranges essays (like prompts 1
and 7 in ASAP dataset).
III. MATERIALS AND METHODS
A. Baseline Model
Taghipour and Ng [6], developed an AES system
(AEST&N) based on neural networks, which automatically
predicts the overall score of a given essay [10]. AEST&N takes
the sequence of words in an essay as input; their model first
uses a convolution layer to extract n-gram level features.
These features, which capture the local textual dependencies
among the words in an n-gram, are then passed to a recurrent
layer composed of an LSTM network. It was trained and given
state-of-the-art results on the Kaggle's ASAP dataset. The Fig. 2. AESAUG Model Architecture, where the Output Layer Predicts both
evaluation metric, which is used to evaluate the efficiency of the Overall Score and Traits‟ Score.
the system, is Quadratic Weighted Kappa (QWK) [6], [8].
They used a 5-fold cross-validation, and for each fold, they 1
EASE is an open source handcrafted features-based AES system. It
distributed the dataset into 60%, 20%, and 20%; training, depends on Bayesian linear ridge regression and vector regression techniques.
It was the third in the ASAP competition (among 154 systems).

288 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

1) The Lookup Table Layer; first layer/step of the model where is weight matrix (with mini-batch size 32), is
transforms each word into dimensional space . Given a bias vector, is activations of the previous layer, is the
sentence ( ), the output of the lookup table input of the layer (from MoT layer), and is the dense layer
operation ( ) represented in Equation 1. output.

( ) ( ) (1) 6) The Output layer (Linear Layer with Sigmoid


Activation); maps the dense layer generated output vector to a
where : one-hot representation of the word in the scalar value. Equation 10 describes applying the sigmoid
sentence, and : is the embedding matrix (learned in the activation function on the linear layer mapping:
training stage).
( ) ( ) (10)
2) The Convolution Layer (optional); extracts feature
vectors from . It can capture local contextual where: the input vector ( ), : the weight vector, and :
dependencies in writing and, therefore, enhance the efficiency the bias value. In order to predict the traits scores, we extend
of the system. In order to extract local features from the the baseline model architecture layers by adding further linear
sequence, the convolution layer applies a linear transformation units to the output layer that performs a linear regression to
predict traits scores.
to all M windows in the given sequence of vectors.
3) The Recurrent Layer; processes the input (whether We minimized the Mean Squared Error (MSE) between
from the convolution layer or directly from the lookup table the predicted score and the reference score (human-raters‟
layer) to generate a representation for the given essay. This scores). The AEST&N MSE loss function is designed only for
representation should encode all the information required for the overall score prediction. To fit with predicting the overall
and traits scores in our AESAUG model, we adjusted the
scoring the given essay. Since certain essays are usually long,
AEST&N MSE loss function (shown in Equation 11) to
the proposed model preserved all the intermediate states of the compute the overall loss function as a linear combination of
recurrent layer to keep track of the important bits of multi loss functions (shown in Equation 12), back-propagating
information. We also experimented with basic RNN vs. GRU the error gradients to the embedding matrix.
vs. LSTM.
( ) ∑ ( ) (11)
In order to control the flow of information during the
processing of the input sequence, LSTM units use three gates 1
( , )= (∑ =1( )2 + ∑𝑗 =1 ∑ =1( 𝑗 𝑗)
2
) (12)
to discard (forget) or pass the information through time. The (12)
following equations formally describe the LSTM function:
where : a number of a specific prompt traits, given :
( ) (2) number of training essays and their corresponding normalized
reference overall scores , and : traits normalized
( ) (3)
reference scores. The model computes the predicted overall
̃ ( ) (4) scores and traits scores for all training essays.
̃ (5) C. Dataset
( ) AES research has been dominated for the last eight years
(6)
by the dataset from the 2012 Automated Student Assessment
( ) (7) Prize (ASAP) competition [13]. It was established by Kaggle
and funded by the Hewlett Foundation. ASAP competition has
where : represents the sigmoid function, : denotes provided the data and all the required information (hand-
multiplication (element-wise), and : the input and output crafted features), which can help to evaluate AES systems that
vectors at time , respectively, use machine learning algorithms. ASAP consists of 12.976
and : weight matrices, and and : bias vectors. essays, with average length 150-to-550 words per essay, each
double scored (Cohen's = 0.86) [8]. The dataset consists of
4) The Mean over Time (MoT); this layer input is V eight tasks/prompts; each task is an essay that has learners'
vectors (the output of the recurrent layer) with variable length, responses. ASAP provided the scoring guides, raters'
( ),. This layer aggregates these inputs into exemplars, and practice sets for each task. Five tasks
a fixed-length vector and fed it to the dense layer. Equation 8 employed a holistic scoring rubric, one was scored with a two-
describes the function of this layer: trait analytic rubric, and two were scored with a multi-trait
analytic rubric but reported as an overall score [14]. Shermis
( ) ∑ (8) [15] provides a summary of the competition, and most of the
recent papers report their results using the same public dataset
5) The Dense layer (optional); gives more depth and [16][6][12][17][18][6][19].
enhances the efficiency of the model to predict the traits
scores in addition to the overall score in the output layer. The In this research, we have used the ASAP data and
specifically task 7 data. Task 7 was selected because it has a
mathematical form of the layer is shown in Equation 9:
multi-trait analytic rubric that can be used for formative
( ) (9) feedback to learners, and it has the largest dataset (1,569

289 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

essays) on the multi-trait analytic rubric-based tasks. The type


of writing in task 7 is persuasive/narrative/expository. The
prompt asks learners to write a story about patience. The
scoring rubric has four traits: ideas, organization, style, and
conventions. Each trait score ranges from 0-3. Each score in
each trait has a description that guides the rater to identify the
appropriate score (level) to each text. In ideas, for example, if
the ideas are clearly focused on the topic and are thoroughly
developed with specific relevant details, a score of 3 should be Fig. 3. Prompt no. 7 Dataset Folds Distribution, Green is Training, Yellow is
assigned. If the ideas are somewhat focused on the topic and Validation, and Red is Test Set.
are developed with a mix of specific and/or general details, a
score of 2 should be assigned. If the ideas are minimally TABLE I. AESAUG MODEL HYPER-PARAMETERS
focused on the topic and developed with limited and/or
general details, a score of 1 should be assigned. If the ideas are Parameter Parameter meaning/description Value
not focused on the task and/or are undeveloped, a score of 0 Word embedding dimension 50
should be assigned. For objectivity and accuracy, two raters Output dimension of the recurrent layer 300
should score the response of each learner for each trait. Then,
the scores were summed independently for Rater1 and Rater2 Word context window size 3
to form the resolved score (0-30) by adding the sum of the two Word convolution units 50
raters. drop-rate Dropout probability 0.5
D. Training and Testing batch-size Mini-batch sizea
32
We have followed the dataset split by Taghipour and Ng Learn-rate Base learning rate 0.001
[6], so we used a 5-fold cross-validation model to assess our a.
a fixed 50 epochs.
proposed system. Data, in each fold, is distributed into 60%,
20%, and 20%; training, development, and test sets, E. Evaluation
respectively. For prompt no. 7 and each of its four traits, the The evaluation of AES systems is always done by
fold predictions have been aggregated and evaluated together. comparing the AES scores to the scores assigned by human
In order to evaluate the system efficiency, the results are raters. Various statistics tests of correlation or agreement are
averaged across the four traits. See Fig. 3. The essays have used for this purpose, including Pearson‟s correlation,
been tokenized by the NLTK 2 tokenizer that lowercases the Spearman‟s correlation, and QWK [22]. QWK was identified
letters and normalizes the reference scores to the range of [0, as the official evaluation metric for ASAP. In this paper, we
1]. For the system performance evaluation, we rescaled the used the QWK to evaluate our system to the well-established
system-predicted normalized scores to the original range of baseline (AEST&N) that used the same dataset. The QWK is a
scores. commonly used measure of the degree of agreement among
In some experimental scenarios, we used a different split raters (a.k.a. inter-rater reliability). The following part
ratio in each fold to maximize the training data size for the illustrates how QWK is computed.
best training: 80% of the data as a training set, and 20% as the A weight matrix is created based on Equation 13:
test set.
( )
We followed the AEST&N by using the RMSProp (13)
( )
optimization algorithm [20] to minimize the MSE loss
function over the training data. We also used dropout where and 𝑗 are the reference scores, and the hypothesis
regularization to avoid overfitting. If the norm of the gradient scores (AES scores), respectively. refers to the number of
is larger than a threshold, it will be clipped. We did not use all possible scores. is a matrix calculating like refers to
any early stopping method. We trained the model for a fixed the number of texts which are given a score by the rater and
50 epochs, and after each epoch, we monitored the model an AES score 𝑗. A count matrix is computed to represent the
efficiency on the development set. outer product of histogram vectors of the two scores. The sum
of elements in is equal to the sum of elements in as the
The system hyper-parameters are several: To train the matrix is normalized. Lastly, based on matrices and , the
network, we have used RMSProp optimizer with the decay QWK is computed as of Equation 14:
rate ( ) set to 0.9. We used pre-trained word embeddings 3 ,

released by Zou et al. [21] to initialize the lookup table layer. (14)
The hyper-parameter settings are listed in Table I. We used ∑
Nvidia GEFORCE GTX 1050 GPU to perform our Our comparison between the AESAUG and AEST&N is
experiments in parallel. always by using the QWK values. A one-tailed paired t-test is
always used to check the significance of the differences
between the two systems.

2
http://www.nltk.org
3
http://ai.stanford.edu/~wzou/mt

290 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

IV. RESULTS AND DISCUSSION accuracy of predicting the overall score [0.851] (on Fold 2) to
We describe in this part our experiments and results. In the outperform the baseline AEST&N best model (LSTM) which
case of overall scores, we mention the results and then was [0.805] with 4.6% improvement. It even outperformed the
evaluate our system to the baseline system (AEST&N). In the best result for prompt no. 7, which is LSTM ensembles (10
case of traits scores, we present only the results of our runs), which QWK was [0.811] with a 4% improvement. As
AESAUG system, and its QWK evaluation as the AEST&N shown in Table III, predicting traits scores always leads to
system did not predict traits scores. improvement in the AESAUG overall score.
We started our experiments by replicating the AEST&N 4 Table III shows the QWKs of our AESAUG models on
model results over the ASAP dataset. Taghipour and Ng [6] prompt no. 7 overall score and four traits scores. It also shows
(using AEST&N) experimented and explored a variety of neural the AEST&N systems replicated results for the overall score.
network model architectures like CNN, basic RNN, GRU, and The statistical significance of improvements is marked with „*‟.
LSTM without using an MoT layer. After replicating the We produced the AESAUG systems for all models (CNN,
AEST&N systems (CNN, RNN, GRU, and LSTM) and RNN, GRU, and LSTM)6; all results are shown in Table III.
producing the same QWKs results, we extended the model to Based on Table III, all models can predict the overall and
the AESAUG model architecture. We trained the model with the traits scores competitively compared to the baseline. However,
training data (described in section 2.3), including the overall we agree with Taghipour and Ng [6] findings that LSTM has
score and the four traits reference-scores (by 2 human raters as performed better than the other models significantly, and it has
described in section 2.3). We started by simulating the human outperformed the baseline model by (4.6%). Nevertheless, the
approach in scoring traits that every rater gives a score, and least accurate model is basic RNN, which does not work
the trait score is the summation of the two raters‟ scores, so precisely as GRU or LSTM. Such a finding can be due to the
AESAUG systems predicted two scores for every trait, and we moderately long sequences of words in texts. Both LSTM and
summed them. We got the same QWK (0.805) for the overall GRU demonstrate efficient learning of long-term
score (on Fold 4) and QWK [0.715, 0.623, 0.581, 0.443] for dependencies and sequences. Therefore, we believe this is of
the first predicted traits scores and [0.723, 0.656, 0.568, 0.476] the RNN‟s poor performance points. The CNN model is the
for the second predicted traits scores, with an average [0.598]5. fastest in the training and the evaluation compared to other
We found that the predicted traits scores have low QWK models.
values, so we analyzed the case by calculating the QWK We further investigated the overall and traits scores
among the first human rater (H-R1), the second human rater predicted by our best model (AESAUG LSMT), for the
(H-R2), and each of AESAUG predicted scores (A-R3 & A-R4). predicted and original in ASAP dataset. We presented the
Table II shows QWK for traits scores of the human raters and results in Fig. 4((a) for overall score, (b), (c), (d), and (e) for
the AESAUG system (using the best model, which is LSTM). the traits). The graphs show the system predictions are less
We noticed that the agreement (QWK) between the human varied and positively contribute to the performance of our
raters (0.64) is lower than the agreement (QWK) between any proposed approach.
AESAUG prediction and any of the human raters (0.66, 0.67,
0.68 and 0.68); All the QWKs are shown in Table II. In our TABLE II. QWK AMONG HUMAN RATERS (H-R1 & H-R2) AND EACH OF
attempt to understand the logic behind this low agreement, we AESAUG PREDICTED SCORES (A-R3 & A-R4)
examined the prompt content and rubrics with the help of two
Raters Average QWK score
English language specialists. They confirmed that the
definitions of the level descriptors in the rubrics are not clear H-R1 vs. H-R2 0.641
and definite, which may lead to different interpretations A-R3 vs. A-R4 0.906
between raters, which accordingly may lead to a low H-R1 vs. A-R3 0.684
agreement between raters. They also added that using the H-R1 vs. A-R4 0.680
summation of the two raters on each trait (as described on the
H-R2 vs. A-R3 0.669
ASAP scoring guide) will provide a more accurate and
objective indicator for a learner‟s performance. H-R2 vs. A-R4 0.670

In order to enhance the traits QWK scores for AESAUG TABLE III. THE QWK OF THE AESAUG SEVERAL NEURAL NETWORK
systems, we changed our score calculation approach, i.e., MODELS AND THE AEST&N
before training the system, we calculated one score for each
AEST&N AESAUG AESAUG traits QWK
trait by summing the two human scores. Then, we calculated Systems
QWK QWK 1 2 3 4 average
the QWK score for each trait between one reference-score and
one AESAUG system predicted score. As a result of that change CNN 0.746 0.822* 0.793 0.717 0.714 0.700 0.731
in score calculation methods, we got higher QWKs for the RNN 0.743 0.760* 0.733 0.656 0.652 0.641 0.671
traits scores [0.820, 0.767, 0.767, 0.733], respectively, with an GRU 0.752 0.827* 0.837 0.749 0.728 0.700 0.754
average QWK of [0.771]. We also noticed that the traits scores LSTM 0.805 0.851* 0.820 0.767 0.767 0.733 0.771
prediction within AESAUG model architecture enhanced the
*
p < .0001.

4
https://github.com/nusnlp/nea 6
5
We tried to add L2 regularization and 256 dense layers, but the model All the mentioned neural network models are unidirectional and include the
extracted was not better than the one that was concluded. MoT layer. The convolution layer is included in the CNN model only.

291 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

(a) (b)

(c) (d)

(e)
Fig. 4. The Graphs show for Prompt no. 7 and its Traits, the System Predictions are Less Varied and Positively Contribute to the Efficiency of our Proposed
Approach. (a) Representing the Overall Score, While (b), (c), (d), and (e) Represent the Four Traits‟ score, respectively. The Blue Circles Represent the Original
Essay Scores, and the Red Pluses the Predicted Scores. All Predicted Scores are Mapped to their Original Scoring Scale.

In the end, we experimented with using a different split to


the dataset from the one described in Section III-D (which is
60% training, 20% validation, and 20% testing). Thus, we
merged the training set with the validation set to be 80%
training and 20% testing. It has achieved better QWK scores
for the overall score to be [0.858] instead of [0.851], which
means that the availability of a bigger training set will
improve the results.
Finally, we used the above method, and its results in the
iAssistant, an educational module that provides trait-specific
adaptive feedback to learners. As shown in Fig. 5, iAssistant
provides learners with predicted scores on multiple rubric
traits and levels of performance per each trait. In addition to
that, it helps learners to evaluate the length of their essay on a Fig. 5. An Example of iAssistant in use: Predicted Scores on Multiple
scale of 3 levels (short, good, and long). Rubric Traits and Levels of Performance. In Addition to Representing the
Overall Score and Length of the Essay.

292 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

V. CONCLUSIONS REFERENCES
[1] S. Brown, 500 Tips on Assessment. Routledge, 2004.
In this paper, we have proposed a framework, based on
[2] T. C. Stephen, “Using Automated Essay Scoring to Assess Higher-Level
deep learning models that strengthens the validity and Thinking Skills in Nursing Education,” 2019.
enhances the accuracy of a baseline system with respect to [3] S. M. Brookhart, How To Create and Use Rubrics. Ascd, 2013.
traits‟ evaluation/scoring. Our method does not rely only on
[4] A. J. N. and S. M. Brookhart, Educational assessment of students.
overall score prediction but also on essay traits prediction to Pearson Merrill Prentice Hall, 2007.
give trait-specific adaptive feedback. We explored multiple [5] M. Lu, Q. Deng, and M. Yang, “EFL writing assessment: Peer
deep learning models for the automatic essay scoring task. assessment vs. automated essay scoring,” in Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial Intelligence and
Based on our experiments, we can conclude that our Lecture Notes in Bioinformatics), 2020.
proposed AESAUG model outperformed all the previously used [6] K. Taghipour and H. T. Ng, “A Neural Approach to Automated Essay
AES models (CNN, RNN, GRU, and LSTM). Including traits Scoring,” in Proceedings of the 2016 Conference on Empirical Methods
in training has significantly improved the learning process. in Natural Language Processing, 2016, pp. 1882–1891.
Thus, our AESAUG system has significantly increased the [7] F. Nadeem, H. Nguyen, Y. Liu, and M. Ostendorf, “Automated Essay
accuracy of the overall and traits scores for essays using Scoring with Discourse-Aware Neural Models,” 2019.
analytic-rubrics. This point highlights the contributions of our [8] M. A. Hussein, H. Hassan, and M. Nassef, “Automated language essay
model over all the previous models. scoring systems: A literature review,” PeerJ Comput. Sci., vol. 2019, no.
8, 2019.
It is also found that the LSTMAUG model, like the AEST&N [9] N. W. Solomons et al., “[Use of tests based on the analysis of expired air
system, proves to be the best model to predict scores for in nutritional studies].,” Arch. Latinoam. Nutr., vol. 28, no. 3, pp. 301–
essays that include relatively long sequences of words which 17, 1978.
is consistent with the nature of the LSTM models. However, [10] Z. Ke and V. Ng, “Automated Essay Scoring: A Survey of the State of
the Art,” 2019.
adding a dense layer between the MoT layer and the output
layer did not improve the results of our AESAUG model. We [11] R. D. Varner, L. K. Crossley, and S. A. Mcnamara, “Developing
pedagogically-guided algorithms for intelligent writing feedback,” 2013.
can also assume, based on our experiments, that increasing the
[12] B. Woods, D. Adamson, S. Miel, and E. Mayfield, “Formative Essay
training data has a positive effect on the accuracy of AESAUG Feedback Using Predictive Scoring Models,” dl.acm.org, vol. Part
scores. F1296, pp. 2071–2080, Aug. 2017.
Additionally, it is very important to note that the clarity of [13] S. Dikli, “Automated essay scoring,” Turkish Online Journal of Distance
Education, vol. 7, no. 1. pp. 45–56, 2006.
the definition of the scoring rubrics strongly influences the
[14] M. Shermis and B. H. Education, “Contrasting state-of-the-art
accuracy of both human and AESAUG scores, which automated scoring of essays: Analysis,” Annual national council on
accordingly affects the quality of the adaptive feedback that measurement in education meeting. 2012.
can be given to the learners. In other words, the more the [15] M. D. Shermis, “State-of-the-art automated essay scoring: Competition,
rubric is clear and definite, the more the AESAUG scores are results, and future directions from a United States demonstration,”
accurate, and the feedback is more specific. Assess. Writ., vol. 20, pp. 53–76, 2014.
[16] T. Dasgupta, A. Naskar, R. Saha, and L. Dey, “Augmenting Textual
Finally, our proposed AESAUG model offers a new Qualitative Features in Deep Convolution Recurrent Neural Network for
methodology that may be interesting to the users, and it Automatic Essay Scoring,” aclweb.org, pp. 93–102, 2018.
provides more accurate results without requiring a high [17] F. Dong and Y. Zhang, “Automatic Features for Essay Scoring-An
configuration of hardware. Empirical Study,” 2016.
[18] P. Phandi, K. A. Ming Chai, and H. Tou Ng, “Flexible Domain
VI. FUTURE WORK Adaptation for Automated Essay Scoring Using Correlated Linear
Regression,” Association for Computational Linguistics, 2015.
The future directions of this work may be to highlight the
[19] D. Alikaniotis, H. Yannakoudakis, and M. Rei, “Automatic Text Scoring
words and sentences that made the AES system give a specific Using Neural Networks,” 2016.
score for further analysis and adaptive feedbacking, in [20] Y. N. Dauphin, H. De Vries, and Y. Bengio, “Equilibrated adaptive
addition to training and testing the model on a larger dataset learning rates for non-convex optimization,” in Advances in Neural
with well-defined rubrics. Information Processing Systems, 2015, vol. 2015-Janua, pp. 1504–1512.
[21] W. Y. Zou, R. Socher, D. Cer, and C. D. Manning, “Bilingual Word
Embeddings for Phrase-Based Machine Translation,” Association for
Computational Linguistics, 2013.
[22] H. Yannakoudakis and R. Cummins, “Proceedings of the Tenth
Workshop on Innovative Use of NLP for Building Educational
Applications,” 2015.

293 | P a g e
www.ijacsa.thesai.org

You might also like