Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer
Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer
Accepted Manuscript
PII: S0925-2312(19)30106-7
DOI: https://doi.org/10.1016/j.neucom.2019.01.078
Reference: NEUCOM 20383
Please cite this article as: Gang Liu, Jiabao Guo, Bidirectional LSTM with attention
mechanism and convolutional layer for text classification, Neurocomputing (2019), doi:
https://doi.org/10.1016/j.neucom.2019.01.078
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights
• For the convolutional layer, the convolution window size and the stride
size affect the classification performance.
T
• BiLSTM and attention mechanism have greater effects than the convolu-
IP
tional layer on the classification accuracy.
CR
vector requires to train more parameters and it causes relatively lower
classification accuracy in limited iterations.
US
• Our proposed approach performs better than some state-of-the-art DNNs.
AN
M
ED
PT
CE
AC
1
ACCEPTED MANUSCRIPT
T
a School of Computer Science, Hubei University of Technology, Wuhan 430072, China
IP
CR
Abstract
Neural network models have been widely used in the field of natural language
processing (NLP). Recurrent neural networks (RNNs), which have the ability to
US
process sequences of arbitrary length, are common methods for sequence mod-
eling tasks. Long short-term memory (LSTM) is one kind of RNNs and has
AN
achieved remarkable performance in text classification. However, due to the
high dimensionality and sparsity of text data, and to the complex semantics of
the natural language, text classification presents difficult challenges. In order to
solve the above problems, a novel and unified architecture which contains a bidi-
M
layers of BiLSTM. Finally, the softmax classifier is used to classify the processed
context information. AC-BiLSTM is able to capture both the local feature of
phrases as well as global sentence semantics. Experimental verifications are
AC
✩ The work described in this paper was support by National Natural Science Foundation of
China Foundation No.61300127. Any conclusions or recommendations stated here are those
of the authors and do not necessarily reflect official positions of NSFC.
∗ Corresponding author.
T
Keywords: long short-term memory, attention mechanism, natural language
IP
processing, text classification
CR
1. Introduction
5
US
into categories from a predefined set and is an important task in many areas of
nature language processing (NLP). It has been applied to recommender systems
[1], spam filtering system [2] and other areas where it is necessary to understand
AN
the sentiment of the users. Sentiment analysis, is a branch of text classification,
which is the field of study that analyzes peoples opinions, sentiments, appraisals,
attitudes, and emotions toward entities and their attributes expressed in written
M
text [3]. At present, there are three main types of the methods for text clas-
10 sification. These methods are: (1) the statistics-based classification methods,
such as Bayesian classifier [4]; (2) the connected network learning classification
ED
methods, such as neural networks [5]; (3) the rule-making methods, such as
decision tree classification [6].
PT
sentences or the target words to solve the text classification problem without
20 considering the relationship between each word. In fact, the classification of the
text should be determined based on all the contexts. Especially, the traditional
sentiment analysis focuses on identifying the polarity of a text (e.g. positive,
negative, neutral) based on the language clues extracted from the textual con-
3
ACCEPTED MANUSCRIPT
tents of sentences [7, 8, 9]. Generally, the sentiment text can often be expressed
25 in a more subtle or arbitrary manner, making it difficult to be identified by
simply looking each sentence or word in isolation. Despite having several strik-
ing features and successful applications in various fields, the traditional text
T
classification approaches have been shown to have certain weaknesses.
IP
Deep learning technology (DL) [10] has achieved remarkable results in many
30 fields, such as computer vision [11], speech recognition [12] and text classifica-
CR
tion [13] in recent years. For text classification, most of the studies with the deep
learning methods can be divided into two parts: (1) learning word vector rep-
resentations through neural language models [14]; (2) performing composition
35 US
over the learned word vectors for classification [15]. There are two kinds of deep
learning models in text classification: convolutional neural networks (CNNs)[16]
and recurrent neural networks (RNNs) [17]. In recent years, many text classifi-
AN
cation methods based on CNNs or RNNs have been proposed [18, 19, 20, 21, 22].
CNNs are able to learn the local response from the temporal or spatial data but
lack the ability to learn sequential correlations. In contrast to CNNs, RNNs are
M
task. Due to the characteristics of RNNs, RNNs are used more frequently in
text classification. However, for long data sequences, traditional RNNs cause
exploding and vanishing state against its gradient. Long short term memory
PT
45 (LSTM) [23] is a kind of RNNs architecture with long short term memory units
as hidden units and effectively solves vanishing gradient and gradient explo-
CE
4
ACCEPTED MANUSCRIPT
T
difficult to optimize. The convolution operation can extract the features while
IP
60 reducing dimensionality of data. Therefore, the convolution operation can be
used to extract the features of the text vector and reduce the dimensions of the
CR
vector. Although BiLSTM can obtain the contextual information of the text,
it is not possible to focus on the important information in the obtained con-
textual information. Focusing on the important information will improve the
65
US
accuracy of the classification. Attention mechanism can highlight the impor-
tant information from the contextual information by setting different weights.
The combination of BiLSTM and attention mechanism can further improve the
AN
classification accuracy.
To further continue the research in this direction, this paper proposes a
70 novel deep learning architecture for text classification. This new architecture is
M
enhanced BiLSTM using attention mechanism (AM) [29] and the convolutional
layer, referred to as attention-based BiLSTM with the convolutional layer (AC-
ED
BiLSTM). The basic idea of the proposed architecture is based on the following
consideration. The one-dimensional convolutional filters in the convolutional
75 layer perform in extracting n-gram features at different positions of a sentence
PT
and reduce the dimensions of the input data. BiLSTM is used to extract the
contextual information from the features outputted by the convolutional layer.
CE
5
ACCEPTED MANUSCRIPT
T
90 The remainder of this paper is organized as follows. Section 2 introduces
IP
LSTM, BiLSTM and gives a short literature review on text classification. The
proposed approach is presented in detail in Section 3. Experimental results and
CR
discussions are reported in Section 4. Finally, some conclusions and possible
paths for future research are provided in Section 5.
95
100 can handle the variable-length sequences. LSTM is a kind of RNNs architecture
and has become the mainstream structure of RNNs at present. LSTM addresses
ED
105 memory units enable the network to be aware of when to learn new information
and when to forget old information. A LSTM unit consists of the four compo-
nents and it is as illustrated in Fig.1. The i is an input gate and it controls
CE
the size of the new memory content added to the memory. The f is a forget
gate and it determines the amount of the memory that needs to be forgotten.
AC
110 The o is an output gate and it modulates the amount of the output memory
content. The c is the cell activation vector and it consists of two components,
namely partially forgotten previous memory ct−1 and modulated new memory
c˜t . t nominates the t − th moment.
6
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
Figure 1: Illustration of the LSTM unit. The weight matrices are represented by lines with
arrows
M
The mathematical form of LSTM shown in Fig.1 is given. The hidden state
115 ht given input xt is computed as follows:
ED
ht = ot ⊗ tanh(ct ) (6)
7
ACCEPTED MANUSCRIPT
T
IP
(a) LSTM (b) BiLSTM
CR
Figure 2: Illustration of a LSTM model (a) and a BiLSTM model (b).
US
where it , ft , ot and ct represent the value of i, f , o and c at the moment
t, respectively. The W denotes the self-updating weights of the hidden layer
and the term b denotes the bias vector. The σ(.) and tanh(.) are sigmoid and
AN
hyperbolic tangent function respectively. All the gate values and hidden layer
120 outputs lie within the range of [0, 1]. The operator ⊗ denotes element-wise
multiplication.
M
The graphical illustration of the standard LSTM network can be found in the
(a) of Fig.2. The standard LSTM network can only exploit the historical con-
text. However, the lack of future context may lead to incomplete understanding
ED
125 of the meaning of the problem. Therefore, BiLSTM is proposed to access both
the preceding and succeeding contexts by combining a forward hidden layer and
PT
a backward hidden layer as depicted in the (b) of Fig.2. The forward and back-
ward pass over the unfolded network over time are carried out in a similar way
to regular network forward and backward passes, except that BiLSTM need to
CE
130 unfold the forward hidden states and the backward hidden states for all time
steps. The BiLSTM networks are trained using backpropagation through time
AC
(BPTT) [24].
In deep learning, LSTM is mainly used to process the sequence data. The
135 breadth of applications for LSTM has expanded rapidly in recent years. In or-
8
ACCEPTED MANUSCRIPT
T
140 The combination of LSTM and other network structures is an important re-
IP
search direction. Kolawole [31] employed the Child-Sum Tree-LSTM for solving
the challenging problem of textual entailment. Their approach is simple and able
CR
to generalize well without excessive parameter optimization. The literature [32]
demonstrated that LSTM networks predict the subcellular location of proteins
145 given only the protein sequence with high accuracy outperforming current state-
US
of-the-art algorithms. They further improved the performance by introducing
convolutional filters and experiment with an attention mechanism which lets the
LSTM focus on specific parts of the protein. Jin [33] proposed a regional CNN-
AN
LSTM model consisting of two parts: regional CNN and LSTM to predict the
150 valence-arousal (VA) ratings of texts. The proposed regional CNN uses an indi-
vidual sentence as a region, dividing an input text into several regions such that
M
the useful affective information in each region can be extracted and weighted
according to their contribution to the VA prediction. Such regional informa-
ED
ered in the prediction process. Experimental results showed that the proposed
method outperforms lexicon-based, regression-based, and NN-based methods
CE
introduces the phrase factor mechanism which combines the feature vectors of
the phrase embedding layer and the LSTM hidden layer to extract more exact
information from the text. The experimental results showed that the P-LSTM
165 achieves excellent performance on the sentiment classification tasks. Chen [27]
proposed a divide-and-conquer approach which first classifies sentences into dif-
9
ACCEPTED MANUSCRIPT
ferent types, then performs sentiment analysis separately on sentences from each
type. Their approach, BiLSTM-CRF, is used to classify opinionated sentences
into three types according to the number of targets appeared in a sentence. Each
170 group of sentences is then fed into a one-dimensional convolutional neural net-
T
work separately for sentiment classification. The literature [35] proposed a deep
IP
learning-based approach for temporal 3D pose recognition problems based on
a combination of a CNN and a LSTM recurrent network. The paper presents
CR
a two-stage training strategy which firstly focuses on CNN training and sec-
175 ondly, adjusts the full method (CNN+LSTM). Le [36] introduced a multi-view
recurrent neural network (MV-RNN) approach for 3D mesh segmentation. The
US
architecture combines CNN and a two-layer LSTM to yield coherent segmenta-
tion of 3D shapes. The imaged-based CNN are useful for effectively generating
the edge probability feature map while the LSTM correlates these edge maps
AN
180 across different views and output a well-defined per-view edge image.
Currently, attention mechanism has become an effective method to select the
significant information to obtain the superior results. Many studies have been
M
ong [38] examined two simple and effective classes of attentional mechanism:
a global approach which always attends to all source words and a local one
CE
190 that only looks at a subset of source words at a time. These classes differ in
terms of whether the attention is placed on all source positions or on only a
few source positions. The idea of a global attentional model is to consider all
AC
the hidden states of the encoder when deriving the context vector. The lo-
cal attention mechanism selectively focuses on a small window of context and
195 is differentiable. Lin [39] proposed a self-attention mechanism. The proposed
self-attention mechanism allows extracting different aspects of the sentence into
multiple vector representations. Vaswani [40] proposed scaled dot-product at-
10
ACCEPTED MANUSCRIPT
T
subspaces at different positions. Due to the reduced dimension of each head,
IP
the total computational cost is similar to that of single-head attention with full
dimensionality. Shen [41] proposed bidirectional block self-attention (Bi-BloSA)
CR
205 for fast and memory-efficient context fusion. The basic idea is to split a sequence
into several length-equal blocks (with padding if necessary), and apply an intra-
block self-attention networks (SAN) to each block independently. The outputs
210
US
for all the blocks are then processed by an inter-block SAN. The intra-block
SAN captures the local dependency within each block, while the inter-block
SAN captures the long-range/global dependency. Hence, every SAN only needs
AN
to process a short sequence.
The combination of LSTM and attention mechanism can obtain better re-
sults. Especially, for sequence problems, attention mechanism has been used
M
ument classification. The model has two distinctive characteristics: (i) it has a
hierarchical structure that mirrors the hierarchical structure of documents; (ii)
it has two levels of attention mechanisms applied at the wordand sentence-level,
PT
scale text classification tasks demonstrate that the proposed architecture out-
perform previous methods by a substantial margin. Cui [43] presented a simple
but novel model called attention-over-attention reader for better solving cloze-
AC
style reading comprehension task. The proposed model aims to place another
225 attention mechanism over the document-level attention and induces ”attended
attention” for final answer predictions. Experimental results show that the pro-
posed methods significantly outperform various state-of-the-art systems by a
large margin in public datasets. Li [44] developed a novel model, employing
11
ACCEPTED MANUSCRIPT
T
tion for additional input depending on different contexts. Paulus [45] introduced
IP
a neural network model with a novel intra-attention that attends over the in-
235 put and continuously generated output separately, and a new training method
CR
that combines standard supervised word prediction and reinforcement learn-
ing (RL). The model obtains a 41.16 ROUGE-1 score on the CNN/Daily Mail
dataset, an improvement over previous state-of-the-art models. Human eval-
240 US
uation also shows that our model produces higher quality summaries. Huang
[46] introduced a new neural structure called FusionNet, which extends existing
attention approaches from three perspectives. First, it puts forward a novel
AN
concept of ”history of word” to characterize attention information from the
lowest word-level embedding up to the highest semantic-level representation.
Second, it introduces an improved attention scoring function that better uti-
M
245 lizes the ”history of word” concept. Third, it proposes a fully-aware multi-level
attention mechanism to capture the complete information in one text (such as a
ED
250 els of granularity and uses bi-directional attention flow mechanism to obtain a
query-aware context representation without early summarization. Experimental
CE
evaluations show that the model achieves the state-of-the-art results in Stanford
Question Answering Dataset (SQuAD) and CNN/DailyMail cloze test. Daniluk
[48] proposed a neural language model with a key-value attention mechanism
AC
255 that outputs separate representations for the key and value of a differentiable
memory, as well as for encoding the next-word distribution. This model outper-
forms existing memory-augmented neural language models on two corpora. The
literature [49] proposed a simple neural architecture for natural language infer-
ence. The approach uses attention to decompose the problem into subproblems
12
ACCEPTED MANUSCRIPT
260 that can be solved separately, thus making it trivially parallelizable. On the
Stanford Natural Language Inference (SNLI) dataset, it obtains state-of-the-
art results with almost an order of magnitude fewer parameters than previous
work and without relying on any word-order information. Min [50] presented an
T
attention-based bidirectional LSTM approach to improve the target-dependent
IP
265 sentiment classification. The method learns the alignment between the target
entities and the most distinguishing features. The experimental results showed
CR
that the model achieves state-of-the-art results.
Some studies have used other methods to improve LSTM and proposed many
new variants of LSTM. Wei [51] proposed a transfer learning framework based
270
US
on a convolutional neural network and a long short-term memory model, called
ConvL, to automatically identify whether a post expresses confusion, determine
the urgency and classify the polarity of the sentiment. Luo [52] proposed the
AN
models based on LSTM for classifying relations from clinical notes. They com-
pared the segment LSTM model with the sentence LSTM model, and demon-
275 strated the benefits of exploring the difference between concept text and context
M
text, and between different contextual parts in the sentence. They also evalu-
ated the impact of word embedding on the performance of LSTM models and
ED
showed that medical domain word embedding help improve the relation classifi-
cation. Hu [53] established a keyword vocabulary and proposed an LSTM-based
280 model that is sensitive to the words in the vocabulary. Experimental results
PT
demonstrated that their model outperforms the baseline LSTM in terms of ac-
curacy and is effective with significant performance enhancement over several
CE
13
ACCEPTED MANUSCRIPT
showed that it outperforms the state-of-the-art RNN, RvNN, and LSTM net-
works in two semantic compositionality tasks by increasing the classification ac-
curacies and sentence correlation while significantly decreasing computational
complexities. Tang [56] introduced a neural network model to learn vector-
T
295 based document representation in a unified, bottom-up fashion. The model first
IP
learns sentence representation with convolutional neural network or LSTM. Af-
terwards, semantics of sentences and their relations are adaptively encoded in
CR
document representation with gated recurrent neural network. Wang [57] re-
garded the microblog conversation as sequence, and leveraged BiLSTM models
300 to incorporate preceding tweets for context-aware sentiment classification. Their
US
proposed method could not only alleviate the sparsity problem in the feature
space, but also capture the long distance sentiment dependency in the microblog
conversations. Extensive experiments on a benchmark dataset showed that the
AN
BiLSTM models with context information could outperform other strong base-
305 line algorithms.
It can be seen that much research has been done in the basic structure of
M
LSTM to enhance its performance and LSTM has also achieved outstanding
results in text classification. These methods are the basis of AC-BiLSTM and
ED
the ability to extract the local contextual information. Furthermore, not all
parts of the document are equally relevant but LSTM can not recognize the dif-
ferent relevance between each part of the document. These problems affect the
AC
315
14
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
M
325
tively. The features processed by the AM layers are concatenated together and
then are fed into the softmax classifier. The architecture of the AC-BiLSTM
330 model is shown in Fig.3.
The entire learning algorithm of AC-BiLSTM is summarized as Algorithm
L
1, where denotes the future context representation and the history context
15
ACCEPTED MANUSCRIPT
T
[Lc1 , Lc2 , . . . , Lc100 ], using Eq.8;
−→
IP
3: Employ BiLSTM to obtain the preceding contextual features hf and the
←
−
succeeding contextual features hb from the feature sequences, using Eq.9
CR
and Eq.10;
4: Employ two attention layers to obtain the future context representation
F c and the historical context representation Hc from the preceding and
5: US
succeeding contextual features, using Eq.13 and Eq.14;
Combine the future and historical context representations to obtain the
comprehensive context representations S = [F c, Hc];
AN
6: Feed the comprehensive context representations into the softmax classifier
to get the class labels;
7: Update parameters of the model using the loss function Eq.15 with the
M
Adam method.
ED
Traditional word representations, such as one-hot vectors, face the two main
problems: losing word order and oversize of dimensionality. Compared to one-
hot representations of word embedding, distributed representations of word
CE
embedding is more suitable and more powerful. This paper focuses on text-
340 level classification in this work. Assume that a text has M words, wrm with
AC
16
ACCEPTED MANUSCRIPT
345 The off-the-shelf word embedding matrices that are already on line can be
easily employed. In this paper our approach uses the word2vec method pro-
posed by Mikolve [58] for word embedding. The skip-gram model is used in
the word2vec method for the task. The model is trained by using the skip-
T
gram method by max-mizing the average log probability of all the words. The
IP
350 skip-gram model trains semantic embeddings by predicting the target word in
accordance with its context and the skip-gram model can also capture semantic
CR
relations between words. In this paper, the dimensionality of each word vector
is 300.
355
3.2. One dimension convolutional layer
US
In AC-BiLSTM, the single convolutional layer is used to capture the se-
quence information and reduce the dimensions of the input data. The convo-
AN
lution operation in the convolutional layer is conducted in one dimension. In
the convolutional layer, 100 filters with windows size of 3 move on the textual
representation to extract the features. As the filter moves on, many sequences,
M
360 which capture the syntactic and semantic features, are generated. The illustra-
tion of the convolutional layer is shown as Fig.4. Blocks of the same pattern in
ED
the feature sequences layer and the filter windows layer corresponds to features
for the same window. The dashed lines connect the feature of a window with
the source feature sequences. For the convolutional layer, the dimension of the
PT
365 input data is 300 ∗ M and the dimension of the output data is 100 ∗ M . M is
the number of words in the text. Hence, the convolutional layer is an effective
way for dimensionality reduction.
CE
370 word embedding vector xi:i+m−1 which represents a window of m words starting
from the i − th word, is used to obtain the features for the window of words in
the corresponding feature sequences. Multiple filters with differently initialized
weights are used to improve learning capability of the model. The n − th feature
17
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
375 where b is a bias vector and g(.) represents the nonlinear activation function
of the convolutional operation, rectified linear units (ReLU). In AC-BiLSTM,
PT
ReLU is used as the nonlinear activation function because it can improve the
learning dynamics of the networks and significantly reduce the number of itera-
CE
tions required for convergence in deep networks. Because the number of filters is
380 100, the feature sequences Lc of words xi:i+m−1 are Lc = [Lc1 , Lc2 , . . . , Lc100 ].
AC
18
ACCEPTED MANUSCRIPT
T
390 different weights to words to enhance understanding of the sentiment of the
IP
entire text. Hence, BiLSTM and attention mechanism can improve classification
efficiency.
CR
BiLSTM obtains the annotations of words by summarizing information from
both directions (forward and backward) for words, and hence the annotations
395 incorporate the contextual information. BiLSTM contains the forward LSTM
−−−−→
US
(represented as LST M ) which reads the feature sequences from Lc1 to Lc100
←−−−−
and the backward LSTM (represented as LST M ) which reads from Lc100 to
Lc1 . Formally, the outputs of BiLSTM are stated as follows:
AN
−
→ −−−−→
hf = LST M (Lcn ), n ∈ [1, 100] (9)
M
←
− ←−−−−
hb = LST M (Lcn ), n ∈ [100, 1] (10)
−
→ ←
−
400 hidden state hf and the backward hidden state hb . These states summarize
the information of the entire text centered around Lcn and implement the word
encoding.
PT
405 fully-connected layer and a softmax function. The work process of attention
mechanism in AC-BiLSTM is following detailed.
−→
The word annotation hf is first fed to get −
u→
f by one layer perceptron as a
AC
−
→
hidden representation of h . The −
f u→ is formulated as follows:
f
− −
→
u→
f = tanh(w hf + b) (11)
where w and b are represented as the weight and bias in the neuron, tanh(.) is
410 hyperbolic tangent function. The model uses the similarity between −u→ and a
f
19
ACCEPTED MANUSCRIPT
it uses the softmax function to get the normalized weight −a→ of each word. −
f a→ f
is formulated as follows:
− exp(−
u→ −
→
f ∗ vf )
a→
f = (12)
T
P
M
(exp(−
u→ −
→
f ∗ vf ))
IP
i=1
where M is the number of words in the text and exp(.) is the exponential func-
tion. The word level context vector −
v→ can be seen as a high-level representation
CR
415 f
of the informative words over the words and is randomly initialized and jointly
learned during the training process.
420
US
After that, a weighted sum of the forward read word annotations based on
the weight −
a→ is computed as the forward context representation F c. The F c is
f
the part of the output of the attention layer, and it can be expressed as:
AN
X −
→
Fc = (−
a→
f ∗ hf ) (13)
←
−
Similar to −
a→ ←
−
f , ab can be calculated using the backward hidden state hb . Like
M
X ←
−
Hc = (←
a−b ∗ hb ) (14)
425 catenating the forward context representation F c and backward context repre-
sentation Hc. Finally, the comprehensive context representations S = [F c, Hc]
CE
430 to achieve classification. The purpose of the dropout layer is to avoid overfitting.
Currently, the cross entropy is a commonly used loss function to evaluate the
classification performance of the models. It is often better than the classification
error rate or the mean square error. In our approach, Adam optimizer [59] is
chosen to optimize the loss function of the network. The model parameters are
20
ACCEPTED MANUSCRIPT
435 fine-tuned by Adam optimizer which has been shown as an effective and efficient
backpropagation algorithm. The cross entropy as the loss function can reduce
the risk of a gradient disappearance during the process of stochastic gradient
descent. The loss function can be denoted as follows in Eq.15
T
1 X
Ltotal = − [y ln o + (1 − y) ln(1 − o)] (15)
num
IP
Sp
where num is the number of training samples, Sp represents the training sample,
CR
440 y is the label of the sample, o is the output of AC-BiLSTM.
The main contributions and originality of AC-BiLSTM are as follows:
1) the convolutional layer extracts the low-level semantic features from the
445
US
raw text and is used for dimensionality reduction. For text classification, the
vector representation of the entire document is generally the high-dimensional
vector. The parameters of BiLSTM will increase significantly when BiLSTM
AN
is used to capture the semantics of the entire document. However, too many
network parameters will increase the difficulty of network optimization. Directly
reducing the dimensionality of the text vector will lose a lot of information and
M
convolutional layer can also reduce the dimension of the input data. Therefore,
one dimension convolutional layer can extract the feature information of the
text vector while reducing the dimensionality of the text vector;
PT
earlier words [60]. Therefore, the information extracted by LSTM cannot ef-
fectively represents the actual semantics of the text. Since BiLSTM can access
both the preceding and succeeding contextual features, the features extracted
AC
by BiLSTM can more realistically represent the actual semantics of the text.
460 Compared with extracting the contextual information directly from the text,
extracting the contextual information from the low-level semantic features can
improve the efficiency of extracting information of BiLSTM;
3) The forward hidden layer and backward hidden layer in BiLSTM use
21
ACCEPTED MANUSCRIPT
their respective attention mechanism layers. Since BiLSTM can access both the
465 preceding and succeeding contexts, the information obtained by BiLSTM can
be considered as two different representations of the text. The same informa-
tion may use different representations in the information obtained by BiLSTM.
T
Therefore, using attention mechanism for each representation of the text can
IP
better focus on the respective important information and avoid mutual interfer-
470 ence of the important information in the different representations. Moreover,
CR
the attention mechanism layers in AC-BiLSTM make the understanding of text
semantics more accurate.
Hence, our approach effectively improves the classification accuracy.
4. Experiments US
AN
475 Experiments are conducted to evaluate the performance of the proposed
approach for text classification on various benchmarking datasets. In this sec-
tion, the experimental setup and baseline methods followed by the discussion of
M
480 a) Datasets
Our model is evaluated on text classification task (including sentiment and
question classification) using the following datasets. Summary statistics of these
PT
485 positive snippets and 5331 negative snippets extracted from Rotten Tomatoes
web site pages where reviews marked with ”fresh” are labeled as positive, and
reviews marked with ”rotten” are labeled as negative.
AC
22
ACCEPTED MANUSCRIPT
RT-2k: The standard 2000 full-length movie review dataset [7]. Classifica-
tion involves detecting positive/negative reviews.
SST-1: Stanford Sentiment Treebankan extension of MR but with train/dev/test
495 splits provided and fine-grained labels (very positive, positive, neutral, negative,
T
very negative), re-labeled by Socher [62].
IP
SST-2: Binary labeled version of Stanford sentiment treebank, in which
neutral reviews are removed, very positive and positive reviews are labeled as
CR
positive, negative and very negative reviews are labeled as negative [62].
500 Subj: The subjectivity dataset consists of subjective reviews and objective
plot summaries [8]. The task of subjectivity dataset is to classify the text as
being subjective or objective.
US
TREC: TREC question datasettask involves classifying a question into 6
question types [52]. TREC divides all questions into 6 categories, including
AN
505 location, human, entity, abbreviation, description and numeric. The training
dataset contains 5452 labelled questions while the testing dataset contains 500
questions.
M
510 sentence length, N is the dataset size and |V | is the vocabulary size. ”Test” is
the test set size and ”CV” means that there is no standard train/test split and
thus 10-fold CV is used.
PT
b) Parameter settings
Our experiments use accuracy as the evaluation metric to measure the overall
CE
word vectors trained from Google News are used as pre-trained word embed-
dings. The size of these embeddings is 300. The memory dimension of BiLSTM
520 is set to be 150 and the number of filters of length 3 is set to be 100 in the convo-
lutional layer. The training batch size for all datasets is set as 50. The dropout
rate is 0.7. A back-propagation algorithm with Adam stochastic optimization
23
ACCEPTED MANUSCRIPT
Data c l N |V | Test
MR 2 20 10662 18765 CV
T
IMDB 2 231 50000 392000 25000
IP
SST-1 5 18 11855 17836 2210
SST-2 2 19 9613 16185 1821
CR
Subj 2 23 10000 21323 CV
RT-2K 2 787 2000 51000 CV
TREC 6 10 5952 9592 CV
US
method is used to train the network through time with the learning rate of
AN
0.001. After each training epoch, the network is tested on validation data. The
525 log-likelihood of validation data is computed for convergence detection.
This paper benchmarks the following baseline methods for text classifica-
tion, they are effective methods and have achieved some good results in text
ED
classification:
530 SVM: Support Vector Machine [64].
MNB: Multinomial naive Bayes with uni-bigrams [65].
PT
NBSVM: SVM variant using naive Bayes log-count ratios as feature values
proposed by Wang [65].
CE
24
ACCEPTED MANUSCRIPT
T
CNN-static: 1d-CNN with pre-trained word embedding vector from word2vec
IP
545 proposed by Kim [18].
CNN-non-static: 1d-CNN with pre-trained word embedding and fine-tuning
CR
optimizing strategy proposed by Kim [18].
CNN-multichannel: 1d-CNN with two sets of pre-trained word embeddings
proposed by Kim [18].
550
4.3. Results
565 In this section, our evaluation results are shown on the sentiment classifica-
tion and question type classification tasks. Moreover, some approach analysis
are given.
a) Sentiment classification
25
ACCEPTED MANUSCRIPT
The comparison results for long reviews (RT-2k and IMDB) and short re-
570 views (MR, SST-1, SST-2 and Subj) are presented in Table 2. The experimental
results are evaluated by the classification accuracy. The best results are shown in
boldface. The top 3 approaches are conventional machine learning approaches
T
with hand-crafted features. Other 16 approaches, including our approach, are
IP
deep neural network (DNN) approaches, which can automatically extract fea-
575 tures from the input data for classifier training without feature engineering. The
CR
results of the top 16 approaches are taken from [18, 27, 34, 72].
From Table 2, AC-BiLSTM achieves better results than other methods on
the majority of the benchmark datasets. Among the 19 approaches mentioned
580 US
above, our approach outperforms other baselines on all datasets except SST-
1. The results of AC-BiLSTM are 83.2%, 91.8%, 88.3%,93.0% and 94.0% for
MR, IMDB, SST-2, RT-2k and Subj datasets. AC-BiLSTM gives the relative
AN
improvements of 2.09%, 0.33%, 0.23%, 3.91% and 0.21% compared to CNN-non-
static on MR dataset, P-LSTM on IMDB dataset, CNN-multichannel on SST-2
dataset, NBSVM on RT-2k dataset and P-LSTM on Subj dataset, respectively.
M
585 It is observed that comparing with four CNN-based methods (DCNN, CNN-
static, CNN-non-static and CNN-multichannel), AC-BiLSTM gives better re-
ED
590 hand-crafted features based methods (SVM, MNB and NBSVM) and other
methods (RAE, MV-RNN, RNTN, Paragraph-Vec and DRNN) on all datasets.
CE
For the dataset SST-1, where the data is divided into 5 classes, Tree-LSTM is
the only method to arrive at above 50%. But our approach do not differ sig-
nificantly from the result of Tree-LSTM. It demonstrates that AC-BiLSTM, as
AC
595 an end-to-end model, the results are still promising and comparable with those
models that heavily rely on linguistic annotations and knowledge. This indicates
that AC-BiLSTM will be more feasible for various scenarios. Simultaneously, it
can be seen that the performance of DNN-based methods is better than that of
the conventional machine learning approaches.
26
ACCEPTED MANUSCRIPT
T
rules. Ada-CNN [73] is a self-adaptive hierarchical sentence model with gating
IP
605 networks and it is added to the baseline models. Other baseline models have
been introduced in the section 4.2.
CR
From Table 3, it can be seen that our approach achieves better results than
other baselines on the TREC dataset and the result of AC-BiLSTM is 97.0%.
Our approach gives the relative improvements of 1.57% and 3.63% compared
610
US
to BiLSTM and CNN-non-static on TERC dataset, respectively. Comparing
with the CNN-based methods and the LSTM-based methods, AC-BiLSTM gives
superior performance on the question type classification dataset. For the TREC
AN
dataset, the results of LSTM-based methods are better than these of the CNN-
based methods. It shows that the LSTM-based methods are more suitable than
615 the CNN-based methods for this task. As shown above, AC-BiLSTM captures
M
cation, our results consistently outperform the most of the published baseline
models. In view of the above discussion it can be concluded that the overall
620 performance of AC-BiLSTM is better than that of the state-of-the-art methods
PT
625 that all components are useful for the final results. In this section, a set of ex-
periments are to investigate the effect of each component on the performance
of AC-BiLSTM. AC-BiLSTM without the convolutional layer, AC-BiLSTM re-
placing BiLSTM with LSTM, AC-BiLSTM without the attention mechanism
layers and AC-BiLSTM are compared in this section. The relative improve-
27
ACCEPTED MANUSCRIPT
630 ment ratio ∆ and the classification accuracy are used as the evaluation metric.
The relative improvement ratio ∆ calculates as follows:
T
where ACCAC−BiLST M is the classification accuracy of our approach and ACCvar
IP
is the classification accuracy of each AC-BiLSTM variant. A-BiLSTM is the
classification accuracy of AC-BiLSTM without the convolutional layer. AC-
CR
635 LSTM is the classification accuracy of AC-BiLSTM replacing BiLSTM with
LSTM. C-BiLSTM is the classification accuracy of AC-BiLSTM without the
attention mechanism layers. The results are presented in Table 4.
640
US
From Table 4, it can be seen that the attention mechanism layers and BiL-
STM have a powerful influence on the performance of AC-BiLSTM. Among
the all architectures mentioned above, AC-BiLSTM obtains the best results.
AN
Compared with AC-LSTM and C-BiLSTM, AC-BiLSTM brings the relative
improvements of 0.21% to 2.09%. It is observed that the performance of AC-
BiLSTM decreases considerably when the attention mechanism layers and BiL-
M
STM are removed. In AC-BiLSTM, the attention mechanism layers can identify
645 the effect of each word for the text and BiLSTM can obtain both preceding and
ED
The input of the attention mechanism layers also has an important influence
655 on the classification results. In AC-BiLSTM, the preceding and succeeding con-
texts are fed to different attention mechanism layers. In this section, a set of
experiments are to investigate the effect of the different input of the attention
mechanism layers on the performance of AC-BiLSTM. AC-BiLSTM with the sin-
28
ACCEPTED MANUSCRIPT
T
The relative improvement ratio ∆ and the classification accuracy are used as
IP
the evaluation metric. The results are presented in Table 5. A1C-BiLSTM is
665 the classification accuracy of AC-BiLSTM with the single attention mechanism
CR
layer.
From Table 5, it can be seen that using the two attention mechanism lay-
ers to process the forward and backward information separately is better than
670 US
using the single attention mechanism layer to process the concatenation of the
forward and backward information. AC-BiLSTM gives the relative improve-
ments of 1.09% and 0.94% compared to A1C-BiLSTM on the RT-2k dataset and
AN
the TREC dataset, respectively. Except for the IMDB dataset, AC-BiLSTM
achieves the better results on other datasets. It proves that using the two
attention mechanism layers to process the forward and backward information
M
size of the convolution filters applied in one dimension convolutional layer, a set
of experiments are to investigate the effect of the window size. The window size
AC
29
ACCEPTED MANUSCRIPT
T
IP
CR
US
Figure 5: Accuracy with different window sizes
AN
window size m = 3 has excellent or comparable performance compared to other
690 window sizes. The results show that when the window size is 3, the classification
accuracy is better.
M
The stride size of the convolutional sliding windows also affects the features
extracted by the convolutional layer. In this section, a set of experiments are
to investigate the effect of the stride size of the convolutional sliding windows.
ED
695 The stride size s is as follows: m = 1, 2, 3, 4. Except to the stride size, all other
parameters are kept unchanged. The results are presented in Fig.6.
PT
From Fig.6, it can be seen that the classification accuracy of our method is
significantly reduced for all datasets when the stride size increases. For the long
sentence datasets (IMDB and RT-2k), the performance of our method is less
CE
700 affected by the stride size. But for the short sentence datasets (MR and SST-
2), the performance of the approach has declined dramatically. The reason is
that increasing the stride size means that the convolutional layer will lose more
AC
semantic information. In the short sentences, each word contains relatively more
semantic information, and in the long sentences, each word contains relatively
705 less semantic information. Therefore, when the stride size increases, the short
sentences will lose more semantic information than the long sentences. In short,
30
ACCEPTED MANUSCRIPT
T
IP
CR
US
Figure 6: Accuracy with different stride sizes
AN
reducing the stride size can achieve better results.
710 commonly used methods to generate word embedding vector are mainly the pre-
training method and the random method. Pre-trained word embedding vector
ED
means the model with the pre-trained vectors from word2vec. Random word
embedding vector means the model where all words are randomly initialized and
then modified during training. Generally, the different methods to generate word
PT
715 embedding vector have different effects on the classification performance. In this
section, a set of experiments are to investigate the effect of the different methods
CE
720 precision and F1-score are used to measure binary classification performance.
Recall is the fraction of relevant instances that have been retrieved over the
total amount of relevant instances. Precision is the fraction of relevant instances
among the retrieved instances. For a classifier dedicated to binary classification,
31
ACCEPTED MANUSCRIPT
2 ∗ P recision ∗ Recall
F 1 − score = (17)
P recision + Recall
T
The higher the three indicators are, the better the binary classification per-
IP
formance and robustness should be. Table 6 shows the performance comparison
results on the binary classification datasets.
CR
From Table 6, AC-BiLSTM using pre-trained word embedding vector achieves
730 better results than AC-BiLSTM using random word embedding vector on all
datasets. For F1-score, AC-BiLSTM using pre-trained word embedding vector
US
can get higher results than AC-BiLSTM using random word embedding vector.
For the MR and Subj datasets, pre-trained word embedding vector is supe-
rior to random word embedding vector in all indicators. For the RT-2k and
AN
735 SST-2 datasets, pre-trained word embedding vector is superior to random word
embedding vector in precision and F1-score while random word embedding vec-
tor performs better in recall. For IMDB, random word embedding vector only
M
745 vector are higher than these of random word embedding vector. The results
suggest that the pre-trained word vectors are good universal feature extractors
and can be utilized across datasets.
AC
4.4. Discussions
32
ACCEPTED MANUSCRIPT
T
IP
CR
US
Figure 7: Statistics and comparison of word embedding vector variants
AN
or temporal structures, the convolutional layer performs excellently in extract-
ing n-gram features at different positions of the text through the convolutional
M
filters from the word vectors. In addition, the convolutional layer reduces the
parameters of the network. Compared to LSTM, BiLSTM can accesses both
the preceding and succeeding contextual information. Hence, BiLSTM can more
ED
755
effectively learn the context of each word in the text. Attention mechanism is
mainly to identify the influence of each word on the sentence. It assigns the at-
PT
tention weights to each word and can capture the important components of the
sentence semantics. The combination of these methods makes the understand-
760 ing of sentence semantics more accurate and improves the classification ability
CE
greater effects than the convolutional layer on the classification accuracy. For
765 the convolutional layer, the convolution window size and the stride size also
affect the performance of AC-BiLSTM. Experiments also show that when the
window size is 3 and the stride size is 1, AC-BiLSTM can achieve the best
33
ACCEPTED MANUSCRIPT
results.
For text classification, the methods to generate word embedding vector can
770 affect the classification accuracy. Compared to pre-trained embedding word
vector, random embedding word vector requires to train more parameters and
T
it causes relatively lower classification accuracy in limited iterations. The ex-
IP
periments show that pre-trained embedding word vector can achieves better
results than random word embedding vector. Hence, the method to generate
CR
775 pre-trained embedding word vector is more suitable for AC-BiLSTM.
All experiment results indicate the combination of the convolutional layer,
BiLSTM and attention mechanism remarkably improves text classification ac-
780
US
curacy. For the most of the benchmark datasets, AC-BiLSTM can obtain better
results than other baseline models. It shows that AC-BiLSTM has the better
classification ability and our proposed AC-BiLSTM performs better than some
AN
state-of-the-art DNNs.
M
ED
PT
CE
AC
34
ACCEPTED MANUSCRIPT
T
Table 2: Experimental results of sentiment classification accuracy. % is omitted and
IP
”-” indicates no data and this dataset is not used by the method.
CR
SVM - 89.2 40.7 79.4 87.4 91.7
MNB 79.0 86.6 - - 85.9 93.6
NBSVM 79.4 91.2 - - 89.5 93.2
RAE
MV-RNN
77.7
79.0
US
-
-
43.2
44.4
82.4
82.9
-
-
-
-
AN
RNTN - - 45.7 85.4 - -
Paragraph-Vec - - 48.7 87.8 - -
DCNN - - 48.5 86.8 - -
M
35
ACCEPTED MANUSCRIPT
T
Model ACC Reported in
IP
SVM 95.0 Silva [74]
Paragraph-Vec 91.8 Zhao [73]
CR
Ada-CNN 92.4 Zhao [73]
CNN-non-static 93.6 Kim [18]
CNN-multichannel 92.2 Kim [18]
DCNN
C-LSTM
US
93.0
94.6
Kalchbrenner [69]
Zhou [72]
AN
LSTM 95.3 Our implementation
BiLSTM 95.5 Our implementation
AC-BiLSTM 97.0 Our implementation
M
ED
36
ACCEPTED MANUSCRIPT
T
Dataset A1C-BiLSTM AC-BiLSTM
IP
ACC ∆ ACC ∆
CR
MR 82.5 0.85 83.2 -
IMDB 91.8 0 91.8 -
SST-1 48.7 0.41 48.9 -
SST-2
RT-2k
Subj
88.2
92.0
93.4
US
0.11
1.09
0.64
88.3
93.0
94.0
-
-
-
AN
TREC 96.1 0.94 97.0 -
M
37
ACCEPTED MANUSCRIPT
5. Conclusions
For text classification, feature extraction and the design of classifier are very
important. LSTM has shown better performance on many real-world and bench-
mark text classification problems. However, it is still difficult to understand the
T
785
IP
solve these problems, this paper presents an improved LSTM method, namely
AC-BiLSTM, in which the convolutional layer, BiLSTM and attention mecha-
CR
nism are used to enhance semantic understanding and improve the classification
790 accuracy. Experiments are conducted on seven benchmark datasets to evaluate
the performance of our presented approach. The experimental results indicate
US
that AC-BiLSTM can understand semantics more accurately and enhance the
performance of LSTM in terms of the quality of the final results.
AN
Comparisons with some state-of-the-art baseline methods, it demonstrates
795 that the new method is more effective and efficient in terms of the classification
quality in most cases.
M
Future work focuses on the research of attention mechanism and the de-
sign of network architecture. In addition, the new methods are also applied to
the field of machine reading comprehension. Future works mainly includes the
ED
800 following parts: 1) using other attention mechanisms to further improve our ap-
proach; 2) investigating the effect of attention mechanism on the performance
PT
of our approach; 3) designing the new attention mechanism and the network
architecture; 4) applying our approach to the practical applications; 5) apply-
ing the designed attention mechanism and the network architecture to machine
CE
Acknowledgement
AC
The work described in this paper was support by National Natural Science
Foundation of China Foundation No.61300127. Any conclusions or recommen-
dations stated here are those of the authors and do not necessarily reflect official
810 positions of NSFC.
38
ACCEPTED MANUSCRIPT
References
References
T
alized snippets for web page recommender systems, Transactions of the
IP
815 Japanese Society for Artificial Intelligence 31 (5).
CR
malization and semantic indexing to enhance instant messaging and sms
spam filtering, Knowledge-Based Systems 108 (September 15, 2016) (2016)
25–32.
820
US
[3] B. Liu, Sentiment analysis: Mining opinions, sentiments, and emotions,
Cambridge University Press, Cambridge, England, 2015.
AN
[4] L. H. Lee, D. Isa, W. O. Choo, W. Y. Chue, High relevance keyword extrac-
tion facility for bayesian text classification on different domains of varying
M
825 [5] J. Lei, T. Jin, Hierarchical text classification based on bp neural network,
ED
835 [8] B. Pang, L. Lee, A sentimental education: Sentiment analysis using sub-
jectivity summarization based on minimum cuts, in: Proceedings of the
42nd annual meeting on association for computational linguistics (ACL),
39
ACCEPTED MANUSCRIPT
840 [9] B. Liu, Sentiment analysis and opinion mining. Synthesis Lectures on Hu-
man Language Technologies, Morgan and Claypool Publishers, San Rafael,
T
CA, United states, 2012.
IP
[10] J. Schmidhuber, Deep learning in neural networks: an overview, Neural
CR
Networks 61 (January 01, 2015) (2015) 85–117.
845 [11] V. Campos, B. Jou, X. Giro-i Nieto, From pixels to sentiment: Fine-
tuning cnns for visual sentiment prediction, Image and Vision Computing
65 (September 2017) (2017) 15–22.
US
[12] L. Brocki, K. Marasek, Deep belief neural networks and bidirectional long-
AN
short term memory hybrid for speech recognition, Archives of Acoustics
850 40 (2) (2015) 191–195.
[14] B. Pang, L. Lee, Seeing stars: Exploiting class relationships for sentiment
CE
ation for Computational Linguistics (ACL), Ann Arbor, MI, United states,
2005, pp. 115–124.
40
ACCEPTED MANUSCRIPT
T
870 2012, pp. 1097–1105.
IP
[17] K.-i. Funahashi, Y. Nakamura, Approximation of dynamical systems by
continuous time recurrent neural networks, Neural Networks 6 (6) (1993)
CR
801–806.
US
Proceedings of 2014 Conference on Empirical Methods in Natural Lan-
guage Processing, Association for Computational Linguistics (ACL), Doha,
Qatar, 2014, pp. 1746–1751.
AN
[19] S. Liao, J. Wang, R. Yu, K. Sato, Z. Cheng, Cnn for situations understand-
ing based on sentiment analysis of twitter data, in: Proceedings of the 8th
M
[20] W. Cao, A. Song, J. Hu, Stacked residual recurrent neural network with
word weight for text classification, IAENG International Journal of Com-
puter Science 44 (3) (2017) 277–284.
PT
41
ACCEPTED MANUSCRIPT
T
18 (5-6) (2005) 602–610.
IP
[25] W. Liu, P. Liu, Y. Yang, Y. Gao, J. Yi, An attention-based syntax-tree
CR
and tree-lstm model for sentence summarization, International Journal of
900 Performability Engineering 13 (5) (2017) 775–782.
[26] J. Nowak, A. Taspinar, R. Scherer, Lstm recurrent neural networks for short
US
text and sentiment classification, in: Proceedings of the 16th International
Conference on Artificial Intelligence and Soft Computing, Springer Verlag,
Zakopane, Poland, 2017, pp. 553–562.
AN
905 [27] T. Chen, R. Xu, Y. He, X. Wang, Improving sentiment analysis via sen-
tence type classification using bilstm-crf and cnn, Expert Systems with
M
[30] Z. Zhang, Y. Zou, C. Gan, Textual sentiment analysis via three different
attention convolutional neural networks and cross-modality consistent re-
gression, Neurocomputing 275 (31 January 2018) (2018) 1407–1415.
42
ACCEPTED MANUSCRIPT
T
925 the 2nd International Conference on Algorithms for Computational Biology,
IP
Springer Verlag, Mexico City, Mexico, 2015, pp. 68–80.
CR
[33] J. Wang, L.-C. Yu, K. R. Lai, X. Zhang, Dimensional sentiment analysis us-
ing a regional cnn-lstm model, in: Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics, ACL 2016 - Short Pa-
930
940 [36] T. Le, G. Bui, Y. Duan, A multi-view recurrent neural network for 3d
mesh segmentation, Computers and Graphics (Pergamon) 66 (August 2017)
CE
(2017) 103–112.
Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation
945 with visual attention, in: Proceedings of the 32nd International Conference
on Machine Learning, ICML 2015, International Machine Learning Society
(IMLS), Lile, France, 2015, pp. 2048–2057.
43
ACCEPTED MANUSCRIPT
T
1421.
IP
[39] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, Y. Ben-
gio, A structured self-attentive sentence embedding, CoRR abs/1703.03130.
CR
955 arXiv:1703.03130.
URL http://arxiv.org/abs/1703.03130
US
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of the
31st Annual Conference on Neural Information Processing Systems, NIPS
AN
960 2017, Neural information processing systems foundation, Long Beach, CA,
United states, 2017, pp. 5999–6009.
970 Association for Computational Linguistics (ACL), San Diego, CA, United
states, 2016, pp. 1480–1489.
AC
44
ACCEPTED MANUSCRIPT
T
pp. 927–935.
IP
[45] R. Paulus, C. Xiong, R. Socher, A deep reinforced model for abstractive
summarization, CoRR abs/1705.04304. arXiv:1705.04304.
CR
URL http://arxiv.org/abs/1705.04304
985 [46] H. Huang, C. Zhu, Y. Shen, W. Chen, Fusionnet: Fusing via fully-aware at-
arXiv:1711.07341. US
tention with application to machine comprehension, CoRR abs/1711.07341.
URL http://arxiv.org/abs/1711.07341
AN
[47] M. J. Seo, A. Kembhavi, A. Farhadi, H. Hajishirzi, Bidirectional atten-
990 tion flow for machine comprehension, CoRR abs/1611.01603. arXiv:
M
1611.01603.
URL http://arxiv.org/abs/1611.01603
ED
URL http://arxiv.org/abs/1702.04521
[50] M. Yang, W. Tu, J. Wang, F. Xu, X. Chen, Attention-based lstm for target-
dependent sentiment classification, in: Proceedings of the 31st AAAI Con-
ference on Artificial Intelligence, AAAI press, San Francisco, CA, United
states, 2017, pp. 5013–5014.
45
ACCEPTED MANUSCRIPT
[52] Y. Luo, Recurrent neural networks for classifying relations in clinical notes,
T
Journal of Biomedical Informatics 72 (August 2017) (2017) 85–95.
IP
1010 [53] F. Hu, L. Li, Z.-L. Zhang, J.-Y. Wang, X.-F. Xu, Emphasizing essential
CR
words for sentiment classification based on recurrent neural networks, Jour-
nal of Computer Science and Technology 32 (4) (2017) 785–795.
[56] D. Tang, B. Qin, T. Liu, Document modeling with gated recurrent neu-
1020 ral network for sentiment classification, in: Proceedings of Conference on
ED
46
ACCEPTED MANUSCRIPT
T
Austin, TX, United states, 2015, pp. 2267–2273.
IP
[61] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, C. Potts,
Learning word vectors for sentiment analysis, in: Proceedings of the 49th
CR
Annual Meeting of the Association for Computational Linguistics: Human
1040 Language Technologies, Association for Computational Linguistics (ACL),
Portland, OR, United states, 2011, pp. 142–150.
US
[62] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y.
Ng, C. Potts, Recursive deep models for semantic compositionality over
AN
a sentiment treebank, in: Proceedings of 2013 Conference on Empirical
1045 Methods in Natural Language Processing, Association for Computational
Linguistics (ACL), Seattle, WA, United states, 2013, pp. 1631–1642.
M
1155.
1050 [64] Y. Liu, J.-W. Bi, Z.-P. Fan, A method for multi-class sentiment classifica-
PT
tion based on an improved one-vs-one (ovo) strategy and the support vec-
tor machine (svm) algorithm, Information Sciences 394-395 (July 1, 2017)
(2017) 38–52.
CE
47
ACCEPTED MANUSCRIPT
T
ity through recursive matrix-vector spaces, in: Proceedings of 2012 Joint
IP
1065 Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning, Association for Computational
CR
Linguistics (ACL), Jeju Island, Korea, Republic of, 2012, pp. 1201–1211.
1070 US
in: Proceedings of the 31st International Conference on Machine Learning,
International Machine Learning Society (IMLS), Beijing, China, 2014, pp.
2931–2939.
AN
[69] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural net-
work for modelling sentences, in: Proceedings of the 52nd Annual Meeting
M
1080 [71] P. Liu, X. Qiu, H. Xuanjing, Recurrent neural network for text classification
CE
[72] C. Zhou, C. Sun, Z. Liu, F. C. M. Lau, A c-lstm neural network for text
1085 classification, Computer Science 1 (4) (2015) 39–44.
48
ACCEPTED MANUSCRIPT
T
view 35 (2) (2011) 137–154.
IP
CR
US
AN
M
ED
PT
CE
AC
49
ACCEPTED MANUSCRIPT
Biography
T
IP
CR
1095 Gang Liu received the Ph.D. degree in computer software and theory from
State Key Laboratory of Software Engineering, Wuhan University, Wuhan,
China, in 2012.
US
He is currently an associate professor with the School of
Computer Science, Hubei University of Technology, Wuhan, China. He has
published more than 20 international journal/conference papers. His current
AN
1100 research interests include evolutionary computation, deep learning technology,
image processing and natural language processing.
M
ED
50