100% found this document useful (1 vote)
120 views

Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer

This document describes a new neural network architecture called attention-based bidirectional long short-term memory with convolution layer (AC-BiLSTM) for text classification. AC-BiLSTM uses a convolutional layer to extract phrase representations, bidirectional LSTM to access context in both directions, and attention mechanism to focus on important information. Experimental results on sentiment classification and question classification datasets show AC-BiLSTM outperforms other state-of-the-art text classification methods.

Uploaded by

Pradeep Patwa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
120 views

Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer

This document describes a new neural network architecture called attention-based bidirectional long short-term memory with convolution layer (AC-BiLSTM) for text classification. AC-BiLSTM uses a convolutional layer to extract phrase representations, bidirectional LSTM to access context in both directions, and attention mechanism to focus on important information. Experimental results on sentiment classification and question classification datasets show AC-BiLSTM outperforms other state-of-the-art text classification methods.

Uploaded by

Pradeep Patwa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Communicated by Dr Shenglan Liu

Accepted Manuscript

Bidirectional LSTM with attention mechanism and convolutional layer


for text classification

Gang Liu, Jiabao Guo

PII: S0925-2312(19)30106-7
DOI: https://doi.org/10.1016/j.neucom.2019.01.078
Reference: NEUCOM 20383

To appear in: Neurocomputing

Received date: 14 March 2018


Revised date: 7 January 2019
Accepted date: 26 January 2019

Please cite this article as: Gang Liu, Jiabao Guo, Bidirectional LSTM with attention
mechanism and convolutional layer for text classification, Neurocomputing (2019), doi:
https://doi.org/10.1016/j.neucom.2019.01.078

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT

Highlights

• For the convolutional layer, the convolution window size and the stride
size affect the classification performance.

T
• BiLSTM and attention mechanism have greater effects than the convolu-

IP
tional layer on the classification accuracy.

• Compared to pre-trained embedding word vector, random embedding word

CR
vector requires to train more parameters and it causes relatively lower
classification accuracy in limited iterations.

US
• Our proposed approach performs better than some state-of-the-art DNNs.
AN
M
ED
PT
CE
AC

1
ACCEPTED MANUSCRIPT

Bidirectional LSTM with attention mechanism and


convolutional layer for text classification✩

Gang Liua , Jiabao Guoa,∗

T
a School of Computer Science, Hubei University of Technology, Wuhan 430072, China

IP
CR
Abstract

Neural network models have been widely used in the field of natural language
processing (NLP). Recurrent neural networks (RNNs), which have the ability to

US
process sequences of arbitrary length, are common methods for sequence mod-
eling tasks. Long short-term memory (LSTM) is one kind of RNNs and has
AN
achieved remarkable performance in text classification. However, due to the
high dimensionality and sparsity of text data, and to the complex semantics of
the natural language, text classification presents difficult challenges. In order to
solve the above problems, a novel and unified architecture which contains a bidi-
M

rectional LSTM (BiLSTM), attention mechanism and the convolutional layer is


proposed in this paper. The proposed architecture is called attention-based
ED

bidirectional long short-term memory with convolution layer (AC-BiLSTM). In


AC-BiLSTM, the convolutional layer extracts the higher-level phrase represen-
tations from the word embedding vectors and BiLSTM is used to access both
PT

the preceding and succeeding context representations. Attention mechanism is


employed to give different focus to the information outputted from the hidden
CE

layers of BiLSTM. Finally, the softmax classifier is used to classify the processed
context information. AC-BiLSTM is able to capture both the local feature of
phrases as well as global sentence semantics. Experimental verifications are
AC

✩ The work described in this paper was support by National Natural Science Foundation of

China Foundation No.61300127. Any conclusions or recommendations stated here are those
of the authors and do not necessarily reflect official positions of NSFC.
∗ Corresponding author.

Email addresses: [email protected] (Gang Liu), [email protected] (Jiabao Guo)

Preprint submitted to Neurocomputing February 1, 2019


ACCEPTED MANUSCRIPT

conducted on six sentiment classification datasets and a question classification


dataset, including detailed analysis for AC-BiLSTM. The results clearly show
that AC-BiLSTM outperforms other state-of-the-art text classification methods
in terms of the classification accuracy.

T
Keywords: long short-term memory, attention mechanism, natural language

IP
processing, text classification

CR
1. Introduction

Text classification is the task of automatically classifying a set of documents

5
US
into categories from a predefined set and is an important task in many areas of
nature language processing (NLP). It has been applied to recommender systems
[1], spam filtering system [2] and other areas where it is necessary to understand
AN
the sentiment of the users. Sentiment analysis, is a branch of text classification,
which is the field of study that analyzes peoples opinions, sentiments, appraisals,
attitudes, and emotions toward entities and their attributes expressed in written
M

text [3]. At present, there are three main types of the methods for text clas-
10 sification. These methods are: (1) the statistics-based classification methods,
such as Bayesian classifier [4]; (2) the connected network learning classification
ED

methods, such as neural networks [5]; (3) the rule-making methods, such as
decision tree classification [6].
PT

Text classification mainly includes topic classification, question classification


15 and sentiment analysis. Hence, it is hard to find a universal approach, which
will consistently performs for all classes of text classification problems. In ad-
CE

dition, most of the existing research on traditional text classification centers on


one special type of phrase or sentences. These methods only rely on the target
AC

sentences or the target words to solve the text classification problem without
20 considering the relationship between each word. In fact, the classification of the
text should be determined based on all the contexts. Especially, the traditional
sentiment analysis focuses on identifying the polarity of a text (e.g. positive,
negative, neutral) based on the language clues extracted from the textual con-

3
ACCEPTED MANUSCRIPT

tents of sentences [7, 8, 9]. Generally, the sentiment text can often be expressed
25 in a more subtle or arbitrary manner, making it difficult to be identified by
simply looking each sentence or word in isolation. Despite having several strik-
ing features and successful applications in various fields, the traditional text

T
classification approaches have been shown to have certain weaknesses.

IP
Deep learning technology (DL) [10] has achieved remarkable results in many
30 fields, such as computer vision [11], speech recognition [12] and text classifica-

CR
tion [13] in recent years. For text classification, most of the studies with the deep
learning methods can be divided into two parts: (1) learning word vector rep-
resentations through neural language models [14]; (2) performing composition

35 US
over the learned word vectors for classification [15]. There are two kinds of deep
learning models in text classification: convolutional neural networks (CNNs)[16]
and recurrent neural networks (RNNs) [17]. In recent years, many text classifi-
AN
cation methods based on CNNs or RNNs have been proposed [18, 19, 20, 21, 22].
CNNs are able to learn the local response from the temporal or spatial data but
lack the ability to learn sequential correlations. In contrast to CNNs, RNNs are
M

40 specilized for sequential modelling but unable to extract features in a parallel


way. In fact, text classification can be considered as the sequential modelling
ED

task. Due to the characteristics of RNNs, RNNs are used more frequently in
text classification. However, for long data sequences, traditional RNNs cause
exploding and vanishing state against its gradient. Long short term memory
PT

45 (LSTM) [23] is a kind of RNNs architecture with long short term memory units
as hidden units and effectively solves vanishing gradient and gradient explo-
CE

sion problems. Moreover, it can capture long-term dependencies. In terms of


the great power of LSTM to extract the high-level text information, it plays
a pivotal role in NLP. Bidirectional long short term memory (BiLSTM) [24]
AC

50 is a further development of LSTM and BiLSTM combines the forward hidden


layer and the backward hidden layer, which can access both the preceding and
succeeding contexts. Compared to BiLSTM, LSTM only exploits the historical
context. Hence, BiLSTM can solve the sequential modelling task better than
LSTM. Currently, LSTM and BiLSTM have been applied to text classification

4
ACCEPTED MANUSCRIPT

55 and made some achievements [25, 26, 27, 28].


For text classification, the vector representation of the text is generally the
high-dimensional vector. The high-dimensional vector as the input of LSTM
will cause a sharp increase in the network parameters and make the network

T
difficult to optimize. The convolution operation can extract the features while

IP
60 reducing dimensionality of data. Therefore, the convolution operation can be
used to extract the features of the text vector and reduce the dimensions of the

CR
vector. Although BiLSTM can obtain the contextual information of the text,
it is not possible to focus on the important information in the obtained con-
textual information. Focusing on the important information will improve the
65

US
accuracy of the classification. Attention mechanism can highlight the impor-
tant information from the contextual information by setting different weights.
The combination of BiLSTM and attention mechanism can further improve the
AN
classification accuracy.
To further continue the research in this direction, this paper proposes a
70 novel deep learning architecture for text classification. This new architecture is
M

enhanced BiLSTM using attention mechanism (AM) [29] and the convolutional
layer, referred to as attention-based BiLSTM with the convolutional layer (AC-
ED

BiLSTM). The basic idea of the proposed architecture is based on the following
consideration. The one-dimensional convolutional filters in the convolutional
75 layer perform in extracting n-gram features at different positions of a sentence
PT

and reduce the dimensions of the input data. BiLSTM is used to extract the
contextual information from the features outputted by the convolutional layer.
CE

Attention mechanism has also been successfully applied to text classification


[30]. In AC-BiLSTM, attention mechanism is respectively employed to give
80 different focus to the information extracted from the forward hidden layer and
AC

the backward hidden layer in BiLSTM. Attention mechanism strengthens the


distribution of weights to the variable-length sequences. There are two attention
mechanism layers in AC-BiLSTM. The features extracted from the attention
mechanism layers are banded together and will be classified by the softmax
85 classifier. In order to verify the performance of the proposed approach, seven

5
ACCEPTED MANUSCRIPT

comprehensive labeled datasets of experiments (including 6 sentiment analysis


datasets and 1 text classification dataset) are conducted. Compared with other
state-of-the-art text classification methods, our approach performs better, or at
least comparably, in terms of the classification accuracy and the robustness.

T
90 The remainder of this paper is organized as follows. Section 2 introduces

IP
LSTM, BiLSTM and gives a short literature review on text classification. The
proposed approach is presented in detail in Section 3. Experimental results and

CR
discussions are reported in Section 4. Finally, some conclusions and possible
paths for future research are provided in Section 5.

95

2.1. Long short-term memory


US
2. Long short-term memory and related work
AN
RNNs are a kind of feedforward neural networks which have a recurrent hid-
den state and the hidden state is activated by the previous states at a certain
time. Therefore, RNNs can model the contextual information dynamically and
M

100 can handle the variable-length sequences. LSTM is a kind of RNNs architecture
and has become the mainstream structure of RNNs at present. LSTM addresses
ED

the problem of vanishing gradient by replacing the self-connected hidden units


with memory blocks. The memory block uses purpose-built memory cell to store
information, and it is better at finding and exploiting long range context. The
PT

105 memory units enable the network to be aware of when to learn new information
and when to forget old information. A LSTM unit consists of the four compo-
nents and it is as illustrated in Fig.1. The i is an input gate and it controls
CE

the size of the new memory content added to the memory. The f is a forget
gate and it determines the amount of the memory that needs to be forgotten.
AC

110 The o is an output gate and it modulates the amount of the output memory
content. The c is the cell activation vector and it consists of two components,
namely partially forgotten previous memory ct−1 and modulated new memory
c˜t . t nominates the t − th moment.

6
ACCEPTED MANUSCRIPT

T
IP
CR
US
AN
Figure 1: Illustration of the LSTM unit. The weight matrices are represented by lines with
arrows
M

The mathematical form of LSTM shown in Fig.1 is given. The hidden state
115 ht given input xt is computed as follows:
ED

it = σ(Wxi xt + Whi ht−1 + bi ) (1)


PT

ft = σ(Wxf xt + Whf ht−1 + bf ) (2)


CE

ot = σ(Wxo xt + Who ht−1 + bo ) (3)

c˜t = tanh(Wxc xt + Whc ht−1 + bc ) (4)


AC

ct = ft ⊗ ct−1 + it ⊗ c˜t (5)

ht = ot ⊗ tanh(ct ) (6)

7
ACCEPTED MANUSCRIPT

T
IP
(a) LSTM (b) BiLSTM

CR
Figure 2: Illustration of a LSTM model (a) and a BiLSTM model (b).

US
where it , ft , ot and ct represent the value of i, f , o and c at the moment
t, respectively. The W denotes the self-updating weights of the hidden layer
and the term b denotes the bias vector. The σ(.) and tanh(.) are sigmoid and
AN
hyperbolic tangent function respectively. All the gate values and hidden layer
120 outputs lie within the range of [0, 1]. The operator ⊗ denotes element-wise
multiplication.
M

The graphical illustration of the standard LSTM network can be found in the
(a) of Fig.2. The standard LSTM network can only exploit the historical con-
text. However, the lack of future context may lead to incomplete understanding
ED

125 of the meaning of the problem. Therefore, BiLSTM is proposed to access both
the preceding and succeeding contexts by combining a forward hidden layer and
PT

a backward hidden layer as depicted in the (b) of Fig.2. The forward and back-
ward pass over the unfolded network over time are carried out in a similar way
to regular network forward and backward passes, except that BiLSTM need to
CE

130 unfold the forward hidden states and the backward hidden states for all time
steps. The BiLSTM networks are trained using backpropagation through time
AC

(BPTT) [24].

2.2. Related work

In deep learning, LSTM is mainly used to process the sequence data. The
135 breadth of applications for LSTM has expanded rapidly in recent years. In or-

8
ACCEPTED MANUSCRIPT

der to further improve the performance of LSTM to handle the variable-length


sequential information for the requirements of various tasks, many researchers
have proposed many methods to improve LSTM. Currently, LSTM and its vari-
ants have been employed to produce the promising results on a variety of tasks.

T
140 The combination of LSTM and other network structures is an important re-

IP
search direction. Kolawole [31] employed the Child-Sum Tree-LSTM for solving
the challenging problem of textual entailment. Their approach is simple and able

CR
to generalize well without excessive parameter optimization. The literature [32]
demonstrated that LSTM networks predict the subcellular location of proteins
145 given only the protein sequence with high accuracy outperforming current state-

US
of-the-art algorithms. They further improved the performance by introducing
convolutional filters and experiment with an attention mechanism which lets the
LSTM focus on specific parts of the protein. Jin [33] proposed a regional CNN-
AN
LSTM model consisting of two parts: regional CNN and LSTM to predict the
150 valence-arousal (VA) ratings of texts. The proposed regional CNN uses an indi-
vidual sentence as a region, dividing an input text into several regions such that
M

the useful affective information in each region can be extracted and weighted
according to their contribution to the VA prediction. Such regional informa-
ED

tion is sequentially integrated across regions using LSTM for VA prediction.


155 By combining the regional CNN and LSTM, both local (regional) information
within sentences and long-distance dependency across sentences can be consid-
PT

ered in the prediction process. Experimental results showed that the proposed
method outperforms lexicon-based, regression-based, and NN-based methods
CE

proposed in previous studies. Lu [34] proposed a novel model based on LSTM


160 called P-LSTM for sentiment classification. In P-LSTM, three-words phrase
embedding is used instead of single word embedding as is often done. P-LSTM
AC

introduces the phrase factor mechanism which combines the feature vectors of
the phrase embedding layer and the LSTM hidden layer to extract more exact
information from the text. The experimental results showed that the P-LSTM
165 achieves excellent performance on the sentiment classification tasks. Chen [27]
proposed a divide-and-conquer approach which first classifies sentences into dif-

9
ACCEPTED MANUSCRIPT

ferent types, then performs sentiment analysis separately on sentences from each
type. Their approach, BiLSTM-CRF, is used to classify opinionated sentences
into three types according to the number of targets appeared in a sentence. Each
170 group of sentences is then fed into a one-dimensional convolutional neural net-

T
work separately for sentiment classification. The literature [35] proposed a deep

IP
learning-based approach for temporal 3D pose recognition problems based on
a combination of a CNN and a LSTM recurrent network. The paper presents

CR
a two-stage training strategy which firstly focuses on CNN training and sec-
175 ondly, adjusts the full method (CNN+LSTM). Le [36] introduced a multi-view
recurrent neural network (MV-RNN) approach for 3D mesh segmentation. The

US
architecture combines CNN and a two-layer LSTM to yield coherent segmenta-
tion of 3D shapes. The imaged-based CNN are useful for effectively generating
the edge probability feature map while the LSTM correlates these edge maps
AN
180 across different views and output a well-defined per-view edge image.
Currently, attention mechanism has become an effective method to select the
significant information to obtain the superior results. Many studies have been
M

conducted on the architecture of attention mechanism and many novel atten-


tion mechanisms are proposed. Xu [37] proposed a stochastic ”hard” attention
ED

185 mechanism and a deterministic ”soft” attention mechanism. The determinis-


tic attention model is an approximation to the marginal likelihood over the
attention locations and it is the most widely used attention mechanism. Lu-
PT

ong [38] examined two simple and effective classes of attentional mechanism:
a global approach which always attends to all source words and a local one
CE

190 that only looks at a subset of source words at a time. These classes differ in
terms of whether the attention is placed on all source positions or on only a
few source positions. The idea of a global attentional model is to consider all
AC

the hidden states of the encoder when deriving the context vector. The lo-
cal attention mechanism selectively focuses on a small window of context and
195 is differentiable. Lin [39] proposed a self-attention mechanism. The proposed
self-attention mechanism allows extracting different aspects of the sentence into
multiple vector representations. Vaswani [40] proposed scaled dot-product at-

10
ACCEPTED MANUSCRIPT

tention and multi-head attention. Scaled dot-product attention computes the


dot products of the input data, divide each by the scaling factor, and apply
200 a softmax function to obtain the weights on the values. Multi-head attention
allows the model to jointly attend to information from different representation

T
subspaces at different positions. Due to the reduced dimension of each head,

IP
the total computational cost is similar to that of single-head attention with full
dimensionality. Shen [41] proposed bidirectional block self-attention (Bi-BloSA)

CR
205 for fast and memory-efficient context fusion. The basic idea is to split a sequence
into several length-equal blocks (with padding if necessary), and apply an intra-
block self-attention networks (SAN) to each block independently. The outputs

210
US
for all the blocks are then processed by an inter-block SAN. The intra-block
SAN captures the local dependency within each block, while the inter-block
SAN captures the long-range/global dependency. Hence, every SAN only needs
AN
to process a short sequence.
The combination of LSTM and attention mechanism can obtain better re-
sults. Especially, for sequence problems, attention mechanism has been used
M

successfully in a variety of tasks including text classification, reading compre-


215 hension and so on. Yang [42] proposed a hierarchical attention network for doc-
ED

ument classification. The model has two distinctive characteristics: (i) it has a
hierarchical structure that mirrors the hierarchical structure of documents; (ii)
it has two levels of attention mechanisms applied at the wordand sentence-level,
PT

enabling it to attend differentially to more and less important content when


220 constructing the document representation. Experiments conducted on six large
CE

scale text classification tasks demonstrate that the proposed architecture out-
perform previous methods by a substantial margin. Cui [43] presented a simple
but novel model called attention-over-attention reader for better solving cloze-
AC

style reading comprehension task. The proposed model aims to place another
225 attention mechanism over the document-level attention and induces ”attended
attention” for final answer predictions. Experimental results show that the pro-
posed methods significantly outperform various state-of-the-art systems by a
large margin in public datasets. Li [44] developed a novel model, employing

11
ACCEPTED MANUSCRIPT

context-dependent word-level attention for more accurate statement represen-


230 tations and question-guided sentencelevel attention for better context modeling.
Employing these attention mechanisms, the model accurately understands when
it can output an answer or when it requires generating a supplementary ques-

T
tion for additional input depending on different contexts. Paulus [45] introduced

IP
a neural network model with a novel intra-attention that attends over the in-
235 put and continuously generated output separately, and a new training method

CR
that combines standard supervised word prediction and reinforcement learn-
ing (RL). The model obtains a 41.16 ROUGE-1 score on the CNN/Daily Mail
dataset, an improvement over previous state-of-the-art models. Human eval-

240 US
uation also shows that our model produces higher quality summaries. Huang
[46] introduced a new neural structure called FusionNet, which extends existing
attention approaches from three perspectives. First, it puts forward a novel
AN
concept of ”history of word” to characterize attention information from the
lowest word-level embedding up to the highest semantic-level representation.
Second, it introduces an improved attention scoring function that better uti-
M

245 lizes the ”history of word” concept. Third, it proposes a fully-aware multi-level
attention mechanism to capture the complete information in one text (such as a
ED

question) and exploit it in its counterpart (such as context or passage) layer by


layer. Seo [47] introduced the Bi-Directional Attention Flow (BIDAF) network,
a multi-stage hierarchical process that represents the context at different lev-
PT

250 els of granularity and uses bi-directional attention flow mechanism to obtain a
query-aware context representation without early summarization. Experimental
CE

evaluations show that the model achieves the state-of-the-art results in Stanford
Question Answering Dataset (SQuAD) and CNN/DailyMail cloze test. Daniluk
[48] proposed a neural language model with a key-value attention mechanism
AC

255 that outputs separate representations for the key and value of a differentiable
memory, as well as for encoding the next-word distribution. This model outper-
forms existing memory-augmented neural language models on two corpora. The
literature [49] proposed a simple neural architecture for natural language infer-
ence. The approach uses attention to decompose the problem into subproblems

12
ACCEPTED MANUSCRIPT

260 that can be solved separately, thus making it trivially parallelizable. On the
Stanford Natural Language Inference (SNLI) dataset, it obtains state-of-the-
art results with almost an order of magnitude fewer parameters than previous
work and without relying on any word-order information. Min [50] presented an

T
attention-based bidirectional LSTM approach to improve the target-dependent

IP
265 sentiment classification. The method learns the alignment between the target
entities and the most distinguishing features. The experimental results showed

CR
that the model achieves state-of-the-art results.
Some studies have used other methods to improve LSTM and proposed many
new variants of LSTM. Wei [51] proposed a transfer learning framework based
270

US
on a convolutional neural network and a long short-term memory model, called
ConvL, to automatically identify whether a post expresses confusion, determine
the urgency and classify the polarity of the sentiment. Luo [52] proposed the
AN
models based on LSTM for classifying relations from clinical notes. They com-
pared the segment LSTM model with the sentence LSTM model, and demon-
275 strated the benefits of exploring the difference between concept text and context
M

text, and between different contextual parts in the sentence. They also evalu-
ated the impact of word embedding on the performance of LSTM models and
ED

showed that medical domain word embedding help improve the relation classifi-
cation. Hu [53] established a keyword vocabulary and proposed an LSTM-based
280 model that is sensitive to the words in the vocabulary. Experimental results
PT

demonstrated that their model outperforms the baseline LSTM in terms of ac-
curacy and is effective with significant performance enhancement over several
CE

non-recurrent neural network latent semantic models. Huang [54] discovered


that encoding syntactic knowledge (part-of-speech tag) in neural networks can
285 enhance sentence/phrase representation. Specifically, they proposed to learn
AC

tag-specific composition functions and tag embeddings in recursive neural net-


works, and proposed to utilize POS tags to control the gates of tree-structured
LSTM networks. The literature [55] proposed a quadratic connections of the
LSTM model in terms of RvNNs (abbreviated as qLSTM-RvNN) in order to
290 attack the problem of representing compositional semantics. Empirical results

13
ACCEPTED MANUSCRIPT

showed that it outperforms the state-of-the-art RNN, RvNN, and LSTM net-
works in two semantic compositionality tasks by increasing the classification ac-
curacies and sentence correlation while significantly decreasing computational
complexities. Tang [56] introduced a neural network model to learn vector-

T
295 based document representation in a unified, bottom-up fashion. The model first

IP
learns sentence representation with convolutional neural network or LSTM. Af-
terwards, semantics of sentences and their relations are adaptively encoded in

CR
document representation with gated recurrent neural network. Wang [57] re-
garded the microblog conversation as sequence, and leveraged BiLSTM models
300 to incorporate preceding tweets for context-aware sentiment classification. Their

US
proposed method could not only alleviate the sparsity problem in the feature
space, but also capture the long distance sentiment dependency in the microblog
conversations. Extensive experiments on a benchmark dataset showed that the
AN
BiLSTM models with context information could outperform other strong base-
305 line algorithms.
It can be seen that much research has been done in the basic structure of
M

LSTM to enhance its performance and LSTM has also achieved outstanding
results in text classification. These methods are the basis of AC-BiLSTM and
ED

therefore they are elaborated.

310 3. Attention-based BiLSTM with convolutional layer


PT

LSTM is good at handling the variable-length sequences; however, LSTM


can not utilize the contextual information from the future tokens and it lacks
CE

the ability to extract the local contextual information. Furthermore, not all
parts of the document are equally relevant but LSTM can not recognize the dif-
ferent relevance between each part of the document. These problems affect the
AC

315

text classification accuracy of LSTM. In order to improve the performance of


LSTM in text classification, this paper attempts to design the novel architecture
which helps to address the drawbacks mentioned above by integrating BiLSTM,
attention mechanism and the convolutional layer. The proposed architecture

14
ACCEPTED MANUSCRIPT

T
IP
CR
US
AN
M

Figure 3: The architecture of the AC-BiLSTM


ED

320 is named attention-based BiLSTM with convolutional layer (AC-BiLSTM). In


AC-BiLSTM, the convolutional layer extracts n-gram features from the text for
sentence modeling. And then BiLSTM accesses both the preceding and succeed-
PT

ing contextual features by combining a forward hidden layer and a backward


hidden layer. The attention mechanism (AM) for the single word representation
pays more attention to the words related to the sentiment of the text and it can
CE

325

help to understand the sentence semantics. Two attention mechanism layers in


AC-BiLSTM process the preceding and succeeding contextual features, respec-
AC

tively. The features processed by the AM layers are concatenated together and
then are fed into the softmax classifier. The architecture of the AC-BiLSTM
330 model is shown in Fig.3.
The entire learning algorithm of AC-BiLSTM is summarized as Algorithm
L
1, where denotes the future context representation and the history context

15
ACCEPTED MANUSCRIPT

representation are concatenated together.

Algorithm 1 Pseudo-code for AC-BiLSTM


1: Construct word embedding table using pre-trained word vectors with Eq.7;

2: Employ the convolutional layer to obtain the feature sequences Lc =

T
[Lc1 , Lc2 , . . . , Lc100 ], using Eq.8;
−→

IP
3: Employ BiLSTM to obtain the preceding contextual features hf and the


succeeding contextual features hb from the feature sequences, using Eq.9

CR
and Eq.10;
4: Employ two attention layers to obtain the future context representation
F c and the historical context representation Hc from the preceding and

5: US
succeeding contextual features, using Eq.13 and Eq.14;
Combine the future and historical context representations to obtain the
comprehensive context representations S = [F c, Hc];
AN
6: Feed the comprehensive context representations into the softmax classifier
to get the class labels;
7: Update parameters of the model using the loss function Eq.15 with the
M

Adam method.
ED

The key points of our approach are described in detail as follows.

335 3.1. Word embedding


PT

Traditional word representations, such as one-hot vectors, face the two main
problems: losing word order and oversize of dimensionality. Compared to one-
hot representations of word embedding, distributed representations of word
CE

embedding is more suitable and more powerful. This paper focuses on text-
340 level classification in this work. Assume that a text has M words, wrm with
AC

m ∈ [1, M ] represents the vector of the m − th words in the text. Given a


text with words wrm , AC-BiLSTM embeds the words to vectors through an
embedding matrix We . The xm is the vector representation of wrm , which is
formulated by Eq.7.
xm = We wrm (7)

16
ACCEPTED MANUSCRIPT

345 The off-the-shelf word embedding matrices that are already on line can be
easily employed. In this paper our approach uses the word2vec method pro-
posed by Mikolve [58] for word embedding. The skip-gram model is used in
the word2vec method for the task. The model is trained by using the skip-

T
gram method by max-mizing the average log probability of all the words. The

IP
350 skip-gram model trains semantic embeddings by predicting the target word in
accordance with its context and the skip-gram model can also capture semantic

CR
relations between words. In this paper, the dimensionality of each word vector
is 300.

355
3.2. One dimension convolutional layer

US
In AC-BiLSTM, the single convolutional layer is used to capture the se-
quence information and reduce the dimensions of the input data. The convo-
AN
lution operation in the convolutional layer is conducted in one dimension. In
the convolutional layer, 100 filters with windows size of 3 move on the textual
representation to extract the features. As the filter moves on, many sequences,
M

360 which capture the syntactic and semantic features, are generated. The illustra-
tion of the convolutional layer is shown as Fig.4. Blocks of the same pattern in
ED

the feature sequences layer and the filter windows layer corresponds to features
for the same window. The dashed lines connect the feature of a window with
the source feature sequences. For the convolutional layer, the dimension of the
PT

365 input data is 300 ∗ M and the dimension of the output data is 100 ∗ M . M is
the number of words in the text. Hence, the convolutional layer is an effective
way for dimensionality reduction.
CE

The convolutional layer, which is between k filters Wc ∈ Rmd×k (R is real


number system and the term d is the dimension of word embedding.) and a
AC

370 word embedding vector xi:i+m−1 which represents a window of m words starting
from the i − th word, is used to obtain the features for the window of words in
the corresponding feature sequences. Multiple filters with differently initialized
weights are used to improve learning capability of the model. The n − th feature

17
ACCEPTED MANUSCRIPT

T
IP
CR
US
AN

Figure 4: The architecture for the convolution operation


M

sequence Lcn is generated from a window of words xi:i+m−1 by

Lcn = g(Wc T xi:i+m−1 + b) ∈ Rk (8)


ED

375 where b is a bias vector and g(.) represents the nonlinear activation function
of the convolutional operation, rectified linear units (ReLU). In AC-BiLSTM,
PT

ReLU is used as the nonlinear activation function because it can improve the
learning dynamics of the networks and significantly reduce the number of itera-
CE

tions required for convergence in deep networks. Because the number of filters is
380 100, the feature sequences Lc of words xi:i+m−1 are Lc = [Lc1 , Lc2 , . . . , Lc100 ].
AC

3.3. BiLSTM and attention mechanism

Intuitively, text classification is the processing of sequential information.


However, the feature sequences obtained in a parallel way from the convolutional
layer do not contain sequence information. BiLSTM is specilized for sequential
385 modelling and can further extract the contextual information from the feature

18
ACCEPTED MANUSCRIPT

sequences obtained by the convolutional layer. The effect of BiLSTM is to build


the text-level word vector representation. Because all the words have different
contributions to the sentiment of the context, assigning different weights to
words is a common way of solving the problem. Attention mechanism is to assign

T
390 different weights to words to enhance understanding of the sentiment of the

IP
entire text. Hence, BiLSTM and attention mechanism can improve classification
efficiency.

CR
BiLSTM obtains the annotations of words by summarizing information from
both directions (forward and backward) for words, and hence the annotations
395 incorporate the contextual information. BiLSTM contains the forward LSTM
−−−−→

US
(represented as LST M ) which reads the feature sequences from Lc1 to Lc100
←−−−−
and the backward LSTM (represented as LST M ) which reads from Lc100 to
Lc1 . Formally, the outputs of BiLSTM are stated as follows:
AN

→ −−−−→
hf = LST M (Lcn ), n ∈ [1, 100] (9)
M


− ←−−−−
hb = LST M (Lcn ), n ∈ [100, 1] (10)

An annotation for a given feature sequence Lcn is obtained by the forward


ED


→ ←

400 hidden state hf and the backward hidden state hb . These states summarize
the information of the entire text centered around Lcn and implement the word
encoding.
PT

Attention mechanism can focus on the features of the keywords to reduce


the impact of non-keywords on the text sentiment and it is considered as a
CE

405 fully-connected layer and a softmax function. The work process of attention
mechanism in AC-BiLSTM is following detailed.
−→
The word annotation hf is first fed to get −
u→
f by one layer perceptron as a
AC



hidden representation of h . The −
f u→ is formulated as follows:
f

− −

u→
f = tanh(w hf + b) (11)

where w and b are represented as the weight and bias in the neuron, tanh(.) is
410 hyperbolic tangent function. The model uses the similarity between −u→ and a
f

19
ACCEPTED MANUSCRIPT

word level context vector −


v→
f to measure the importance of each word. And then

it uses the softmax function to get the normalized weight −a→ of each word. −
f a→ f

is formulated as follows:

− exp(−
u→ −

f ∗ vf )
a→
f = (12)

T
P
M
(exp(−
u→ −

f ∗ vf ))

IP
i=1

where M is the number of words in the text and exp(.) is the exponential func-
tion. The word level context vector −
v→ can be seen as a high-level representation

CR
415 f

of the informative words over the words and is randomly initialized and jointly
learned during the training process.

420
US
After that, a weighted sum of the forward read word annotations based on
the weight −
a→ is computed as the forward context representation F c. The F c is
f

the part of the output of the attention layer, and it can be expressed as:
AN
X −

Fc = (−
a→
f ∗ hf ) (13)



Similar to −
a→ ←

f , ab can be calculated using the backward hidden state hb . Like
M

F c, the backward context representation Hc is also the part of the output of


the attention layer, and it can be expressed as:
ED

X ←

Hc = (←
a−b ∗ hb ) (14)

AC-BiLSTM obtains an annotation for a given feature sequence Lcn by con-


PT

425 catenating the forward context representation F c and backward context repre-
sentation Hc. Finally, the comprehensive context representations S = [F c, Hc]
CE

are obtained. The comprehensive context representations are considered as the


features for text classification. In AC-BiLSTM, the dropout layer and the soft-
max layer are used to generate the conditional probabilities over the class space
AC

430 to achieve classification. The purpose of the dropout layer is to avoid overfitting.
Currently, the cross entropy is a commonly used loss function to evaluate the
classification performance of the models. It is often better than the classification
error rate or the mean square error. In our approach, Adam optimizer [59] is
chosen to optimize the loss function of the network. The model parameters are

20
ACCEPTED MANUSCRIPT

435 fine-tuned by Adam optimizer which has been shown as an effective and efficient
backpropagation algorithm. The cross entropy as the loss function can reduce
the risk of a gradient disappearance during the process of stochastic gradient
descent. The loss function can be denoted as follows in Eq.15

T
1 X
Ltotal = − [y ln o + (1 − y) ln(1 − o)] (15)
num

IP
Sp

where num is the number of training samples, Sp represents the training sample,

CR
440 y is the label of the sample, o is the output of AC-BiLSTM.
The main contributions and originality of AC-BiLSTM are as follows:
1) the convolutional layer extracts the low-level semantic features from the

445
US
raw text and is used for dimensionality reduction. For text classification, the
vector representation of the entire document is generally the high-dimensional
vector. The parameters of BiLSTM will increase significantly when BiLSTM
AN
is used to capture the semantics of the entire document. However, too many
network parameters will increase the difficulty of network optimization. Directly
reducing the dimensionality of the text vector will lose a lot of information and
M

reduce the accuracy of classification. The convolutional layer is supposed to be


450 good at extracting robust and abstract features of the input. In addition, the
ED

convolutional layer can also reduce the dimension of the input data. Therefore,
one dimension convolutional layer can extract the feature information of the
text vector while reducing the dimensionality of the text vector;
PT

2) BiLSTM extracts the contextual information from the low-level semantic


455 features. RNNs is a biased model, where later words are more dominant than
CE

earlier words [60]. Therefore, the information extracted by LSTM cannot ef-
fectively represents the actual semantics of the text. Since BiLSTM can access
both the preceding and succeeding contextual features, the features extracted
AC

by BiLSTM can more realistically represent the actual semantics of the text.
460 Compared with extracting the contextual information directly from the text,
extracting the contextual information from the low-level semantic features can
improve the efficiency of extracting information of BiLSTM;
3) The forward hidden layer and backward hidden layer in BiLSTM use

21
ACCEPTED MANUSCRIPT

their respective attention mechanism layers. Since BiLSTM can access both the
465 preceding and succeeding contexts, the information obtained by BiLSTM can
be considered as two different representations of the text. The same informa-
tion may use different representations in the information obtained by BiLSTM.

T
Therefore, using attention mechanism for each representation of the text can

IP
better focus on the respective important information and avoid mutual interfer-
470 ence of the important information in the different representations. Moreover,

CR
the attention mechanism layers in AC-BiLSTM make the understanding of text
semantics more accurate.
Hence, our approach effectively improves the classification accuracy.

4. Experiments US
AN
475 Experiments are conducted to evaluate the performance of the proposed
approach for text classification on various benchmarking datasets. In this sec-
tion, the experimental setup and baseline methods followed by the discussion of
M

results are described.

4.1. Experimental setup


ED

480 a) Datasets
Our model is evaluated on text classification task (including sentiment and
question classification) using the following datasets. Summary statistics of these
PT

datasets are as follow:


MR: Movie review sentence polarity dataset v1.0 [14]. It contains 5331
CE

485 positive snippets and 5331 negative snippets extracted from Rotten Tomatoes
web site pages where reviews marked with ”fresh” are labeled as positive, and
reviews marked with ”rotten” are labeled as negative.
AC

IMDB: A benchmark dataset for sentiment classification [61]. It is a large


movie review dataset with full-length reviews. The task is to determine if the
490 movie reviews are positive or negative. Both the training and test set have 25K
reviews.

22
ACCEPTED MANUSCRIPT

RT-2k: The standard 2000 full-length movie review dataset [7]. Classifica-
tion involves detecting positive/negative reviews.
SST-1: Stanford Sentiment Treebankan extension of MR but with train/dev/test
495 splits provided and fine-grained labels (very positive, positive, neutral, negative,

T
very negative), re-labeled by Socher [62].

IP
SST-2: Binary labeled version of Stanford sentiment treebank, in which
neutral reviews are removed, very positive and positive reviews are labeled as

CR
positive, negative and very negative reviews are labeled as negative [62].
500 Subj: The subjectivity dataset consists of subjective reviews and objective
plot summaries [8]. The task of subjectivity dataset is to classify the text as
being subjective or objective.
US
TREC: TREC question datasettask involves classifying a question into 6
question types [52]. TREC divides all questions into 6 categories, including
AN
505 location, human, entity, abbreviation, description and numeric. The training
dataset contains 5452 labelled questions while the testing dataset contains 500
questions.
M

We test our model on various benchmarks. Summary statistics of the datasets


are in Table 1. The c is the number of the target classes, l means the average
ED

510 sentence length, N is the dataset size and |V | is the vocabulary size. ”Test” is
the test set size and ”CV” means that there is no standard train/test split and
thus 10-fold CV is used.
PT

b) Parameter settings
Our experiments use accuracy as the evaluation metric to measure the overall
CE

515 classification performance. During training AC-BiLSTM for feature extraction


in the text, the input sequence xm is set to the m − th word embedding (a dis-
tributed representation for a word [63]) in a input sentence. Publicly available
AC

word vectors trained from Google News are used as pre-trained word embed-
dings. The size of these embeddings is 300. The memory dimension of BiLSTM
520 is set to be 150 and the number of filters of length 3 is set to be 100 in the convo-
lutional layer. The training batch size for all datasets is set as 50. The dropout
rate is 0.7. A back-propagation algorithm with Adam stochastic optimization

23
ACCEPTED MANUSCRIPT

Table 1: Summary statistics for the datasets after tok-


enization.

Data c l N |V | Test

MR 2 20 10662 18765 CV

T
IMDB 2 231 50000 392000 25000

IP
SST-1 5 18 11855 17836 2210
SST-2 2 19 9613 16185 1821

CR
Subj 2 23 10000 21323 CV
RT-2K 2 787 2000 51000 CV
TREC 6 10 5952 9592 CV

US
method is used to train the network through time with the learning rate of
AN
0.001. After each training epoch, the network is tested on validation data. The
525 log-likelihood of validation data is computed for convergence detection.

4.2. Baseline methods


M

This paper benchmarks the following baseline methods for text classifica-
tion, they are effective methods and have achieved some good results in text
ED

classification:
530 SVM: Support Vector Machine [64].
MNB: Multinomial naive Bayes with uni-bigrams [65].
PT

NBSVM: SVM variant using naive Bayes log-count ratios as feature values
proposed by Wang [65].
CE

RAE: Semi-supervised recursive auto-encoders with pre-trained word vectors


535 from Wikipedia proposed by Socher [66].
MV-RNN: Recursive neural network using a vector and a matrix on every
AC

node in a parse tree for semantic compositionality proposed by Socher [67].


RNTN: Recursive deep neural network for semantic compositionality over a
sentiment treebank using tensor-based feature function proposed by Socher [62].

24
ACCEPTED MANUSCRIPT

540 Paragraph-Vec: An unsupervised algorithm learning distributed feature rep-


resentations from sentences and documents proposed by Le [68].
DCNN: Dynamic convolutional neural network with dynamic k-max pooling
operation proposed by Kalchbrenner [69].

T
CNN-static: 1d-CNN with pre-trained word embedding vector from word2vec

IP
545 proposed by Kim [18].
CNN-non-static: 1d-CNN with pre-trained word embedding and fine-tuning

CR
optimizing strategy proposed by Kim [18].
CNN-multichannel: 1d-CNN with two sets of pre-trained word embeddings
proposed by Kim [18].
550

proposed by Irsoy [70]. US


DRNN: Deep recursive neural networks with stacked multiple recursive layers

Multi-task LSTM: A multi-task learning framework using LSTM to jointly


AN
learn across multiple related tasks proposed by Liu [71].
Tree LSTM: A generalization of LSTM to tree structured network topologies
555 proposed by Tai [13].
M

P-LSTM: A model introduces the phrase factor mechanism which combines


the feature vectors of the phrase embedding layer and the LSTM hidden layer
ED

to extract more exact information from the text proposed by Lu [34].


C-LSTM: A model combining with the strengths of CNN and RNN for sen-
560 tence representation and text classification proposed by Zhou [72].
PT

LSTM: Long short term memory.


BiLSTM: Bidirectional long short term memory.
CE

4.3. Results

4.3.1. Overall comparison


AC

565 In this section, our evaluation results are shown on the sentiment classifica-
tion and question type classification tasks. Moreover, some approach analysis
are given.
a) Sentiment classification

25
ACCEPTED MANUSCRIPT

The comparison results for long reviews (RT-2k and IMDB) and short re-
570 views (MR, SST-1, SST-2 and Subj) are presented in Table 2. The experimental
results are evaluated by the classification accuracy. The best results are shown in
boldface. The top 3 approaches are conventional machine learning approaches

T
with hand-crafted features. Other 16 approaches, including our approach, are

IP
deep neural network (DNN) approaches, which can automatically extract fea-
575 tures from the input data for classifier training without feature engineering. The

CR
results of the top 16 approaches are taken from [18, 27, 34, 72].
From Table 2, AC-BiLSTM achieves better results than other methods on
the majority of the benchmark datasets. Among the 19 approaches mentioned

580 US
above, our approach outperforms other baselines on all datasets except SST-
1. The results of AC-BiLSTM are 83.2%, 91.8%, 88.3%,93.0% and 94.0% for
MR, IMDB, SST-2, RT-2k and Subj datasets. AC-BiLSTM gives the relative
AN
improvements of 2.09%, 0.33%, 0.23%, 3.91% and 0.21% compared to CNN-non-
static on MR dataset, P-LSTM on IMDB dataset, CNN-multichannel on SST-2
dataset, NBSVM on RT-2k dataset and P-LSTM on Subj dataset, respectively.
M

585 It is observed that comparing with four CNN-based methods (DCNN, CNN-
static, CNN-non-static and CNN-multichannel), AC-BiLSTM gives better re-
ED

sults on the four datasets. Compared to six LSTM-based methods (Multi-task


LSTM, Tree-LSTM, P-LSTM, C-LSTM, LSTM and BiLSTM), AC-BiLSTM
gives superior performance on the five datasets. AC-BiLSTM outperforms three
PT

590 hand-crafted features based methods (SVM, MNB and NBSVM) and other
methods (RAE, MV-RNN, RNTN, Paragraph-Vec and DRNN) on all datasets.
CE

For the dataset SST-1, where the data is divided into 5 classes, Tree-LSTM is
the only method to arrive at above 50%. But our approach do not differ sig-
nificantly from the result of Tree-LSTM. It demonstrates that AC-BiLSTM, as
AC

595 an end-to-end model, the results are still promising and comparable with those
models that heavily rely on linguistic annotations and knowledge. This indicates
that AC-BiLSTM will be more feasible for various scenarios. Simultaneously, it
can be seen that the performance of DNN-based methods is better than that of
the conventional machine learning approaches.

26
ACCEPTED MANUSCRIPT

600 b) Question type classification


The prediction accuracy on TREC question classification is reported in Table
3. The SVM classifier uses unigrams, bigrams, wh-word, head word, POS tags,
parser, hypernyms, WordNet synsets as engineered features and 60 hand-coded

T
rules. Ada-CNN [73] is a self-adaptive hierarchical sentence model with gating

IP
605 networks and it is added to the baseline models. Other baseline models have
been introduced in the section 4.2.

CR
From Table 3, it can be seen that our approach achieves better results than
other baselines on the TREC dataset and the result of AC-BiLSTM is 97.0%.
Our approach gives the relative improvements of 1.57% and 3.63% compared
610

US
to BiLSTM and CNN-non-static on TERC dataset, respectively. Comparing
with the CNN-based methods and the LSTM-based methods, AC-BiLSTM gives
superior performance on the question type classification dataset. For the TREC
AN
dataset, the results of LSTM-based methods are better than these of the CNN-
based methods. It shows that the LSTM-based methods are more suitable than
615 the CNN-based methods for this task. As shown above, AC-BiLSTM captures
M

intentions of TREC questions well.


Combined with the results in sentiment classification and question classifi-
ED

cation, our results consistently outperform the most of the published baseline
models. In view of the above discussion it can be concluded that the overall
620 performance of AC-BiLSTM is better than that of the state-of-the-art methods
PT

in terms of the classification accuracy.

4.3.2. Effect of each component of AC-BiLSTM


CE

AC-BiLSTM contains three components, namely, the convolutional layer,


BiLSTM, the attention mechanism layers. For AC-BiLSTM, it should be proven
AC

625 that all components are useful for the final results. In this section, a set of ex-
periments are to investigate the effect of each component on the performance
of AC-BiLSTM. AC-BiLSTM without the convolutional layer, AC-BiLSTM re-
placing BiLSTM with LSTM, AC-BiLSTM without the attention mechanism
layers and AC-BiLSTM are compared in this section. The relative improve-

27
ACCEPTED MANUSCRIPT

630 ment ratio ∆ and the classification accuracy are used as the evaluation metric.
The relative improvement ratio ∆ calculates as follows:

∆ = (ACCAC−BiLST M − ACCvar ) ÷ ACCvar (16)

T
where ACCAC−BiLST M is the classification accuracy of our approach and ACCvar

IP
is the classification accuracy of each AC-BiLSTM variant. A-BiLSTM is the
classification accuracy of AC-BiLSTM without the convolutional layer. AC-

CR
635 LSTM is the classification accuracy of AC-BiLSTM replacing BiLSTM with
LSTM. C-BiLSTM is the classification accuracy of AC-BiLSTM without the
attention mechanism layers. The results are presented in Table 4.

640
US
From Table 4, it can be seen that the attention mechanism layers and BiL-
STM have a powerful influence on the performance of AC-BiLSTM. Among
the all architectures mentioned above, AC-BiLSTM obtains the best results.
AN
Compared with AC-LSTM and C-BiLSTM, AC-BiLSTM brings the relative
improvements of 0.21% to 2.09%. It is observed that the performance of AC-
BiLSTM decreases considerably when the attention mechanism layers and BiL-
M

STM are removed. In AC-BiLSTM, the attention mechanism layers can identify
645 the effect of each word for the text and BiLSTM can obtain both preceding and
ED

succeeding information. These components effectively improve the classification


accuracy of AC-BiLSTM. Compared with A-BiLSTM, AC-BiLSTM brings the
relative improvements of 0.21% to 0.88%. It means that the influence of the
PT

convolutional layer on our approach is lower than that of other components.


650 But the convolutional layer still helps to improve the classification accuracy.
CE

For AC-BiLSTM, the importance of the attention mechanism layers or BiL-


STM is higher than the importance of the convolutional layer. It proves that
all components are useful for the final results in AC-BiLSTM.
AC

The input of the attention mechanism layers also has an important influence
655 on the classification results. In AC-BiLSTM, the preceding and succeeding con-
texts are fed to different attention mechanism layers. In this section, a set of
experiments are to investigate the effect of the different input of the attention
mechanism layers on the performance of AC-BiLSTM. AC-BiLSTM with the sin-

28
ACCEPTED MANUSCRIPT

gle attention mechanism layer and AC-BiLSTM are compared. In AC-BiLSTM


660 with the single attention mechanism layer, the preceding and succeeding con-
textual features are concatenated together to form the output of BiLSTM. And
then the output of BiLSTM is as the input of the attention mechanism layer.

T
The relative improvement ratio ∆ and the classification accuracy are used as

IP
the evaluation metric. The results are presented in Table 5. A1C-BiLSTM is
665 the classification accuracy of AC-BiLSTM with the single attention mechanism

CR
layer.
From Table 5, it can be seen that using the two attention mechanism lay-
ers to process the forward and backward information separately is better than

670 US
using the single attention mechanism layer to process the concatenation of the
forward and backward information. AC-BiLSTM gives the relative improve-
ments of 1.09% and 0.94% compared to A1C-BiLSTM on the RT-2k dataset and
AN
the TREC dataset, respectively. Except for the IMDB dataset, AC-BiLSTM
achieves the better results on other datasets. It proves that using the two
attention mechanism layers to process the forward and backward information
M

675 separately can further improve the performance in AC-BiLSTM.


ED

4.3.3. Tuning of hyperparameters in one dimension convolutional layer


The convolutional layer usually uses the fixed-size convolution filters. It
means that there is a fixed-size window sliding from the beginning to the end
PT

of a text to produce feature maps, which is equivalent to extracting fixed-size


680 n-gram features. Therefore, it is especially important to choose the appropriate
fixed-size convolution window size. In order to verify the impact of the window
CE

size of the convolution filters applied in one dimension convolutional layer, a set
of experiments are to investigate the effect of the window size. The window size
AC

m is as follows: m = 2, 3, 5, 7. Except to the window size, all other parameters


685 are kept unchanged. The results are presented in Fig.5.
As shown in Fig.5, the accuracy is not significantly influenced by the window
size on the most datasets. For IMDB dataset, the window size m = 2 gives the
relative improvements of 0.98% compared to m = 3. For other datasets, the

29
ACCEPTED MANUSCRIPT

T
IP
CR
US
Figure 5: Accuracy with different window sizes
AN
window size m = 3 has excellent or comparable performance compared to other
690 window sizes. The results show that when the window size is 3, the classification
accuracy is better.
M

The stride size of the convolutional sliding windows also affects the features
extracted by the convolutional layer. In this section, a set of experiments are
to investigate the effect of the stride size of the convolutional sliding windows.
ED

695 The stride size s is as follows: m = 1, 2, 3, 4. Except to the stride size, all other
parameters are kept unchanged. The results are presented in Fig.6.
PT

From Fig.6, it can be seen that the classification accuracy of our method is
significantly reduced for all datasets when the stride size increases. For the long
sentence datasets (IMDB and RT-2k), the performance of our method is less
CE

700 affected by the stride size. But for the short sentence datasets (MR and SST-
2), the performance of the approach has declined dramatically. The reason is
that increasing the stride size means that the convolutional layer will lose more
AC

semantic information. In the short sentences, each word contains relatively more
semantic information, and in the long sentences, each word contains relatively
705 less semantic information. Therefore, when the stride size increases, the short
sentences will lose more semantic information than the long sentences. In short,

30
ACCEPTED MANUSCRIPT

T
IP
CR
US
Figure 6: Accuracy with different stride sizes
AN
reducing the stride size can achieve better results.

4.3.4. Comparison on word embedding vector variants


In word2vec, there are four ways to generate word embedding vector. The
M

710 commonly used methods to generate word embedding vector are mainly the pre-
training method and the random method. Pre-trained word embedding vector
ED

means the model with the pre-trained vectors from word2vec. Random word
embedding vector means the model where all words are randomly initialized and
then modified during training. Generally, the different methods to generate word
PT

715 embedding vector have different effects on the classification performance. In this
section, a set of experiments are to investigate the effect of the different methods
CE

to generate word embedding vector on the performance of AC-BiLSTM. In order


to disentangle the effect of the above word embedding variations versus other
random factors, all parameters are kept unchanged. In this section, recall,
AC

720 precision and F1-score are used to measure binary classification performance.
Recall is the fraction of relevant instances that have been retrieved over the
total amount of relevant instances. Precision is the fraction of relevant instances
among the retrieved instances. For a classifier dedicated to binary classification,

31
ACCEPTED MANUSCRIPT

F1-score is an important indicator and it is the combination of the recall and


725 precision. F1-score calculates as follows:

2 ∗ P recision ∗ Recall
F 1 − score = (17)
P recision + Recall

T
The higher the three indicators are, the better the binary classification per-

IP
formance and robustness should be. Table 6 shows the performance comparison
results on the binary classification datasets.

CR
From Table 6, AC-BiLSTM using pre-trained word embedding vector achieves
730 better results than AC-BiLSTM using random word embedding vector on all
datasets. For F1-score, AC-BiLSTM using pre-trained word embedding vector

US
can get higher results than AC-BiLSTM using random word embedding vector.
For the MR and Subj datasets, pre-trained word embedding vector is supe-
rior to random word embedding vector in all indicators. For the RT-2k and
AN
735 SST-2 datasets, pre-trained word embedding vector is superior to random word
embedding vector in precision and F1-score while random word embedding vec-
tor performs better in recall. For IMDB, random word embedding vector only
M

performs better in precision while pre-trained word embedding vector performs


better in other indicators. Compared to random word embedding vector, pre-
ED

740 trained word embedding vector has obvious advantages.


The F1-score comparison results of the five binary classification datasets are
depicted in Fig.7. From Fig.7, it can be seen that pre-trained word embedding
PT

vector can obtain better classification performance compared to random word


embedding vector. For all datasets, the F1-scores of pre-trained word embedding
CE

745 vector are higher than these of random word embedding vector. The results
suggest that the pre-trained word vectors are good universal feature extractors
and can be utilized across datasets.
AC

4.4. Discussions

For AC-BiLSTM, the purpose of the convolutional layer is to preprocess the


750 input text data. Owing to the capability of capturing local correlations of spatial

32
ACCEPTED MANUSCRIPT

T
IP
CR
US
Figure 7: Statistics and comparison of word embedding vector variants
AN
or temporal structures, the convolutional layer performs excellently in extract-
ing n-gram features at different positions of the text through the convolutional
M

filters from the word vectors. In addition, the convolutional layer reduces the
parameters of the network. Compared to LSTM, BiLSTM can accesses both
the preceding and succeeding contextual information. Hence, BiLSTM can more
ED

755

effectively learn the context of each word in the text. Attention mechanism is
mainly to identify the influence of each word on the sentence. It assigns the at-
PT

tention weights to each word and can capture the important components of the
sentence semantics. The combination of these methods makes the understand-
760 ing of sentence semantics more accurate and improves the classification ability
CE

of AC-BiLSTM. Experiments show that the convolutional layer, BiLSTM and


attention mechanism have an important influence on the performance of AC-
BiLSTM. It is worthwhile to note that BiLSTM and attention mechanism have
AC

greater effects than the convolutional layer on the classification accuracy. For
765 the convolutional layer, the convolution window size and the stride size also
affect the performance of AC-BiLSTM. Experiments also show that when the
window size is 3 and the stride size is 1, AC-BiLSTM can achieve the best

33
ACCEPTED MANUSCRIPT

results.
For text classification, the methods to generate word embedding vector can
770 affect the classification accuracy. Compared to pre-trained embedding word
vector, random embedding word vector requires to train more parameters and

T
it causes relatively lower classification accuracy in limited iterations. The ex-

IP
periments show that pre-trained embedding word vector can achieves better
results than random word embedding vector. Hence, the method to generate

CR
775 pre-trained embedding word vector is more suitable for AC-BiLSTM.
All experiment results indicate the combination of the convolutional layer,
BiLSTM and attention mechanism remarkably improves text classification ac-

780
US
curacy. For the most of the benchmark datasets, AC-BiLSTM can obtain better
results than other baseline models. It shows that AC-BiLSTM has the better
classification ability and our proposed AC-BiLSTM performs better than some
AN
state-of-the-art DNNs.
M
ED
PT
CE
AC

34
ACCEPTED MANUSCRIPT

T
Table 2: Experimental results of sentiment classification accuracy. % is omitted and

IP
”-” indicates no data and this dataset is not used by the method.

Model MR IMDB SST-1 SST-2 RT-2k Subj

CR
SVM - 89.2 40.7 79.4 87.4 91.7
MNB 79.0 86.6 - - 85.9 93.6
NBSVM 79.4 91.2 - - 89.5 93.2
RAE
MV-RNN
77.7
79.0
US
-
-
43.2
44.4
82.4
82.9
-
-
-
-
AN
RNTN - - 45.7 85.4 - -
Paragraph-Vec - - 48.7 87.8 - -
DCNN - - 48.5 86.8 - -
M

CNN-static 81.0 - 45.5 86.8 - 93.0


CNN-non-static 81.5 - 48.0 87.2 - 93.4
CNN-multichannel 81.1 - 47.4 88.1 - 93.2
ED

DRNN - - 49.8 86.6 - -


Multi-task LSTM - - 49.6 87.9 - -
PT

Tree-LSTM - - 50.6 86.9 - -


P-LSTM - 91.5 - - 89.3 93.8
C-LSTM - - 49.2 87.8 - -
CE

LSTM 80.1 87.0 48.0 86.4 86.7 91.3


BiLSTM 80.3 87.9 48.4 88.0 87.2 92.3
AC-BiLSTM 83.2 91.8 48.9 88.3 93.0 94.0
AC

35
ACCEPTED MANUSCRIPT

Table 3: The 6-way question type classification accuracy on


TREC. % is omitted and ”ACC” means the classification ac-
curacy.

T
Model ACC Reported in

IP
SVM 95.0 Silva [74]
Paragraph-Vec 91.8 Zhao [73]

CR
Ada-CNN 92.4 Zhao [73]
CNN-non-static 93.6 Kim [18]
CNN-multichannel 92.2 Kim [18]
DCNN
C-LSTM
US
93.0
94.6
Kalchbrenner [69]
Zhou [72]
AN
LSTM 95.3 Our implementation
BiLSTM 95.5 Our implementation
AC-BiLSTM 97.0 Our implementation
M
ED

Table 4: Effect of each component on the performance of AC-BiLSTM. % is omit-


ted, ”ACC” means the classification accuracy and ”-” indicates no data.

Dataset A-BiLSTM AC-LSTM C-BiLSTM AC-BiLSTM


PT

ACC ∆ ACC ∆ ACC ∆ ACC ∆

MR 82.7 0.60 81.5 2.09 81.5 2.09 83.2 -


CE

IMDB 91.0 0.88 90.8 1.10 90.0 2.00 91.8 -


SST-1 48.8 0.20 48.2 1.45 48.5 0.82 48.9 -
SST-2 88.0 0.34 87.6 0.80 87.9 0.46 88.3 -
AC

RT-2k 90.9 2.31 91.0 2.20 89.7 3.68 93.0 -


Subj 93.8 0.21 93.4 0.64 92.5 1.62 94.0 -
TREC 96.7 0.31 96.3 0.73 96.8 0.21 97.0 -

36
ACCEPTED MANUSCRIPT

Table 5: Effect of the different input of the atten-


tion mechanism layer on the performance of AC-
BiLSTM. % is omitted, ”ACC” means the classifi-
cation accuracy and ”-” indicates no data.

T
Dataset A1C-BiLSTM AC-BiLSTM

IP
ACC ∆ ACC ∆

CR
MR 82.5 0.85 83.2 -
IMDB 91.8 0 91.8 -
SST-1 48.7 0.41 48.9 -
SST-2
RT-2k
Subj
88.2
92.0
93.4
US
0.11
1.09
0.64
88.3
93.0
94.0
-
-
-
AN
TREC 96.1 0.94 97.0 -
M

Table 6: Experimental results of different word embedding vector generation


methods on binary classification datasets. % is omitted.
ED

Dataset Embedding variant Recall Precision F1-score

Pre-trained 83.48 82.86 83.17


MR
PT

Random 78.80 77.20 78.00


Pre-trained 92.09 88.91 90.47
IMDB
Random 86.45 89.17 87.78
CE

Pre-trained 89.90 95.70 92.71


RT-2k
Random 90.91 85.71 88.24
AC

Pre-trained 87.62 87.26 87.44


SST-2
Random 87.77 85.25 86.49
Pre-trained 93.76 94.14 93.95
Subj
Random 91.65 92.62 92.13

37
ACCEPTED MANUSCRIPT

5. Conclusions

For text classification, feature extraction and the design of classifier are very
important. LSTM has shown better performance on many real-world and bench-
mark text classification problems. However, it is still difficult to understand the

T
785

semantics and the classification accuracy still needs to be improved. In order to

IP
solve these problems, this paper presents an improved LSTM method, namely
AC-BiLSTM, in which the convolutional layer, BiLSTM and attention mecha-

CR
nism are used to enhance semantic understanding and improve the classification
790 accuracy. Experiments are conducted on seven benchmark datasets to evaluate
the performance of our presented approach. The experimental results indicate

US
that AC-BiLSTM can understand semantics more accurately and enhance the
performance of LSTM in terms of the quality of the final results.
AN
Comparisons with some state-of-the-art baseline methods, it demonstrates
795 that the new method is more effective and efficient in terms of the classification
quality in most cases.
M

Future work focuses on the research of attention mechanism and the de-
sign of network architecture. In addition, the new methods are also applied to
the field of machine reading comprehension. Future works mainly includes the
ED

800 following parts: 1) using other attention mechanisms to further improve our ap-
proach; 2) investigating the effect of attention mechanism on the performance
PT

of our approach; 3) designing the new attention mechanism and the network
architecture; 4) applying our approach to the practical applications; 5) apply-
ing the designed attention mechanism and the network architecture to machine
CE

805 reading comprehension.

Acknowledgement
AC

The work described in this paper was support by National Natural Science
Foundation of China Foundation No.61300127. Any conclusions or recommen-
dations stated here are those of the authors and do not necessarily reflect official
810 positions of NSFC.

38
ACCEPTED MANUSCRIPT

References

References

[1] A. Watanabe, R. Sasano, H. Takamura, M. Okumura, Generating person-

T
alized snippets for web page recommender systems, Transactions of the

IP
815 Japanese Society for Artificial Intelligence 31 (5).

[2] T. A. Almeida, T. P. Silva, I. Santos, J. M. Gomez Hidalgo, Text nor-

CR
malization and semantic indexing to enhance instant messaging and sms
spam filtering, Knowledge-Based Systems 108 (September 15, 2016) (2016)
25–32.

820
US
[3] B. Liu, Sentiment analysis: Mining opinions, sentiments, and emotions,
Cambridge University Press, Cambridge, England, 2015.
AN
[4] L. H. Lee, D. Isa, W. O. Choo, W. Y. Chue, High relevance keyword extrac-
tion facility for bayesian text classification on different domains of varying
M

characteristic, Expert Systems with Applications 39 (1) (2012) 1147–1155.

825 [5] J. Lei, T. Jin, Hierarchical text classification based on bp neural network,
ED

Journal of Computational Information Systems 5 (2) (2009) 581–590.

[6] V. N. Phu, V. T. N. Tran, V. T. N. Chau, N. D. Dat, K. L. D. Duy, A


decision tree using id3 algorithm for english semantic analysis, International
PT

Journal of Speech Technology 20 (3) (2017) 593–613.

830 [7] P. D. Turney, Thumbs up or thumbs down? semantic orientation applied to


CE

unsupervised classification of reviews, in: Proceedings of the 40th annual


meeting on association for computational linguistics (ACL), Association
AC

for Computational Linguistics (ACL), Philadelphia, Pennsylvania, United


states, 2002, pp. 417–424.

835 [8] B. Pang, L. Lee, A sentimental education: Sentiment analysis using sub-
jectivity summarization based on minimum cuts, in: Proceedings of the
42nd annual meeting on association for computational linguistics (ACL),

39
ACCEPTED MANUSCRIPT

Association for Computational Linguistics (ACL), Barcelona, Spain, 2004,


pp. 271–278.

840 [9] B. Liu, Sentiment analysis and opinion mining. Synthesis Lectures on Hu-
man Language Technologies, Morgan and Claypool Publishers, San Rafael,

T
CA, United states, 2012.

IP
[10] J. Schmidhuber, Deep learning in neural networks: an overview, Neural

CR
Networks 61 (January 01, 2015) (2015) 85–117.

845 [11] V. Campos, B. Jou, X. Giro-i Nieto, From pixels to sentiment: Fine-
tuning cnns for visual sentiment prediction, Image and Vision Computing
65 (September 2017) (2017) 15–22.
US
[12] L. Brocki, K. Marasek, Deep belief neural networks and bidirectional long-
AN
short term memory hybrid for speech recognition, Archives of Acoustics
850 40 (2) (2015) 191–195.

[13] K. S. Tai, R. Socher, C. D. Manning, Improved semantic representations


M

from tree-structured long short-term memory networks, in: Proceedings of


the 53rd Annual Meeting of the Association for Computational Linguistics
ED

and the 7th International Joint Conference on Natural Language Process-


855 ing of the Asian Federation of Natural Language Processing ACL-IJCNLP
2015, Association for Computational Linguistics (ACL), Beijing, China,
PT

2015, pp. 1556–1566.

[14] B. Pang, L. Lee, Seeing stars: Exploiting class relationships for sentiment
CE

categorization with respect to rating scales, in: Proceedings of the 43rd


860 Annual Meeting of the Association for Computational Linguistics, Associ-
AC

ation for Computational Linguistics (ACL), Ann Arbor, MI, United states,
2005, pp. 115–124.

[15] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa,


Natural language processing (almost) from scratch, Journal of Machine
865 Learning Research 12 (August 2011) (2011) 2493–2537.

40
ACCEPTED MANUSCRIPT

[16] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with


deep convolutional neural networks, in: Advances in Neural Information
Processing Systems 25: 26th Annual Conference on Neural Information
Processing Systems 2012 (NIPS’2012), Lake Tahoe, NV, United states,

T
870 2012, pp. 1097–1105.

IP
[17] K.-i. Funahashi, Y. Nakamura, Approximation of dynamical systems by
continuous time recurrent neural networks, Neural Networks 6 (6) (1993)

CR
801–806.

[18] Y. Kim, Convolutional neural networks for sentence classification, in:


875

US
Proceedings of 2014 Conference on Empirical Methods in Natural Lan-
guage Processing, Association for Computational Linguistics (ACL), Doha,
Qatar, 2014, pp. 1746–1751.
AN
[19] S. Liao, J. Wang, R. Yu, K. Sato, Z. Cheng, Cnn for situations understand-
ing based on sentiment analysis of twitter data, in: Proceedings of the 8th
M

880 International Conference on Advances in Information Technology, Elsevier


B.V., Macau, China, 2016, pp. 376–381.
ED

[20] W. Cao, A. Song, J. Hu, Stacked residual recurrent neural network with
word weight for text classification, IAENG International Journal of Com-
puter Science 44 (3) (2017) 277–284.
PT

885 [21] Y. Zhang, M. J. Er, R. Venkatesan, N. Wang, M. Pratama, Sentiment clas-


sification using comprehensive attention recurrent models, in: Proceedings
CE

of 2016 International Joint Conference on Neural Networks, IEEE, Van-


couver, BC, Canada, 2016, pp. 1562–1569.
AC

[22] L. Wang, Z. Wang, S. Liu, An effective multivariate time series classifica-


890 tion approach using echo state network and adaptive differential evolution
algorithm, Expert Systems with Applications 43 (January 1, 2016) (2016)
237–249.

41
ACCEPTED MANUSCRIPT

[23] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Compu-


tation 9 (8) (1997) 1735–1780.

895 [24] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidi-


rectional lstm and other neural network architectures, Neural Networks

T
18 (5-6) (2005) 602–610.

IP
[25] W. Liu, P. Liu, Y. Yang, Y. Gao, J. Yi, An attention-based syntax-tree

CR
and tree-lstm model for sentence summarization, International Journal of
900 Performability Engineering 13 (5) (2017) 775–782.

[26] J. Nowak, A. Taspinar, R. Scherer, Lstm recurrent neural networks for short

US
text and sentiment classification, in: Proceedings of the 16th International
Conference on Artificial Intelligence and Soft Computing, Springer Verlag,
Zakopane, Poland, 2017, pp. 553–562.
AN
905 [27] T. Chen, R. Xu, Y. He, X. Wang, Improving sentiment analysis via sen-
tence type classification using bilstm-crf and cnn, Expert Systems with
M

Applications 72 (April 15, 2017) (2017) 221–230.

[28] X. Niu, Y. Hou, P. Wang, Bi-directional lstm with quantum attention


ED

mechanism for sentence modeling, in: Proceedings of the 24th International


910 Conference on Neural Information Processing, Springer Verlag, Guangzhou,
China, 2017, pp. 178–188.
PT

[29] M.-T. Luong, H. Pham, C. D. Manning, Effective approaches to attention-


based neural machine translation, in: Proceedings of Conference on Em-
CE

pirical Methods in Natural Language Processing, Association for Compu-


915 tational Linguistics, Lisbon, Portugal, 2015, pp. 1412–1421.
AC

[30] Z. Zhang, Y. Zou, C. Gan, Textual sentiment analysis via three different
attention convolutional neural networks and cross-modality consistent re-
gression, Neurocomputing 275 (31 January 2018) (2018) 1407–1415.

[31] A. Kolawole John, L. Di Caro, L. Robaldo, G. Boella, Textual inference


920 with tree-structured lstm, in: Proceedings of the 28th Benelux Conference

42
ACCEPTED MANUSCRIPT

on Artificial Intelligence, Revised Selected Papers, Springer Verlag, Ams-


terdam, Netherlands, 2016, pp. 17–31.

[32] S. K. Sonderby, C. K. Sonderby, H. Nielsen, O. Winther, Convolutional


lstm networks for subcellular localization of proteins, in: Proceedings of

T
925 the 2nd International Conference on Algorithms for Computational Biology,

IP
Springer Verlag, Mexico City, Mexico, 2015, pp. 68–80.

CR
[33] J. Wang, L.-C. Yu, K. R. Lai, X. Zhang, Dimensional sentiment analysis us-
ing a regional cnn-lstm model, in: Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics, ACL 2016 - Short Pa-
930

2016, pp. 225–230. US


pers, Association for Computational Linguistics (ACL), Berlin, Germany,
AN
[34] C. Lu, H. Huang, P. Jian, D. Wang, Y.-D. Guo, A p-lstm neural network
for sentiment classification, in: Proceedings of 21st Pacific-Asia Conference
on Knowledge Discovery and Data Mining, Springer Verlag, Jeju, Korea,
M

935 Republic of, 2017, pp. 524–533.

[35] J. C. Nunez, R. Cabido, J. J. Pantrigo, A. S. Montemayor, J. F. Velez,


ED

Convolutional neural networks and long short-term memory for skeleton-


based human activity and hand gesture recognition, Pattern Recognition
76 (April 2018) (2018) 80–94.
PT

940 [36] T. Le, G. Bui, Y. Duan, A multi-view recurrent neural network for 3d
mesh segmentation, Computers and Graphics (Pergamon) 66 (August 2017)
CE

(2017) 103–112.

[37] K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S.


AC

Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation
945 with visual attention, in: Proceedings of the 32nd International Conference
on Machine Learning, ICML 2015, International Machine Learning Society
(IMLS), Lile, France, 2015, pp. 2048–2057.

43
ACCEPTED MANUSCRIPT

[38] M.-T. Luong, H. Pham, C. D. Manning, Effective approaches to attention-


based neural machine translation, in: Proceedings of conference on Empir-
950 ical Methods in Natural Language Processing, EMNLP 2015, Association
for Computational Linguistics (ACL), Lisbon, Portugal, 2015, pp. 1412–

T
1421.

IP
[39] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, Y. Ben-
gio, A structured self-attentive sentence embedding, CoRR abs/1703.03130.

CR
955 arXiv:1703.03130.
URL http://arxiv.org/abs/1703.03130

US
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of the
31st Annual Conference on Neural Information Processing Systems, NIPS
AN
960 2017, Neural information processing systems foundation, Long Beach, CA,
United states, 2017, pp. 5999–6009.

[41] T. Shen, T. Zhou, G. Long, J. Jiang, C. Zhang, Bi-directional block


M

self-attention for fast and memory-efficient sequence modeling, CoRR


abs/1804.00857. arXiv:1804.00857.
ED

965 URL http://arxiv.org/abs/1804.00857

[42] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical at-


PT

tention networks for document classification, in: Proceedings of the 15th


Conference of the North American Chapter of the Association for Compu-
tational Linguistics: Human Language Technologies, NAACL HLT 2016,
CE

970 Association for Computational Linguistics (ACL), San Diego, CA, United
states, 2016, pp. 1480–1489.
AC

[43] Y. Cui, Z. Chen, S. Wei, S. Wang, T. Liu, G. Hu, Attention-over-attention


neural networks for reading comprehension, in: Proceedings of the 55th
Annual Meeting of the Association for Computational Linguistics, ACL
975 2017, Association for Computational Linguistics (ACL), Vancouver, BC,
Canada, 2017, pp. 593–602.

44
ACCEPTED MANUSCRIPT

[44] H. Li, M. R. Min, Y. Ge, A. Kadav, A context-aware attention network for


interactive question answering, in: Proceedings of the 23rd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD
980 2017, Association for Computing Machinery, Halifax, NS, Canada, 2017,

T
pp. 927–935.

IP
[45] R. Paulus, C. Xiong, R. Socher, A deep reinforced model for abstractive
summarization, CoRR abs/1705.04304. arXiv:1705.04304.

CR
URL http://arxiv.org/abs/1705.04304

985 [46] H. Huang, C. Zhu, Y. Shen, W. Chen, Fusionnet: Fusing via fully-aware at-

arXiv:1711.07341. US
tention with application to machine comprehension, CoRR abs/1711.07341.

URL http://arxiv.org/abs/1711.07341
AN
[47] M. J. Seo, A. Kembhavi, A. Farhadi, H. Hajishirzi, Bidirectional atten-
990 tion flow for machine comprehension, CoRR abs/1611.01603. arXiv:
M

1611.01603.
URL http://arxiv.org/abs/1611.01603
ED

[48] M. Daniluk, T. Rocktäschel, J. Welbl, S. Riedel, Frustratingly short atten-


tion spans in neural language modeling, CoRR abs/1702.04521. arXiv:
995 1702.04521.
PT

URL http://arxiv.org/abs/1702.04521

[49] A. P. Parikh, O. Täckström, D. Das, J. Uszkoreit, A decomposable atten-


CE

tion model for natural language inference, CoRR abs/1606.01933. arXiv:


1606.01933.
1000 URL http://arxiv.org/abs/1606.01933
AC

[50] M. Yang, W. Tu, J. Wang, F. Xu, X. Chen, Attention-based lstm for target-
dependent sentiment classification, in: Proceedings of the 31st AAAI Con-
ference on Artificial Intelligence, AAAI press, San Francisco, CA, United
states, 2017, pp. 5013–5014.

45
ACCEPTED MANUSCRIPT

1005 [51] X. Wei, H. Lin, L. Yang, Y. Yu, A convolution-lstm-based deep neu-


ral network for cross-domain mooc forum post classification, Information
(Switzerland) 8 (3).

[52] Y. Luo, Recurrent neural networks for classifying relations in clinical notes,

T
Journal of Biomedical Informatics 72 (August 2017) (2017) 85–95.

IP
1010 [53] F. Hu, L. Li, Z.-L. Zhang, J.-Y. Wang, X.-F. Xu, Emphasizing essential

CR
words for sentiment classification based on recurrent neural networks, Jour-
nal of Computer Science and Technology 32 (4) (2017) 785–795.

[54] M. Huang, Q. Qian, X. Zhu, Encoding syntactic knowledge in neural net-

1015 tems 35 (3).


US
works for sentiment classification, ACM Transactions on Information Sys-
AN
[55] D. Wu, M. Chi, Long short-term memory with quadratic connections in
recursive neural networks for representing compositional semantics, IEEE
Access 5 (2017) (2017) 16077–16083.
M

[56] D. Tang, B. Qin, T. Liu, Document modeling with gated recurrent neu-
1020 ral network for sentiment classification, in: Proceedings of Conference on
ED

Empirical Methods in Natural Language Processing, Association for Com-


putational Linguistics (ACL), Lisbon, Portugal, 2015, pp. 1422–1432.
PT

[57] Y. Wang, S. Feng, D. Wang, Y. Zhang, G. Yu, Context-aware chinese mi-


croblog sentiment classification with bidirectional lstm, in: Proceedings of
1025 the 18th Asia-Pacific Web Conference on Web Technologies and Applica-
CE

tions, Springer Verlag, Suzhou, China, 2016, pp. 594–606.

[58] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word


AC

representations in vector space, CoRR abs/1301.3781. arXiv:1301.3781.

[59] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in:


1030 Proceedings the 3rd International Conference for Learning Representations,
Springer Verlag, San Diego, CA, United states, 2015, pp. 1–15.

46
ACCEPTED MANUSCRIPT

[60] S. Lai, L. Xu, K. Liu, J. Zhao, Recurrent convolutional neural networks


for text classification, in: Proceedings of the 29th AAAI Conference on
Artificial Intelligence, AAAI 2015 and the 27th Innovative Applications
1035 of Artificial Intelligence Conference, IAAI 2015, AI Access Foundation,

T
Austin, TX, United states, 2015, pp. 2267–2273.

IP
[61] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, C. Potts,
Learning word vectors for sentiment analysis, in: Proceedings of the 49th

CR
Annual Meeting of the Association for Computational Linguistics: Human
1040 Language Technologies, Association for Computational Linguistics (ACL),
Portland, OR, United states, 2011, pp. 142–150.

US
[62] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y.
Ng, C. Potts, Recursive deep models for semantic compositionality over
AN
a sentiment treebank, in: Proceedings of 2013 Conference on Empirical
1045 Methods in Natural Language Processing, Association for Computational
Linguistics (ACL), Seattle, WA, United states, 2013, pp. 1631–1642.
M

[63] Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin, A neural probabilistic


language model, Journal of Machine Learning Research 3 (6) (2003) 1137–
ED

1155.

1050 [64] Y. Liu, J.-W. Bi, Z.-P. Fan, A method for multi-class sentiment classifica-
PT

tion based on an improved one-vs-one (ovo) strategy and the support vec-
tor machine (svm) algorithm, Information Sciences 394-395 (July 1, 2017)
(2017) 38–52.
CE

[65] S. Wang, C. D. Manning, Baselines and bigrams: Simple, good sentiment


1055 and topic classification, in: Proceedings of the 50th Annual Meeting of the
AC

Association for Computational Linguistics, Association for Computational


Linguistics (ACL), Jeju Island, Korea, Republic of, 2012, pp. 90–94.

[66] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, C. D. Manning, Semi-


supervised recursive autoencoders for predicting sentiment distributions, in:

47
ACCEPTED MANUSCRIPT

1060 Proceedings of the Conference on Empirical Methods in Natural Language


Processing, Association for Computational Linguistics (ACL), Edinburgh,
United kingdom, 2011, pp. 151–161.

[67] R. Socher, B. Huval, C. D. Manning, A. Y. Ng, Semantic compositional-

T
ity through recursive matrix-vector spaces, in: Proceedings of 2012 Joint

IP
1065 Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning, Association for Computational

CR
Linguistics (ACL), Jeju Island, Korea, Republic of, 2012, pp. 1201–1211.

[68] Q. Le, T. Mikolov, Distributed representations of sentences and documents,

1070 US
in: Proceedings of the 31st International Conference on Machine Learning,
International Machine Learning Society (IMLS), Beijing, China, 2014, pp.
2931–2939.
AN
[69] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural net-
work for modelling sentences, in: Proceedings of the 52nd Annual Meeting
M

of the Association for Computational Linguistics, Association for Computa-


1075 tional Linguistics (ACL), Baltimore, MD, United states, 2014, pp. 655–665.
ED

[70] O. Irsoy, C. Cardie, Deep recursive neural networks for compositionality in


language, in: Proceedings of the 28th Annual Conference on Neural Infor-
mation Processing Systems 2014, Neural information processing systems
PT

foundation, Montreal, QC, Canada, 2014, pp. 2096–2104.

1080 [71] P. Liu, X. Qiu, H. Xuanjing, Recurrent neural network for text classification
CE

with multi-task learning, in: Proceedings of the 25th International Joint


Conference on Artificial Intelligence, International Joint Conferences on
Artificial Intelligence, New York, NY, United states, 2016, pp. 2873–2879.
AC

[72] C. Zhou, C. Sun, Z. Liu, F. C. M. Lau, A c-lstm neural network for text
1085 classification, Computer Science 1 (4) (2015) 39–44.

[73] H. Zhao, Z. Lu, P. Poupart, Self-adaptive hierarchical sentence model, in:


Proceedings of the 24th International Joint Conference on Artificial Intel-

48
ACCEPTED MANUSCRIPT

ligence, International Joint Conferences on Artificial Intelligence, Buenos


Aires, Argentina, 2015, pp. 4069–4076.

1090 [74] J. Silva, L. Coheur, A. C. Mendes, A. Wichert, From symbolic to sub-


symbolic information in question classification, Artificial Intelligence Re-

T
view 35 (2) (2011) 137–154.

IP
CR
US
AN
M
ED
PT
CE
AC

49
ACCEPTED MANUSCRIPT

Biography

T
IP
CR
1095 Gang Liu received the Ph.D. degree in computer software and theory from
State Key Laboratory of Software Engineering, Wuhan University, Wuhan,
China, in 2012.

US
He is currently an associate professor with the School of
Computer Science, Hubei University of Technology, Wuhan, China. He has
published more than 20 international journal/conference papers. His current
AN
1100 research interests include evolutionary computation, deep learning technology,
image processing and natural language processing.
M
ED

Jiabao Guo is currently a postgraduate student in the


School of Computer Science, Hubei University of Technology, Wuhan, China.
PT

Her current research interests include evolutionary computation, deep learning


1105 technology and Natural Language Processing.
CE
AC

50

You might also like